The challenges of 100% uptime solution

The year of 2018 announced the era when DDoS attacks exceeded 1 terabyte on an individual attack basis. The definition of a “DDoS” (Distributed Denial-of Service) attack is “a malicious attempt to disrupt normal traffic of a targeted server, service or network by overwhelming the target with a flood of internet traffic”. The most famous instance was the attack on GitHub that caused downtime of 15-20 minutes[1]. Two days later, NETSCOUT Arbor confirmed a 1.7 Tbps DDoS attack but this one managed to fly under the radar as there were no reported service disruptions.

1 terabyte equates to 1,000,000 MB (in decimal) so these attacks amounted to over 1,000,000 MB per second!

These numbers fascinate me as the topic of system uptime is a particularly sensitive one for the financial industry. What is the best possible uptime percentage and how can financial institutions achieve it? I will share some of my thoughts on that controversial topic.

Firstly, I want to understand what the best legally recorded uptime time is. These are the highest Service Level Agreement (SLA) imprinted numbers among the enterprise level service providers that I was able to find:

Rackspace – 99.99% up time[2]
Equinix – 99.99% up time[3] on SLA and 99.9999% on marketing brochures
CenturyLink – 99.99%[4]

Let’s look a little closer and do some simple math:

Uptime Percentage	Annual SLA Downtime
99.90%	8 hours 45 minutes 36 seconds
99.99%	52 minutes 34 seconds
100.00%	5 minutes 15 seconds
100.00%	32 seconds

If we apply this to the FX Industry then the numbers may look slightly different. Based on 6864 hours of trading production each year:

Uptime Percentage	Annual SLA Downtime
99%	68 hours 38 minutes
99.90%	6 hours 52 minutes
99.99%	42 minutes

A Perfect 100% Uptime

My best advice to any client who is promised 100% uptime is to make sure that this number is recorded in your SLA (service level agreement) with punitive damages to be paid to you should the company fail to deliver on its promise.

What are the most common reasons of outages (trading server and price feed/execution-related) among FX companies?

1. Human errors

The human factor plays a big role, not only for FX Industry, but it is particularly true for FX as:

(a) many FX companies simply don’t have a budget for the own IT team and often outsource technical support to 3rd party vendors or individuals. Time-zones and communication are the foundation stones of a proper support infrastructure, as well as a professionalism of some service providers

(b) high employee turnover that raises the potential for configuration errors

2. Third-party providers downtime

Hosting Providers:

Not every FX company has its own dedicated cage^[5] and network engineers on site in premium Data Centers. This is expensive proposition, so they work with resellers, or resellers of resellers. While taking this route may be cost-efficient when the business is small, it becomes dangerous when the business grows. That lag in communication between three or four companies can significantly impact the issue resolution time as well as cause errors. That being said, some may think that short term savings are more important but are they really? On average a company will lose $100k per hour of downtime^[6]

Do your own math.

Pricing Providers:

The majority of FX Businesses nowadays have both a main and a back-up Liquidity Provider. In the event that one LP has a disconnect (outage), the FX business can reconnect to the alternative one. Any decent Liquidity provider has default real-time replication protocols between datacenters and a back- up pricing feed is provided virtually by any LP or Bridge/Aggregator provider. I also believe that a back-up LP is not just a back-up in the event of an emergency but rather a broker’s best tool for negotiation and comparison. Furthermore, one should take a very conservative approach to anything that has to do with client’s money. Diversification is the key for any business. Particularly in FX.

3. Network failure/Usage spikes/surges/Software malfunction

No one can prevent severe weather, a rare software or network malfunction, or mistakes in the code on the server or pricing side. For example, CISCO had a 5-hour outage in October of 2019 caused by an “internal system change”

What can a financial company do to minimize the downtime?

The “one sentence” IT answer is, “to minimize the downtime one must eliminate every single point of failure and have hot-hot redundancy”.

(a) Setting up clear and straightforward Infrastructure

(b) Staying on top of network-monitoring implementations and updates. Trying to use the latest technology.^[7]

With the right technology tools and expertise, an FX business can get really close to the unobtainable 100%. Although as Charles Dickens wrote back in 1837 “Never say never”.

Best of luck and may the highest possible uptime be with you!

Sources:

^[1] https://www.wired.com/story/github-ddos-memcached/

^[2] https://www.rackspace.com/information/legal/cloud/sla

^[3] https://www.equinix.com/industries/cloud-providers/infrastructure/

^[4] https://apps.centurylink.com/slas

^[5] I will use some technical terms in this article. You can brush up your knowledge on terms here http://aboutcolocation.info/us-cabinets-racks-and-cages/

^[6]https://itic-corp.com/blog/2019/05/hourly-downtime-costs-rise-86-of-firms-say-one-hour-of-downtime-costs-300000-34-of-companies-say-one-hour-of-downtime-tops-1million/

^{[7]https://uptimeinstitute.com/uptime_assets/c7994b3638025e429eb22a4e0ba873803cf1ee993902ef0d983e5a4302901a3e-data-center-outages-are-common-costly-and-preventable.pdf}

Get market insight right on your inbox