Clear Street — Modernizing the brokerage ecosystem
Engineering11 min read
Mar 9, 2021

Running FIX on Kubernetes with Microservices

Clear Street Engineering

Background

At Clear Street, trading is what we do. Making it easy for customers to submit trades to us is of paramount importance. The FIX protocol is ubiquitous across the financial services industry as a means to facilitate exchange of financial data. FIX engines allow clients and servers to stream trades and responses back and forth continuously with many built-in features, such as heartbeats and status messages. Although we already accept trades via file drops and our API, customers expect to have the ability to use the same FIX engine they already have in place.

As part of integrating with FIX, we adapted our existing API spec to fit the FIX protocol and created a new spec.

Problem: Outdated Assumptions

FIX engines are well understood and have been implemented a number of times by various companies in the industry. In fact, we were able to leverage an existing open source Go library to handle the implementation of the FIX protocol. However, the FIX protocol was released in 1992; it makes assumptions about the way users will run it based on the technology that was available at the time. As such, it is much more easily implemented within a monolith running on a single real machine, rather than the distributed system deployed on Kubernetes that we have at Clear Street.

Traditional Setup

The traditional FIX setup is relatively simple. A client will connect via a fixed IP and port to a single application that runs on a real machine. The FIX engine maintains a live socket connection over which the client and server stream requests and responses respectively. Multiple clients can connect to a single instance and the traffic is differentiated based on a tuple of (SenderID, SessionQualifier) that is sent along with the trade details.

Overview

There were a few major considerations that factored into our design:

High throughput: Our FIX servers interface directly with Order Management Systems (OMSs), which manage the trades of thousands of clients. While we may only have a connection to a dozen different OMSs, they can each be submitting millions of trades a day.

High connection uptime: Our users need to be able to maintain their connection for the entirety of the trading day. Downstream failures of the trade processing should not impact the connection.

Reliability: Clients must receive all responses to their requests.

To ensure that these requirements were met, we came up with a design that includes three major components:

Routing layer: This uses nginx to route each client based on their incoming IP.

Session management layer: This uses Kubernetes StatefulSets so each pod has a unique identity that can ensure nginx routes each client to the same pod deterministically.

Trade parsing layer: This is entirely separate from the session management layer and communicates back and forth via dynamically generated Kafka topics.

Overview of System

Overview of System

Routing Layer

Before clients can initiate a FIX session, their connection must be routed to a session manager. This layer uses information from the incoming connection to deterministically route each client to the same session manager on each connection.

An initial proposed solution excluded this nginx routing layer entirely. Clients could connect to any instance of the session manager, submit their trades, and receive responses from whatever instance they happened to connect with. A problem emerged when we considered the following situation: a client connects to session-manager-0, sends in a request, disconnects, and reconnects, but is routed to session-manager-1. That response is dropped completely. Although there are other ways of resolving this response issue, the deterministic routing offers two other key advantages.

First, this allows us to segment off each client from one another. That way a problem with handling one client’s inputs will not impact any other. If one session manager crashes, it impacts no one else and trades can continue flowing.

Second, this allows us to cope with the volume in a more fine-grained way. The normal routing strategies, such as round robin, least conns, etc., can’t take into account the differences in load that our varied clients may present. Some OMSs may send in millions of trades per day while others send in only hundreds of thousands. We can keep the largest clients isolated by giving them one or two instances to themselves, while other clients could potentially be lumped together (though that loses the isolation benefits).

nginx provides powerful tools for handling this sort of situation. The FIX protocol uses TCP, so nginx must be built with the — with-stream module. From there you can use out of the box nginx to accomplish the task. Specifically, the nginx ability to route based on incoming IPs is key here.

Ordinarily in Kubernetes, pods have no unique identity. If there are two instances of the service session-manager, you can reach it via the IP session-manager.prod.svc.cluster.local, but you would not know which of the instances it would hit. By redefining the session-manager service to be deployed as a StatefulSet, each pod has its own unique identity and unique URL. The instance session-manager-1 can now be targeted specifically with the URL session-manager-1.prod.svc.cluster.local.

Let’s say we have two clients with IPs 10.0.0.0 and 10.0.0.1 and two instances of our session manager: session-manager-0 and session-manager-1. We can ensure IP 10.0.0.0 always connects to session-manager-0 and 10.0.0.1 always connects to session-manager-1 using the following nginx configuration:

stream {
upstream session0 {
server session-manager-0.prod.svc.cluster.local
}
upstream session1 {
server session-manager-1. prod.sc.cluster.local
}
map $proxy_protocol_addr $backend_svr {
123.14.15.16 session0
341.41.51.61 session1
}
server 1 {
Listen 8008 proxy_protocol;
proxy_pass $backend_svr;
proxy_protocol on;
}
}

(Note: proxy_protocol must be used because there is a load balancer in front of our nginx instances.)

Thus, we have our consistent routing. This piece is very easily scalable as we can always just add in more instances of nginx if the load ever gets too much, and clients will be able to immediately connect to “their” server through it.

nginx workflow

nginx workflow

Session Management Layer

In order to maximize session uptime, we decided to minimize downstream dependencies. We want to maintain a connection with our clients and continue heartbeating even if the rest of the system is down. To this end, it made sense to separate the session manager entirely from any part of parsing and submitting the trades.

The session manager has no responsibilities outside of receiving requests and sending them farther along into the system and returning responses (more on that later). The session manager simply takes the raw message and emits it as an event on Kafka with no parsing or verification done.

The unique identity provided by the StatefulSet also benefits the session management beyond just the routing. A FIX server has a configuration for what messages it is ready to accept based on the sender ID, their target ID, and the “session qualifier,” which is just an integer. To ensure that each session manager uses the proper config to match up with its given client, its unique pod identity signals which config it should be reading from. So, session-manager-0 would read from fix-0.cfg while session-manager-1 would read from fix-1.cfg.

This configuration is used both to establish the connection as well as to generate the Kafka topic that this session will emit events onto. Using the Sender ID and the Session Qualifier, a Kafka topic is generated of the form {SenderID}_{SessionQualifier}.input

Kafka Request and Response Routing

Kafka Request and Response Routing

Trade Parsing Layer

All that remains is to parse the trade and return it. The trade parsing engine reads in from all of the input topics, parses the trades into our internal format, submits them, and returns the response. In order to communicate the responses back to the session manager, the parser simply takes the topic the trade came from and modifies its name to be an output topic that the session manager knows to read from. For instance, if the input topic is client_0.input, the output topic would become client_1.output.

This strategy guarantees that the client will receive their response at least once. The offset is not committed until the response has been sent out. Even if the program were to crash after sending but before committing the offset, the program would simply send the response twice, which is fine. At this point, the session manager simply needs to read off its own output topic, and convey those responses over the wire. This isolates the session from all other parts of the system.

Conclusion

By utilizing a few different features in well understood technologies, we were able to adapt the FIX protocol for a more modern age. Each individual component is now scalable, allowing us far more flexibility than past systems have had, while maintaining the reliability that customers have come to expect when communicating over FIX.

Future Work

There are still areas that need improvement in the current design.

  • There is a potential for a fault in the system if the parser crashes in between sending the trade to the trade capture system and putting the response back onto the Kafka stream. An improvement here would be to have both the sending of the trade and the sending of the response utilize the transaction outbox pattern. This would ensure that either the entire operation succeeded or failed.
  • The session-manager layer is not scalable. Currently there must be a single manager for each client. Since this acts merely as a pass-through, it should be ok, but it is still the one aspect of the system that cannot grow to account for increased usage.

Help & support

Get support

Contact

Please add your full name
Please add your work phone
Please add your company
Get in Touch Image

Get in touch with our team