Clear Street — Modernizing the brokerage ecosystem
Engineering11 min read
May 10, 2021

3 Key Data Tools We Chose for our Stack — and Why

Clear Street Engineering

Image

The first version of the data platform for our trading software was about one thing only: getting things done. If you’re at all acquainted with building software in a startup environment, then you’re familiar with the reality that you need to “ship it or ship out.” Something is always better than nothing. The most winning pieces of software are the most quickly made products that gracefully meet the requirements, give a solid nod to the expectations, and ignore the aspects no one asked for.

Our Data Operations team sits at the center of the business, and our top priority has always been to serve the needs of our various stakeholders: operations staff, internal departments, clients, regulators, and other engineering groups. Most workflows start as manual processes, move to Jupyter notebooks, and then into productionized Python applications backed by system libraries. We work as fast as we can to keep up with the growth of clients, volume, and new business features. And we did.

For engineers, it’s common to start fantasizing about the next major version of your system soon after (or even before, let’s be honest) the current version has even launched. If you could start from scratch, what would you do differently? In all the other startups where I’ve worked, the bottleneck has consistently been on the business side, so it was a pleasant and galvanizing surprise here at Clear Street when our need to break ground on a second version of our data processing architecture became a reality so quickly.

For the next version, the need to ship results quickly remains paramount, so we will still be leaning on existing tools and frameworks rather than inventing our own. Below are the three main tools we are pulling into our stack to help us scale to the next rung:

Snowflake

Purpose: Data warehouse, for centralizing, aggregating, and analyzing data from various sources

Main alternative considered: AWS Redshift

Overview:

Snowflake ($SNOW) is a vendor that offers a Data-Warehouse-as-a-Service. The product offering is proprietary, fully-managed, and somewhat black-box, optimizing for usability over configurability. They provide a ready-to-go, columnar, analytical database solution, with a straightforward ANSI-compliant SQL interface.

Pros:

  • Easy to query with SQL
  • Total horizontal scalability is provided as part of the service
  • Nice user portal with shared worksheets; easy for business stakeholders to interact with
  • Minimal configuration required; no need (or ability) to “tune” many performance parameters
  • Non-opinionated and supports a variety of data warehousing use cases
  • Technical reps who will provide support and assistance with your implementation

Cons:

  • No arbitrary precision decimal type
  • No local mock version for building unit tests against
  • Proprietary and costs money; cost scales with usage and can get expensive

Summary:

Snowflake had been under discussion as an obvious data warehouse candidate since the very beginning. The Snowflake offering is especially easy to get up and running with real data warehouse capabilities quickly and at an early stage. It also offers many add-on features and has a deep ecosystem of vendors that can help you get ramped in as quickly as you want, yet you don’t need to worry about the nitty-gritty aspects of data warehouse scaling. As with any as-a-service offering, the cost scales with you, but then again so does the revenue.

Within our stack, Snowflake allows us to pull all of our data into one central location that we can use to run our reporting and analyses, rather than require each individual reporting job to scrape its data from disparate sources. As a SQL-compliant data warehouse, it also lowers the barrier to accessing our data to anyone who can work with SQL, rather than needing an engineer’s intervention just to pull data. Its worksheets interface is particularly useful for iterating with stakeholders on queries and reports, which provides a strong substitute for burdensome meetings to gather qualitative requirements.

Snowflake’s offering is priced-based on compute usage (CPU costs of the instance running the queries) and storage. It has efficient (and proprietary) ways of storing data at rest, so the bulk of the costs are generally on the compute side. For the next version of our data platform, having our costs scale up with us is a good problem to have.

Argo

Purpose: Task scheduling and orchestration framework for managing productionized batch processing workflows

Main alternative considered: Apache Airflow

Overview:

Argo is a CNCF project for running workflows directly in a native Kubernetes (k8s) environment. It allows you to express your batch processing workflows as DAGs, where the nodes are containerized jobs and the edges are dependencies between the outputs/inputs of those jobs. It is a k8s-first/only solution. For example, the workflows are defined using YAML and represented as custom k8s resources in the cluster.

Pros:

  • Kubernetes-first solution, fits well into a k8s stack
  • Language-agnostic; nodes can be anything that runs in a container; and DAG definition is done entirely as part of the yml k8s config
  • Workflows can be triggered with cron jobs, webhooks, Kafka messages, SNS notifications, etc.
  • Freely available and open source

Cons:

  • Kubernetes-only solution: need k8s in production, fully set up local k8s cluster for prototyping, etc
  • Event triggering is early stage
  • YAML syntax can be brittle and require a meta templating framework on top of it

Summary:

Argo is a rapidly maturing alternative to Airflow and other open-source orchestration and scheduling frameworks. It is implemented in Golang and built for k8s, which aligns with our stack perfectly. However, if we had a different stack, it would be a much tougher sell, as Airflow is much more flexible in terms of the stacks that it can support.

The idea is that instead of having monolithic jobs that contain the logic for all their tasks from beginning to end, you can make modular tasks that do one specific thing. For instance, you can have a task that uploads a file from S3 to an SFTP server, or a task that pulls a query from a database into a CSV file. You can then compose these tasks into a dependency graph (DAG) of steps to capture the entire workflow. Once you have your workflows set up, you can configure triggers to set off the individual workflows, or compose the smaller workflows into larger, more holistic workflows.

Within our stack, this would simplify how we can represent our system, and as a bonus, cut down on both extraneous code in the codebase and resource use in production. Our initial impetus to get a working system resulted in a lot of code that we now have an opportunity to modularize and reuse. Having a first-class way of triggering, managing, and monitoring our batch process workflows is a greatly simplifying shift in the way we process data in production.

Dask

Purpose: Distributed computing framework for arbitrary distributed and out-of-core computing needs

Main alternative considered: Apache Spark

Overview:

Dask is a NumFOCUS project for native distributed computation within Python. It provides distributed libraries with interface compatibility for a number of popular scientific computing libraries in Python. Of most immediate interest to us are Dask DataFrame (Pandas) and Dask Distributed (arbitrary distributed computing), alongside other APIs like Dask ML (Scikit-learn/Xgboost) that we will likely need in the future.

Pros:

  • Python-first; can maintain Python stack and its advantages
  • Interface compatibility with common python libraries (numpy, pandas, scikit-learn)
  • Freely available and open source

Cons:

  • Python-only; lacks rigidity of a systems language and may not match your stack
  • Not as mature and widely supported as Spark or older alternatives
  • May have to fork and tweak to get full required feature set

Summary:

Dask is very much up-and-coming when contrasted with Spark, the incumbent solution to large-scale data-processing problems. Instead of building on top of the mapreduce model of Hadoop, Dask comes at the problem from the other direction, looking to the interfaces of popular ad-hoc data manipulation tools like pandas, then working on parallelizing the computations behind the scenes.

Within our stack, a majority of our analytical computing needs can be expressed using SQL and handled directly by our data warehouse. But we are still on the hook for that minority of use cases that have complex data processing needs, which is where Dask comes in. With Dask, we can pull any amount of data from the data warehouse, then load it into a distributed dataframe for processing, or even run custom Python code to do the processing, all in a horizontally scalable way.

On top of this, Dask has the additional advantage over Spark that it fits our prototyping model, which is Python-based, and allows us to maintain our tight cycle of prototyping to production.

Bonus Feature — Python

Purpose: Core data application development language

Main alternative considered: Scala/Java

Underneath all of this was the question as to whether we should pivot away from Python to another core language for our data stack. We decided not to, since Python has become the lingua franca of data science and data analysis, and can be extended as much as needed by combining it with different technologies.

Even though we use some Java-based technologies at our core, such as Kafka, we are not a JVM shop. Python has aggressively evolved as a viable language that can be built both on top of and from beneath, and making a long-term bet on Python as a core language for expressing things simply and powerfully is one we’re comfortable with.

Help & support

Get support

Contact

Please add your full name
Please add your work phone
Please add your company
Get in Touch Image

Get in touch with our team