ETL Pipeline Improvements
One of my primary responsibilities at my current job is ownership of the ETL pipeline that brings in the data upon which we run our business. Every day it processes hundreds of gigabytes of data, cleaning and normalizing it, and outputting it in several different forms.
For a few years I have been using a Hadoop-based mrjob pipeline on top of AWS EMR. It replaced a (I kid you not) Redshift-based pipeline that was hugely expensive (that I didn't write). It's been very reliable. For the last year or two the only failures have been when AWS's systems have had issues, something I can't do anything about. Despite this reliability, it hasn't been perfect. The biggest problem with it is that it's expensive to run. There are few reasons why it's been so expensive:
-
EMR adds a 25% upcharge on all resources used. It is reasonable for AWS to charge something because EMR is a useful service with added value. However, in my opinion, 25% is too high for what it does. AWS already gets paid plenty because you're using their other services underneath it (mainly EC2 virtual computers and EBS block storage), so the extra mark-up feels like a big cash-grab. It is, of course, entirely possible to run Hadoop on AWS without using EMR. But it is a hassle, and it's likely that AWS has figured out that 25% is the inflection point between too expensive and not worth the hassle
-
Hadoop isn't the most efficient way of doing things. Newer tools, notably Spark, have surpassed Hadoop in both speed and features. I originally used Hadoop because I wasn't happy with how Spark needed quite a bit more memory than Hadoop for a similar operation, but over time I became less and less satisfied with my inability to speed up certain parts of the process
-
AWS offers spot compute instances, which are virtual computers that are significantly cheaper than on-demand instances. The difference is that on-demand nodes are yours as long as you want them, while spot instances can be taken away at any time with only two minutes warning.
One of Hadoop's killer features is that during a MapReduce cycle, if one or more worker nodes goes away (for whatever reason including spot node removal), in most cases it can recover and redo any lost results.
However, the EMR pipeline used a multi-step MapReduce process. Unfortunately, Hadoop cannot recover lost results from earlier, fully completed MapReduce cycles. This means that in order to run the pipeline, it had to have enough on-demand instances that the data that needed to be preserved across cycles fit on them. This raised the cost considerably when compared to an all-spot pipeline run
Earlier this year I decided it was time to start looking at how to rewrite the pipeline to dramatically lower costs. I saw two possible ways forward:
-
Choose a modern, high performance tool like Spark/PySpark or Dask that would still run on EMR but hopefully would be much faster
-
Abandon EMR entirely (and its 25% surcharge) and write something that could run completely on spot instances and use S3 for storage, which is four to five times cheaper than EBS
After some thought, I decided that the second option was the better choice. If I could figure it out, it offered the best possible outcome. The pipeline runs once per day, meaning that it has 24 hours to finish before the next run needs to start. Ultimately, high performance was less important than lowering costs.
I have been using Polars quite a bit and have been (mostly1) impressed with its speed and functionality. It is written in Rust, which is one of the fastest programming languages. Polars has a Python-facing API as a first-class member of the project. Rust has an ever-growing library of packages that I've found are high-quality and well documented (in contrast to my experience with cough R packages). The crucial difference between Polars and Hadoop/Spark/Dask is that Polars runs on only one node at a time (it can and does use all the CPU cores), while all of the others can run on multiple nodes. If I could figure out how to slice up the pipeline into chunks that would work on separate instances, I believed I could use Polars in place of Hadoop.
Jumping to the end of the story, I was able to convert the pipeline to use Polars, and to great success. I use a simple pattern for each step. An orchestration process builds a list of work which is submitted to SQS. An EC2 spot fleet is created which launches workers that consume the work. The input and output of the work is stored on S3. The workers send success or failure messages to a callback queue on SQS, which is monitored by the orchestration process. If a spot node goes away interrupting work, the work will be picked up by a different node once the message returns to availability. Once the work is done for a step (i.e. all work has generated a callback), the orchestration process kills the spot fleet and continues to the next step (or sends an error message for a human to figure out).
The bottom line is that the cost has dropped by roughly 85%, primarily due to the following reasons:
-
Polars has the concept of LazyFrames which are data objects that are not realized in memory nor computation until Polars is told to do so. Operations and filtering can be applied to them and Polars can do the work in parallel with efficiency tricks that overall increases speed without loading the whole dataset into memory at once. The combination of sink_parquet with PartitionByKey is effectively a MapReduce operation that is much faster than Hadoop on similar hardware
-
AWS has "regions" and "availability zones (AZs)" which are the physical locations where cloud compute happens. Each AZ is a distinct data center (or close by data centers) within a region. When running an EMR job, you are restricted to a single AZ largely because AWS charges for cross-AZ data transfer, and EMR jobs are very loquacious across the network. There's also increased network latency between AZs. Running and EMR job in multiple AZs would hugely impact performance and cost.
Because the new pipeline reads from and saves data to S3, and there are no cross-AZ charges for accessing S3 within a region, it doesn't matter which AZ the workers run in. This means that the spot fleets can target all AZs within the region, unlike EMR
-
When launching an EC2 fleet, you must specify one or more launch templates, which describe how to launch each instance in terms of OS and installed software. AWS EC2 offers instances using x86 processors from Intel and AMD, and ARM instances using AWS Graviton processors. Conveniently, the pipeline doesn't require any processor-specific features. Therefore, I created two AMIs, one for each of x86 and ARM, which allows the spot fleet to target any and all of Intel, AMD, and Graviton instances
-
The pipeline requirements for each step basically comes down to number of CPU cores and amount of RAM, more or less of each depending on what the step is doing. The upshot of all of the above is that for a given step all the pipeline cares about is the resources of the node, not what kind of node it is. Of course, not all instances are the same speed, but the cost of an instance is roughly proportional with its speed, so it all works out. This means that for a given step, across all AZs and EC2 instance types, there can be over 100 distinct resource combinations to pick from. This basically guarantees spot availability at all times
-
The pipeline uses a fair number of User Defined Functions. Polars supports UDFs written in Python, Numba, and Rust using PyO3. By using the latter two, basically all of the inner loops and heavy computation in the pipeline happens in compiled C or Rust. This, in my opinion, is a really nice way of doing things. Let Python handle moving data around and high-level stuff, and run all the heavy computation in compiled code.
Overall, I'm very pleased about the results of this work. The goal was to save money, and it has done that. I wasn't expecting 85% savings (I'm not sure what I was hoping for), but I feel quite good about that.
Claude Code
A few weeks ago I started playing with Claude Code, which is an AI tool that can help build software projects for/with you. Like ChatGPT, you interact with the AI conversationally, using whole sentences. You can tell it what programming language and which software packages to use, and what you want the new program to do.
I started by asking it to build a mortgage calculator using Python and Dash. Python is one of the most popular programming languages and Dash is an open source Python package that combines Flask, a tool to build websites using Python, and Plotly, a tool that builds high-quality interactive web plots. The killer feature of Dash is that it handles all the nasty and tedious parts of an interactive webpage (meaning Javascript, eeeek!). It deals with webpage button clicks and form submissions for you, and you, the coder, can write things in lovely Python.
With a fair amount of back and forth Claude Code built this advanced mortgage calculator, which is only sorta kinda functional. It does a fair amount of what I asked it to do, but it also doesn't do a fair amount of what I asked it to do, and it has a decent number of bugs. The good things it did was handle some of the tedious boiler plate stuff like creating the necessary directory hierarchy and files, including a README.md detailing how to run the software. Creating a Dash webpage requires writing Python function(s) and a template detailing how to insert the output of the function(s) into a webpage, and Claude Code handled that with aplomb.
What it didn't handle well was more complicated things, like asking it to write an optimizer for funding/paying off a mortgage considering various funding sources and economic factors. It also didn't write the code using standard Python practices. The first time I looked at the code using VS Code and Ruff, Ruff reported over 1,000 style violations. If I found a bug in the code, some of the time I could tell Claude to fix it, and it would, but other times, it would simply fail. Altogether there's about 6,000 lines of code, and as the project got bigger it was clear that Claude was struggling. Simply put, there's a limit of the size or complexity of a codebase that Claude can handle. Humans are not going to be replaced, yet.
At this point I'm not sure what I'll do with the financial calculator. I don't think Claude can help me any more, so I'd have to manually dive in to fix bugs and improve it, and I haven't decided if I will. In summary, my impression of Claude is that it's decent at creating the base of a project or application, but then anything truly creative and complicated is beyond what it can do.
Claude Code isn't free (they give you $5 to start) and I had to deposit money to make this tool. I still have a bit of money to spend, so let's try a few tasks and see how Claude does. I've uploaded all of the code generated below to this repository.
Simple Calculator
I asked it to create a simple web-based calculator, and at a cost of $0.17, here is what it created that works well (that's not a picture below, try clicking on the buttons!).
Simple C Hello World!
Create a template for a program in C. Include a makefile and a main.c file with a main function that prints a simple "hello world!". $0.08 later it performed this simple task flawlessly.
Excel Telescope Angular Resolution
Create an Excel file that can calculate the angular resolution of a reflector telescope as a function of A) the diameter of the primary mirror and B) the altitude of the telescope. It went off and thought for a bit, and spent another $0.17 of my money. It output a file with a ".xlsx" extension, but the file can't be opened. Looking at the output of Claude, I suspect that this may be a file formatting issue because Claude is designed to handle text file output rather than binary.
Text to Speech
Seeing that Claude struggled with creating binary output, next I asked it to create a Jupyter notebook (Jupyter notebooks have the extension .ipynb but they're actually text files) that uses Python and any required packages and can take a block of text and use text to speech to output the text to a sound file. This succeeded ($0.12), and in particular used gTTS (Google Text To Speech) to do the heavy lifting.
Rust Parallel Pi Calculator
Write a program in Rust that uses Monte Carlo techniques to calculate pi. Use multithreading. The input to the program should be the desired number of digits of pi, and the output is pi to that many digits. $0.11 later, I got a Rust program that crashes with a "attempt to multiply with overflow" error. Not great! I could interact with Claude and ask it to try to fix the error, but I haven't.
Baseball Scoresheet Web App
Create a web app for keeping score of a baseball game. Make the page resemble a baseball scoresheet. Use a SQLite database file to store the data. Make the page responsive such that each scoring action is saved immediately to the database. At a cost of $0.93, it output almost 2,000 lines of code. The resulting npm web app doesn't work. Upon initial page load it asks the user to enter player names, numbers, and positions, but no matter what you do, you cannot get past that. Bummer!

Random Number Generator
It looks like the more complicated the ask gets, the worse Claude gets. Here's one more thing I'll try that I have no idea how to do myself. We'll see how well Claude does at it. Create a Mac OS program that generates a truly random floating point number. It should not use a pseudo-random number generator. It should capture random input from the user as the source of randomness. It should give the user the option of typing random keyboard keys, or mouse movements, or making noises captured by the microphone. Please use the swift programming language. Create a Xcode compatible development stack. Create a stylish GUI that looks like a first-class Mac OS program. $0.81 later, and asking it to fix one bug, led to a second try that was also broken with 10 bugs. Clearly, I've pushed Claude past its breaking point.
I'm guessing that all of the broken code can be fixed, maybe by Claude itself, but it ultimately might require human intervention in some cases. I'm optimistic that it could fix the Rust/pi bug, but I'm not optimistic that it can fix the baseball nor random number generator stuff. AI code generation might be coming for us eventually, but not today.

