Credit Karma leverages data for over 60 million members to deliver a personalized user experience. When you visit our site or use our mobile apps, we crunch your credit report and other factors in real time to understand how we can help you improve your credit score, refinance your credit card, or otherwise make financial progress.
To do this, we rely largely on Scala and Akka to do the heavy lifting: Scala gives us type safety and the blazing performance of the JVM (to name a few strengths), and Akka abstracts away the complexity of multi threaded, distributed systems.
Powerful tools, however, demand some mastery on how to use them. As Zack Loebel-Begelman highlights in his blog post, Solving for High Throughput with Akka Streams, it’s important to identify the problem you need to solve when starting with Scala and Akka. It definitely will dictate the success of your project.
This is the story of what happens when you get that wrong.
Sometime back in 2015, we needed to move data from a third party analytics provider that helped us gauge how members interacted with our product features. Scala and Akka were gaining traction internally as engineers gravitated away from dynamic types and the lack of first class concurrency in PHP, so we began to build out an Akka Actor System that could help us parallelize the work of importing data into our pipeline.
Initial Actor System to move data from third party
Each component above represents an Actor or a Pool of Actors, with arrows between them denoting message communication between Actors. (See more info on Actor Systems.)
Just getting started with Scala and Akka, we didn’t immediately see the problem with this approach.
1) Coordinator actor would request work from our Data Provider, and...
2) Send that to a group of Transformer actors who would transform the data into the format we required, and...
3) Then finally send that data to a group of Spray/HTTP actors that would leverage a thread pool and Futures to move data into our Ingest server.
We shipped it, and it looked like this:
Red line is heap size, blue line is used heap space. Purple line is max heap space.
Okay, that’s a lot of memory usage! Not ideal, but we could live with it, albeit as a bad citizen to other applications running on this machine. But a few days later, we started noticing other errors surface.
<span style="color: red;">Futures time out after [ 30 seconds ]</span> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) ~[org.scala-lang.scala-library-2.11.7.jar:na] at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) ~[org.scala-lang.scala-library-2.11.7.jar:na] at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:19) ~[org.scala-lang.scala-library-2.11.7.jar:na]
We realized we were missing something fundamental with our Actor System. After some research, we stumbled into a very important concept that had been ignored: back pressure.
Back pressure, included as part of the Reactive Manifesto, is a feedback mechanism using async signals that allows a system to control the rate at which messages are received by downstream actors. This prevents slower components in your system from getting overwhelmed by faster components.
Thankfully, our friends at Lightbend provide a tool to solve this: Akka Streams. Akka Streams provides a streaming API that introduces the concept of demand between publishers and subscribers. Instead of pushing messages through the system, we move to asking for and pulling work down the stream. In this way, slower actors can request work as resources become free.
We were seeing Futures timing out in production for this very reason. We had fast actors quickly parsing JSON and pushing events to slow actors sending data out over HTTP and waiting on network IO. These slow actors used a Future to wait on the response from the HTTP server, but with a fixed thread pool. Work exceeded the available threads, and the Futures in queue timed out waiting for a thread to become available. As a result, work was lost and data we needed ultimately didn’t make it into our pipelines.
After rewriting the core logic to use Akka Streams, these errors went away. As threads became available, the slower actors moved to ask for work from the faster actors, instead of having it shoved in their face.
As a result of moving to a pull-based system, instead of front loading and pushing a lot of work to queue up in memory, we were able to reduce heap usage by 3x.
Red line is max heap space, blue line is used heap space.
In conclusion, think about back pressure before getting started with Akka. Understand which actors will be fast, which will be slow, and if you need an API like Akka Streams to help manage the throughput of work between them. If you'd like to learn more about why and how we used Akka Streams at Credit Karma, see the talk that my colleague Zack Loebel-Begelman and I gave at Data by the Bay: Akka Streams for Large Scale Data Processing. If you'd like to join our team, you can see our open roles.
Thanks to Zack, Vishnu Ram, and Craig Giles.
For more information on the motivation of Akka Streams, check out the official docs.