How to Make Your Finagle Services Resilient

Mar 08, 2017

Finagle and the move to microservices

When serving 60 million members (and counting), there come some interesting scaling challenges. At Credit Karma, we are constantly facing and addressing those challenges to provide our ever-growing member base with a great user experience. As a part of this, we are currently on a path to evolve our monolithic architecture into a microservices-based one.

My team, Core Systems Framework, has taken a deep dive into the Finagle RPC ecosystem to better understand its features and what it can provide us with. In this post, I’ll share what we learned about how to use/configure Finagle clients to create fault-tolerant, performant, and resilient microservices.

What is Finagle?

Finagle is a powerful RPC system, based on Java’s async NIO APIs, that allows construction of highly performant concurrent services. Finagle is designed on the principle of “Your Server as a Function.” Servers and clients in Finagle are both built on top of Netty, an asynchronous NIO client/server framework. They share the same thread pool, with a default size of twice the number of processors on the machine.

Servers and clients in Finagle also share a configuration API that is simple to use, yet powerful. Unfortunately, the API’s simplicity can sometimes mislead developers into overlooking the configuration of some parameters that are crucial to achieve a highly performant fault-tolerant system.

So let’s talk about some of those key parameters and how to configure them, specifically on the client side, which requires a more extensive configuration.

Before you push your microservice into production

So you just finished implementing your Finagle microservice after spending a decent amount of time carefully working on its architecture and design. You are ready to integrate the microservice with your ecosystem and push it to production so you create a simple client to call your server, expecting Finagle to handle all your non-functional requirements (e.g. SLAs, fault-tolerance, timeout handling, etc.) under the hood, right? Not exactly. By design, Finagle won’t set sane defaults for your server/client configurations. The philosophy behind this is that Finagle does not (and should not) have any awareness of the application or microservice being implemented on it. Therefore, there are no logical defaults that can be defined.

The good news is we found that configuring clients to have these functionalities is a relatively easy process. In Finagle, the clients are much more complex than servers and, accordingly, most of the configurations are defined on the client side.

A quick note on making the most of Finagle

Before diving into the sections below, I highly recommend you go through the Finagle Client Modules architecture to gain an understanding of how client stacks are structured if you haven’t already. Most Finagle client configurations come in two flavors that can be roughly classified as simple and advanced. The motivation behind this is that minimum knowledge would be needed to configure clients with basic sane settings, while a deeper understanding is needed to configure more advanced settings.

Note: This is not intended to cover all of the configurations that can be applied to Finagle clients as there are simply too many of them. Instead, I’ll focus on some of the most essential ones to guarantee a stable configuration. I encourage you to experiment with these configurations to reach the optimal settings for your microservice.

With that said, let’s talk about six useful configurations. The examples below use a client using Thrift protocol but the same configurations can be applied on any other supported protocol (e.g. Http, ThirftMux).

Retry Behavior

Finagle handles retries by using two filters: “RequeueFilter” and “RetryFilter.” Failures that are known to be safe to retry (for example, exceptions that occurred before the bytes were written to the wire) will be automatically retried by Finagle using the “RequeueFilter.” A timeout, for example, will not be retried automatically since Finagle would not know the state of the request and would not be able to determine if it safe to retry (this behavior can be overridden as will be shown below). See RetryableWriteException for more details on which failures are considered retryable by the “RequeueFilter”. If a request is considered retriable by both filters, it will first be retried by the “RequeueFilter” since its order is before the “RetryFilter” in the client stack.

By default, a “RequeueFilter” is configured in each client stack and its default behavior is to have no delays, which is not recommended. Backoff policies should be used to configure the behavior of the filter and Finagle provides several built-in policies for that purpose. A Decorrelated Jitter policy is generally the one I’d recommend but this could vary based on your service requirements. See here for a great explanation of how jittered backoffs work. In addition to backoff policies, retry budgets, which essentially manages the number of retries, should be configured when using a “RequeueFilter.”

This is how you can configure a “RequeueFilter” with a Decorrelated Jitter policy and a “RetryBudget.”

Thrift.client    
  .withRetryBackoff(Backoff.decorrelatedJittered(2.seconds, 32.seconds))    
  .withRetryBudget(RetryBudget(ttl = 5.seconds, minRetriesPerSec = 5, percentCanRetry = 0.1))

In addition to “RequeueFilter”s, “RetryFilter”s can be configured to retry specific requests based on predefined rules. Since both filters share some common functionalities, as a best practice, I’d suggest having “RequeueFilter”s handle Finagle-specific exceptions and “RetryFilter”s handle domain-specific and custom exceptions.

A “RetryFilter” is configured using a “RetryPolicy”, which is created by providing the number of attempts to retry and a partial function that determines if the request should be retried. A “RetryFilter” also requires a “RetryBudget”. The example below shows how to create a “RetryFilter” that is configured to retry timeouts exceptions:

private def retryPolicy[Req, Rep]: RetryPolicy[(Req, Try[Rep])] = RetryPolicy.tries(3, {
  case (_, Throw(Failure(Some(_: TimeoutException)))) | (_, Throw(_: TimeoutException)) =>
    logger.debug("Request timed out. Retrying")
    true
  case _ => false
})
val retryBudget = RetryBudget(ttl = 5.seconds, minRetriesPerSec = 5, percentCanRetry = 0.1) 
Thrift.client
  .filtered(new RetryFilter(
    retryPolicy,
    HighResTimer.Default,
    CustomStatsReceiver(),
    retryBudget))

You can use the same “RetryBudget” for both the “RequeueFilter” and the “RetryFilter” in order to prevent retry storms.

Timeouts

Timeouts define reasonable bounds for a service request. Finagle contains two modules that are responsible for handling timeouts: Session Timeout module and Request Timeout module. Session timeout refers to the time allowed for a request to obtain a session while a request timeout refers to the time allowed for a request to wait for a response once a connection is established. By default, there are no upper bounds defined for both modules so it’s important to define reasonable values here.

The code below demonstrates how to configure session timeouts by defining a maximum life time and idle time for the client. By default, a client session never expires. Configuring these parameters allows you to set a maximum lifetime for the session and a maximum idle time, where no requests are being sent from that client. Also, a connection acquisition timeout can be set (unbounded by default). Once any of these thresholds are met, the client is expired.

Thrift.client
  .withSession.maxLifeTime(20.seconds)
  .withSession.maxIdleTime(10.seconds)
  .withSession.acquisitionTimeout(10.seconds)

The most basic configuration to define a request timeout on the client side is to use the “withRequestTimeout”method on the client API. The code snippet below, for example, defines a timeout for a request.

Thrift.client.withRequestTimeout(200.milliseconds)

Finagle also allows for a more advanced request timeout configuration through the use of a “TimeoutFilter.” This should be used when configurations cannot be provided through any of the “with” API methods. For example, the following code creates and adds a “TimeoutFilter” to the Client stack.

Thrift.client.filtered(new TimeoutFilter(
  200.milliseconds,
  new IndividualRequestTimeoutException(200.milliseconds), 
  HighResTimer.Default,
  CustomStatsReceiver()))

Load Balancing

Configuring load balancing is one of the most important tasks in the journey to making Finagle microservices resilient. Vladimir Kostyukov gives an excellent explanation of how the Load Balancing module evolved at Finagle and the different algorithms available in his Finagle 101 post.

Finagle comes equipped with five different stock algorithms for Load Balancing:

Round Robin
Heap + Least Loaded
Power of Two Choices (P2C) + Least Loaded
Power of Two Choices (P2C) + Peak Exponentially Weighted Moving Average (EWMA)
Aperture + Least Loaded

Unlike most modules, Finagle clients have a default configuration that uses the P2C + Least Loaded algorithm. However, some experiments have shown that using P2C + EWMA can provide better performance (FYI the Aperture + Least Loaded algorithm was not part of the experiment). The example below illustrates how to override the default Load Balancing algorithm. P2C + EWMA is used in the example but the same method can be used to specify any of the algorithms above.

Thrift.client
  .withLoadBalancer(Balancers.p2cPeakEwma(maxEffort = 100, decayTime = 100.seconds))

Circuit breakers

Failure Accrual is one of the three types of circuit breakers provided by Finagle, along with Fail Fast and Threshold Failure Detection. This section focuses on Failure Accrual since it is, arguably, the most powerful of the three mechanisms. I’ll touch briefly on Fail Fast and Threshold Failure Detection in the next sections.

Failure Accrual

Similar to Retry Filters, Failure Accrual is configured using a policy that essentially can be success-biased or failure-biased. The Failure Accrual module is enabled by default for all clients with a policy based on five consecutive failures and an equal jittered backoff policy. These examples show both failure-biased and success-biased configurations respectively.

Thrift.client
  .configured(FailureAccrualFactory.Param(() =>
  FailureAccrualPolicy.consecutiveFailures(
    numFailures = 10,
    markDeadFor = Backoff.decorrelatedJittered(5.seconds, 20.seconds))))

Thrift.client
  .configured(FailureAccrualFactory.Param(() =>
  FailureAccrualPolicy.successRate(
    requiredSuccessRate = 0.99,
    markDeadFor = Backoff.decorrelatedJittered(5.seconds, 20.seconds))))

Fail Fast

The Fail Fast module is enabled by default for all clients except Memcached. Its function is to reduce the number of requests dispatched to endpoints that are likely to fail. When a connection fails, the host is marked down and a background process is launched to attempt reestablishing the connection (using a backoff policy). Until the connection is reestablished, the load balancer will avoid using this host.

In the case that only one host exists, it’s important to disable the module since no other paths will exist. This is how you can do that:

Thrift.client
  .withSessionQualifier.noFailFast

Threshold Failure Detection

The Threshold Failure Detector module is a ping-based failure detector. It essentially works by periodically sending pings to nodes and recording their round-trip times. If a node does not respond within the predefined time, it is marked as busy and possibly closed based on the configuration parameters.

Note: At the time of writing, Threshold Failure Detection is only implemented for the Mux protocol so I will not cover it in detail here.

Response Classification

Using a response classifier can be very useful as it allows Finagle to gain some understanding of your domain in order to properly classify failures, which leads to better failure accrual handling and better metrics collection.

By default, Finagle will classify any “Return” type as a success and any “Failure” as a failure. The code below changes this default behavior; it instructs Finagle to classify any deserialized Thrift exception as a failure.

Thrift.client
  .withResponseClassifier(ThriftResponseClassifier.ThriftExceptionsAsFailures)

A more fine-tuned method can be used to configure response classifiers than simply classifying all Thrift exceptions as failures. This can be accomplished by creating a custom “ResponseClassifier”. In the example below, the “domainExceptionClassifier” classifies “CustomServiceException”s as non-retryable failures.

val domainExceptionClassifier: ResponseClassifier = {
  case ReqRep(_, Throw(_: CustomServiceException)) => 
    ResponseClass.NonRetryableFailure
}
Thrift.client
  .withResponseClassifier(domainExceptionClassifier)

Note: Right now, Finagle does not distinguish between “RetryableFailure”s and “NonRetryableFailure”s. They are added as groundwork for future enhancements. You can see more details on Finagle response classification here.

Connection Pooling

The Finagle architecture puts the responsibility of connection pool management on the client side. A balance needs to be maintained between the cost of keeping persistent connections open to be reused across requests (including the number of those connections) and the cost of acquiring new connections. This balance can be configured through the Pooling module.

There are three types of connection pools that are provided by Finagle: Buffering Pool, Watermark Pool, and Caching Pool. These pools are configured to be stacked on top of each other. By default, the client stack is configured to have an overlay of caching and watermark pools that have a low watermark of 0, an unbounded high watermark, and an unbounded TTL. The Buffering Pool is disabled by default.

As with most Finagle client configurations, there are generally two methods to override the default behavior of connection pools. Here is a simple method:

Thrift.client
  //Minimum number of persistent connections to keep open
  .withSessionPool.minSize(5)
  //Maximum number of temporary and persistent connections to keep open
  .withSessionPool.maxSize(10)
  //Max number of connection requests to queue when exceeding high watermark 
  .withSessionPool.maxWaiters(20)

A more fine grained configuration can be made by creating a “DefaultPool” and adding it to the client stack. In addition to overriding the parameters above, configuring pooling using the method below allows for overriding the “bufferSize” and “idleTime” for the connections as well. Like this:

Thrift.client
  .configured(DefaultPool.Param(
     low = 5,
     high = 20,
     bufferSize = 0,
     idleTime = 60.milliseconds,
     maxWaiters = 20))

Note: Finagle’s “Mux” protocol does not require connection pooling since, by design, it maintains a single connection for each endpoint. The connection is shared by all Finagle clients.

Wrap Up

Finagle is a powerful system that will enable expedited development of resilient, performant, and fault-tolerant microservices. We have been successfully using it for this purpose at Credit Karma. However, it’s critical to spend time setting up and optimizing its configurations in order to utilize its full potential.

Metrics are the best indicator of potential configuration problems. Finagle provides a very rich set of metrics that you can make use of. Fire up the admin module and analyze the numbers; it will go a long way towards finding the optimal configuration for your service. Let us know how this works for you @CreditKarmaEng. If you’d like to work on microservices with our Platform Engineering team, check out our open roles.

Note: Full client configuration samples that were discussed are available here.