Solving Flight Monitoring with CRDTs

Freebird empowers travelers to skip the line and instantly book a new ticket after a flight cancellation, significant delay, or missed connection — on any airline, for free, with only three taps on their phone. Because keeping travelers and customer experience agents well-informed is critical to providing a five-star hospitality experience, we constantly monitor the status of every flight that our travelers are scheduled to take.

Below, I'll describe how Freebird built a robust flight status monitoring system using CRDTs, a recent development in distributed systems technology.

The Problem

When we were designing our flight status monitoring system, our analysis suggested that integrating feeds from multiple flight status providers could result in a product with better reliability, coverage, and accuracy. The main tension of this approach is that different providers may provide conflicting information about the same flight, so we need to resolve these discrepancies in our own software. Imagine waking to a pager at 3 AM: "Flight AA 100 from JFK to LHR has encountered a status merge failure. Please log in to resolve." We needed a way to integrate conflicting information into a cohesive and informative whole.

We wanted our flight status feed to be:

  • Fast
    Fast:
    when a flight's status changes, travelers and customer experience agents should know about it immediately.
  • Resilient
    Resilient:
    even if some of our flight status providers go down, our system should continue to function well.
  • Reliable
    Reliable:
    it should be impossible to encounter a "merge conflict" that blocks further updates to a flight.
  • Comprehensible
    Comprehensible:
    we want to be able to reason about how a flight's status ended up in a particular state.

A few technical knots:

  • Flight status updates frequently arrive slightly out of order
  • If one provider goes down for a while, updates may arrive severely out of order
  • Different providers provide updates at different frequencies

The Solution

After reflecting on these business needs, we determined that our use case would be addressed well by a relatively new distributed computing technique: convergent replicated data types (CRDTs). Below we'll introduce CRDTs and their mathematical properties, and then show how CRDTs allowed us to build a first-class flight monitoring system. Although this blog post applies CRDTs to flight status monitoring, hopefully it will inspire you to think about how you could use CRDTs to build your own applications.

CRDTs

A convergent replicated data type is a data structure that can be replicated in multiple computers across a network. Separate copies of a CRDT in different parts of the network can be updated independently then merged back together.

Mathematically, a convergent data type consists of a set and a merge function that merges two elements of that set to produce another element of the set. The merge function needs to have three properties.

Idempotent: merge(A, A) = A. Min and max are good examples of idempotent functions on numbers. As a counterexample, addition is not idempotent because A + A does not usually equal A.

Commutative: merge(A, B) = merge(B, A). Addition and multiplication are commutative, but division and subtraction are not.

Associative: merge(merge(A, B), C) = merge(A, merge(B, C)). Addition and multiplication are both commutative and associative. However, taking the mean of two numbers is an operation that is commutative but not associative:

mean(mean(4, 8), 12) = 9
mean(4, mean(8, 12)) = 7

Idempotence Commutativity Associativity

Since their definition in 2011, CRDTs have been used to power real-time chat, gaming, and document editing systems. Below, I'll explain how Freebird used them to monitor flight statuses.

Modeling cancellations

When a traveler's flight is canceled, we want to alert our customer experience team so that they can immediately jump into action. Under the hood a cancellation is more complex than it sounds, because our flight status providers will find out about the cancellation at different times. Although we could pass one cancellation boolean for each provider to downstream consumers, that's just kicking the can down the road. We want to meaningfully merge our cancellation status updates together, rather than making customer experience agents or travelers do that work in their heads.

Non-CRDT Solution

First, I want to convince you that a CRDT is actually useful. Here is a straw man non-CRDT solution:

def merge_canceled_b(a, b):
  return b

Let's assume that we always call this merge function such that a is the data that we previously had in our database and b is new data coming in from one of our flight status providers. The function above is saying that we'll always use an incoming update from a flight status provider rather than storing state in a database.

Consider the behavior of this function on a straightforward canceled flight:

Incoming status Merged status
{ "time": 1, "isCanceled": false } { "time": 1, "isCanceled": false }
{ "time": 2, "isCanceled": true } { "time": 2, "isCanceled": true }

The behavior above is what we want. However, if the updates arrive out of order, it does poorly:

Incoming status Merged status
{ "time": 2, "isCanceled": true } { "time": 2, "isCanceled": true }
{ "time": 1, "isCanceled": false } { "time": 1, "isCanceled": false }

Since this function doesn't look at the timestamp, it will take the older update instead of the newer one, forgetting that the flight has been canceled. Not good!

One way you could describe this problem would be to say that the merge function is not commutative: merge(A, B) != merge(B, A). Since we frequently receive out-of-order updates, a noncommutative merge function would produce merged flight status updates that felt arbitrary.

CRDT: Last writer wins

A simple way to resolve disparate values in a CRDT is "last writer wins." This is where we take the value with the newest timestamp.

def merge_canceled_last_writer_wins(a, b):
  if a['time'] > b['time']:
    return a
  else:
    return b

This will do a good job on the out-of-order sequence of updates that we mentioned above, because it is smart enough to take the newer update:

Incoming status Merged status
{ "time": 2, "isCanceled": true } { "time": 2, "isCanceled": true }
{ "time": 1, "isCanceled": false } { "time": 2, "isCanceled": true }

Let's consider a different sequence of updates, in which we receive an update that a flight is canceled and receive a newer one that says it is not canceled. This could be due to a few different reasons:

  • Most commonly, one provider hasn't yet found out about the cancellation
  • Sometimes, a provider has incorrect data
  • Rarely, the flight was actually reinstated (i.e. "un-canceled")

To be on the safe side, any reported cancellation should immediately alert our customer experience agents that the flight may have been canceled. Here's the behavior of our “last writer wins” algorithm:

Incoming status Merged status
{ "time": 1, "isCanceled": true } { "time": 1, "isCanceled": true }
{ "time": 2, "isCanceled": false } { "time": 2, "isCanceled": false }

We've lost track of the fact that this flight was reported to be canceled, which could lead to a bad customer experience. Although this is a valid state-based CRDT, it doesn't solve the right problem.

CRDT: Boolean OR

A better way to solve our problem is to introduce a "hasBeenCanceled" field into the mix. Since our incoming updates from providers don't directly provide the concept of "has been canceled," we'll just set "hasBeenCanceled" to "isCanceled" for every incoming update.

Now we need a way to merge two hasBeenCanceled fields. This can be done with boolean OR. Here's our new merging pseudocode:

def merge_canceled_or(a, b):
  return {
    'time': max(a['time'], b['time']),
    'hasBeenCanceled': a['hasBeenCanceled'] or b['hasBeenCanceled']
  }

And the result:

Incoming status Merged status
{ "time": 2, "isCanceled": true } { "time": 2, "isCanceled": true }
{ "time": 1, "isCanceled": false } { "time": 2, "isCanceled": true }

You can verify for your that this example will also end up in the same state if these two updates arrive in order.

This is a nice example for reviewing the necessary conditions for a CRDT. Since there are only two possible values for each variable, you can prove these with a truth table. For me, associativity was not completely intuitive, but the truth table doesn't lie.

  • Idempotence: A or A == A
  • Commutativity: A or B == B or A
  • Associativity: (A or B) or C == A or (B or C)

Stare at this truth table until you're convinced that OR is associative

A B C A or B B or C (A or B) or C A or (B or C)
false false false false false false false
false false true false true true true
false true false true true true true
false true true true true true true
true false false true false true true
true false true true true true true
true true false true true true true
true true true true true true true

So, even merging a boolean can be tricky! The strategy above will alert our Customer Experience team if a flight has been reported canceled by any provider at any time. Now let's model some more complicated stuff.

Modeling delays

If a flight is more than K hours delayed, we want to alert our customer experience team. Specifically, we do this when the estimated departure time is after the scheduled departure time plus K hours. For the purposes of this blog post, scheduled departure time can be thought of as a constant.

As before, last-writer-wins could cause us to lose track of valuable information. Instead, we'll store the maximum estimated departure time. The merge function looks like:

def merge_delayed(a, b):
  time = max(a['time'], b['time'])
  max_estimated_departure = max(
    a['maxEstimatedDeparture'],
    b['maxEstimatedDeparture']
  )
  return {
    'time': time,
    'maxEstimatedDeparture': max_estimated_departure
  }

As a result, we end up with a system that can reliably detect if a flight was disrupted, while being immune to the problems of flip-flopping and out-of-order updates.

We can verify that max follows all of the necessary CRDT rules:

  • Idempotence: max(a, a) == a
  • Commutativity: max(a, b) == max(b, a)
  • Associativity: max(max(a, b), c) == max(a, max(b, c))

Proving associativity is a bit more difficult than for boolean OR, but can still be proven with case analysis.

Composing CRDTs to merge multiple fields

Above, I independently presented strategies for merging cancellation and delay status. But we actually care about merging many fields at the same time. Consider the following two flight status updates:

{ "hasBeenCanceled": true, "maxEstimatedDeparture": "2018-01-01T12:00:00Z" }
{ "hasBeenCanceled": false, "maxEstimatedDeparture": "2018-01-01T18:18:18Z" }

Fortunately, we can achieve a desirable result by merging each of the fields using the strategies we discussed above. Here's the merged status:

{ "hasBeenCanceled": true, "maxEstimatedDeparture": "2018-01-01T18:18:18Z" }

Pleasantly, we can prove that this strategy satisfies the necessary conditions to be a CRDT: idempotence, commutativity, and associativity. The upshot of this result is that it gives us a nice way to compose CRDTs. This is good news for writing application logic, because we can now elegantly combine many different types of merge logic for different purposes.

Extending application logic

CRDTs are a great starting point for customizing our application logic. Here are just a few directions that we can take this:

  • Flight status providers can "vote" on value of a field such as the current state of a flight. We can assign different weights to different status providers based on trustworthiness.

  • Instead of tracking whether or not a flight is canceled, we can produce a probability that the flight is canceled by feeding flight status information into a machine learning model.

  • We can aggregate one piece of information in multiple different ways to meet our needs. For example, we can track both whether a flight has ever been canceled and whether a flight is currently canceled.

  • These CRDT operations can be often be elegantly represented with standard SQL. This gives us an efficient way to perform computations on both real-time data and to do post-hoc analysis of flight status updates. These computations can be efficiently parallelized over years of flight status updates using a tool like Presto.

Conclusion

At Freebird, we take flight status monitoring seriously because it can be the difference between snagging a last-minute flight home and sitting in the airport for hours. To make our product work, we need a system that can reason about the world given multiple potentially conflicting views of reality.

To solve this, we used CRDTs, a relatively new concept from distributed systems inspired by modern abstract algebra. CRDTs let us use our air travel domain knowledge to elegantly combine conflicting information from multiple status providers into a single cohesive view. Not only do CRDTs provide a straightforward framework for developing application logic, but simple CRDTs can be effortlessly composed into more powerful CRDTs. Beyond satisfying our basic requirements, CRDTs lend themselves well to extensions such as SQL analytics.

Over its lifetime, our flight status monitoring system has processed billions of flight status updates with excellent uptime and accuracy. CRDTs helped us meet our goal of building a system that is fast, resilient, reliable, and comprehensible. After reading this post, I hope that you will have similar success in applying CRDTs to build your own solutions.