Amazon DynamoDB vs Apache Cassandra

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Share on Reddit

Cassandra and DynamoDB both origin from the same paper: Dynamo: Amazon’s Highly Available Key-value store. (By the way – it has been a very influential paper and set the foundations for several NoSQL databases).

Of course it means that DynamoDB and Cassandra have a lot in common! (They have the same DNA). However both AWS DynamoDB and Apache Cassandra have evolved quite a lot since this paper was written back in 2007 and there are now some key differences to be aware of when choosing between the two.

This post aims at comparing these 2 distributed databases so that you can choose the one that best matches your requirements.

The concept

First of all let’s talk about the things they have in common and it starts with their implementation concept: Both of them map partition key onto a token ring using constant hashing to determine where to store the data.

The idea is that the partition key is hashed into a 128 bits value. All the possible hash-values form a ring and each node in the cluster is responsible for one (or more) range of the ring.

By hashing the partition key every node is able to know the range it belongs to and from there the node in charge of this range. Of course we also need availability so the data needs to be replicated across multiple nodes. Where to place replicas (aka replication strategy) depends on the database implementation – (more on that later).

It means that both DynamoDB and Cassandra are true peer-to-peer systems, with no master nodes (and no single point-of-failures). It also means that you can send your queries to any node in the cluster (or even better have your driver sent the request to the most appropriate node).

The data structure

Primary key

Both databases belong to the column-family. Each item is identified by a primary key composed of 2 parts:

  • The partition key (mandatory): determines the partition where the item is stored
  • The sort or cluster key (optional): determines how the item are sorting inside a partition

Cassandra supports any number of fields for both the partition key and the clustering key. On the other hand DynamoDB supports only 2 fields: one for the partition key and one for the sort key. It means that if you need several attributes to compose your key you need to manually concatenate them into a single field.

Schema

Cassandra supports structured data by means of a schema definition that you define in the table creation request in CQL (Cassandra Query Language).

DynamoDB is schema-less (except for the primary key). Items in the same table can have completely different attributes (except for the partition and sort key).

Indexes

Both Cassandra and DynamoDB supports secondary indexes in order to query an item using an attribute that is not part of the primary key.

However secondary index are usually slower and should be used wisely (preferred for attributes with low-cardinality).

Materialised views

Again both databases support materialised views (although DynamoDB calls them “Global secondary index”).

Time-To-Live

DynamoDB supports TTL at the item level. It means that when the TTL expires the whole item is deleted.

Cassandra offers finer control as it supports TTL on columns which makes it possible to expire only certain fields of an item.

Consistency

One of the advantage of the dynamo family of NoSQL databases is that you can control the level of consistency that you need. E.g. if eventual consistency is enough for your application you can read one copy from any node in the cluster – it might not be the most up-to-date version of the data but it’s the fastest way to query the database.

On the other hand you can require multiple nodes to answer your query making sure that you retrieve the latest version of the data.

This is basically the 2 consistency levels that DynamoDB offers. You can control it with a “strong-consistency” flag in the query. If false eventual consistency is enough and you might not retrieve the latest version of the data. It true DynamoDB makes sure that you get the most up-to-date version of the data. Of course using strong consistency is slower and cost you more as well.

Cassandra offers the same sort of consistency levels but with much finer control. You can choose between ANY (any node may answer), ONE (one node among the replicas of the given key), QUORUM (the majority of the replicas), LOCAL_QUORUM (the local majority of the replicas), ALL (all replicas).

Conflicts resolution

Although there is a preferred node in charge of a partition it may happen that we end up with a conflict for a given key (e.g. the preferred node wasn’t available). In this case DynamoDB and Cassandra take a different approach.

The Cassandra strategy is simple: Every node add a timestamp when it writes the data. When there is a conflict the data with the most recent timestamp wins. This is the last-write-wins (LWW) strategy.

Note that each field has its own timestamp so that it’s always possible to merge the changes if they concern different fields of the same object.

The Dynamo paper relies on vector-clock to detect conflicts. Every node maintains a counter (or version number) of the changes it makes to an object. When it detects a conflict it tries to merge the results (which makes sense for Amazon because in the worst case you end-up with an extra item in your cart).

Although it’s not clear from the AWS documentation it seems that DynamoDB in AWS now relies on a LLW strategy as well (or a combination of vector-clocks and timestamps).

In the end there is not much difference between the 2 because both of them always return a single version of an object and the application never has to resolve conflicting objects.

The only control that you have is the consistency level that you set in the request.

Language

Cassandra comes with its own language: The Cassandra Query Language (or CQL). It’s pretty close to SQL with some adaptations to support Cassandra features not present in SQL (collection and user-defined types, TTL, …).

INSERT INTO cycling.cyclist_name (id, lastname, firstname)
VALUES (6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47, 'KRUIKSWIJK','Steven')
USING TTL 86400 AND TIMESTAMP 123456789;

DynamoDB relies on a specific JSON-based interface with variables replacement. The AWS CLI (Command Line Interface) makes it slightly easier but I still find using CQL much more natural.

{
    "ForumName": {"S": "Amazon DynamoDB"},
    "Subject": {"S": "New discussion thread"},
    "Message": {"S": "First post in this thread"},
    "LastPostedBy": {"S": "fred@example.com"},
    "LastPostDateTime": {"S": "201603190422"}
}

The S in the snippet above indicates that the attributes are of type String. It’s also possible to use variable substitution by using preceding the variable name with a :

{
  "TableName": "Thread",
  "Key": {
    "ForumName": {
      "S": "Amazon DynamoDB"
    },
    "Subject": {
      "S": "A question about updates"
    }
  },
  "UpdateExpression": "set Replies = Replies + :num",
  "ExpressionAttributeValues": { 
    ":num": {"N": "1"}
  },
  "ReturnValues" : "NONE"
}

Multiple Updates

One note on updating all records stored under the same partition key. Cassandra fully support this operation by specifying only the partition key:

UPDATE my_table SET some_field = "some value" WHERE my_partition_key = "my_key";

With DynamoDB you have to specify the entire primary key (i.e. both the hash and sort keys).

Protocol

Like all Amazon services DynamoDB offers a JSON/HTTP interface. The JSON syntax is specific to Dynamo as you need to indicate the type of the fields inside the JSON structure.

To exchange data with DynamoDB you have to marshall to/from JSON which impacts performance (sometimes in a substantial way).

The Cassandra client relies on a binary format to communicate with the database which makes the serialisation process more efficient (both in terms of CPU and network bandwidth). It is even more efficient than Thrift which was deprecated in favour of the CQL native protocol. Parts of its efficiency comes from the possibility to have several on-going requests at the same time.

Drivers

Both Cassandra and DynamoDB provide a set of drivers supporting most of the mainstream languages. In this section I discuss in more details the drivers available on the JVM as it is the platform I use the most.

The java driver for Cassandra is pretty smart as it tries to optimise your query as best as it could (e.g. send the query to the node managing the hash of the partition key). It is also fully async and non-blocking as it can be used to send multiple requests simultaneously. The main caveat is the lack of type-safety as CQL queries are just written in plain text.

There are a number of alternatives for the Scala language (Phantom, Quill and more recently Troy).

The java driver for DynamoDB is obviously not as good. The implementation is based on java future which only provides blocking calls to access the value. This really is a bummer to build reactive applications. There is a so-called async version of the client but all it does is delegate the blocking to a dedicated thread pool.

The good news is that Amazon is working on a new version of the java driver that will be truly async but as the time of writing is still under developer-preview (not yet stable).

There are a few Scala drivers developed by the community like Scanamo which provides nice marshalers and a Free monad based implementation. However it relies on the AWS java client under the hood. (Using the Free monad makes it easy to plug in another client though).

There is also an akka-stream connector available as part of the Alpakka project. To my knowledge this is the only non-blocking client available at the time of writing.

Data replication

In my opinion this is the key differentiating factor! Cassandra is fully tuneable and let’s you configure every aspect of the data replication. You can configure the number of replicas inside each datacenter and even enable replication across datacenter (making cross-region replication seamless – in case you’re hosting your Cassandra cluster on EC2).

Note that the number of replicas doesn’t change automatically. You have to manually enable more replicas when your data or traffic outgrows the available resources.

With DynamoDB it’s the opposite approach! You have (almost) no control over the number of replicas involved. Everything is performed automatically by AWS. You only provision the required throughput and DynamoDB makes sure you have enough partitions to handle the load.

As your data grow AWS automatically add more partition. However it doesn’t change the provisioned throughput which means less throughput per partition. If your data is uniformly distributed that’s not an issue. If not you can quickly run into problem! This can become tricky when you store time-series into DynamoDB. (There is even a whole section dedicated to this topic in the DynamoDB documentation).

Basically with DynamoDB you need to make sure your data is uniformly distributed! DynamoDB documentation does a pretty good job at explaining the rules on when partitions are created and how it impacts the provision throughput per partition.

AWS has recently added support for Dynamo global tables: multi-region/multi-master data replication. It uses dynamo streams to replicate the data and resolves conflicts using a LastWriteWins strategy. There is also a new table backup/restore functionality allowing to backup a whole table at once without consuming any provisioned throughput. Restore can only be made into a new table and doesn’t consume any throughput. However it doesn’t play well with the multi-region support as backups can’t be exported into another region and restoring a local replica inside a global table doesn’t generate the corresponding events on the stream (which makes it impossible to sync the other region with this table). Basically global tables are a great step-forward but it is currently difficult to import existing data into a global table. More information here: https://aws.amazon.com/dynamodb/global-tables

Costs

DynamoDB offers a “plug-and-play” approach. You set your expected throughput and you’re ready to go. Amazon manages everything for you. The costs are based on the provisioned throughput (0.47$/WCU and 0.09$/RCU) and storage used (0.25$/Gb). Of course you’d better check the official pricing documentation for up-to-date information.

On the other hand Cassandra is free open-source software so you don’t have to pay for it. However you have to run it somewhere and you have to pay for the machine you’re going to use (most likely EC2 instances or something equivalent on GoogleCloud or Microsoft Azure). Cassandra is designed to run on cheap commodity hardware but you will need several instances (3, 5 or 7, … instances for each datacenter). You should also consider the amount of time you need to setup the cluster …

Note that Datastax (the company behind Apache Cassandra) now offers Datastax-Managed-Cloud where they deploy and administrate a Cassandra cluster for you on AWS.

AWS now offers a caching layer (most likely based on ElastiCache) on top of DynamoDB called DAX. DAX allows you to improve performances (reduced latency due to in-memory caching) and save your provisioned throughput because requests will hit the cache instead of your DynamoDB tables. DAX supports caching both for items and queries.

Conclusion

As though Cassandra and DynamoDB seem very close at first sight there is a number of key differences such as

  • The replication strategy
  • The service ownership (Database managed by you or AWS)
  • The pricing model

DynamoDB is easy to get started with however you might keep an eye on the cost involved. It is also not a good choice to store more and more rarely-accessed data in Dynamo as it increases your storage costs and impacts your provisioned throughput. You’d better move this data into a dedicated table with a different throughput or even outside of Dynamo.

I tried to cover the main aspects worth to consider when making a decision. However it’s quite difficult to cover everything so don’t hesitate to leave a comment if you find something missing.

  • rdwarak

    Good read. An add-on point. DynamoDB recently introduced DAX, a in-memory cache service on top of DynamoDB

    http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DAX.concepts.html

    It more like they have combined DynamoDB and Redis Elasticache in the background.

    • Thanks! I will add a note to mention it.

  • Shekhar Londhe

    Very nice article. Small and concise to the point. Thank you for sharing.

  • Tomer Sandler

    2 weeks ago at AWS re:Invent, AWS announced two new features for DynamoDB.
    The first upgrade, Global Tables, allows the creation of tables that are automatically replicated across two or more AWS Regions, with full support for multi-master writes. This service is GA. The second announcement, On-Demand Backup, enables the creation of full backups of DynamoDB tables “with a single click”, and with zero impact on performance or availability (providing the read and write capacity units are configured correctly). The backup service is also GA, with point-in-time restore predicted to be available early 2018.

    https://aws.amazon.com/blogs/aws/new-for-amazon-dynamodb-global-tables-and-on-demand-backup/

    • Thanks for pointing this out. I have updated the post with this new information. Global tables and backup/restore sounds great and are filling a gap. However I don’t see any easy way to import existing data into a global table.