Understanding Cassandra tombstones

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Share on Reddit

We recently deployed in production a distributed system that uses Cassandra as its persistent storage.

Not long after we noticed that there were many warnings about tombstones in Cassandra logs.

WARN  [SharedPool-Worker-2] 2017-01-20 16:14:45,153 ReadCommand.java:508 - 
Read 5000 live rows and 4771 tombstone cells for query 
SELECT * FROM warehouse.locations WHERE token(address) >= token(D3-DJ-21-B-02) LIMIT 5000 
(see tombstone_warn_threshold)

We found it quite surprising at first because we’ve only inserted data so far and didn’t expect to see that many tombstones in our database. After asking some people around no one seemed to have a clear explanation on what was going on in Cassandra.

In fact, the main misconception about tombstones is that people associate it with delete operations. While it’s true that tombstones are generated when data is deleted it is not the only case as we shall see.

Looking into sstables

Cassandra provides a tool to look at what is stored inside an sstable: sstabledump. This tool comes with the ‘casssandra-tools’ package which is not automatically installed with Cassandra. It’s quite straight-forward to install on a delian-like (e.g. unbuntu) distribution:

sudo apt-get update
sudo apt-get install cassandra-tools

In this blog post I used ssltabledump to understand how Cassandra stores data and when tombstones are generated.
The syntax is pretty straightforward:

sstabledump /var/lib/cassandra/data/warehouse/locations-660dbcb0e4a211e6814a9116fc548b6b/mc-1-big-Data.db

sstabledump just takes the sstable file and displays its content as json. Before being able to dump an sstable we need to flush the in-memory data into an sstable file using nodetool:

nodetool flush warehouse locations

This command flushes the table ‘locations’ in the ‘warehouse’ keyspace.

Now that we’re all setup, let’s have a look at some cases that generate tombstones.

Null values creates tombstones

An upsert operation can generate a tombstone as well. Why? Because Cassandra doesn’t store ‘null’ values. Null means the absence of data. Cassandra returns a ‘null’ value when there is no value for a field. Therefore when a field is set to null Cassandra needs to delete the existing data.

INSERT INTO movements (
  id,
  address, 
  item_id, 
  quantity,
  username
) VALUES (
  103,
  'D3-DJ-21-B-02', 
  '3600029145',
  2,
  null
);

This statements removes any existing username value for the movement identified by the id 103. And how does Cassandra remove data ? Yes, by inserting a tombstone.

This is the corresponding ssltabledump output:

[
  {
    "partition" : {
      "key" : [ "103" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "liveness_info" : { "tstamp" : "2017-01-27T15:09:50.065224Z" },
        "cells" : [
          { "name" : "address", "value" : "D3-DJ-21-B-02" },
          { "name" : "item_d", "value" : "3600029145" },
          { "name" : "quantity", "value" : "2" },
          { "name" : "username", "deletion_info" : { "local_delete_time" : "2017-01-27T15:09:50Z" }
        ]
      }
    ]
  }
]

Cassandra is designed for optimised performance and every operation is written to an append-only log. When a data is removed we can’t removed the existing value from the log, instead a “tombstone” value is inserted in the log.

Moreover Cassandra doesn’t perform read before write (except for light-weight transactions) as it would be too expensive.

Therefore when the above insert is executed Cassandra insert a tombstone value for the username field (even if there was no existing data for this key before).

Now let’s consider the following statement that looks very similar to the previous one:

INSERT INTO movements (
  id,
  address, 
  item_id, 
  quantity
) VALUES (
  103,
  'D3-DJ-21-B-02', 
  '3600029145',
  2
);

But there is one difference. The first statement creates a tombstone for the username whereas the second statement doesn’t insert anything in that column.

[
  {
    "partition" : {
      "key" : [ "103" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "liveness_info" : { "tstamp" : "2017-01-27T15:09:50.065224Z" },
        "cells" : [
          { "name" : "address", "value" : "D3-DJ-21-B-02" },
          { "name" : "item_d", "value" : "3600029145" },
          { "name" : "quantity", "value" : "2" }
        ]
      }
    ]
  }
]

If this is the first insert for this key (no previously existing data) then both statements yield to the same state, except that the second one doesn’t insert an unnecessary tombstone.

If there is existing data then what ends up in the username column might be different. With statement 1 whatever data was there it is deleted with the tombstone and no longer returned. With statement 2 the username remains unchanged so whatever value was there before (if any) will get returned.

Therefore you should strive to only update the fields that you need to.

For instance let’s say that I need to update the status of a location. Then I should only update the status field rather than the whole object. That would avoid tombstones for every missing value in the object.

The ‘properties’ field is not set in the query so no value is stored in Cassandra.

The following statement is exactly what we need as it only sets the status field:

UPDATE locations SET status = 'damaged' WHERE location_address = 'D3-DJ-21-B-02';

as the sstable dump shows

[
  {
    "partition" : {
      "key" : [ "D3-DJ-21-B-02" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 28,
        "cells" : [
          { "name" : "status", "value" : "damaged", "tstamp" : "2017-01-28T11:55:18.146255Z" }
        ]
      }
    ]
  }
]

Compare it with the following one which saves the whole location object (which happens to not have any properties – and insert an unnecessary tombstone in the ‘properties’ column).

INSERT INTO locations (
  address,
  status,
  properties
) VALUES (
  'D3-DJ-21-B-02',
  'damaged',
  null
);
An empty collection is stored as a tombstone cell
[
  {
    "partition" : {
      "key" : [ "D3-DJ-21-B-02" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "liveness_info" : { "tstamp" : "2017-01-28T11:58:59.160898Z" },
        "cells" : [
          { "name" : "status", "value" : "damaged" },
          { "name" : "properties", "deletion_info" : { "marked_deleted" : "2017-01-28T11:58:59.160897Z", "local_delete_time" : "2017-01-28T11:58:59Z" } }
        ]
      }
    ]
  }
]

Be aware of the collection types

In the previous example the ‘properties’ field is a collection type (most likely a set), so let’s talk about collections as they are trickier than it looks.

Let’s create a new location with the following statement

INSERT INTO locations (
  address,
  status,
  properties
) VALUES (
  'C3-BE-52-C-01',
  'normal',
  {'pickable'}
);

Everything looks good, doesn’t it? Every field has a value, so no tombstone expected. And yet, this statement does create a tombstone for the ‘properties’ field.

[
  {
    "partition" : {
      "key" : [ "C3-BE-52-C-01" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "liveness_info" : { "tstamp" : "2017-01-28T12:01:00.256789Z" },
        "cells" : [
          { "name" : "status", "value" : "normal" },
          { "name" : "properties", "deletion_info" : { "marked_deleted" : "2017-01-28T12:01:00.256788Z", "local_delete_time" : "2017-01-28T12:01:00Z" } },
          { "name" : "properties", "path" : [ "pickable" ], "value" : "" }
        ]
      }
    ]
  }
]

To understand why, we need to look at how Cassandra store a collection in the underlying storage.

The collection field includes a tombstone cell to empty the collection before adding a value.

Cassandra appends new values to the set, so when we want the collection to contain only the values passed in the query, we have to remove everything that might have been there before. That’s why Cassandra inserts a tombstone and then our value. This makes sure the set now contains only the ‘pickable’ value whatever was there before.

That’s one more reason to just set the values you need to update and nothing more.

Be careful with materialised views

A materialised view is a table that is maintained by Cassandra. One of its main feature is that we can define a different primary key than the one in the base table. You can re-order the fields of the primary key from the base table, but you can also add one extra field into the primary key of the view.

This is great as it allows to define a different partitioning or clustering but it also generates more tombstones in the view. Let’s consider an example to understand what’s happening.

Imagine that we need to query the locations by status. For instance we want to retrieve all ‘damaged’ locations. We can create a materialised view to support this use case.

CREATE MATERIALIZED VIEW locations_by_status AS
  SELECT
    status,
    address,
    properties
  FROM locations
  WHERE status IS NOT NULL
  AND address IS NOT NULL
  PRIMARY KEY (status, address);

Good, now we can use this view to find out all the locations with a given status.

But let’s consider what happens in the view when we change the status of a location with an update query

UPDATE locations
SET status = 'damaged'
WHERE address = 'C3-BE-52-C-01';

As we’ve seen this query just updates one field in the base table (locations) and doesn’t generate any tombstone in this table. However in the materialised view the ‘status’ field is part of the primary key. When the status changes the partition key changes as well.

[
  {
    "partition" : {
      "key" : [ "normal" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "clustering" : [ "C3-BE-52-C-01" ],
        "deletion_info" : { "marked_deleted" : "2017-01-20T10:34:27.707604Z", "local_delete_time" : "2017-01-20T10:46:14Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "damaged" ],
      "position" : 31
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 49,
        "clustering" : [ "C3-BE-52-C-01" ],
        "liveness_info" : { "tstamp" : "2017-01-20T10:46:14.285730Z" },
        "cells" : [
          { "name" : "properties", "deletion_info" : { "marked_deleted" : "2017-01-20T10:46:14.285729Z", "local_delete_time" : "2017-01-20T10:46:14Z" } }
        ]
      }
    ]
  }
]

To maintain the view in sync with the base table Cassandra needs to delete the row from the existing partition and insert a new one into the new partition. And a delete means a tombstone.

The update in the base table triggers a partition change in the materialised view which creates a tombstone to remove the row from the old partition.

The key thing here is to be thoughtful when designing the primary key of a materialised view (especially when the key contains more fields than the key of the base table). That being said it might be the only solution and completely worth it.

Also consider the rate of the changes of the fields of the primary key. In our case we should evaluate the rate at which the status changes for a given location (the location address doesn’t change). The less often the better off we are with respect to tombstones.

Finally tombstones will disappear over time, when compaction occurs. The typical delay is 10 days (which corresponds to the ‘gc_grace_seconds’ configuration parameter). You may want to adjust it if there are too many tombstones generated during this period.

Conclusion

As we’ve seen tombstones can be tricky and there not only associated to delete operations. There are many other cases that may generate tombstones.

Tombstones are not necessarily a bad thing that we should avoid at all cost. It’s just a way to delete data in an append-only structure.

However it can affect performances so you’d better be aware when they are generated when designing your data model and queries.

If you use the java driver the query fails if more than 100,000 tombstones are seen.

With this knowledge you should be able to limit their generation only when necessary. In this case you should evaluate how many are going to be generated and evaluate if it’s going to be an issue or not. If so you may want to tune the ‘gc_grace_seconds’ parameter to trigger compaction more often.