Categories
Databases

DSEGraph under the hood

Today we’ll dig deeper inside DSEGraph and see how it uses Cassandra to distribute the graph over the cluster.

We’ll illustrate this post with a classic toy example: a movie graph.

Our model is very simple. I consists in “persons” who “acted in”  “movies” which “belongs to” “genres”.

The graph schema used in the post
The graph schema used in the post

The Schema

It can be created with the following commands:

// Create the schema (you should specify the replication strategy here)
system.graph("MovieGraph").create()
// Property keys
schema.propertyKey("genreId").Text().create()
schema.propertyKey("personId").Text().create()
schema.propertyKey("movieId").Text().create()
schema.propertyKey("name").Text().create()
schema.propertyKey("title").Text().create()
schema.propertyKey("year").Int().create()
schema.propertyKey("amount").Int().create()
// Vertex labels
schema.vertexLabel("genre").properties("genreId","name").create()
schema.vertexLabel("movie").properties("movieId","title","year").create()
// Edge labels
schema.edgeLabel("BELONGS_TO").connection("movie","genre").create()
schema.edgeLabel("ACTED_IN").connection("person","movie").properties("amount").create()
// Vertex indexes
schema.vertexLabel("genre").index("genresById").materialized().by("genreId").add()
schema.vertexLabel("genre").index("genresByName").materialized().by("name").add()
schema.vertexLabel("person").index("personsById").materialized().by("personId").add()
schema.vertexLabel("person").index("personsByName").materialized().by("name").add()
schema.vertexLabel("movie").index("moviesById").materialized().by("movieId").add()
schema.vertexLabel("movie").index("moviesByTitle").materialized().by("title").add()
schema.vertexLabel("movie").index("moviesByYear").secondary().by("year").add()
schema.vertexLabel("movie").index("moviesBySearchTitle").search().by("title").add()

If you look at the type of entities declared in this schema, you’ll find 4 different types: Properties, Vertices, Edges and Indices.

The order is important. You first describe the properties you’re going to use. Notice that we declared only one name property although both the person and the genre have a name. It means you can reuse the same property on different vertices.

The Keyspaces

We’re going to focus on each of this types in more details. But first let’s have a look at what’s been created into Cassandra.

cqlsh> describe keyspaces;

"MovieGraph"    "MovieGraph_system"    "MovieGraph_pvt"       ...

We have defined one graph but under the hood, 3 keyspaces have been created.

  • “MovieGraph” is the main key space and will contain the graph data (vertices, edges, indices)
  • “MovieGraph_system” contains the metadata like the graph’s schema.
  • “MovieGraph_pvt” deals with vertices having a big number of connections (edges).

All our schema description lies inside “MovieGraph_system”.shared_data. If you query this table using cqlsh you’ll find metadata such as timestamp validity, types (text, int, timestamp, …) but not all data is in a human readable format as it’s not intended for direct access.

The tables

Now let’s move on to the main keyspace “MovieGraph”. As I said this keyspace holds the graph data and it contains all the tables to support our entities.

Let’s have a look at the created tables:

cqlsh:MovieGraph> describe tables;

person_e    person_p    genre_e    movie_p
genre_p     movie_e     id_allocation ...
The tables used to store MovieGraph data into Cassandra
The tables used to store MovieGraph data into Cassandra

As you can see there are 2 tables created for each vertex type.

  • <vertex>_p: contains the vertex itself with all the properties declared in the schema.
  • <vertex>_e: stores the edges connected to that vertex (both incoming and outgoing edges).

There is one additional support table “id_allocation” that is used to generate new ids as entities are added to the graph.

Let’s have a look at one of the vertex table:

cqlsh:MovieGraph> describe table movie_p;

CREATE TABLE "MovieGraph".movie_p (
  community_id int,
  member_id bigint,
  "~~property_key_id" int,
  "~~property_id" uuid,
  title text,
  year int,
  "~~vertex_exists" boolean,
  PRIMARY KEY (community_id, member_id, "~~property_key_id", "~~property_id")
) WITH CLUSTERING ORDER BY (member_id ASC, "~~property_key_id" ASC, "~~property_id" ASC)
...

As you can see this table stores the properties associated to the movie entity. Nothing too surprising but the interesting thing is the primary key.

The partitioning

In Cassandra the primary key is composed of a partition key and a clustering key. The partition key “controls” which partition will store the data. Here the partition key is the “community_id” key.

This key is going to drive how the graph is partitioned. Every vertex with the same community_id will fall inside the same partition. If you can control the community_id you can control how your graph is going to be partitioned but you don’t have to.

Let’s see first how default vertex ids are managed in DSEGraph. The community and member_id are generated automatically and as far as I know it doesn’t rely on any kind of graph properties like community detection, …

If we want some control over how our graph is partitioned we have to craft our own vertex ids (in the same fashion that we’re used to craft our Cassandra tables and partition keys).

Of course this is done when we declare our schema. Let’s use the year property that already exists to partition our movies by year.

schema.vertexLabel("movie")
      .partitionKey("year")
      .clusteringKey("title")
      .create();

Now all the movies that were released on the same year will occupy the same partition.

g.V().hasId("{~label: movie, year: 2010}");

This snippet gives us all the movies releases in 2010, reading only a single partition. Note that the ~label always needs to be specify as different labels map to different table as we’ve seen previously.

If the partition key is chosen correctly it can have big improvements performance wise.

Sample of Actors, Movies and ACTED_IN relations
Sample of Actors, Movies and ACTED_IN relations
Cassandra partitions of the movie_p table with primary key ((year), title)
Cassandra partitions of the movie_p table with primary key ((year), title)

Now let’s have a quick look at the edge ids (which we have no control over).

cqlsh:MovieGraph> describe table movie_e;

CREATE TABLE "KillrVideo".movie_e (
    community_id int,
    member_id bigint,
    "~~edge_label_id" int,
    "~~adjacent_vertex_id" blob,
    "~~adjacent_label_id" smallint,
    "~~edge_id" uuid,
    "~amount" int,
    "~~edge_exists" boolean,
    "~~simple_edge_id" uuid,
    PRIMARY KEY (community_id, member_id, "~~edge_label_id", "~~adjacent_vertex_id", "~~adjacent_label_id", "~~edge_id")
) WITH CLUSTERING ORDER BY (member_id ASC, "~~edge_label_id" ASC, "~~adjacent_vertex_id" ASC, "~~adjacent_label_id" ASC, "~~edge_id" ASC) ...

Here we can see that the edge properties are stored in this table (e.g. ‘~amount’ property – The ‘ACTED_IN’ edge has an amount property which represents how much an actor was payed to play in a given movie).

It’s worth noting that this tables stores all the edges linked to movies. In our case it contains both ACTED_IN and BELONGS_TO relations.

The partition key is obviously the same as the corresponding vertex. In our case it’s the community_id (but it can be any custom partition key you’ve defined for the movie vertices).

It looks ok but it might actually be an issue for high-degree vertices (i.e. vertices with a huge number of edges). Such vertices – with millions of connections – are too big to fit on a regular partition. This will slow down the system as the partition will take too long to read. Moreover such vertices are often the most frequently accessed.

For such high-degree nodes we need to split the connections into several partitions. This is where the third keyspace: “MovieGraph_pvt” comes into play. It holds vertex partition triple (PVT) information. That’s basically metadata needed to manage the partitions for such high-degree vertices.

Ids are not always the best way to access data, often we need to query our entities (vertex or edge) using a given set of properties. This use case is supported by indexes.

The Indices

In DSEGraph indices can be created for every entity type: Vertex, Edge and Property.

Property indices are available to index meta-properties with a ‘multiple’ cardinality. You need to specify the vertex (meta-properties are only available on vertices), the property and the meta-property you want to index. We don’t have any in our schema here (so go and check the API if you need to).

Edge indices are straight-forward as well. You specify the vertex, the edge and the property to index.

Vertex indices are more interesting as 3 index types are available:

  • materialized: These indices are backed up by cassandra materialized views which means they’re as fast as querying a table. That’s the type you want to use if the partitions won’t be too big (high cardinality values)
  • secondary: Secondary indices – as the name says – rely on Cassandra secondary indices. You want to use them for low cardinality values (e.g. The movies release year in our example is a good candidate (probably about 100 entries)
  • search: This is the less efficient index as it is backed up by a Solr index but it addresses very specific needs like approximate (fuzzy) search (e.g. it allows us to search by approximate movie titles in our example) and geographical search.

In this post we have explored how DSEGraph data are stored into Cassandra and how the underlying concepts (partition and clustering keys, indices) are exposed into the DSEGraph API. I especially like the fact that you can ‘control’ the primary key and thus the partitioning of your graph.