DSE Graph vs Neo4J

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Share on Reddit

Last week I took the course on DSEGraph available on Datastax academy so logically this post is about DSEGraph. It turns out I’ve also used Neo4J a couple of months ago. So this post is going to be a comparison between these 2 graph databases.

DSEGraph is a layer that comes on top of the Cassandra and provides a directed, attributed, binary graph database.
(It just means that vertices and edges can hold properties, edges are directed and connect 2 vertices together).

Of course Neo4J falls into the exact same category: the directed, attributed, binary graph.

DSEGraph was inspired by the Titan project (an open-source project providing a graph DB on top of other DB backends like Cassandra or HBase). However DSEGraph is not open-source. On the other hand being part of the DSE platform means a better and more efficient integration with the Cassandra backend.

It’s amazing to see how the Cassandra underlying database is abstracted away (but still present in terms of performance optimisation for schema definition) and how the integration with DSE features (Solr, Spark) provide a whole new product of its own.

Neo4J was developed specifically to handle highly-connected data and overcome all the limitations related to this kind of data on regular SQL databases (mainly costly multi-joins).

The engines

Neo4J engines has been designed mostly to run on a single node and provide ACID transactions by means of locking in a similar fashion to traditional SQL database. However Neo4J doesn’t lock the whole database but only the entities (edges and vertices) being updated.

It performs well when running on a single machine but scaling is not something it excels at. Neo4J scaling is based on a master-slave architecture where one node the master is responsible for the writes that are replicated on the slaves which allows to scale out the querying part. Fortunately it’s often the case that the reads are more frequent and/or complicated than the writes but if you need to scale out the writes too Neo4J might not be the best bet.

DSEGraph is distributed by nature as Cassandra itself is a distributed database and of course DSEGraph benefits from it directly when it comes to scaling. You can easily replicates data inside and across datacenters for free as all the replication mechanism is provided by Cassandra. When you understand how the graph data maps to Cassandra data model (keyspaces and tables) you can even “control” how to split your graph on different partitions.

The performance depends on the kind of index you need. DSEGraph provides 3 types of indexes:

  • Materialised views: The fastest type of index where the index is stored as a cassandra table
  • Secondary index: Simply a Cassandra secondary index – performs worse than a regular table but still efficient for low cardinality values (e.g. country-codes, etc)
  • Solr index: The less efficient but provides powerful search features: full text or geo searches

Moreover you can also benefit from the Spark compute engine to run intensive analytics query.

Unless your data is small enough to fit on a single machine (and don’t get me wrong this is still often the case today) DSEGraph is a clear winner here.

The data model

There is not much to say here as both model are very very similar. The main difference is that DSEGraph relies on a data schema to model the properties, vertices and edges and map them back to Cassandra tables. In dev mode the schema is optional and is inferred automatically.

Neo4J is more “schemaless” as nothing needs to be declared and you can just add random properties on the fly on any entities.

Neo4J allows you to assign multiple labels per vertex/edge whereas DSEGraph requires exactly one (a sort of entity type: Person, Actor, …).

DSEGraph provides meta-properties (properties on properties) and Neo4J supports any JSON-like structure.

For me there is no clear winner here as both of them are very similar when it comes to modelling. Neo4J provides slightly more flexibility but on the other the DSEGraph schema leads to better performance.

The Query language

Neo4J provides its own language: the “Cypher” query language. (not to confuse with the Cassandra query language which both abbreviate to CQL). Cypher is an rather easy language to grab and get started. It relies mostly on pattern matching and “ASCII art”: A vertex is represented by parenthesis () and a directed edge by an arrow: –>. An example query might look like this:

MATCH (actor:Person)-[ACTED_IN]->(movie:Movie)
WHERE movie.title STARTS WITH "T"
RETURN movie.title AS title, collect(actor.name) AS casting

DSEGraph query language is Gremlin – the graph query language of the Apache TinkerPop project. Gremlin is available in several programming language. DSEGraph comes with the groovy flavour. A similar query looks like this:

g.V().hasLabel('Person').as('actor').out('ACTED_IN').as('movie').
  select('actor', 'movie').by('name').by('title')

Here Cypher seems much closer to SQL and is a good entry point to get started with graph databases. Gremlin, in my opinion, is much richer when it comes to complex graph traversal and the icing on the cake: it’s possible to define its own DSL (domain specific language) on top of it.

The user interface

Not a key point for me but again it’s a draw. They both provides similar web console where you can run query and explore the graph using neat d3 visualisation (But don’t use it to display the whole graph at once!).

Neo4j web console

DSEGraph notebook

The community

Neo4J is the market leader and the community behind it is probably much bigger than the DSEGraph/Titan community. I am aware of many Neo4J or Cassandra user groups, so there is probably some near you too.
However both databases provides extensive documentation, examples, courses and tutorials.
Commercial support is available as well.

The price

I am not an expert in terms of licences and I am not aware of the pricing of any of these solutions, so you’d better check for yourself.

I think Neo4J licence is somehow similar to MySQL – not really open-source but still accessible – at least for open-source projects.

On the other hand DSEGraph is definitely not open-sourced and you need a DSE licence to use it – although you can download a VM with DSEGraph install from the Datastax website and try it for yourself.

Both Datastax (DSEGraph) and Neo Technology (Neo4J) provides commercial support for their databases so go and check the price with them.

Conclusion

Neo4J is the database to get started with graphs. It is simple, accessible and performs very well on not-too-big amount of data. It is widely used and owns the majority of the market. That’d probably be my choice to get started with unless I have very specific need. The community

DSEGraph addresses the weaknesses of Neo4J: scalability and analytics (using the Spark backend) at the cost of additional complexity (schema is mandatory and you have to understand the mapping to the underlying Cassandra architecture to make the most of it).

My advice would be to go for Neo4J if you’re just getting started or have a reasonable amount of data. Once you reach the limit of Neo4J it might be worth considering DSEGraph. The good thing is that both data model are really similar so migrating to one or the other should not be too much of a problem.

  • EmbraceTheIrrational

    I always hear this about scalability limit for Neo4J but I’m curious what that really means ? It seems there are graphs out there with at least 10^9 vertices in Neo4J. I understand the master-slave setup, or rather now oligarchical scheme they have in some sort of futuristic feudal landscape of storage, which really means that each machine needs to hold the entire copy of the graph. This is what allows for index-free-adjacency, and what limits it from a multi-machine setup. That and there is no data sharding like you have with DSEGraph. DSEGraph is just a bunch of optimised tables in Cassandra so you get the performance and scalability of a distributed NoSQL backend but at the cost of traversals being lookups in indexed tables. So somehow this is meant to scale linearly with the size of the graph, which I don’t understand either.

    But given that you can cheaply acquire terabytes of storage space, I’d be curious at what point is Neo4J really limited by this paradigm of index-free-adjacency.

    • It’s not only the number of vertices but also the way you access the database. With Cassandra you can read/write on different nodes simultaneously. However DSEGraph uses Gremlin as its query language which is known to be particularly slow. Basically if your writes (and data) fits on a single node you’re definitely better off with Neo4j. If you only need to traverse a couple edges in your queries you might also consider a traditional SQL store as demonstrated here: https://blog.acolyer.org/2017/07/07/do-we-need-specialized-graph-databases-benchmarking-real-time-social-networking-applications/

      • EmbraceTheIrrational

        yeah I’ve seen that post before. I do think it’s a bit unfair as it’s Neo4J pre-bolt (serialization protocol which sped things up quite a bit) as well as Titan rather than DSEGraph with its optimizations or Janus. But my question was about how much data can fit on a single node, and it seems there are really massive Neo4J graphs out there so the answer must be “a lot” !