World Wide Webber


My Books
REST in Practice: Hypermedia and Systems Architecture
Amazon:
US, UK
Developing Enterprise Web Services by Sandeep Chatterjee and Jim Webber
Amazon:
US, UK,
Also available: Korean Edition

My Bookshelf
RESTful Web Services Cookbook by Subbu Allamaraju
Programming Clojure by Stuart Halloway
RESTful Web Services by Leonard Richardson and Sam Ruby
Graph Processing versus Graph Databases
Posted: 24 August 2011 @ 11:20 UT from Seattle, US
Last updated: 25 August 2011 @ 07:26 UT

There's recently been a great deal of discussion on the subject of graph processing. For those of us in the graph database space, this is an exciting development since it reinforces the utility of graphs as both a storage and a computational model. Confusingly however, processing graph-like data is often mistakenly conflated with graph databases because they share the same data model, yet each tool addresses a fundamentally different problem.

For example, graph processing platforms like Google's Pregel achieve high aggregate computational throughput by adopting the Bulk Synchronous Processing (BSP) model from the parallel computing community. Pregel supports large-scale graph processing by partitioning a graph across many machines and allowing those machines to efficiently compute at vertices using localised data. Only during synchronisation phases is localised information exchanged (c.f. the BSP model). This gives Google the ability to process huge volumes of interconnected data, albeit at relatively high latencies, to gain greater business insight than with traditional (non-graph optimised) map-reduce approaches.

Sadly few of us have Google-scale resources at our disposal to invent novel platforms on demand. In enterprise-scale scenarios, Hadoop (incidentally an implementation of Google's earlier map-reduce framework) has become a popular platform for batch processing large volumes of data. Like Pregel, Hadoop is a high-latency, high-throughput processing tool that optimises computational throughput by processing large volumes of data in parallel outside the database.

Unlike Pregel, Hadoop is a general purpose framework which means that while it can be used for graph processing, it's not optimised for that purpose nor are the underlying storage mechanisms HDFS (a distributed file system) and HBase (a distributed tabular database designed for large numbers of rows and columns) graph-oriented in nature (though interestingly the Ravel Golden Orb platform claims to add a Pregel-like programming model above Hadoop).

What Pregel and Hadoop have in common is their tendency towards the data analytics (OLAP) end of the spectrum, rather than being focussed on transaction processing. This is in stark contrast to graph databases like Neo4j which optimise storage and querying of connected data for online transaction processing (OLTP) scenarios - much like a regular RDBMS, only with a more expressive and powerful data model. We can visualise these differing capabilities easily as in the figure below:

Slide1

In this breakdown, Pregel is positioned firmly in the OLAP graph processing space, much as Hadoop is positioned in the general-purpose OLAP space (though closer to the OLTP axis because of recent advances in so-called real-time Hadoop). Relational databases are positioned as general purpose OLTP engines that can be somewhat adapted to the OLAP needs. Neo4j has strong graph affinity and is designed primarily for OLTP scenarios, though as a native graph database with strong read-scalability, it can also be suited to OLAP work.

However the Hadoop community continues to foster innovation in the area of graph processing, and there are regular announcements about how Hadoop can be adapted towards solving graph problems. Recently Daniel Abadi publicised work on solving graph problems more efficiently with Hadoop from his team at Yale University.

This work is novel empirical science and presents an important observation: by skillfully partitioning data in HBase to exploit locality, (graph) computational throughput in Hadoop can be substantially increased. And yet for casual observers of the NOSQL community, this is easily inferred as the demise of graph databases, which appear to have much more modest throughput. I don't believe this is a valid comparison however:

  • Hadoop is a batch processing framework, and operates at high latencies compared to graph databases (even real-time Hadoop involves seconds of latencies, compared to the millisecond scale at which Neo4j operates). The work done to improve graph processing through data locality means that batches will be executed more efficiently, and so throughput will be higher (or similar throughput will be achievable with fewer computational resources). Yet latency will remain comparatively high and so this approach is unlikely to be well-suited to on-demand processing (OLTP) that is the mainstay of most applications where data latency is more helpfully measured in milliseconds. Instead it is likely to remain firmly in the OLAP domain for the foreseeable future.
  • For generating regular reports from a data warehouse or pre-computing results, batch processing can be a sensible strategy, especially if it can be made efficient through laying out data carefully. Making this efficient comes at a cost, namely that data has to be denormalised within HBase, expanding the cognitive gap between your data and how it is represented for processing. Conversely Neo4j works in OLAP scenarios consistently with how it works in OLTP scenarios - your OLTP database is your OLAP database (usually a read slave, with the same data model). This means Neo4j doesn't need denormalisation or special processing infrastructure, and for large read-queries like reporting jobs scales very well even under heavy and unpredictable online loads.
  • Batch-oriented approaches are best suited where data can be read and processed outside the database rather than manipulated in place. That is, efficiently processing static graph-like data (or triples), not only requires careful placement of data in HBase, but practically rules out mutating the graph during processing. In contrast Neo4j supports in-place graph mutation graphs, which is a more powerful tool for Web real-time analytics than (even efficiently processed) batches.

Bringing all of these sentiments together, it's clear that we're looking at two different tools for two different sets of problems. The Hadoop-based solution is batch-oriented processing at high throughput with correspondingly high latency with substantial denormalisation. The Neo4j approach emphasises OLTP native graph processing with real-time OLAP and more modest throughput at very low latency (ms), and since work happens in the database it's always consistent.

So if you need OLTP and deep insight (OLAP-style) in near real-time at enterprise scale then Neo4j is a sensible choice. For niche problems where you can afford high latency in exchange for higher throughput, then the graph processing platforms like Pregel or Hadoop could be beneficial. But it’s important to understand that they are not the same.

Comments:
#

That's Apache Hadoop, to get its name right.

#

Thanks for this post. 

Where do you place Microsoft Trinity in the above figure?

#

@Pierre 

Aside from Trinity seems to be vapourware, I'd place it in the same quadrant as Pregel. 

Jim

#

Great post. 

Trinity is far from vaporware. It's actively used. It's as vaporware as Pregel is. They are both internal infrastructure technologies.

#

Where would you place InfiniteGraph from Objectivity in your graph?

#

Hi Bill, 

As I understand your product, I'd place it in the OLAP/Graph affinity quadrant.  

Jim

#

Hi Jim, 

Great post. 2 comments 

1. Could you give a real life example that uses neo4j as Olap in production like capacity. Would be great if you could give insight on read / write percentage.  

2. Also can any pregel work as Olap in front of neo4j that is proven as oltp.  

3. Also do you work at neo? 

2.

#

Hi Pragnesh, 

1. Yes, I've worked on some. But they're customer confidential. FWIW I came to Neo4j through requirements in the OLAP space. 

2. Yes. In this case Neo4j provides a good storage platform for OLAP graph processing in Pregel-like environments. 

3. Yes. It says so on my bio page. I've been Chief Scientist at Neo Technology for just over a year now.

#

1. Which Pregel like OLAP Solution will work here? Do you know any that have plugin to Neo4J? Also that OLAP must have bindings to nice reporting tools. 

2. Also how can I archive the information from Neo4J (like >1 year of data) that could be stored in some cold storage and activated only in case I need to run some analysis on the same. Keeping it in OLTP can potentially kill it with growth. I guess Neo4J has some limitation of 100B nodes from performance perspective.

#

Hi Pragnesh, 

1a. I don't know. I hope the GoldenOrb folks might do something, but I don't know of anything today. 

1b. JasperSoft has a Neo4j plugin (amongst others). 

2. That's not something Neo4j helps with today. You'd have to build your own import/export. 

Jim

#

Are there any videos / blogs that shows what is the integration with JasperSoft about? If there are other BI tools also it is fine too.

#

Hi Pragnesh, 

I don't have any links on that. I'd guess Google is your friend there. 

Jim

#

Hi Jim, why would you place InfinteGraph in the OLAP/Graph quadrant vs the OLTP/Graph quadrant like neo4j?

#

Hi Luanne, 

From reading about InfiniteGraph, it's oriented towards very large graph processing. The Objectivity folks talk about its scalability at length. But they don't talk about its transactionality, or latency. 

That's what leads me to conclude that InfiniteGraph is OLAP oriented - very large datasets where latency and transactionality are less critical than dataset size. 

HTH, 

Jim

Author Name:
Email:
Author URL:
Comment:
Antispam:
Please type the following string (note that if the strings don't match, your comment will be lost... sorry!): 'FHIIP'.
 
Recent entries

Recent comments

Feeds:
RSS 2.0 Atom