World Wide Webber


My Books
REST in Practice: Hypermedia and Systems Architecture
Amazon:
US, UK
Developing Enterprise Web Services by Sandeep Chatterjee and Jim Webber
Amazon:
US, UK,
Also available: Korean Edition

My Bookshelf
RESTful Web Services Cookbook by Subbu Allamaraju
Programming Clojure by Stuart Halloway
RESTful Web Services by Leonard Richardson and Sam Ruby
Square Pegs and Round Holes in the NOSQL World
Posted: 21 April 2011 @ 17:10 UT from Seattle, US

The graph database space is a peculiar corner of the NOSQL universe. In general, the NOSQL movement has pushed towards simpler data models with more sophisticated computation infrastructure compared to traditional RDBMS. In contrast graph databases like Neo4j actually provide a far richer data model than a traditional RDBMS and a search-centric rather than compute-intensive method for data processing.

Strangely the expressive data model supported by graphs can be difficult to understand amid the general clamour of the simpler-is-better NOSQL movement. But what makes this doubly strange is that other NOSQL database types can support limited graph processing too.

This strange duality where non-graphs stores can be used for limited graph applications was the subject of a thread on the Neo4j mailing list, which was the inspiration for this post. In that thread, community members discussed the value of using non-graph stores for graph data particularly since prominent Web services are known to use this approach (like Twitter's FlockDB). But as it happens the use-case for those graphs tends to be relatively shallow - "friend" and "follow" relationships and suchlike. In those situations, it can be a reasonable solution to have information in your values (or document properties, columns, or even rows in a relational database) to indicate a shallow relation as we can see in this diagram:

1

At runtime, the application using the datastore  (remember: that’s code you typically have to write) follows the logical links between stored documents and creates a logical graph representation. This means the application code needs to understand how to create a graph representation from those loosely linked documents.

2

If the graphs are shallow, this approach can work reasonably well. Twitter's FlockDB is an existential proof of that. But as relationships between data become more interesting and deeper, this is an approach that rapidly runs out of steam. This approach requires graphs to be structured early on in the system lifecycle (design time), meaning a specific topology is baked into the datastore and into the application layer. This implies tight coupling between the code that reifies the graphs and the mechanism through which they're flattened in the datastore. Any structural changes to the graph now require changes to the stored data and the logic that reifies the data.

Neo4j takes a different approach: it stores graphs natively and so separates application and storage concerns. That is, where your documents have relationships between them, that's they way they're stored, searched, and processed in Neo4j even if those relationships are very deep. In this case, the logical graph that we reified from the document store can be natively (and efficiently) persisted in Neo4j.

3

What's often deceptive is that in some use-cases, projecting a graph from a document or KV store and using Neo4j might begin with seemingly similar levels of complexity. For example, we might create an e-commerce application with customers and items they have bought. In a KV or document case we might store the identifiers of products our customers had bought inside the customer entity. In Neo4j, we'd simply add relationships like PURCHASED between customer nodes and the product nodes they'd bought. Since Neo4j is schema-less, adding these relationships doesn’t require migrations, nor should it affect any existing code using the data. The next diagram shows this contrast: the graph structure is explicit in the graph database, but implicit in a document store.

4

Even at this stage, the graph shows its flexibility. Imagine that a number of customers bought a product that had to be recalled. In the document case we'd run a query (typically using a map/reduce framework) that grabs the document for each customer and checks whether a customer has the identifier for the defective product in their purchase history. This is a big undertaking if each customer has to be checked, though thankfully because it's an embarrassingly parallel operation we can throw hardware at the problem. We could also design a clever indexing scheme, provided we can tolerate the write latency and space costs that indexing implies.

With Neo4j, all we need to do is locate the product (by graph traversal or index lookup) and look for incoming PURCHASED relations to determine immediately which customers need to be informed about the product recall. Easy peasy!

As the e-commerce solution grows, we want to evolve a social aspect to shopping so that customers can receive buying recommendations based on what their social group has purchased. In the non-native graph store, we now have to encode the notion of friends and even friends of friends into the store and into the logic that reifies the graph. This is where things start to get tricky since now we have a deeper traversal from a customer to customers (friends) to customers (friends of friends) and then into purchases. What initially seemed simple, is now starting to look dauntingly like a fully fledged graph store, albeit one we have to build.

6

Conversely in the Neo4j case, we simply use the FRIEND relationships between customers, and for recommendations we simply traverse the graph across all outgoing FRIEND relationships (limited to depth 1 for immediate friends, or depth 2 for friends-of-friends), and for outgoing PURCHASED relationships to see what they've bought. What's important here is that it's Neo4j that handles the hard work of traversing the graph, not the application code as we can see in the diagram above.

But there's much more value the e-commerce site can drive from this data. Not only can social recommendations be implemented by close friends, but the e-commerce site can also start to look for trends and base recommendations on them. This is precisely the kind of thing that supermarket loyalty schemes do with big iron and long-running SQL queries - but we can do it on commodity hardware very rapidly using Neo4j.

For example, one set of customers that we might want to incentivise are those people who we think are young performers. These are customers that perhaps have told us something about their age, and we've noticed a particular buying pattern surrounding them - they buy DJ-quality headphones. Often those same customers buy DJ-quality decks too, but there's a potentially profitable set of those customers that - shockingly - don't yet own decks (much to the gratitude of their flatmates and neighbours I suspect).

With a document or KV store, looking for this pattern by trawling through all the customer documents and projecting a graph is laborious. But matching these patterns in a graph is quite straightforward and efficient – simply by specifying a prototype to match against and then by efficiently traversing the graph structure looking for matches.

6

This shows a wonderful emergent property of graphs - simply store all the data you like as nodes and relationships in Neo4j and later you'll be able to extract useful business information that perhaps you can't imagine today, without the performance penalties associated with joins on large datasets.

In these kind of situations, choosing a non-graph store for storing graphs is a gamble. You may find that you've designed your graph topology far too early in the system lifecycle and lose the ability to evolve the structure and perform business intelligence on your data. That's why Neo4j is cool - it keeps graph and application concerns separate, and allows you to defer data modelling decisions to more responsible points throughout the lifetime of your application.

So if you're fighting with graph data imprisoned in Key-Value, Document or relational datastores, then try Neo4j.

Comments:
#

Great explanation and illustration ... A question, in practice have you seen problems where there are advantages to storing data in parallel to both a quasi relational store (rdbms or document oriented nosql) and a graphDB? I'm thinking there are modes of thinking / views of the problem that could warrant separate specialization (e.g. Data warehousing / operational reporting). GraphDBs feel like the right tool for a lot, but not all data problems, so I'm wondering about system architectures for making them fit with existing applications. @kindageeky

#

Hi Steve, 

The polyglot persistence folks would certainly agree that there are good use-cases for mixed stores.  

From a systems perspective I agree with them. The systems we interact with are often composites under the covers and different parts of those systems will have radically different needs. For example, Amazon's Dynamo (and the databases it inspired like the wonderful Riak) are brilliant for being highly dependable for simple data, like shopping carts. But if we tried to use that same data model for introspecting customer behaviour, we end up writing map-reduce jobs and such introspection becomes batch-oriented. Instead, for the part of the composite system that deals with customer behaviour, I'd choose a graph database. A classic example of this is a recommendation engine, which needs to provide hints to customers while they're in the workflow (rather than emailing them offline). Graph databases support this because they're quick at mining information from connected data.  

So I think there are good systems architectural reasons to mix databases in a composite system. Where I feel mainstream NOSQL-ers are going a bit off the rails is when the write their own graph stores atop document databases. I think that's hard work, and the results are clunky compared to using a real graph store (unless you're doing big data stuff and latency is not an important consideration). 

Thanks for your comment. 

Jim

#

Wonderful article. Clearly explained and illustrated and adds a much needed and missing analysis to the NOSQL solutions people are throwing into place.

#

Jim, Thanks for the inspiring article. You've just justified my intuitive design decision for our Agent Modeling Platform. I knew it was the right thing to do, but I couldn't explain it to other people that well! All relations in that model (AMF) are specified through network/graph relations in contrast to the donor meta-model (EMF) which specifies relations in the more traditional way. Of course there are benefits to each approach and the beauty of a meta-model based approach is that you don't need to choose one or the other. 

One question.. you draw this out as a comparison between Search and Compute intensive. Perhaps these ways of looking at problem spaces are wrapped in some history but to me it looks like a distinction without much deep difference?

#

I'm using orient db as a graph database in my project as well. It's a nice (a little rougher around the edges) open source option for graph databases.

#

@reid 

Neo4j is open source too - it's freely available under the GPL. 

Jim

#

I knew nothing about computers back when the relational model started gaining traction, but I think the argument was basically the same. Instead of storing data in a way that matches the current needs of the app, decompose things into relations so you can later perform queries that you hadn't thought of before. 

This recent popularity of no-SQL is because the generality of SQL limits application-specific optimizations. No-SQL databases give you lower-level control on how your data is laid out and how your queries are executed. This is more work, more error-prone, and harder to modify, but it lets you push the performance envelope. 

For example, if the application never needs to know how many customers bought a specific product, then Neo4j is doing extra work (at some cost) for no benefit. So I guess I'm wondering how/if Neo4j breaks this cycle. 

For example, is there a declarative way to say "I don't need the incoming edge count for these nodes"? Or maybe the kinds of extra things Neo4j automatically keeps track of are known to be cheap?

#

@Kannan 

If the application never needs to know how many customers bought a specific product then the only cost is a few bytes to store those relationships on disk. No computation overhead in Neo4j. 

Saying "I don't care about those relationships" happens at query time - you program your queries (traversals in graph speak) to only traverse relationships that you care about. All others are ignored. 

Jim

#

So Neo4j doesn't maintain a count of incoming edges? I guess I was assuming that the "how many incoming edges" query was fast because that information was kept up to date, but I guess it isn't? 

What if we wanted that query to be fast? Would we keep a count ourselves? And maybe periodically do a consistency check with the "incoming edge count" query to make sure things stay in sync?

#

@Kannan 

Adding the "count" is strange for two reasons: 

1. What to count? Incoming relationships? Incoming relationships that are named FOO? Outgoing relationships named BAR with property FOOBAR > 99? 

2. This count (or more likely these counts) actually slow things down since they'd imply an extra write into the node. 

My advice here is leave it to the database. If you really want to count the relationships, then count them - if they're cached you'll be able to count them really quickly. 

Jim

#

Evaluating neo4j for storing a social connection graph and have been very impressed with it till now.

#

You used "reify" again. Awesome!

#

Set 1: You are the complement of me. 

Set 2: Shutup. You had me at 'reify'. 

Bloody good mate. 

However, how do you feel about this? Recalling my "ORMs (or any mappers from application model to persistence layer) are in the wrong place in the architecture" and mapping penalty should be paid externally, if the relationships are rarely used, particularly in the online system, the graph is excessive and recording all relationships we conceive serves no value. 

In other words, reify from a graph in an online store only if you want that. Otherwise ETL and do it in an analytic / reporting context.

#

The cost of storing a relationship is basically zero in the general case. So storing them and perhaps availing yourself of future serendipitous uses is most likely a very reasonable strategy.

#

Been trying and failing to explain the benefits of graphing to the powers that be here at HX - How all i have to do is point them to this article.  

You're my hero and when I grow up I want to be just like you (but with more hair)

#

can we have multiple graphs at design time it self? 

for better management and solving the sharding problem in advance? 

eg: 

a dedicated graph for each user as in your example. this graph tracks the user profile, behaviour etc and another graph which stores global information , all the graphs evolve simultaneously and can be linked to each other?

#

Does document-oriented NoSQL means that Database management that uses a JSON model and gives you reasonably robust access to individual field values inside a JSON (JavaScript Object Notation) object?

#

@Jim, 

What is reify?  

What is the limitation on the number of edges a node can have in neo4j? What should be the ideal approach?  

You mentioned that traversal between nodes based on edges is managed by neo4j automatically. Does it maintain this traversal information in a separate meta-data section?  

Sorry, lots of questions; trying to get hang of neo4j & graph db at the same time. 

Thanks, 

Nayak

#

@Nayak, 

Reify == to project, to make real 

Neo4j currently can hold 32 billion nodes, 32 billion relationships, and 64 billion properties. 

Traversal between nodes is not "automatic" but Neo4j has a traversal framework to help you construct graph queries (aka traversals). It also supports the query languages Gremlin and Cypher. 

HTH, 

Jim

Author Name:
Email:
Author URL:
Comment:
Antispam:
Please type the following string (note that if the strings don't match, your comment will be lost... sorry!): 'EHTJ'.
 
Recent entries

Recent comments

Feeds:
RSS 2.0 Atom