Strategies for Scaling Neo4j

As I’ve discussed before, graph databases like Neo4j can be lack the same predictability in terms of scaling when compared to other kinds of NOSQL stores (that’s the cost of a rich data model). But with a little thought, we’ve seen how both cache-sharding and the application of domain-specific strategies can help to improve throughput and increase the dataset size that Neo4j can store and process.

Those blog posts triggered some very useful discussion on the Neo4j mailing list, with several community members adding their own thoughts and experiences. In particular Mark Harwood suggested a simple heuristic for deciding a scaling strategy, that I thought was so useful I’d share it (with his permission) here.

  1. Dataset size: Many tens of Gigabytes
    Strategy: Fill a single machine with RAM
    Reasoning: Modern racks contain a lot of RAM, of the order of 128GB typically for a typical machine. Since Neo4j likes to use RAM to cache data, where O(dataset) ≈ O(memory) we can keep all the data cached in-memory and operate on it at extremely high speed.
    Weaknesses: Still need to cluster for availability; write scalability limited by disk performance.
  2. Dataset size: Many hundreds of Gigabytes
    Strategy:Cache sharding
    Reasoning: The dataset is too big to hold all in RAM on a single server but small enough to allow replicating it on disk across machines. Using cache sharding improves the likelihood of reaching a warm cache which maintains high performance. The cost of a cache miss is not catastrophic (a local disk read), and can be mitigated by replacing spinning disks with SSDs.
    Weaknesses: Consistent routing requires a router on top of the Neo4j infrastructure; write-master in the cluster is a limiting factor in write-scalability.
  3. Dataset size: Terabytes and above
    Strategy: Domain-specific sharding
    Reasoning: At this scale, the dataset is too big for a single memory space and it’s too big to practically replicate across machines so sharding is the only viable alternative. However given there is no perfect algorithm (yet) for arbitrary sharding of graphs, we rely on domain-specific knowledge to be able to predict which nodes should be allocated to which machines.
    Weaknesses: Not all domains may be amenable to domain-specific sharding

As an architect, I really like this heuristic. It’s easy to find where I am on the scale and plan a Neo4j deployment accordingly. It also provides quite an exciting pointer towards the future – while I think most enterprise-scale deployments are currently in the tens to hundreds of gigabytes range, there are clearly applications out there for connected data that do – or will – require more horsepower.

Neo4j 2.0 will address these challenges, it’s going to be a fun ride. But until then, I hope you’ll find this as helpful as heuristic as I did.

Posted in neo4j, NOSQL
2 comments on “Strategies for Scaling Neo4j
  1. Vishal Makadia says:

    I will want to use neo4j with approximetly 80 billions of modes and that much relationships.Can you give me some ideas,so that I fulfill my requirements.
    i)How can I shard my neo4j database?(How can I implement domain specific sharding?)
    ii)How can I implement neo4j High availability.Is node is replicated across different database instances?
    iii)If I will make 80 billions of nodes,what is the memory and disk requirements.Can you give me some ideas where all 80 billions of nodes resides(In memory or in disk)?
    Thnaks in advance..

  2. jim says:

    @Vishal

    To shard the database, you first observe some properties from your domain which are good for sharding. Good candidates for sharding often emerge from considering the queries you want to run (hence ultimately from the domain).

    Then you must create a convention to describe “proxy” nodes whose real implementation is on another instance of Neo4j (this could be as simple as a well-known property in Neo4j 1.9 or a label in Neo4j 2.0). Your application must be written to obey this convention. On receiving a proxy node, it fires off a request at the machine that hosts that node. Your choice of sharding key determines how many such proxies will be used at runtime, so it’s worth giving it due consideration – queries that cross machine boundaries will, of course, be slower than queries that run in a single instance.

    In terms of operating a cluster or multi-cluster (when you shard), Neo4j provides High Availability as a master/slave cluster. You can read up on that here: http://docs.neo4j.org/chunked/stable/ha-how.html. And finally there is a hardware calculator that helps with sizing of machines here: http://info.neotechnology.com/CalculatorLandingPageV2.html.

    Hope that helps.

    Jim

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>