Usage of Neo4j in a professional web based scientific software

Software architect and lead developer at Quantis since 2009, I discovered Neo4j one year ago. This article is a non-exhaustive review of the use of Neo4j in the software we have been developping for the last 4 years. The objective of this article is to contribute to the knowledge building within the Neo4j community.

Context

The core business of Quantis is to provide services and software to companies aiming at the assessment of the environmental footprint of their company, products and technologies. This could be for reporting (e.g. GHG protocol), improving products (eco-design) or simply communication. Footprinting is based on LCA, which is according to wikipedia:

A life-cycle assessment (LCA, also known as life-cycle analysis, ecobalance, and cradle-to-grave analysis) is a technique to assess environmental impacts associated with all the stages of a product's life from-cradle-to-grave (i.e., from raw material extraction through materials processing, manufacture, distribution, use, repair and maintenance, and disposal or recycling). LCAs can help avoid a narrow outlook on environmental concerns by:

Compiling an inventory of relevant energy and material inputs and environmental releases;

Evaluating the potential impacts associated with identified inputs and releases;

Interpreting the results to help make a more informed decision.

Quantis SUITE 2.0 (QS2) is the proprietary software from Quantis having the objective to enable the easy realisation of robust footprints by non-experts. The full description of the software features is out of the scope of this article but the main features related to our need of a graph database like Neo4j:

Design hierarchical systems, e.g. to represent a product and its components as well as sub-components, using trees
Link the designed systems to internationally recognised databases enabling the conversion of technical information into environmental impacts
Dynamically explore computed results using OLAP-like features

After two years of development, it became clear that a classical SQL/JPA combo would never fit some of our long term specific needs, like massive usage of trees, fast customization capabilities or the easy-made plug-ins. The advanced scientific functions required by the software also seemed too specific to be handled by an existing SQL/OLAP solution. After having explored several other solutions, we finally decided last year to introduce Neo4j as our main database, while keeping a MySQL database for some parts.

Today, Neo4j is in our pre-production version, and will be used in production in the coming weeks.

Stack

Client:

Flex

Client/Server communication:

BlazeDS, using AMF protocol over HTTPS.

We use a double-validation concept to cope with two requirements: a rich client and security. Every client-side modification is validated by the client to get a quick feedback and avoid useless communication with the server. Modifications are then sent to the server where another validation is made, mainly to avoid attacks using hacked clients or other database corruptions (due to client bugs or concurrent updates). Communications are made using the command pattern.

App server:

Glassfish.

Less mandatory than earlier, it will be replaced by a lighter layer during the next phase of consolidation

Server code:

Java, using libraries like Guava, Guice, Oval, MyBatis, POI, JAXB (oh, and Neo4j libs :P)

Databases:

MySQL and Embedded Neo4j

A specificity of the database implementation is the rule of one database per customer. While first implemented for mendatory security purposes (difficult implementation, try doing multiDB with Glassfish and without Hibernate magical multi-tenancy...), this revealed itself finally a really good choice for additional reasons (more about this below).

Why a graph database?

QS2 uses tree based structures for more than half of the data. The other half can fit in a relational database, but we still need a complex schema for some parts, more easily handled with graphs.

Also, the model is highly hierarchical with a clear notion of ownership. Each atomic piece (a row in a table or a node in Neo4j) is only related to one parent, containing at most a few hundred children. Graph databases are more adapted to hold such data than managing SQL queries with dozens of joins or denormalized schema.

Why a schemaless database?

Even if we are doing SaaS, some of our customers are big enough to request some customizations. Customization include adding custom validation procedures with specific comments fields, or linking QS2 to other sytems with connectors to retrieve existing data. Much easier if you only have to update the code, no?

We are also dealing with extension modules complementing the core software with specific complex business needs. With a schemaless database, modules can simply be activated on demand for a specific client, without thinking about deploying a complex SQL alter-script.

I agree that schemaless databases are not a miracle solution. It can even be worse than schema-based databases due to the need of thinking explicitely about constraints implementation. A bad implementation will easily lead to data corruption. So, warning here. Remember, NoSQL stands for Not Only SQL. Never use a NoSQL database because it's cool. (But because it's fun, you can :P).

Why Neo4j?

Easy to understand, easy to use, professional, and it works!

During our preliminary tests, another java database seemed pretty cool and I gave it a try. This base can be used as a graph database or as a document database, with 6 different API to do whatever you want. Sadly, during some simple tests, we found 3 critical bugs, with one pretty frightening for scientists like us (missing parenthesis in a 3 lines 'if'). Neo4j offers far less features, but done well!

After one year of use, we only have been annoyed by one bug, which was already fixed in the next version at the time we discovered it.

Also, I have to thank the great Neo4j community, and offer them this blog post :)

Fear, Uncertainty and Doubt

Definitely, Neo4j is a great database, and we love it. Sure, we invested some time to create some things around it, but the great strength of Neo4j is that we have done that easily, without complex constraint. But, we also have some big questionmarks about both the database and our stuff.

Fragmentation

When computing LCA results for a project, we need to traverse the entire project. But we don't need other projects. Now imagine that the first created project is updated over and over for a long time. During this period, more projects are created and completed. The first project will have nodes here and there in the whole database, so it has to be fully loaded to retrieve only a few nodes.

It can even be worst if the whole database doesn't fit in RAM. We will have to be careful about this point.

Potential manual indexing issues

After some discussions in the mailing list, it seems that we have to be careful about indexes. When deleting a node without deleting the corresponding entries in the index, the index still contains the entries to this node. If you do a request on the index that should return the deleted node, corresponding entries will be automatically removed. So everything appears to be ok.

But, here comes the NodeID reusability. If the database is restarted and new nodes created between the deletion and the query, a node with the same nodeID as the deleted one can be created. As indexes works with IDs, your index will be corrupted and the query will return the newly created node...

So, remember: when deleting a node, always remove the entries in all the indexes related to the node.

Database opening time & Warm-up

For now, we're loading the database when a user log-in. We don't really have time to warm it up. We will have to be careful about first-connection metrics, and see if we need to do something about that.

No ordered tree support

A wonderful thing would be to have a native relationship ordering feature :)

No blob support

Even if it's not the purpose of a graph database to store blobs, it would be a great feature. Maybe using the same mecanism as what is done for indexes can be done, using an existing key-value store solution and only implement a kind of connector?

Final word

I strongly recommend Neo4j for highly relational data, and I thank Neo4j creators for all the work, as this product fits many of our needs. It would have been harder to do such a project without it! Keep going guys!

Context

Stack

Why a graph database?

Why a schemaless database?

Why Neo4j?

Quantis SUITE 2.0 internal concepts related to Neo4j

One database per customer

Upgraders and migrations

"Parameters" nodes

Helpers

Ordered trees

Factories, Deleters, Duplicators: 3 mini-frameworks

Why developing 3 mini-frameworks?

Why three, and not one?