ModeShape

An open-source, federated content repository

Improving performance with large numbers of child nodes

A JCR repository is by definition a hierarchical database, and it’s important that the hierarchical node structure be properly designed to maintain good functionality and performance. If the hierarchy is too deep, you’re applications will spend a lot of time navigating lots of nodes just to find the one they’re interested in. Or, if the hierarchy is too wide, then there will be lots of children under a single parent and this large parent might become a performance bottleneck.

Unfortunately, it’s difficult to come up with hard and fast rules about what it means for a repository structure to be “too deep” or “too wide”. In this post we talk in detail about the performance of accessing a single node with lots of children, but applications rarely access just one node at a time. Instead, most applications access multiple nodes when performing most application-specific operations, and these patterns will greatly affect the total performance of the application and repository. So no matter what, measure the performance of your application using a variety of repository designs.

How does the number of child nodes affects performance?

ModeShape stores each node separately in the persistent store, and each node representation stores a reference to its parent and to all children. The parent reference makes it easy to walk up the tree, while the list of children makes it fast to walk children and to maintain the order of the children even as nodes are reordered and nodes with same-name-sibilngs are used.

A parent node that has 10s of thousands of children will thus have a pretty large representation in the persistent store, and this adds to the cost of reading and writing that representation. This is why we recommend not having large numbers of children under a single parent.

ModeShape does have the ability to break up the list of children into segments, and to store these segments separately from the parent node. This behavior is not enabled by default, but it can be enabled as a background optimization process.

Avoiding large numbers of child nodes

Sometimes it’s quite easy to design a node structure that doesn’t have parent nodes with large numbers of children. A blog application might organize the posts by date (e.g., “/posts/{year}/{month}/{date}/{title}”), and this works quite well simply because at every level the number of children is limited. For example, there will never be more than 12 nodes under a given year, and never more than 31 nodes under a given month. Also, it is unlikely to create many posts on a single day, so the number of titles under a given day will even be quite small.

While there are many data structures that can naturally organize your hierarchy of nodes, there are situations where there is no obvious natural hierarchy. Consider an application that maintains customers, where each customer is identified by a unique identifier. Your application may be able to organize the customer by region, by date, or by some other characteristic. But that’s not always possible or ideal. In that case, it may be useful to base the hierarchy on an artificial characteristic.

Consider the common case where the identifiers are UUIDs, which are unique and very easily generated. UUIDs are also very nicely and uniformly distributed, meaning that the characters of the hexadecimal form (e.g., “eb751690-23cb-11e4-8c21-0800200c9a66″) of two consecutively generated UUIDs will differ in most of the characters. We can exploit the hexadecimal representation and the uniform distribution of UUIDs to create a hierarchical structure that can store a lot of nodes with just a few levels in the hierarchy.

For example, if we use the first two hexadecimal characters as the name of our first level of nodes, and the next two characters for the second level of our node structure, then we can easily store 1 million nodes in a structure that never has more than 256 children under a single parent. The customer with ID “eb751690-23cb-11e4-8c21-0800200c9a66″ can be found by turning the ID into a path:

/customers/eb/75/1690-23cb-11e4-8c21-0800200c9a66

We could vary the design to use 3 characters for the first level and no second intermediate level. That means we can store our 1M nodes with fewer intermediate nodes, while still ensuring that the first level contains no more than 4096 children, while each of those intermediate nodes contains around 256 children. That same customer would be found at:

/customers/eb7/51690-23cb-11e4-8c21-0800200c9a66

Or, we might try 4 levels each with a single character, resulting in a lot more intermediate nodes but each with a very small number of children. Then, that same customer would be found at:

/customers/e/b/7/5/1/690-23cb-11e4-8c21-0800200c9a66

The point is that you can often create a hierarchy that does not require parent nodes with large numbers of children. Of course, if you’re whole hierarchy is designed around these artificial traits and no natural traits, then you may be misusing a hierarchical database and might consider other technologies.

Designing with large numbers of child nodes

Sometimes almost all of your hierarchy design will use the natural traits of the data to create a nice hierarchy, but you have one area or level at which you’d like to store parents with relatively large numbers of child nodes under. If you’re careful and follow these guidelines, you may be able to design it so that ModeShape still performs well for your application without having to use artificial traits.

One of the more expensive operations is adding a child to a parent that already has lots of children. This is because the JCR specification requires that ModeShape validate a number of things before allowing the new child. But with proper design, you can minimize or even eliminate much of that expensive validation.

  • The parent’s primary type and mixins should have a total of one one child node definition, and it should allow same-name-siblings. When this is the case, the single child node definition means that ModeShape can use optimal algorithm that is much faster than 2 or more child node definitions. Also, because the child node definition allows SNS, ModeShape does not have to determine if there is already an existing child with the same name (can be very expensive) before it can pick the child node definition. It also means that when saving changes to the parent, ModeShape doesn’t have to re-validate that there are no children with duplicate names. This saves a tremendous amount of time.
  • Large parent should not be versionable. When a parent contains lots of children, make sure that the parent’s node types and mixins are not mix:versionable, and that all child node definitions have an on parent versioning of ignore. This allows ModeShape to speed up quite a few potentially-expensive operations, including addNode(…).
  • Do not use same name siblings. Even though the node types would allow it, we recommend not using same-name-siblings and having your node structure design or your application ensure that you don’t add duplicates. For example, if your node structure uses UUIDs or SHA-1s as subpaths, the nature of those values ensures that there will not be clashes.
  • Add children in batches. ModeShape can very quickly add lots of nodes using a single save operation. For example, it only takes a few seconds to add 10k child nodes under one parent using a single session and a single save. Use as large of batches as possible. Even when repeating that many times (e.g., adding 200k child nodes under one parent using batches), the performance is pretty quick. On the other hand, it is far more expensive and time consuming to add 200k nodes one at a time.
  • When possible, add multiple children under the same parent before creating other nodes. When ModeShape adds a child node with a given name and primary type under a parent, it has to look at the parent’s primary type and mixins to determine if a child node definition allows adding that child. We’ve added some improvements in ModeShape 3.8.1 and later so that ModeShape caches in each thread the last primary type and mixins that were used previously, and this saves a lot of time to add lots of children under the same parent using one session (even across multiple saves).
  • Do not use versioning. JCR’s versioning actually makes a lot of operations quite expensive. For example, before a child can be added or even before a property can be modified, ModeShape has to make sure that the node nor any of its ancestors are checked out. If any of the ancestors have large numbers of children, materializing that node could be very expensive. In ModeShape 3.8.1 and later, we’ve added an optimization to completely skip these checks when there are no version histories.

Other operations, like getting the path of a node, can also be expensive if any of the ancestors is large or expensive to read. ModeShape normally caches nodes, and if they’re frequently used they’ll stay in the cache. But these cached representations are discarded as soon as the node is modified. This is why adding or modifying nodes can impact read performance.

Use the latest version of ModeShape

As mentioned above, ModeShape 3.8.1 and later will include a number of changes that will improve ModeShape’s overall performance and, especially, performance when working with parents that have lots of children. Look for ModeShape 3.8.1 in the next month or so, and more 4.0 pre- and final releases over the next few months.

 

 

About these ads

Filed under: features, jcr, performance, techniques

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

ModeShape is

a lightweight, fast, pluggable, open-source JCR repository that federates and unifies content from multiple systems, including files systems, databases, data grids, other repositories, etc.

Use the JCR API to access the information you already have, or use it like a conventional JCR system (just with more ways to persist your content).

ModeShape used to be 'JBoss DNA'. It's the same project, same community, same license, and same software.

ModeShape

Topics

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: