An open-source, federated content repository

ModeShape 4.0.0.Beta1 is available

The ModeShape community is proud to announce the immediately availability of our latest 4.0 pre-release, ModeShape 4.0.0.Beta1. The JARs and other artifacts are available in the JBoss Maven repository and in our downloads area, and will be soon in Maven Central. See our Getting Started guide for details.

Thanks to our whole community for the work that’s gone into this release!

What’s new?

This beta release contains 46 fixes and a few new features. In the previous alpha releases we previewed a new query engine that no longer indexes everything in the repository like in 3.x, but instead always executes a query even if no indexes are available and can be used. With this beta release, you can now explicitly define indexes that index only the information necessary to make your queries faster. This means lower overhead, smaller footprint, and more efficient query processing. .

Here’s a rundown of the most important features and changes in 4.0 so far:

  • JDK 7 – ModeShape 4.0 requires JDK 7. We’ve not yet begun testing with Java 8, but we’d be happy to hear about it if you do.
  • Queries – The new query engine is more capable than in 3.x, and it buffers results off-heap to prevent large queries from exhausting your application’s memory. The engine still supports a variety of query languages, though JCR-SQL2 is still the most powerful and with 4.0 has a few more extensions. Explicitly define indexes to make your queries faster. All indexes are stored locally on the file system, and in clustered repositories each process in the cluster maintains its own copy of the indexes. In 4.1 we’ll start offering the ability to mix these with indexes stored in Solr, ElasticSearch, and/or Lucene.
  • Clustering – Configuring a cluster of ModeShape repositories is even easier. All configuration is done within the Infinispan’s clustering setup – if Infinispan is clustered, then ModeShape is part of the same cluster. We’ve also upgraded to a newer version of JGroups.
  • Journaling – ModeShape has a new event journal mechanism that helps the new (or returning) processes in a cluster better come up to speed with the history of events. You get all this with no work on your part, but even your applications can use the new feature via the JCR 2.0 event journal API.  This is a great alternative to JCR event listeners that in some situations might be very expensive or time-consuming.
  • Event bus – We’ve completely rewritten the way ModeShape repositories internally handle events. We now use a ring buffer that is substantially faster than what we had in 3.x. There’s no change in the event APIs so your listener implementations will continue to work unchanged – they’ll just be faster. The speed improvement is important, because we’re internally using listeners in more areas.
  • Infinispan – We’ve moved to Infinispan 6.0.x.Final, which is faster and has new cache stores. Some older and poorly-performaing cache stores are no longer valid, so check out the new file-based cache stores. Also, the LevelDB cache store is supposedly very fast.
  • Wildfly 8 – You can embed ModeShape within your applications, or you can install ModeShape as a subsystem within Wildfly so that your web apps and services can store and access content via the JCR API.
  • Repository Explorer – We’ve redesigned this web application to be much more usable.

The set of 4.0 alpha and beta releases also include 118 bug fixes and other improvements.

What’s next?

Over the next few weeks we’ll keep fixing bugs and trying to stabilize the release. But as of today, there are just a few outstanding issues. Not only is the codebase already pretty stable, but we may be able to get to the final release pretty quickly.

What can you do?

Simple: test this release. Download it, use it, try the new features, and put it through its paces. Try it out and see how Infinispan 6 works, how much faster it is, and try one of the new and high-performance cache stores. Try out ModeShape in Wildfly 8. Give queries a whirl, and let us know if there are any queries that worked in 3.x no longer work in 4.x; remember they’ll probably be slower than in 3.x until you explicitly add indexes that cover your query constraints.

Filed under: features, jcr, news, releases

Improving performance with large numbers of child nodes

A JCR repository is by definition a hierarchical database, and it’s important that the hierarchical node structure be properly designed to maintain good functionality and performance. If the hierarchy is too deep, you’re applications will spend a lot of time navigating lots of nodes just to find the one they’re interested in. Or, if the hierarchy is too wide, then there will be lots of children under a single parent and this large parent might become a performance bottleneck.

Unfortunately, it’s difficult to come up with hard and fast rules about what it means for a repository structure to be “too deep” or “too wide”. In this post we talk in detail about the performance of accessing a single node with lots of children, but applications rarely access just one node at a time. Instead, most applications access multiple nodes when performing most application-specific operations, and these patterns will greatly affect the total performance of the application and repository. So no matter what, measure the performance of your application using a variety of repository designs.

How does the number of child nodes affects performance?

ModeShape stores each node separately in the persistent store, and each node representation stores a reference to its parent and to all children. The parent reference makes it easy to walk up the tree, while the list of children makes it fast to walk children and to maintain the order of the children even as nodes are reordered and nodes with same-name-sibilngs are used.

A parent node that has 10s of thousands of children will thus have a pretty large representation in the persistent store, and this adds to the cost of reading and writing that representation. This is why we recommend not having large numbers of children under a single parent.

ModeShape does have the ability to break up the list of children into segments, and to store these segments separately from the parent node. This behavior is not enabled by default, but it can be enabled as a background optimization process.

Avoiding large numbers of child nodes

Sometimes it’s quite easy to design a node structure that doesn’t have parent nodes with large numbers of children. A blog application might organize the posts by date (e.g., “/posts/{year}/{month}/{date}/{title}”), and this works quite well simply because at every level the number of children is limited. For example, there will never be more than 12 nodes under a given year, and never more than 31 nodes under a given month. Also, it is unlikely to create many posts on a single day, so the number of titles under a given day will even be quite small.

While there are many data structures that can naturally organize your hierarchy of nodes, there are situations where there is no obvious natural hierarchy. Consider an application that maintains customers, where each customer is identified by a unique identifier. Your application may be able to organize the customer by region, by date, or by some other characteristic. But that’s not always possible or ideal. In that case, it may be useful to base the hierarchy on an artificial characteristic.

Consider the common case where the identifiers are UUIDs, which are unique and very easily generated. UUIDs are also very nicely and uniformly distributed, meaning that the characters of the hexadecimal form (e.g., “eb751690-23cb-11e4-8c21-0800200c9a66”) of two consecutively generated UUIDs will differ in most of the characters. We can exploit the hexadecimal representation and the uniform distribution of UUIDs to create a hierarchical structure that can store a lot of nodes with just a few levels in the hierarchy.

For example, if we use the first two hexadecimal characters as the name of our first level of nodes, and the next two characters for the second level of our node structure, then we can easily store 1 million nodes in a structure that never has more than 256 children under a single parent. The customer with ID “eb751690-23cb-11e4-8c21-0800200c9a66” can be found by turning the ID into a path:


We could vary the design to use 3 characters for the first level and no second intermediate level. That means we can store our 1M nodes with fewer intermediate nodes, while still ensuring that the first level contains no more than 4096 children, while each of those intermediate nodes contains around 256 children. That same customer would be found at:


Or, we might try 4 levels each with a single character, resulting in a lot more intermediate nodes but each with a very small number of children. Then, that same customer would be found at:


The point is that you can often create a hierarchy that does not require parent nodes with large numbers of children. Of course, if you’re whole hierarchy is designed around these artificial traits and no natural traits, then you may be misusing a hierarchical database and might consider other technologies.

Designing with large numbers of child nodes

Sometimes almost all of your hierarchy design will use the natural traits of the data to create a nice hierarchy, but you have one area or level at which you’d like to store parents with relatively large numbers of child nodes under. If you’re careful and follow these guidelines, you may be able to design it so that ModeShape still performs well for your application without having to use artificial traits.

One of the more expensive operations is adding a child to a parent that already has lots of children. This is because the JCR specification requires that ModeShape validate a number of things before allowing the new child. But with proper design, you can minimize or even eliminate much of that expensive validation.

  • The parent’s primary type and mixins should have a total of one one child node definition, and it should allow same-name-siblings. When this is the case, the single child node definition means that ModeShape can use optimal algorithm that is much faster than 2 or more child node definitions. Also, because the child node definition allows SNS, ModeShape does not have to determine if there is already an existing child with the same name (can be very expensive) before it can pick the child node definition. It also means that when saving changes to the parent, ModeShape doesn’t have to re-validate that there are no children with duplicate names. This saves a tremendous amount of time.
  • Large parent should not be versionable. When a parent contains lots of children, make sure that the parent’s node types and mixins are not mix:versionable, and that all child node definitions have an on parent versioning of ignore. This allows ModeShape to speed up quite a few potentially-expensive operations, including addNode(…).
  • Do not use same name siblings. Even though the node types would allow it, we recommend not using same-name-siblings and having your node structure design or your application ensure that you don’t add duplicates. For example, if your node structure uses UUIDs or SHA-1s as subpaths, the nature of those values ensures that there will not be clashes.
  • Add children in batches. ModeShape can very quickly add lots of nodes using a single save operation. For example, it only takes a few seconds to add 10k child nodes under one parent using a single session and a single save. Use as large of batches as possible. Even when repeating that many times (e.g., adding 200k child nodes under one parent using batches), the performance is pretty quick. On the other hand, it is far more expensive and time consuming to add 200k nodes one at a time.
  • When possible, add multiple children under the same parent before creating other nodes. When ModeShape adds a child node with a given name and primary type under a parent, it has to look at the parent’s primary type and mixins to determine if a child node definition allows adding that child. We’ve added some improvements in ModeShape 3.8.1 and later so that ModeShape caches in each thread the last primary type and mixins that were used previously, and this saves a lot of time to add lots of children under the same parent using one session (even across multiple saves).
  • Do not use versioning. JCR’s versioning actually makes a lot of operations quite expensive. For example, before a child can be added or even before a property can be modified, ModeShape has to make sure that the node nor any of its ancestors are checked out. If any of the ancestors have large numbers of children, materializing that node could be very expensive. In ModeShape 3.8.1 and later, we’ve added an optimization to completely skip these checks when there are no version histories.

Other operations, like getting the path of a node, can also be expensive if any of the ancestors is large or expensive to read. ModeShape normally caches nodes, and if they’re frequently used they’ll stay in the cache. But these cached representations are discarded as soon as the node is modified. This is why adding or modifying nodes can impact read performance.

Use the latest version of ModeShape

As mentioned above, ModeShape 3.8.1 and later will include a number of changes that will improve ModeShape’s overall performance and, especially, performance when working with parents that have lots of children. Look for ModeShape 3.8.1 in the next month or so, and more 4.0 pre- and final releases over the next few months.



Filed under: features, jcr, performance, techniques

ModeShape is

a lightweight, fast, pluggable, open-source JCR repository that federates and unifies content from multiple systems, including files systems, databases, data grids, other repositories, etc.

Use the JCR API to access the information you already have, or use it like a conventional JCR system (just with more ways to persist your content).

ModeShape used to be 'JBoss DNA'. It's the same project, same community, same license, and same software.