ModeShape

An open-source, federated content repository

ModeShape 5’s persistence changes

Starting with ModeShape 3 in early 2012, all repository nodes were internally represented using JSON documents and stored as BSON values in Infinispan. Although we relied upon some Infinispan features, for the most part ModeShape was merely storing its data using a very basic key-value API.

As ModeShape evolved through the 3.x and 4.x versions, we started having some data persistence issues that were largely outside of our control. ModeShape could be deployed inside JBoss AS (eventually known as Wildfly), so we chose our version of Infinispan based upon the version that was shipped with JBoss AS. Unfortunately, when we found bug in Infinispan, those bugs would be fixed in releases that were not yet included in JBoss AS, meaning we couldn’t get the fixes for quite some time. Using Infinispan also made the repository configuration and internals quite complex. Plus, changes in Infinispan’s persistence stores sometimes meant that persisted data could not be read by newer versions of Infinispan.

But most importantly, in certain situations we saw data corruption render a repository’s content largely unusable. This is a complicated issue that we previously outlined in detail this forum post and this issue.

Therefore, our primary goal with ModeShape 5 was to make sure that the repository data is stored in a more durable and strongly consistent manner that avoided the aforementioned corruption issues. This meant that we had to take a more conservative approach to persistence and give up claims of high scalability and performance (which are fine with eventual consistency, but not with strong consistency which is a must-have for ModeShape).

Already having the design of storing BSON documents in a key-value store helped us a lot, since it meant we only had to come up with transactional, strongly consistent, key-value store alternatives.

ModeShape 5’s initial release  comes with three such stores out of the box.

1. RDBMS store

This was the obvious choice, since relational databases provide strong consistency guarantees with good transactional support, at least with READ_COMMITTED isolation level that ModeShape requires. Enterprise users still trust and use relational databases a great deal. Using a relational database store meant users can still cluster multiple repositories together, as long as all those repositories use the same shared database.

ModeShape 5 comes out-of-the-box with support for H2, Oracle, MySql and PostgreSQL. Repository data is persisted in the form of BLOBs using the same internal BSON format we’ve used since ModeShape 3. We’ve also designed the store in such a way so that in the future, we can add specialized storage types that take advantage of the capabilities of different databases (for example PostgresSQL’s JSONB in 9.5 and above).

Configuration is again much simpler than ModeShape 3 and 4 with an equivalent store, as can be seen from the documentation.

2. File system store

This store uses an embedded H2 database to persist information on the local disk. Internally we use its very nice MVStore API, the lower-level key-value engine used within H2’s normal relational and SQL engine. It provides good transactional support and stores/streams binary objects (like our BSON documents) with optional features like compression and encryption.

For users which don’t want to store the repository data in a RDBMS and who also aren’t interested in clustering, this should be the default go-to store in ModeShape 5.

Configuring such a store is trivial and doesn’t require any additional configuration files (see our documentation for examples)

3. Transient in-memory key-value store

This is the default store when nothing is explicitly configured, and data is only persisted in-memory and is lost as soon as the process stops. Therefore, this is not suitable for production but is a very simple and natural option for testing and exploration. Internally, it uses H2’s MVStore API without persistence.

Other key-value stores

We can add support for other key-values stores in the future, provided they:

  • are strongly consistent;
  • support ACID transactions; and
  • run on Java 8 or above

We’re also happy to hear any suggestions or to evaluate any contributions from our community members.

Performance

Our preliminary tests indicate that all the above stores perform at least as well as their previous Infinispan counterparts in local, non-clustered modes. In fact, they should perform better in write-intensive cases while probably performing slightly slower for read-intensive cases, since Infinispan always had an in-memory cache layer on top of every store.

Which should you use?

We recommend using the file store for non-clustered cases. It’s simple, fast, and doesn’t require an external process. A second option to consider is the JDBC store with an embedded H2 database.

When clustering however, the only suitable option is the relational store with a shared JDBC store. As outlined above and as mentioned in the documentation, strict serializability required by the JCR API comes at a cost: all cluster members must coordinate their operations and use a shared persistent store. To help provide this coordination and to avoid write-contention on the same nodes, ModeShape employs global cluster locking (via JGroups) to ensure nodes can only be modified by one cluster member at a time. We believe that this is only way in which we can ensure the JCR consistency  requirements when running in a cluster.

 

 

 

 

 

Advertisements

Filed under: features, performance, releases, repository, uncategorized,

Improving performance with large numbers of child nodes

A JCR repository is by definition a hierarchical database, and it’s important that the hierarchical node structure be properly designed to maintain good functionality and performance. If the hierarchy is too deep, you’re applications will spend a lot of time navigating lots of nodes just to find the one they’re interested in. Or, if the hierarchy is too wide, then there will be lots of children under a single parent and this large parent might become a performance bottleneck.

Unfortunately, it’s difficult to come up with hard and fast rules about what it means for a repository structure to be “too deep” or “too wide”. In this post we talk in detail about the performance of accessing a single node with lots of children, but applications rarely access just one node at a time. Instead, most applications access multiple nodes when performing most application-specific operations, and these patterns will greatly affect the total performance of the application and repository. So no matter what, measure the performance of your application using a variety of repository designs.

How does the number of child nodes affects performance?

ModeShape stores each node separately in the persistent store, and each node representation stores a reference to its parent and to all children. The parent reference makes it easy to walk up the tree, while the list of children makes it fast to walk children and to maintain the order of the children even as nodes are reordered and nodes with same-name-sibilngs are used.

A parent node that has 10s of thousands of children will thus have a pretty large representation in the persistent store, and this adds to the cost of reading and writing that representation. This is why we recommend not having large numbers of children under a single parent.

ModeShape does have the ability to break up the list of children into segments, and to store these segments separately from the parent node. This behavior is not enabled by default, but it can be enabled as a background optimization process.

Avoiding large numbers of child nodes

Sometimes it’s quite easy to design a node structure that doesn’t have parent nodes with large numbers of children. A blog application might organize the posts by date (e.g., “/posts/{year}/{month}/{date}/{title}”), and this works quite well simply because at every level the number of children is limited. For example, there will never be more than 12 nodes under a given year, and never more than 31 nodes under a given month. Also, it is unlikely to create many posts on a single day, so the number of titles under a given day will even be quite small.

While there are many data structures that can naturally organize your hierarchy of nodes, there are situations where there is no obvious natural hierarchy. Consider an application that maintains customers, where each customer is identified by a unique identifier. Your application may be able to organize the customer by region, by date, or by some other characteristic. But that’s not always possible or ideal. In that case, it may be useful to base the hierarchy on an artificial characteristic.

Consider the common case where the identifiers are UUIDs, which are unique and very easily generated. UUIDs are also very nicely and uniformly distributed, meaning that the characters of the hexadecimal form (e.g., “eb751690-23cb-11e4-8c21-0800200c9a66”) of two consecutively generated UUIDs will differ in most of the characters. We can exploit the hexadecimal representation and the uniform distribution of UUIDs to create a hierarchical structure that can store a lot of nodes with just a few levels in the hierarchy.

For example, if we use the first two hexadecimal characters as the name of our first level of nodes, and the next two characters for the second level of our node structure, then we can easily store 1 million nodes in a structure that never has more than 256 children under a single parent. The customer with ID “eb751690-23cb-11e4-8c21-0800200c9a66” can be found by turning the ID into a path:

/customers/eb/75/1690-23cb-11e4-8c21-0800200c9a66

We could vary the design to use 3 characters for the first level and no second intermediate level. That means we can store our 1M nodes with fewer intermediate nodes, while still ensuring that the first level contains no more than 4096 children, while each of those intermediate nodes contains around 256 children. That same customer would be found at:

/customers/eb7/51690-23cb-11e4-8c21-0800200c9a66

Or, we might try 4 levels each with a single character, resulting in a lot more intermediate nodes but each with a very small number of children. Then, that same customer would be found at:

/customers/e/b/7/5/1/690-23cb-11e4-8c21-0800200c9a66

The point is that you can often create a hierarchy that does not require parent nodes with large numbers of children. Of course, if you’re whole hierarchy is designed around these artificial traits and no natural traits, then you may be misusing a hierarchical database and might consider other technologies.

Designing with large numbers of child nodes

Sometimes almost all of your hierarchy design will use the natural traits of the data to create a nice hierarchy, but you have one area or level at which you’d like to store parents with relatively large numbers of child nodes under. If you’re careful and follow these guidelines, you may be able to design it so that ModeShape still performs well for your application without having to use artificial traits.

One of the more expensive operations is adding a child to a parent that already has lots of children. This is because the JCR specification requires that ModeShape validate a number of things before allowing the new child. But with proper design, you can minimize or even eliminate much of that expensive validation.

  • The parent’s primary type and mixins should have a total of one one child node definition, and it should allow same-name-siblings. When this is the case, the single child node definition means that ModeShape can use optimal algorithm that is much faster than 2 or more child node definitions. Also, because the child node definition allows SNS, ModeShape does not have to determine if there is already an existing child with the same name (can be very expensive) before it can pick the child node definition. It also means that when saving changes to the parent, ModeShape doesn’t have to re-validate that there are no children with duplicate names. This saves a tremendous amount of time.
  • Large parent should not be versionable. When a parent contains lots of children, make sure that the parent’s node types and mixins are not mix:versionable, and that all child node definitions have an on parent versioning of ignore. This allows ModeShape to speed up quite a few potentially-expensive operations, including addNode(…).
  • Do not use same name siblings. Even though the node types would allow it, we recommend not using same-name-siblings and having your node structure design or your application ensure that you don’t add duplicates. For example, if your node structure uses UUIDs or SHA-1s as subpaths, the nature of those values ensures that there will not be clashes.
  • Add children in batches. ModeShape can very quickly add lots of nodes using a single save operation. For example, it only takes a few seconds to add 10k child nodes under one parent using a single session and a single save. Use as large of batches as possible. Even when repeating that many times (e.g., adding 200k child nodes under one parent using batches), the performance is pretty quick. On the other hand, it is far more expensive and time consuming to add 200k nodes one at a time.
  • When possible, add multiple children under the same parent before creating other nodes. When ModeShape adds a child node with a given name and primary type under a parent, it has to look at the parent’s primary type and mixins to determine if a child node definition allows adding that child. We’ve added some improvements in ModeShape 3.8.1 and later so that ModeShape caches in each thread the last primary type and mixins that were used previously, and this saves a lot of time to add lots of children under the same parent using one session (even across multiple saves).
  • Do not use versioning. JCR’s versioning actually makes a lot of operations quite expensive. For example, before a child can be added or even before a property can be modified, ModeShape has to make sure that the node nor any of its ancestors are checked out. If any of the ancestors have large numbers of children, materializing that node could be very expensive. In ModeShape 3.8.1 and later, we’ve added an optimization to completely skip these checks when there are no version histories.

Other operations, like getting the path of a node, can also be expensive if any of the ancestors is large or expensive to read. ModeShape normally caches nodes, and if they’re frequently used they’ll stay in the cache. But these cached representations are discarded as soon as the node is modified. This is why adding or modifying nodes can impact read performance.

Use the latest version of ModeShape

As mentioned above, ModeShape 3.8.1 and later will include a number of changes that will improve ModeShape’s overall performance and, especially, performance when working with parents that have lots of children. Look for ModeShape 3.8.1 in the next month or so, and more 4.0 pre- and final releases over the next few months.

 

 

Filed under: features, jcr, performance, techniques

Using a ring buffer for events in 4.0

Events are essential to ModeShape. When your application saves changes to content, ModeShape generates events that describe those changes and sends those events to all of your applications’ listeners registered. The bottom line is that every listener is able to see events for all of the changes made, regardless of which part of the cluster those changes were made or in which part of the cluster your listeners are in.

But your applications aren’t the only components that respond to events: ModeShape itself has quite a few listeners that allow it to monitor and react to those same changes. Some of ModeShape’s listeners respond to changes in your content, while other internal listeners respond to changes made by ModeShape. How? ModeShape stores all kinds of system metadata in the repository (namespaces, node type definitions, locks, versions, index definitions, federated projections, etc.). When any of this metadata is changed and persisted on one process in the cluster, it is only via events that all of the other processes in the cluster notice these changes.

For example, when your application registers a new namespace prefix/URI pair, ModeShape reflects this in the local NamespaceRegistry instance’s in-memory cache and immediately persists the information. But what about the NamespaceRegistry instances elsewhere in the cluster? They’re using listeners to watch for changes in the namespace area of the system metadata, and as soon as they see an event describing the new namespace, the (remote) NamespaceRegistry instances can immediately update their in-memory cache so that all sessions throughout the cluster see a consistent set of namespace registrations.

ModeShape has quite a few components that use events in a similar way: indexes, locks, versions, workspace additions/removals, repository-wide settings, etc.

The ChangeSet and ChangeBus

To register a listener, an application must implement the javax.jcr.observation.EventListener interface and then register an instance with the workspace’s ObservationManager. Standard JCR events can describe the basics of when nodes are created, moved or deleted, and when properties are added, changed or removed. But that’s about it.

Internally, ModeShape uses a much richer and finer-grained kind of events. Every time that a transaction commits (whether that includes a single session save or multiple saves), descriptions of all of the changes made by that commit are bundled into a single ChangeSet. It is these ChangeSets that ModeShape actually ships around the cluster, and all of ModeShape’s internal components are written to respond to them by implementing and regsitering an internal ChangeSetListener interface. Interestingly, every time your applications register a new EventListener instance, ModeShape actually registers an internal ChangeSetListener implementation that merely adapts each ChangeSet (and the changes described by it) into a standard set of JCR Event objects.

Each ModeShape Repository instance has a ChangeBus component that is responsible for keeping track of all of the ChangeSetListeners and forwarding all of the ChangeSets to all those listeners. Multiple internal components send ChangeSet objects to it, and the bus forwards them to each listener. It is very important that this be done quickly and correctly. For example, one listener should never interfere with or block any other listeners. And, a listener should see all of the events in the same order in which they occurred.

If ModeShape is clustered, the ChangeBus satisfies the same requirements, but it works a little differently: when a component sends a ChangeSet, that ChangeSet is immediately sent via JGroups to all members in the cluster, and then in each process JGroups sends the ChangeSet object back to the ChangeBus, which in turn forwards it to all local listeners. By doing it this way, JGroups can ensure that all processes see the same order of ChangeSet objects.

Needless to say, the ChangeBus is critical and is also relatively complicated. The original design in 2.x evolved very little in 3.x, but as we’ll show, we’ve overhauled it completely for 4.0.

The ChangeBus in 2.x and 3.x

ModeShape 2.x and 3.x ChangeBus implementation used a fairly simple design: each listener had a “consumer” thread that ran continuously, popping ChangeSet objects from a listener-specific blocking FIFO queue and calling the actual listener. When a new  ChangeSet is added to the bus, the ChangeBus adds that ChangeSet to the front of the queue for every listener.

Each listener thread consumes ChangeSet objects from its own blocking queue

Each listener thread consumes ChangeSet objects from its own blocking queue

This design had some nice benefits:

  1. The design is fairly simple.
  2. Every listener saw the same order of ChangeSet objects.
  3. Each listener ran in a separate thread, so for the most part each was completely isolated from all other listeners (see below).
  4. Because of the blocking queues, if a listener were really slow and its queue was full, the ChangeBus would block when trying to add the change set to the queue. This provided some backpressure to slow down the system (specifically the sessions making the changes) while the listener could catch up.

It also had a few disadvantages:

  1. When a ChangeSet arrived, the bus had to iteratively add the ChangeSet to all of the listeners’ queues, and it did this before returning from the method. Of course, this takes longer when the bus has more listeners.
  2. A blocking queue has internal locks that must be obtained before a ChangeSet can be added to it, and the consumer is also competing for this lock. This slows down the ChangeBus‘s add operation.
  3. The new ChangeSet is added to the last listener’s queue only after the change set is added to all other queues. This introduces a time lag between the arrival of a ChangeSet in the ChangeBus and the delivery to the last listener, and this lag is more pronounced for those listeners that were added last (since they’re later in the list of listeners).
  4. If any of the blocking queues is full (because its listener is not processing the ChangeSets fast enough), then the ChangeBus‘s add operation will block. This is good because it adds back pressure to the producer (specifically the sessions making the changes), but notice that the add operation is blocked before adding the change set into subsequent queues. So even if those listeners are caught up, they won’t see the change set until the listener with the blocked queue is able to catch up. This makes one listener dependent upon all other listeners that were added to the ChangeBus before it.
  5. Each listener’s queue maintains its own ordered copy of the list of ChangeSet objects. More listeners, more queues.

Notice how having a larger number of listeners has a pretty big impact on the performance. We’ve already noticed a fair amount of lag with 3.x. And in the early pre-releases of 4.0 we’ve already added more internal listeners than we had in 3.x, and we plan to add even more for the index providers.

The new ChangeBus in 4.0

Back in the fall of last year, we knew that the old ChangeBus could be improved and talked about several possible approaches. One of the ideas discussed had a lot of potential: use a ring buffer.

A ring buffer is pretty straightforward. Conceptually it consists of a single circular buffer, one or more producers can add entries (in a thread-safe manner) into the buffer at a single cursor, and consumers trail behind the cursor and process (each in their own thread) each of the entries that are already in the buffer.

ChangeSets are added at the cursor, and consumer threads follow behind reading them

ChangeSets are added at the cursor, and consumer threads follow behind reading them

In the diagram above, the numbers represent the positions of entries in the buffer, starting at 1 and monotonically increasing. The cursor is at position 7, and there are consumer threads that are each reading a ChangeSet at a slightly different position: 6, 4, 3 and 2. Notice that there is a garbage collection thread that follows all other consumers, simply nulling out the ChangeSet reference after it has been consumed by all consumers. (We need this because the ring buffer typically has 1024 or 2048 slots, and this would consume lots of memory if every one had a ChangeSet with lots of changes. The ring buffer’s garbage collector enables all the already-processed ChangeSet objects to be garbage collected by the JVM.)

Here is another image of the ring buffer, after an additional 7 ChangeSet objects have been added and after enough time that the listeners’ consumer threads have advanced.

The cursor has advanced, as have all of the consumers and the buffer's garbage collector

The cursor has advanced, as have all of the consumers and the buffer’s garbage collector

The position of each consumer is completely independent of all other consumers’ positions, though they are obviously dependent upon the cursor position where new entries are being added at the cursor. Typically the listeners are fast enough that the consumers trail very closely behind the cursor. But of course there will be variation, especially if the number of changes in each ChangeSet varies dramatically (and it usually does).

As more ChangeSet objects are added, the cursor advances and will get to the “lap” point, where it starts to reuse the entries in the buffer that were previously used. (Really, the buffer is a simple fixed-size Object[] that is allocated up front, and the positions in the buffer are easily converted into array indexes. We just visualize it as a ring.)

The cursor will eventually reuse buffer entries that are no longer needed

The cursor will eventually reuse buffer entries that are no longer needed

What happens if the cursor catches up to the garbage collector thread? First of all, the ring buffer is usually sized large enough and the listeners fast enough that this doesn’t happen. But if it does, the ring buffer prevents the cursor from advancing onto or beyond the garbage collector (which always stays behind the slowest consumer). Thus, the method adding a ChangeSet object blocks until the cursor can be moved.

The cursor never "laps" the garbage collector or consumers, and this provides natural back pressure

The cursor never “laps” the garbage collector or consumers, and this provides natural back pressure

In a real repository, this back pressure will mean a save operation takes a bit longer. And should this happen more frequently than you’d like, you always have the option of increasing the size of the buffer and restarting the repository. But really what this means is that your system doesn’t have enough cores to support the number of listeners, or that one or more of the listeners are simply taking too long and that perhaps you should consider using the JCR Event Journal instead of the listener framework. (With the event journal, your code can ask for changes that occurred during some period of time.)

At this level of detail it may look like the ring buffer has a lot of potential conflicts. But really, a good ring buffer implementation will maintain this coordination without the use of locks or synchronization techniques. Our implementation does exactly this: it uses volatile longs and compare-and-swap (CAS) operations to keep track of the various positions of the cursor, consumers and garbage collector, and the logic ensures that the consumers never get past the cursor’s position. In fact, we use the exact same technique and code to also ensure that the cursor never laps the garbage collector thread; after all, the buffer is a finite ring.

When all of the consumers are caught up to the cursor and no additional ChangeSet object has been added, then our implementation does currently make each consumer thread block until another ChangeSet object is added. This is done with a simple Java lock condition that is used only in this case; the condition never prevents the addition of a ChangeSet object.

In other words, a ring buffer should be fast. So we looked at various ring buffer implementations, including the LMAX Disruptor (which is very nice). While most of the features were great, there were a few characteristics of the Disruptor that weren’t a great match, so we quickly prototyped our own implementation.

ChangeBus implementation that used the LMAX Disruptor was roughly an order of magnitude faster than our old one, and one that used our prototype ring buffer was even a bit faster.  Given our implementation was small and focused on exactly what we needed, and that we didn’t need another third party dependency, we decided to turn our prototype into something that was more robust and integrated it into the 4.0 codebase. This new ChangeBus implementation will first appear in ModeShape 4.0.0.Alpha3.

This post was quite long, but hopefully you found it interesting and helpful. And for ModeShape users, maybe you’ll even have a bit more insight into how ModeShape handles events, and one of the many ways in which ModeShape 4 is improved.

Filed under: features, performance, techniques

ModeShape is

a lightweight, fast, pluggable, open-source JCR repository that federates and unifies content from multiple systems, including files systems, databases, data grids, other repositories, etc.

Use the JCR API to access the information you already have, or use it like a conventional JCR system (just with more ways to persist your content).

ModeShape used to be 'JBoss DNA'. It's the same project, same community, same license, and same software.

ModeShape

Topics