An open-source, federated content repository

Introduction to ModeShape 4 on WildFly

Francesco Marchioni has written two articles that show just how easy it is to get started with ModeShape 4 on WildFly and to use JavaEE to very quickly create a simple web application to create and access content.

The first article, “NoSQL Data storage with ModeShape 4“, shows exactly how to install ModeShape 4 into an existing (or new) WildFly 8 server.

The second article, “ModeShape 4 in action“, walks through deploying a simple Web application to insert new nodes into Modeshape JCR repository and display them with a nice Primefaces tree view.

Thanks, Francesco, for providing these excellent and very useful articles!


Filed under: features, open source, techniques

Improving performance with large numbers of child nodes

A JCR repository is by definition a hierarchical database, and it’s important that the hierarchical node structure be properly designed to maintain good functionality and performance. If the hierarchy is too deep, you’re applications will spend a lot of time navigating lots of nodes just to find the one they’re interested in. Or, if the hierarchy is too wide, then there will be lots of children under a single parent and this large parent might become a performance bottleneck.

Unfortunately, it’s difficult to come up with hard and fast rules about what it means for a repository structure to be “too deep” or “too wide”. In this post we talk in detail about the performance of accessing a single node with lots of children, but applications rarely access just one node at a time. Instead, most applications access multiple nodes when performing most application-specific operations, and these patterns will greatly affect the total performance of the application and repository. So no matter what, measure the performance of your application using a variety of repository designs.

How does the number of child nodes affects performance?

ModeShape stores each node separately in the persistent store, and each node representation stores a reference to its parent and to all children. The parent reference makes it easy to walk up the tree, while the list of children makes it fast to walk children and to maintain the order of the children even as nodes are reordered and nodes with same-name-sibilngs are used.

A parent node that has 10s of thousands of children will thus have a pretty large representation in the persistent store, and this adds to the cost of reading and writing that representation. This is why we recommend not having large numbers of children under a single parent.

ModeShape does have the ability to break up the list of children into segments, and to store these segments separately from the parent node. This behavior is not enabled by default, but it can be enabled as a background optimization process.

Avoiding large numbers of child nodes

Sometimes it’s quite easy to design a node structure that doesn’t have parent nodes with large numbers of children. A blog application might organize the posts by date (e.g., “/posts/{year}/{month}/{date}/{title}”), and this works quite well simply because at every level the number of children is limited. For example, there will never be more than 12 nodes under a given year, and never more than 31 nodes under a given month. Also, it is unlikely to create many posts on a single day, so the number of titles under a given day will even be quite small.

While there are many data structures that can naturally organize your hierarchy of nodes, there are situations where there is no obvious natural hierarchy. Consider an application that maintains customers, where each customer is identified by a unique identifier. Your application may be able to organize the customer by region, by date, or by some other characteristic. But that’s not always possible or ideal. In that case, it may be useful to base the hierarchy on an artificial characteristic.

Consider the common case where the identifiers are UUIDs, which are unique and very easily generated. UUIDs are also very nicely and uniformly distributed, meaning that the characters of the hexadecimal form (e.g., “eb751690-23cb-11e4-8c21-0800200c9a66”) of two consecutively generated UUIDs will differ in most of the characters. We can exploit the hexadecimal representation and the uniform distribution of UUIDs to create a hierarchical structure that can store a lot of nodes with just a few levels in the hierarchy.

For example, if we use the first two hexadecimal characters as the name of our first level of nodes, and the next two characters for the second level of our node structure, then we can easily store 1 million nodes in a structure that never has more than 256 children under a single parent. The customer with ID “eb751690-23cb-11e4-8c21-0800200c9a66” can be found by turning the ID into a path:


We could vary the design to use 3 characters for the first level and no second intermediate level. That means we can store our 1M nodes with fewer intermediate nodes, while still ensuring that the first level contains no more than 4096 children, while each of those intermediate nodes contains around 256 children. That same customer would be found at:


Or, we might try 4 levels each with a single character, resulting in a lot more intermediate nodes but each with a very small number of children. Then, that same customer would be found at:


The point is that you can often create a hierarchy that does not require parent nodes with large numbers of children. Of course, if you’re whole hierarchy is designed around these artificial traits and no natural traits, then you may be misusing a hierarchical database and might consider other technologies.

Designing with large numbers of child nodes

Sometimes almost all of your hierarchy design will use the natural traits of the data to create a nice hierarchy, but you have one area or level at which you’d like to store parents with relatively large numbers of child nodes under. If you’re careful and follow these guidelines, you may be able to design it so that ModeShape still performs well for your application without having to use artificial traits.

One of the more expensive operations is adding a child to a parent that already has lots of children. This is because the JCR specification requires that ModeShape validate a number of things before allowing the new child. But with proper design, you can minimize or even eliminate much of that expensive validation.

  • The parent’s primary type and mixins should have a total of one one child node definition, and it should allow same-name-siblings. When this is the case, the single child node definition means that ModeShape can use optimal algorithm that is much faster than 2 or more child node definitions. Also, because the child node definition allows SNS, ModeShape does not have to determine if there is already an existing child with the same name (can be very expensive) before it can pick the child node definition. It also means that when saving changes to the parent, ModeShape doesn’t have to re-validate that there are no children with duplicate names. This saves a tremendous amount of time.
  • Large parent should not be versionable. When a parent contains lots of children, make sure that the parent’s node types and mixins are not mix:versionable, and that all child node definitions have an on parent versioning of ignore. This allows ModeShape to speed up quite a few potentially-expensive operations, including addNode(…).
  • Do not use same name siblings. Even though the node types would allow it, we recommend not using same-name-siblings and having your node structure design or your application ensure that you don’t add duplicates. For example, if your node structure uses UUIDs or SHA-1s as subpaths, the nature of those values ensures that there will not be clashes.
  • Add children in batches. ModeShape can very quickly add lots of nodes using a single save operation. For example, it only takes a few seconds to add 10k child nodes under one parent using a single session and a single save. Use as large of batches as possible. Even when repeating that many times (e.g., adding 200k child nodes under one parent using batches), the performance is pretty quick. On the other hand, it is far more expensive and time consuming to add 200k nodes one at a time.
  • When possible, add multiple children under the same parent before creating other nodes. When ModeShape adds a child node with a given name and primary type under a parent, it has to look at the parent’s primary type and mixins to determine if a child node definition allows adding that child. We’ve added some improvements in ModeShape 3.8.1 and later so that ModeShape caches in each thread the last primary type and mixins that were used previously, and this saves a lot of time to add lots of children under the same parent using one session (even across multiple saves).
  • Do not use versioning. JCR’s versioning actually makes a lot of operations quite expensive. For example, before a child can be added or even before a property can be modified, ModeShape has to make sure that the node nor any of its ancestors are checked out. If any of the ancestors have large numbers of children, materializing that node could be very expensive. In ModeShape 3.8.1 and later, we’ve added an optimization to completely skip these checks when there are no version histories.

Other operations, like getting the path of a node, can also be expensive if any of the ancestors is large or expensive to read. ModeShape normally caches nodes, and if they’re frequently used they’ll stay in the cache. But these cached representations are discarded as soon as the node is modified. This is why adding or modifying nodes can impact read performance.

Use the latest version of ModeShape

As mentioned above, ModeShape 3.8.1 and later will include a number of changes that will improve ModeShape’s overall performance and, especially, performance when working with parents that have lots of children. Look for ModeShape 3.8.1 in the next month or so, and more 4.0 pre- and final releases over the next few months.



Filed under: features, jcr, performance, techniques

Using a ring buffer for events in 4.0

Events are essential to ModeShape. When your application saves changes to content, ModeShape generates events that describe those changes and sends those events to all of your applications’ listeners registered. The bottom line is that every listener is able to see events for all of the changes made, regardless of which part of the cluster those changes were made or in which part of the cluster your listeners are in.

But your applications aren’t the only components that respond to events: ModeShape itself has quite a few listeners that allow it to monitor and react to those same changes. Some of ModeShape’s listeners respond to changes in your content, while other internal listeners respond to changes made by ModeShape. How? ModeShape stores all kinds of system metadata in the repository (namespaces, node type definitions, locks, versions, index definitions, federated projections, etc.). When any of this metadata is changed and persisted on one process in the cluster, it is only via events that all of the other processes in the cluster notice these changes.

For example, when your application registers a new namespace prefix/URI pair, ModeShape reflects this in the local NamespaceRegistry instance’s in-memory cache and immediately persists the information. But what about the NamespaceRegistry instances elsewhere in the cluster? They’re using listeners to watch for changes in the namespace area of the system metadata, and as soon as they see an event describing the new namespace, the (remote) NamespaceRegistry instances can immediately update their in-memory cache so that all sessions throughout the cluster see a consistent set of namespace registrations.

ModeShape has quite a few components that use events in a similar way: indexes, locks, versions, workspace additions/removals, repository-wide settings, etc.

The ChangeSet and ChangeBus

To register a listener, an application must implement the javax.jcr.observation.EventListener interface and then register an instance with the workspace’s ObservationManager. Standard JCR events can describe the basics of when nodes are created, moved or deleted, and when properties are added, changed or removed. But that’s about it.

Internally, ModeShape uses a much richer and finer-grained kind of events. Every time that a transaction commits (whether that includes a single session save or multiple saves), descriptions of all of the changes made by that commit are bundled into a single ChangeSet. It is these ChangeSets that ModeShape actually ships around the cluster, and all of ModeShape’s internal components are written to respond to them by implementing and regsitering an internal ChangeSetListener interface. Interestingly, every time your applications register a new EventListener instance, ModeShape actually registers an internal ChangeSetListener implementation that merely adapts each ChangeSet (and the changes described by it) into a standard set of JCR Event objects.

Each ModeShape Repository instance has a ChangeBus component that is responsible for keeping track of all of the ChangeSetListeners and forwarding all of the ChangeSets to all those listeners. Multiple internal components send ChangeSet objects to it, and the bus forwards them to each listener. It is very important that this be done quickly and correctly. For example, one listener should never interfere with or block any other listeners. And, a listener should see all of the events in the same order in which they occurred.

If ModeShape is clustered, the ChangeBus satisfies the same requirements, but it works a little differently: when a component sends a ChangeSet, that ChangeSet is immediately sent via JGroups to all members in the cluster, and then in each process JGroups sends the ChangeSet object back to the ChangeBus, which in turn forwards it to all local listeners. By doing it this way, JGroups can ensure that all processes see the same order of ChangeSet objects.

Needless to say, the ChangeBus is critical and is also relatively complicated. The original design in 2.x evolved very little in 3.x, but as we’ll show, we’ve overhauled it completely for 4.0.

The ChangeBus in 2.x and 3.x

ModeShape 2.x and 3.x ChangeBus implementation used a fairly simple design: each listener had a “consumer” thread that ran continuously, popping ChangeSet objects from a listener-specific blocking FIFO queue and calling the actual listener. When a new  ChangeSet is added to the bus, the ChangeBus adds that ChangeSet to the front of the queue for every listener.

Each listener thread consumes ChangeSet objects from its own blocking queue

Each listener thread consumes ChangeSet objects from its own blocking queue

This design had some nice benefits:

  1. The design is fairly simple.
  2. Every listener saw the same order of ChangeSet objects.
  3. Each listener ran in a separate thread, so for the most part each was completely isolated from all other listeners (see below).
  4. Because of the blocking queues, if a listener were really slow and its queue was full, the ChangeBus would block when trying to add the change set to the queue. This provided some backpressure to slow down the system (specifically the sessions making the changes) while the listener could catch up.

It also had a few disadvantages:

  1. When a ChangeSet arrived, the bus had to iteratively add the ChangeSet to all of the listeners’ queues, and it did this before returning from the method. Of course, this takes longer when the bus has more listeners.
  2. A blocking queue has internal locks that must be obtained before a ChangeSet can be added to it, and the consumer is also competing for this lock. This slows down the ChangeBus‘s add operation.
  3. The new ChangeSet is added to the last listener’s queue only after the change set is added to all other queues. This introduces a time lag between the arrival of a ChangeSet in the ChangeBus and the delivery to the last listener, and this lag is more pronounced for those listeners that were added last (since they’re later in the list of listeners).
  4. If any of the blocking queues is full (because its listener is not processing the ChangeSets fast enough), then the ChangeBus‘s add operation will block. This is good because it adds back pressure to the producer (specifically the sessions making the changes), but notice that the add operation is blocked before adding the change set into subsequent queues. So even if those listeners are caught up, they won’t see the change set until the listener with the blocked queue is able to catch up. This makes one listener dependent upon all other listeners that were added to the ChangeBus before it.
  5. Each listener’s queue maintains its own ordered copy of the list of ChangeSet objects. More listeners, more queues.

Notice how having a larger number of listeners has a pretty big impact on the performance. We’ve already noticed a fair amount of lag with 3.x. And in the early pre-releases of 4.0 we’ve already added more internal listeners than we had in 3.x, and we plan to add even more for the index providers.

The new ChangeBus in 4.0

Back in the fall of last year, we knew that the old ChangeBus could be improved and talked about several possible approaches. One of the ideas discussed had a lot of potential: use a ring buffer.

A ring buffer is pretty straightforward. Conceptually it consists of a single circular buffer, one or more producers can add entries (in a thread-safe manner) into the buffer at a single cursor, and consumers trail behind the cursor and process (each in their own thread) each of the entries that are already in the buffer.

ChangeSets are added at the cursor, and consumer threads follow behind reading them

ChangeSets are added at the cursor, and consumer threads follow behind reading them

In the diagram above, the numbers represent the positions of entries in the buffer, starting at 1 and monotonically increasing. The cursor is at position 7, and there are consumer threads that are each reading a ChangeSet at a slightly different position: 6, 4, 3 and 2. Notice that there is a garbage collection thread that follows all other consumers, simply nulling out the ChangeSet reference after it has been consumed by all consumers. (We need this because the ring buffer typically has 1024 or 2048 slots, and this would consume lots of memory if every one had a ChangeSet with lots of changes. The ring buffer’s garbage collector enables all the already-processed ChangeSet objects to be garbage collected by the JVM.)

Here is another image of the ring buffer, after an additional 7 ChangeSet objects have been added and after enough time that the listeners’ consumer threads have advanced.

The cursor has advanced, as have all of the consumers and the buffer's garbage collector

The cursor has advanced, as have all of the consumers and the buffer’s garbage collector

The position of each consumer is completely independent of all other consumers’ positions, though they are obviously dependent upon the cursor position where new entries are being added at the cursor. Typically the listeners are fast enough that the consumers trail very closely behind the cursor. But of course there will be variation, especially if the number of changes in each ChangeSet varies dramatically (and it usually does).

As more ChangeSet objects are added, the cursor advances and will get to the “lap” point, where it starts to reuse the entries in the buffer that were previously used. (Really, the buffer is a simple fixed-size Object[] that is allocated up front, and the positions in the buffer are easily converted into array indexes. We just visualize it as a ring.)

The cursor will eventually reuse buffer entries that are no longer needed

The cursor will eventually reuse buffer entries that are no longer needed

What happens if the cursor catches up to the garbage collector thread? First of all, the ring buffer is usually sized large enough and the listeners fast enough that this doesn’t happen. But if it does, the ring buffer prevents the cursor from advancing onto or beyond the garbage collector (which always stays behind the slowest consumer). Thus, the method adding a ChangeSet object blocks until the cursor can be moved.

The cursor never "laps" the garbage collector or consumers, and this provides natural back pressure

The cursor never “laps” the garbage collector or consumers, and this provides natural back pressure

In a real repository, this back pressure will mean a save operation takes a bit longer. And should this happen more frequently than you’d like, you always have the option of increasing the size of the buffer and restarting the repository. But really what this means is that your system doesn’t have enough cores to support the number of listeners, or that one or more of the listeners are simply taking too long and that perhaps you should consider using the JCR Event Journal instead of the listener framework. (With the event journal, your code can ask for changes that occurred during some period of time.)

At this level of detail it may look like the ring buffer has a lot of potential conflicts. But really, a good ring buffer implementation will maintain this coordination without the use of locks or synchronization techniques. Our implementation does exactly this: it uses volatile longs and compare-and-swap (CAS) operations to keep track of the various positions of the cursor, consumers and garbage collector, and the logic ensures that the consumers never get past the cursor’s position. In fact, we use the exact same technique and code to also ensure that the cursor never laps the garbage collector thread; after all, the buffer is a finite ring.

When all of the consumers are caught up to the cursor and no additional ChangeSet object has been added, then our implementation does currently make each consumer thread block until another ChangeSet object is added. This is done with a simple Java lock condition that is used only in this case; the condition never prevents the addition of a ChangeSet object.

In other words, a ring buffer should be fast. So we looked at various ring buffer implementations, including the LMAX Disruptor (which is very nice). While most of the features were great, there were a few characteristics of the Disruptor that weren’t a great match, so we quickly prototyped our own implementation.

ChangeBus implementation that used the LMAX Disruptor was roughly an order of magnitude faster than our old one, and one that used our prototype ring buffer was even a bit faster.  Given our implementation was small and focused on exactly what we needed, and that we didn’t need another third party dependency, we decided to turn our prototype into something that was more robust and integrated it into the 4.0 codebase. This new ChangeBus implementation will first appear in ModeShape 4.0.0.Alpha3.

This post was quite long, but hopefully you found it interesting and helpful. And for ModeShape users, maybe you’ll even have a bit more insight into how ModeShape handles events, and one of the many ways in which ModeShape 4 is improved.

Filed under: features, performance, techniques

Structured, unstructured, and everything between

Shane gives a good breakdown of the various ways to classify data as structured or unstructured. He points out that very often data is a mixture of both structured and unstructured data, and he gives several examples.

What I find so interesting about this, however, is how well ModeShape can handle these varieties of data.

ModeShape handles structured data really well. Most data structures are very easily mapped to the nodes and properties that ModeShape uses. And when those nodes also say which node types apply to them, ModeShape can enforce the node structure by validating it against those built-in and/or custom node types and prevent invalid data from being stored.

The other end of the spectrum is unstructured data, and ModeShape handles that beautifully, too. You can store unstructured data in a property using a string value or a binary value. Typically you would use a string value when the data is some form of text, and a binary value in any other cases (or when you don’t want to treat it as text).

But the best part is that ModeShape naturally handles combinations of structured and unstructured data. Recall that ModeShape is a hierarchical database, which means that each database consists of a single tree of nodes, and each node has one or more properties. That hierarchy is by definition structured, though it’s up to you whether ModeShape validates and enforces that structure using node types. But the leaves of that tree — that is the properties and their values — typically unstructured (though property value like dates and even some string values could be considered structured).

ModeShape’s query languages can also deal with both structured and unstructured data. Relationships between nodes, specific properties defined by node types, and the definitions of those properties all are addressable within the query language. But ModeShape queries can include full-text search constraints on both string and binary property values!

ModeShape can search those binary values when it can extract text using the Tika library, which supports many formats, including PDF, Microsoft Office™, RTF, HTML, and many others.

There’s one more way that ModeShape can deal with unstructured data: it can sequence unstructured data (string and binary property values) using built-in or custom sequencers to extract structure and save it as more nodes and properties in the repository. This is ideal for getting at that unstructured data that has the implicit structure defined by the format. For example, if an image is loaded into the repository, ModeShape’s image sequencer can extract the EXIF data in the image (e.g., ISO setting, focal length, aperture, shutter, geo-location, etc.) and save it as properties in the repository. ModeShape has a number of built-in sequencers that can extract this implicit structure from a variety of file formats:

  • DDL files
  • images (JPEG, GIF, BMP, PCX, PNG, IFF, RAS, PBM, PGM, PPM and PSD)
  • audio (MP3)
  • comma-separated and delimited text files
  • Java source and class files
  • Microsoft Office™
  • ZIP archives
  • XML
  • XML Schema
  • WSDL

In summary, ModeShape deals very naturally and easily with data that is part unstructured and part structured. What else could you want?

Filed under: features, repository, techniques

Creating and using tags in your content

UPDATE 2: Changed option 3 to use string identifiers, as WEAKREFERENCE and REFERENCE properties both maintain back-references.

UPDATE 1: Added a 5th option, as suggested by Bertrand Delacretaz.

(This post was inspired by a response I recently wrote to a Stack Overflow question. That answer was a bit long, but I thought it would also be suitable as a blog post.)

Many applications offer a way to tag “things” with either user-defined or system-defined tags. Assuming those “things” are nodes, what’s the best way to add tags to a ModeShape repository? I know of four five possible approaches, each with their own benefits and disadvantages.

Option 1: Use Mixins

This approach will use a separate mixin node type definition for each tag. The mixin is a marker mixin (e.g., it has no property definitions or child node definitions). One example of “known-issue” tag is the following (in CND format):

[tag:known-issue] mixin

Create this tag by registering the node type definition using the NodeTypeManager, either by programmatically creating the node type template or by uploading a CND file.

To “tag” a particular node, simply add the tag’s mixin to the node:


Note that any node can have multiple tags, since any node can have multiple mixins.

To find all nodes that have a particular tag, simply issue a query:

SELECT * FROM [tag:known-issue]

To find all nodes that have two tags, simply perform a UNION:

SELECT * FROM [tag:known-issue]
SELECT * FROM [tag:critical-issue]

This approach is pretty straightforward and really uses ModeShape’s mixin feature. However, it is fairly cumbersome to create new tags, since that requires registering new node types. Plus, you cannot easily rename tags, but instead would have to:

  1. create the mixin for the tag with the new name;
  2. find all nodes that have the mixin representing the old tag, and for each remove the old mixin and add the new one;
  3. finally remove the node type definition for the old tag (after it is no longer used anywhere).

Removing old tags is done in a similar manner. Finally, it’s not really possible to associate additional metadata (like a display name) with a tag, since extra properties aren’t allowed on node type definitions.

This approach should perform quite well, however.

Option 2: Use a taxonomy and references

This approach involves using one or more “taxonomies“, each of which consist of a parent node for the taxonomy and child nodes for each tag in that taxonomy. The exact node types used are entirely up to you, but the taxonomy structure can be as rich as you’d like it to be. For example, you can create inheritance between tags in much the same way that classes can inherit from other classes in an ontology. Obviously adding, renaming, and removing tags is straightforward.

To “tag” a node, this approach uses a REFERENCE property. One way to do this is to define a single node type for the tag nodes and a single mixin that we’ll use to add this REFERENCE property to “taggable” nodes:

[tags:tag] > mix:title, mix:referenceable

[tags:taggable] mixin
- tags:tags (REFERENCE) multiple < 'tags:tag'

To “apply” the tag to a node, simply add the “tags:taggable” mixin to the node (if not already there) and add the REFERENCE to the desired tag node. Here’s some code that does this (although it is too simple and assumes the node hasn’t already been tagged):

Node tag = ... // find in taxonomy
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
Value[] values = new Value[1];
values[0] = session.getValueFactory().createValue(tag);

To find all nodes of a particular tag, simply get the tag and call “getReferences()” on a tag node to find all of the nodes that contain a reference to the tag node:

Node tag = ...
NodeIterator iter = tag.getReferences("tags:tags");
while ( iter.hasNext() ) {
    Node tagged =;

Alternatively, you could use a query to find all of the nodes for a particular tag. Here’s one that finds all the nodes that are tagged with the ‘known-issues’ or ‘critical-issue’ tag (note how easy it is to search for nodes tagged with any of 1, 2, or n tags just by changing the set criteria):

SELECT * FROM [tags:taggable] AS taggable
JOIN [tags:tag] AS tag ON taggable.[tags:tags] = tag.[jcr:uuid]
AND LOCALNAME(tag) IN ('known-issue','critical-issue')

This approach has the benefit that all tags have to be controlled/managed within one or more taxonomies (including perhaps user-specific taxonomies).

However, there is one potentially substantial disadvantage: this option may not scale very well to large numbers of tagged nodes. ModeShape might start to degrade adding and removing REFERENCE values when there are hundreds of nodes pointing to the same tag node. Another disadvantage is that a tag cannot be removed from a taxonomy unless it is no longer used.

You can also use WEAKREFERENCE rather than REFERENCE. The only distinction is that with WEAKREFERENCE you can remove a tag from the taxonomy without having to remove it from the tagged nodes.

Option 3: Use taxonomy and identifier references

This option is similar to Option 2 above in that it involves formally managing one or more taxonomies, in exactly the same was as described above. The difference, however, is that rather than use a REFERENCE (or WEAKREFERENCE) the node that is to be tagged points to the tag node using a STRING property with the identifier of the tag node:

[tags:tag] > mix:title, mix:referenceable

[tags:taggable] mixin
- tags:tags (STRING) multiple

Note that the tag has a “jcr:title” property, which you can use to hold the display name for the tag.

Tagging a node is done similarly to Option 2, except the value of the “tags:tag” property is a string:

Node tag = ... // find in taxonomy
String tagId = tag.getIdentifier();
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
Value[] values = new Value[1];
values[0] = session.getValueFactory().createValue(tagId);

To find all nodes of a particular tag, simply use a query to find all of the nodes that have the identifier of a particular tag. Here’s one that finds all the nodes that are tagged with the ‘known-issues’ or ‘critical-issue’ tag (note how easy it is to search for nodes tagged with any of 1, 2, or n tags just by changing the set criteria):

SELECT * FROM [tags:taggable] AS taggable
JOIN [tags:tag] AS tag ON taggable.[tags:tags] = tag.[jcr:uuid]
AND LOCALNAME(tag) IN ('known-issue','critical-issue')

You’ll note that this is very similar to the query in Options 2 and 3. That’s because REFERENCE and WEAKREFERENCE properties are physically stored in a property value as an identifier.

Like option 2, this approach does enforce using one or more taxonomies, makes it a bit easier to control the tags, since they must exist in a taxonomy before they can be used. Renaming nodes is also pretty easy, although this is not necessary if using the “jcr:title” property for the display name , since renaming involves simply changing the title property value. Performance-wise, this is far better than the REFERENCE and WEAKREFERENCE approach, since non-reference properties will scale much better and perform better with large numbers of references, regardless of whether they all point to one node or many. Looking up the tag(s) from the “tags:tags” property is also very fast (and faster than navigating a path).

This approach is similar to Option 2 with WEAKREFERENCE properties in that you can remove a tag even if it is still used, although nodes’ “tags:tags” property values that point to that removed tag will not be usable anymore. This can be remedied with some conventions in your application, or by simply keeping tags around and using metadata on the taxonomy to say that a particular tag is “deprecated” and shouldn’t be used. (IMO, the latter is actually a benefit of this approach.)

This option will generally perform and scale much better than Option 2.

Option 4: Use string properties

The final approach is to simply use a STRING property to tag each node with the name of the tag(s) that are to be applied. This works great for ad hoc tags, which is when there is no formal taxonomy and any tag can be used at any time.

Here’s a mixin that defines a multi-valued STRING property:

[tags:taggable] mixin
- tags:tags (STRING) multiple

To tag a node, simply add the mixin (if not already present) and add the name of the tag as a value on the “tags:tags” STRING property (again, if it’s not already present as a value). Here’s some simplified code that does none of the checking, but which gives the basic idea:

Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
String[] tags = new String[1]{"known-issue"};

The primary advantage of this approach is that it is very simple: you’re simply using string values on the node that is to be tagged. To find all nodes that are tagged with a particular tag (e.g., “tag1”), simply issue a query:

SELECT * FROM [acme:taggable] AS taggable
WHERE taggable.[tags:tags] = 'known-issue'

Also, there is no taxonomy to manage. But if a tag is to be renamed, then you could simply process the “tags:tags” values. If a tag is to be deleted (and removed from the nodes that are tagged with it), then that can be done by removing the tag name from the “tags:tags” properties (perhaps in a background job).

Note that this allows any tag name to be used, and thus works best for cases where the tag names are not controlled at all. If you want to control the list of strings used as tag values, you could create a taxonomy in the repository (as described in Options 2 and 3 above) and have your application limit the values to those in the taxonomy. You can even have multiple taxonomies, some of which are perhaps user-specific. But this approach doesn’t have quite the same control as Options 2 or 3.

This option will perform just a bit better than Option 3 (since the queries are tad simpler), but will scale just as well.

Option 5: Use taxonomy and paths

A fifth option is very similar to Option 3, except that you use a PATH property (rather than a STRING property) that points to the tag, where the PATH values are paths to the tag. Here are some node types:

[tags:tag] > mix:title

[tags:taggable] mixin
- tags:tags (PATH) multiple

(You could also use a STRING property instead of PATH; really the only advantage of using PATH is that it enforces that each value is a legal path value. But using PATH does not enforce that it is an existing path.)

To tag a node, simply add the mixin (if not already present) and add the path of the tag as a value on the “tags:tags” STRING property (again, if it’s not already present as a value). Here’s some simplified code that does none of the checking, but which gives the basic idea:

Node tag = ... // the tag node
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
String[] tags = new String[1]{tag.getPath()};

Unlike Options 2 or 3, this approach does not even use taxonomies. In fact, you’ll notice that the “tags:tags” property node type has no constraints that require it to contain a path; this reduces the constraints and requires your application to use convention, which can be an advantage. Using a title on the tag for the displayable name obviates having to rename tags. Performance-wise, this is far better than the REFERENCE or WEAKREFERENCE approach, and (for ModeShape) just a bit worse than using the STRING property with an identifier (ModeShape can resolve an identifier faster than it can finding it by path). But it will scale far better than Option 2 and similarly to Option 3.

One advantage of this approach (and of Option 3) over Option 2 is that you can remove a tag even if it is still used, although nodes’ PATH properties that point to that removed tag will be readable but not resolvable. (If you’re using the tag’s title for the display name, this might not be useful since the path might not contain meaningful and usable information.) This can be remedied with some conventions in your application, or by simply keeping tags around and using metadata on the taxonomy to say that a particular tag is “deprecated” and shouldn’t be used. (IMO, the latter is actually a benefit of this approach.)


We looked at five different ways of incorporating tags into your application. Of course, which one works best for you will depend on the needs of your particular application. And use these as a starting point — feel free to customize them, combine them, or even come up with even other alternatives.

If you just need a way to associate informal tags with content, perhaps Option 4 is a good fit. For very small and limited tagging needs, Option 1 might work. Whereas you should seriously look at option 2 for smallish repositories that needs a formal taxonomy.

But for most applications, your repository will be large enough that you will probably want to look at Options 3, 4 or 5, with the deciding factor being whether you need formal or informal taxonomies. Personally, of these three I think I’d tend to lean toward Option 3.

Happy tagging!

Filed under: jcr, techniques

Concurrent writes

It’s almost a certainty that you will have multiple applications and multiple threads within those applications simultaneously update data in your database. The speed of your application will depend significantly on how fast your database can perform these simultaneous updates.

If you’re using ModeShape, the first thing to know is that reading content does not require any locks. In other words, applications or threads that are reading content can always do so with no contention. (ModeShape doesn’t need read locks because it via Infinispan uses MVCC to isolate readers from writers. See the details for more.)

The second thing to know is that, because ModeShape is a hierarchical database, all data is stored in a tree-like structure of nodes and properties, and any transaction updating content must obtain locks for all nodes being updated. Much of the time, applications and threads that change content do tend to update different parts (subtrees) of the database, which means completely different write locks are acquired by the different transactions. In other words, updates to different parts of the database never block each other.

There are times, however, when multiple applications and/or threads do attempt to update the same node at the same time. In this case, the transactions do compete for the node’s lock, and these transactions complete in essentially a serialized fashion. (Again, they still do not block any reading operations or any transactions updating other areas of the repository.) Occasionally two transactions may deadlock, because they each obtain a lock on separate nodes and then try to obtain a lock on the node currently locked by the other. If you run into this situation, you can enable deadlock detection to automatically detect such cases and roll back one of the deadlocked transactions, which your application can simply re-try by performing the save again.

It’s nice to know that most of the time, application will not have any contention. And when there is contention for concurrent writes to the same areas, ModeShape does the logical thing by serializing the transactions. (Isn’t ACID behavior nice?!)

But even after all this, you may find that your applications are still highly contentious while trying to concurrently update the same nodes. In these cases, you have several options:

  1. Can you initialize the highly-contentious area when the database is created? If so, then the different transactions will update different areas of the database.
  2. Can you alter the hierarchical design of your database to eliminate the contention? Consider if your hierarchy would improve by adding one or more time-based levels. Or consider inserting a level for different contexts (e.g., users, groups, customers, etc.).
  3. Can you centralize where/how your application is updating these areas? For example, a hierarchy that includes a level for users might have contention when adding users. Try centralizing the process of adding users. (Queues often work great for these kinds of patterns.)

By the way, how does ModeShape compare to other hierarchical data stores? Really well, actually. One of the more popular JCR implementations uses a single, cluster-wide, global write lock that guarantees that only one write will proceed at a time. Yikes.

Filed under: features, jcr, techniques

New repository backup and restore in ModeShape 3

We recently added a new feature to ModeShape 3.0.0.Beta3 that enables repository administrators to create backups of an entire repository (even when the repository is in use), and to then restore a repository to the state reflected by a particular backup. This works regardless of where the repository content is persisted.

There are several reasons why you might want to restore a repository to a previous state, and many are quite obvious. For example, the application or the process it’s running in might stop unexpectedly. Or perhaps the hardware on which the process is running might fail. Or perhaps the persistent store might have a catastrophic failure (although surely you’re also using the persistent store’s backup system, too).

But there are also non-failure related reasons. Backups of a running repository can be used to transfer the content to a new repository that is perhaps hosted in a different location. It might be possible to manually transfer the persisted content (e.g., in a database or on the file system), but the process of doing so varies with different kinds of persistence options.  Also, ModeShape can be configured to use a distributed in-memory data grid that already maintains its own copies for ensuring high availability, and therefore the data grid might not persist anything to disk. In such cases, the content is stored on the data grid’s virtual heap, and getting access to it without ModeShape may be quite difficult. Or, you may initially configure your repository to use a particular persistence approach that suitable given the current needs, but over time the repository grows and you want to move to a different, more scalable (but perhaps more complex) persistence approach. Finally, the backup and restore feature can be used to migrate to a new major version of ModeShape.

In short, you may very well have the need to set the contents of a repository back to an earlier state. ModeShape’s backup and restore feature makes this easy to do.

Getting started

Let’s walk through the basic process of creating a backup of an existing repository and then restoring the repository. Both of these steps require an authenticated Session that has administrative privileges. It actually doesn’t matter which workspace the session uses:

javax.jcr.Repository repository = ...
javax.jcr.Credentials credentials = ...
String workspaceName = ...
javax.jcr.Session session = repository.login(credentials,workspaceName);

So far, this is basic and standard stuff for any JCR client.

Introducing the RepositoryManager

Each JCR Session instance has it’s own Workspace object that provides workspace-level functionality and access to a set of “manager” interfaces: the VersionManagerNodeTypeManagerObservationManagerLockManager, etc. The JSR-333 (aka, “JCR 2.1”) effort is still incomplete, but has plans to introduce a RepositoryManager that offers some repository-level functionality. The ModeShape public API has created such an interface, and accessing it from a standard JCR Session instance is pretty simple:

org.modeshape.jcr.api.Session msSession = (org.modeshape.jcr.api.Session)session;
org.modeshape.jcr.api.RepositoryManager repoMgr = ((org.modeshape.jcr.api.Session)session).getWorkspace().getRepositoryManager();

The interface is pretty self-explanatory, and defines several methods including two that are related to the backup and restore feature:

public interface RepositoryManager {


     * Begin a backup operation of the entire repository, writing the files
     * associated with the backup to the specified directory on the local
     * file system.
     * The repository must be active when this operation is invoked, and
     * it can continue to be used during backup (e.g., this can be a
     * "live" backup operation), but this is not recommended if the backup
     * will be used as part of a migration to a different version of
     * ModeShape or to different installation.

     * Multiple backup operations can operate at the same time, so it is
     * the responsibility of the caller to not overload the repository
     * with backup operations.

     * @param backupDirectory the directory on the local file system into
     *        which all backup files will be written; this directory
     *        need not exist, but the process must have write privilege
     *        for this directory
     * @return the problems that occurred during the backup operation
     * @throws AccessDeniedException if the current session does not
     *         have sufficient privileges to perform the backup
     * @throws RepositoryException if the backup cannot be run
    Problems backupRepository( File backupDirectory ) throws RepositoryException;

     * Begin a restore operation of the entire repository, reading the
     * backup files in the specified directory on the local file system.
     * Upon completion of the restore operation, the repository will be
     * restarted automatically.
     * The repository must be active when this operation is invoked.
     * However, the repository <em>may not</em> be used by any other
     * activities during the restore operation; doing so will likely
     * result in a corrupt repository.

     * It is the responsibility of the caller to ensure that this method
     * is only invoked once; calling multiple times wil lead to
     * a corrupt repository.

     * @param backupDirectory the directory on the local file system
     *        in which all backup files exist and were written by a
     *        previous {@link #backupRepository(File) backup operation};
     *        this directory must exist, and the process must have read
     *        privilege for all contents in this directory
     * @return the problems that occurred during the restore operation
     * @throws AccessDeniedException if the current session does not
     *         have sufficient privileges to perform the restore
     * @throws RepositoryException if the restoration cannot be run
    Problems restoreRepository( File backupDirectory ) throws RepositoryException;

Next, we’ll take a look at each of these two methods.

Creating a backup

The backupRepository(...) method on ModeShape’s RepositoryManager interface is used to create a backup of the entire repository, including all workspaces that existed when the backup was initiated. This method blocks until the backup is completed, so it is the caller’s responsibility to invoke the method asynchronously if that is desired. When this method is called on a repository that is being actively used, all of the changes made while the backup process is underway will be included; at some point near the end of the backup process, however, additional changes will be excluded from the backup. This means that each backup contains a fully-consistent snapshot of the entire repository as it existed near the time at which the backup completed.

Here’s an code example showing how easy it is to call this method:

org.modeshape.jcr.api.RepositoryManager repoMgr = ... backupDirectory = ...
Problems problems = repoMgr.backupRepository(backupDirectory);
if ( problems.hasProblems() ) {
    System.out.println("Problems restoring the repository:");
    // Report the problems (we'll just print them out) ...
    for ( Problem problem : problems ) {
} else {
    System.out.println("The backup was successful");

Each ModeShape backup is stored on the file system in a directory that contains a series of GZIP-ed files (each containing representations of a approximately 100K nodes) and a subdirectory in which all the large BINARY values are stored.

It is also the application’s responsibility to initiate each backup operation. In other words, there currently is no way to configure ModeShape to perform backups on a schedule. Doing so would add significant complexity to ModeShape and the configuration, whereas leaving it to the application lets the application fully control how and when such backups occur.

Restoring a repository

Once you have a complete backup on disk, you can then restore a repository back to the state captured within the backup. To do that, simply start a repository (or perhaps a new instance of a repository with a different configuration) and, before it’s used by any applications, load into the new repository all of the content in the backup. Here’s a simple code example that shows how this is done:

Here’s an code example showing how easy it is to call this method:

org.modeshape.jcr.api.RepositoryManager repoMgr = ... backupDirectory = ...
Problems problems = repoMgr.restoreRepository(backupDirectory);
if ( problems.hasProblems() ) {
    System.out.println("Problems backing up the repository:");
    // Report the problems (we'll just print them out) ...
    for ( Problem problem : problems ) {
} else {
    System.out.println("The restoration was successful");

Once a restore succeeds, the newly-restored repository will be restarted and will be ready to be used.

Migrating from ModeShape 2.8 to 3.0

Earlier I mentioned that backup and restore can be used to migrate from one version of ModeShape to the next major version of ModeShape. This is how we plan to support migrating from a ModeShape 2.8 repository instance to a new ModeShape 3.0 instance. We plan to cut one more release of ModeShape 2, which we’ll christen 2.8.4.Final, and that will include a utility that will create a 3.0-compatible backup of the ModeShape 2.8 instance. Then, simply use the “restoreRepository” method on the new (and empty) ModeShape 3.0 repository to load all the backed-up content.

Questions or feedback

This feature is still relatively new and was introduced in ModeShape 3.0.0.Beta3, and we’d love to get your feedback on our forums before we freeze the public API and cut the 3.0.0.Final release.

Filed under: features, jcr, repository, techniques, tools

New disk storage option for ModeShape

We’re introducing a new feature that allows ModeShape to store content directly on disk using the native file system. It’s called the Disk Connector, and is capable of storing any content that applications can put into a repository. It’s already in the ‘master’ branch and will be in the upcoming 2.6.0.Beta1 release of ModeShape. (If you want to give it a try before the release, grab the latest from our repository, run a local build to install it into your local Maven repository, and use the ‘2.6-SNAPSHOT’ version in your application’s POM file.)

So now ModeShape offers are five connectors that can store all valid JCR content (including ‘mix:referenceable’ and ‘mix:versionable’ nodes, REFERENCE properties, version histories, etc.) and can also find nodes by identifier. We’ve designed all these connectors to own their data, meaning other applications should not directly access the underlying storage system. But any one of these is a great fit for most applications:

  • JPA Connector – stores all content in one of the 17 relational DBMS systems supported by Hibernate, including DB2, Oracle, MySQL, PostgreSQL, and SQL Server (to name a few)
  • Infinispan Connector – stores all content in a fast, scalable, distributed, and fault-tolerant Infinispan data grid
  • JBoss Cache Connector – stores all content in a JBoss Cache instance, and useful for small-to-medium sized repositories when Infinispan is not available
  • In-memory Connector – stores all content in-memory, and is fast and useful for small transient repositories or when importing XML and using JCR to read and search the content
  • Disk Connector – stores all content on disk in a binary format defined by ModeShape

ModeShape also offers other connectors that enable accessing the information in external systems, even when other applications use those same systems:

  • File System Connector – reads and writes ‘nt:file’, ‘nt:folder’ and ‘nt:resource’ nodes on the native file system using regular files and directories, mapping the properties defined by these node types to the actual file and directory attributes, and storing extra properties added to nodes via mixins in UTF-8 files (BINARY properties stored encoded in hexadecimal) that your applications can even read
  • JCR Connector – reads and writes content into an external JCR repository, and is useful when migrating from other JCR implementations or when federating existing JCR repositories into a single repository
  • Subversion Connector – reads and writes ‘nt:file’, ‘nt:folder’ and ‘nt:resource’ nodes as files and directories in a SVN repository; unlike the File System Connector, this only supports the standard properties defined on the ‘nt:file’, ‘nt:folder’, and ‘nt:resource’ node types
  • JDBC Metadata Connector – a read-only connector that maps the JDBC metadata into nodes representing the databases, catalogs, schemas, tables, columns, procedures, and other metadata information, and is very useful if you want to have a JCR repository that contains an accurate schema representation of one or more databases

Filed under: features, jcr, news, techniques

Finding a JCR repository

Updated 6/21/2011: Added section describing the Seam JCR module
Updated 6/23/2011: Added more detail about the JNDI location when ModeShape is deployed to JBoss AS

Okay, you’re using JCR in your application, and you’re writing all of your code to the JCR API. That’s great, because your application doesn’t have any implementation-specific calls, and you can rely only upon the “javax.jcr” packages.

“But,” you ask, “how do I get a reference to the javax.jcr.Repository instance without using implementation-specific code in my app?”

If you’re using JCR 1.0, you’re basically out of luck. The spec didn’t specify how to do that, and so the implementations all do it differently.

But thankfully JCR 2.0 introduced the javax.jcr.RepositoryFactory interface and described how to use the Java SE Service Locator pattern to get that initial reference to your repository instance without any implementation-specific code. Here’s how that works.

Using the JCR 2.0 RepositoryFactory

Your application will have one (or more) JCR implementations on the classpath, and per JCR 2.0 they will each provide their own RepositoryFactory implementations and manifest entries so that the JVM can find them. Your application can find them by using the Service Locator pattern:

Map parameters = ...
Repository repository = null;
for (RepositoryFactory factory : ServiceLoader.load(RepositoryFactory.class)) {
  repository = factory.getRepository(parameters);
  if (repository != null) break;

This basically iterates over all of the RepositoryFactory implementations, and for each one asks that factory to return the JCR Repository instance given the map of parameters. Per JCR 2.0, if the RepositoryFactory understands the parameters, it will return a Repository instance; otherwise, it will return null. Now, each JCR implementation is allows to define their own parameters, so these definitely are still implementation-specific. But since they’re just properties, your application can remain independent of JCR implementation by simply loading them from a file:

Properties parameters = new Properties();
// Read from a file or from other input streams or readers ...
parameters.load(new FileInputStream(file));
// Find the Repository instance ...
Repository repository = null;
for (RepositoryFactory factory : ServiceLoader.load(RepositoryFactory.class)) {
  repository = factory.getRepository(parameters);
  if (repository != null) break;

Look, Ma! No implementation-specific code!

ModeShape parameters for RepositoryFactory

So what parameters does ModeShape expect? Just one:


If the value of this parameter is a URL that resolves to a ModeShape configuration file, the factory will actually start up a new ModeShape engine using that configuration file, and will look for the repository in the URL. For example:


will look for a ModeShape configuration file named “configRepository.xml” that is in the “config” directory relative to where the JVM was started, and will return the repository defined in the configuration file with the name “MyRepository”. (Remember that a single ModeShape engine can host multiple JCR repositories.) Other URLs are possible, as long as they can be resolved to the configuration file.

If the value of the “org.modeshape.jcr.URL” parameter is a URL that begins with “jndi:”, then the ModeShape factory will attempt to look for a ModeShape engine instance registered in JNDI, and will ask that engine for the named repository. For example:


will look in JNDI for a ModeShape engine at “name/in/jndi”, and will ask it for the repository named “MyRepository”.

The JNDI form is what you’ll use if you’ve deployed ModeShape to JBoss AS and your applications need to access the repositories. ModeShape runs as a service within JBoss AS, so when the app server is started ModeShape will be auto-registered the engine in JNDI at “jcr/local”. If you’ve not changed the configuration, there will be a repository called “repository” (with a default workspace called “default”, though you can create other workspaces using the JCR API), and you can use the following URL for the “org.modeshape.jcr.URL” parameter:


Of course, you probably want to change the configuration to add other repositories or to control where and how the repositories store the content (by default it is stored in-memory). If you add repositories or change the name of the repository, you’ll need to change the URL accordingly.

Injecting JCR Repositories

If you’re building an application that uses CDI, there’s another option for getting a hold of your Repository instance. The Seam JCR project is a portable extension to CDI that provides annotations for automatically injecting a javax.jcr.Repository object into your application, and Seam JCR works with ModeShape and Jackrabbit. Simple ensure that Seam JCR and your JCR implementation are on your classpath, and then simply use annotations to provide the same parameters normally supplied to the RepositoryFactory. Here’s an example of injecting ModeShape with the same “file:” URL used above:

  @Inject @JcrConfiguration(name="org.modeshape.jcr.URL",
  Repository repository;

Seam JCR also makes it easy to inject a JCR Session into your application:

  @Inject @JcrConfiguration(name="org.modeshape.jcr.URL",
  Session session;

This code will obtain a Session using the default workspace and no credentials, but the Seam JCR team is working on supporting Credentials and workspace names.

Of course, Seam JCR also works with Jackrabbit, but uses Jackrabbit-specific parameters. For more details, see the Seam JCR site.

Filed under: features, jcr, repository, techniques

ModeShape moves to Git

The ModeShape project’s official source code repository is now at GitHub:

We’re adopting the Fork+Pull method of development. The basic idea is that you first fork the “official” ModeShape repository on GitHub. Then, you do all your development locally, push your proposed changes into your fork, and generate a pull-request describing your proposed changes. The ModeShape committers will review and discuss your changes and pull them into the “official” repository (using this process).

For details on this process, see our ModeShape Development Workflow article. We’ve started a discussion thread for any questions or lessons learned. We’ll hopefully improve the documentation in the coming days and weeks. And to learn more about Git, we recommend the following resources:

Filed under: news, open source, techniques, tools

ModeShape is

a lightweight, fast, pluggable, open-source JCR repository that federates and unifies content from multiple systems, including files systems, databases, data grids, other repositories, etc.

Use the JCR API to access the information you already have, or use it like a conventional JCR system (just with more ways to persist your content).

ModeShape used to be 'JBoss DNA'. It's the same project, same community, same license, and same software.