Update: changed the Scalability section to make more clear the scope of the term.
When it comes down to it, ModeShape is a database. But there are lots of kinds of databases, and it’s always very important to choose a database that fits your application’s needs. Here are some of the characteristics that distinguish ModeShape from other kinds of databases, which should help you decide whether ModeShape is a good fit for your use cases.
ModeShape is strongly-consistent and adheres to the ACID principles, meaning that all operations are atomic, consistent, isolated, and durable. Your applications create Sessions to interact with the information stored in a repository workspace. Each session sees the latest persisted information, even as other applications (or parts of your application) are persisting changes through their own sessions. Your session can make changes, which are overlaid upon the latest persisted information, but only your session sees these changes until you save your session and the changes are persisted. Internally, ModeShape uses a transaction to make sure that all the session’s changes are made (or none of them are), that the changes are consistent, are seen by other sessions only when the changes are completed, and that the changes are durable.
What this means for you is that it’s very easy to develop and write applications, and in many ways is very similar to how you’ve worked with other ACID systems (like relational databases) in the past. You can even use JTA transactions so that the changes are persisted only upon transaction commit. And your application is written the same way whether ModeShape is clustered or non-clustered.
In the last few years, eventually-consistent databases have become very popular, due in part to the increasingly popular goal of creating very large (distributed) databases. When a change is made to an eventually consistent database, that change is not immediately propagated to other processes, but eventually (after a period of time when no changes are made) the database will become consistent. This means that right after one client makes a change, there is no guarantee when other clients will see those changes, and yet those other clients can change the data that they see. The result is that there can be multiple “versions” of the data, and although the database may attempt to resolve these conflicts, it often can only do this for relatively simple conflicts. Ultimately, your application will likely have to deal with the conflict. Additionally, many eventually consistent databases suggest specific usage patterns to make such conflicts less likely, but those usage patterns are often more complicated than you’re used to using. There absolutely are use cases where eventually consistent databases are perfect fits, but there are also lots of use cases and applications that are perfectly unsuitable for eventually consistent databases.
(Note that the next generation of Apache Jackrabbit, codenamed “Oak”, will be eventually consistent. To do that, they are not going to support all the JCR features. When your application saves its session, any conflicts that arise and that can’t be automatically handled will result in exceptions. Their expectation is that your application should then try again to recreate the changes, and that in the worst case your application may have to explicitly resolve the conflicts.)
ModeShape stores data in a tree of nodes and properties, where you have full control over the design of that tree structure. At the top of the tree is a single root node, and every node can contain multiple child nodes. Every node has a name and a unique identifier, and can also be identified by a path containing the names of all ancestors, from the parent to the node itself. Names are comprised of a namespace and local part, and there is a namespace registry to centralize short prefixes for each namespace.
You can see that this looks very similar to how a file system is laid out. You already know how to organize a file system, and organizing a ModeShape repository is very similar. In fact, lots of data already has implicit hierarchical structure. Consider that URLs are essentially addresses into a website’s hierarchy. And hierarchical data is easy to use: simply navigate the nodes. It’s also often more efficient to navigate, since related data is very close by.
Scalable and highly available
ModeShape repositories can be small and embedded into Java applications, or they can be very large and distributed across a cluster of machines. You can even decide how (and if) ModeShape should persist your data, ranging from keeping data only in-memory, to storing data on the local file system, to storing data in a relational database, to leveraging the performance, scalability, and durability of an in-memory and elastic data grid. In may seem counter-intuitive, but storing your data in RAM is extremely fast as long as multiple copies of all your data are stored across multiple machines while ensuring that machines can be added and removed and the data is automatically and elastically distributed. This is exactly what a data grid can achieve, and this is how ModeShape can scale to very (very) large databases.
All of ModeShape’s functionality and features are built on top of Infinispan, which is a flexible, fast and highly scalable data grid. ModeShape stores each node in one or more entries in Infinispan (a small node will be stored as one entry, but larger ones are broken into multiple entries), and Infinispan is configured to replicate, distribute and persist the data.
Note that when we talk about scalability and large databases, we’re not talking about the kinds of scales that “big data” often refers to. ModeShape is not a “big data” database and doesn’t scale that big. We’re transactional, after all.
ModeShape supports a very powerful and flexible schema system, but interestingly you get to decide where and how much schema enforcement to use. At one extremely, you allow every node to contain any property and any children – this is essentially using ModeShape as a schema-less database, and it’s a perfectly valid way to use ModeShape. Your application becomes fully in-control of the database structure, making it easy to evolve the structure to suit new or changing requirements.
At the other extreme, you fully define every node to fit a particular node type that constrains the properties and child nodes to fit pre-defined patterns. ModeShape ensures that all the data always adheres to the schema, and your application doesn’t have to do any validation or enforcement.
But between these two extremes is where ModeShape really becomes interesting and advantageous. You can choose which subset of nodes in your tree you want to adhere to a schema, allowing parts of the database to be more schema-less and the rest to be more constrained. But more importantly, you can dynamically expand the schema for any individual node by mixing in additional node types with more property and child patterns. For example, you can define a node type that requires a “title” property, and you can add this node type to any node that is to have a title.
ModeShape’s schema system is very powerful and flexible, and makes it far easier to constrain your data while simultaneously enabling future changes to and evolution of your database’s schema.
Query and search
Navigation isn’t the only way to access your data. Your applications can also query a ModeShape repository to find the subset of content that meets application-specified criteria regardless of where in the hierarchy that data exists. ModeShape offers several query languages, including a subset of XPath, a full-text search language (much like internet search engines), and an extremely powerful SQL-like language called “JCR-SQL2”. Here’s a fairly simple example of a JCR-SQL2 query:
SELECT * FROM [veh:vehicle] AS vehicle
WHERE vehicle.[veh:make] IN ('Chevrolet', 'Toyota', 'Ford')
The queries can be much more complex and can include joins, rich criteria, subqueries, and limits/offsets. The results sets are tabular, but still allow you to access the corresponding node(s) in each row.
Of course, ModeShape evaluates each query across all of the data, even when the repository is distributed in a cluster. That means your application is written the same way, regardless of how ModeShape is configured.
ModeShape provides an event API so that your application can be notified when content changes. Your application can register listeners using a variety of criteria (e.g., “only notify me of the addition or removal of nodes in this subgraph”, or “only notify me when nodes of this type are changed”, or even “notify me of all node and property changes”, etc.), and can then respond to the events with application-specific behavior.
Again, this behavior works the same way regardless of whether ModeShape is clustered – applications see the changes made by sessions in all processes in the cluster.
ModeShape includes a number of other features, too. ModeShape can automatically manage the history of a subtree of content – all that’s required is adding the “mix:versionable” mixin to the node, and then calling “checkin()”, “checkout()” and “restore()”.
Individual nodes can be locked to prevent other applications from modifying that area of the repository. Locks are intended to be short-term (e.g., scoped to a single session), though it’s possible to lock nodes for a longer duration.
Take the next step
We’ve covered a lot of topics in this post, but hopefully now you have a clear understanding of what kind of database ModeShape is and whether it is a fit for your use cases. Give it a try. ModeShape 3.0.0.Final is due out next week, but get the latest candidate release.
Filed under: features, jcr, repository