repository | ModeShape

April 8, 2016 • 7:27 am 0

ModeShape 5’s persistence changes

Starting with ModeShape 3 in early 2012, all repository nodes were internally represented using JSON documents and stored as BSON values in Infinispan. Although we relied upon some Infinispan features, for the most part ModeShape was merely storing its data using a very basic key-value API.

As ModeShape evolved through the 3.x and 4.x versions, we started having some data persistence issues that were largely outside of our control. ModeShape could be deployed inside JBoss AS (eventually known as Wildfly), so we chose our version of Infinispan based upon the version that was shipped with JBoss AS. Unfortunately, when we found bug in Infinispan, those bugs would be fixed in releases that were not yet included in JBoss AS, meaning we couldn’t get the fixes for quite some time. Using Infinispan also made the repository configuration and internals quite complex. Plus, changes in Infinispan’s persistence stores sometimes meant that persisted data could not be read by newer versions of Infinispan.

But most importantly, in certain situations we saw data corruption render a repository’s content largely unusable. This is a complicated issue that we previously outlined in detail this forum post and this issue.

Therefore, our primary goal with ModeShape 5 was to make sure that the repository data is stored in a more durable and strongly consistent manner that avoided the aforementioned corruption issues. This meant that we had to take a more conservative approach to persistence and give up claims of high scalability and performance (which are fine with eventual consistency, but not with strong consistency which is a must-have for ModeShape).

Already having the design of storing BSON documents in a key-value store helped us a lot, since it meant we only had to come up with transactional, strongly consistent, key-value store alternatives.

ModeShape 5’s initial release comes with three such stores out of the box.

1. RDBMS store

This was the obvious choice, since relational databases provide strong consistency guarantees with good transactional support, at least with READ_COMMITTED isolation level that ModeShape requires. Enterprise users still trust and use relational databases a great deal. Using a relational database store meant users can still cluster multiple repositories together, as long as all those repositories use the same shared database.

ModeShape 5 comes out-of-the-box with support for H2, Oracle, MySql and PostgreSQL. Repository data is persisted in the form of BLOBs using the same internal BSON format we’ve used since ModeShape 3. We’ve also designed the store in such a way so that in the future, we can add specialized storage types that take advantage of the capabilities of different databases (for example PostgresSQL’s JSONB in 9.5 and above).

Configuration is again much simpler than ModeShape 3 and 4 with an equivalent store, as can be seen from the documentation.

2. File system store

This store uses an embedded H2 database to persist information on the local disk. Internally we use its very nice MVStore API, the lower-level key-value engine used within H2’s normal relational and SQL engine. It provides good transactional support and stores/streams binary objects (like our BSON documents) with optional features like compression and encryption.

For users which don’t want to store the repository data in a RDBMS and who also aren’t interested in clustering, this should be the default go-to store in ModeShape 5.

Configuring such a store is trivial and doesn’t require any additional configuration files (see our documentation for examples)

3. Transient in-memory key-value store

This is the default store when nothing is explicitly configured, and data is only persisted in-memory and is lost as soon as the process stops. Therefore, this is not suitable for production but is a very simple and natural option for testing and exploration. Internally, it uses H2’s MVStore API without persistence.

Other key-value stores

We can add support for other key-values stores in the future, provided they:

are strongly consistent;
support ACID transactions; and
run on Java 8 or above

We’re also happy to hear any suggestions or to evaluate any contributions from our community members.

Performance

Our preliminary tests indicate that all the above stores perform at least as well as their previous Infinispan counterparts in local, non-clustered modes. In fact, they should perform better in write-intensive cases while probably performing slightly slower for read-intensive cases, since Infinispan always had an in-memory cache layer on top of every store.

Which should you use?

We recommend using the file store for non-clustered cases. It’s simple, fast, and doesn’t require an external process. A second option to consider is the JDBC store with an embedded H2 database.

When clustering however, the only suitable option is the relational store with a shared JDBC store. As outlined above and as mentioned in the documentation, strict serializability required by the JCR API comes at a cost: all cluster members must coordinate their operations and use a shared persistent store. To help provide this coordination and to avoid write-contention on the same nodes, ModeShape employs global cluster locking (via JGroups) to ensure nodes can only be modified by one cluster member at a time. We believe that this is only way in which we can ensure the JCR consistency requirements when running in a cluster.

Filed under: features, performance, releases, repository, uncategorized, easre

January 31, 2013 • 5:12 pm 0

Structured, unstructured, and everything between

Shane gives a good breakdown of the various ways to classify data as structured or unstructured. He points out that very often data is a mixture of both structured and unstructured data, and he gives several examples.

What I find so interesting about this, however, is how well ModeShape can handle these varieties of data.

ModeShape handles structured data really well. Most data structures are very easily mapped to the nodes and properties that ModeShape uses. And when those nodes also say which node types apply to them, ModeShape can enforce the node structure by validating it against those built-in and/or custom node types and prevent invalid data from being stored.

The other end of the spectrum is unstructured data, and ModeShape handles that beautifully, too. You can store unstructured data in a property using a string value or a binary value. Typically you would use a string value when the data is some form of text, and a binary value in any other cases (or when you don’t want to treat it as text).

But the best part is that ModeShape naturally handles combinations of structured and unstructured data. Recall that ModeShape is a hierarchical database, which means that each database consists of a single tree of nodes, and each node has one or more properties. That hierarchy is by definition structured, though it’s up to you whether ModeShape validates and enforces that structure using node types. But the leaves of that tree — that is the properties and their values — typically unstructured (though property value like dates and even some string values could be considered structured).

ModeShape’s query languages can also deal with both structured and unstructured data. Relationships between nodes, specific properties defined by node types, and the definitions of those properties all are addressable within the query language. But ModeShape queries can include full-text search constraints on both string and binary property values!

ModeShape can search those binary values when it can extract text using the Tika library, which supports many formats, including PDF, Microsoft Office™, RTF, HTML, and many others.

There’s one more way that ModeShape can deal with unstructured data: it can sequence unstructured data (string and binary property values) using built-in or custom sequencers to extract structure and save it as more nodes and properties in the repository. This is ideal for getting at that unstructured data that has the implicit structure defined by the format. For example, if an image is loaded into the repository, ModeShape’s image sequencer can extract the EXIF data in the image (e.g., ISO setting, focal length, aperture, shutter, geo-location, etc.) and save it as properties in the repository. ModeShape has a number of built-in sequencers that can extract this implicit structure from a variety of file formats:

DDL files
images (JPEG, GIF, BMP, PCX, PNG, IFF, RAS, PBM, PGM, PPM and PSD)
audio (MP3)
comma-separated and delimited text files
Java source and class files
Microsoft Office™
ZIP archives
XML
XML Schema
WSDL

In summary, ModeShape deals very naturally and easily with data that is part unstructured and part structured. What else could you want?

Filed under: features, repository, techniques

October 18, 2012 • 12:24 pm 0

When is ModeShape a good fit?

Update: changed the Scalability section to make more clear the scope of the term.

When it comes down to it, ModeShape is a database. But there are lots of kinds of databases, and it’s always very important to choose a database that fits your application’s needs. Here are some of the characteristics that distinguish ModeShape from other kinds of databases, which should help you decide whether ModeShape is a good fit for your use cases.

Strongly consistent

ModeShape is strongly-consistent and adheres to the ACID principles, meaning that all operations are atomic, consistent, isolated, and durable. Your applications create Sessions to interact with the information stored in a repository workspace. Each session sees the latest persisted information, even as other applications (or parts of your application) are persisting changes through their own sessions. Your session can make changes, which are overlaid upon the latest persisted information, but only your session sees these changes until you save your session and the changes are persisted. Internally, ModeShape uses a transaction to make sure that all the session’s changes are made (or none of them are), that the changes are consistent, are seen by other sessions only when the changes are completed, and that the changes are durable.

What this means for you is that it’s very easy to develop and write applications, and in many ways is very similar to how you’ve worked with other ACID systems (like relational databases) in the past. You can even use JTA transactions so that the changes are persisted only upon transaction commit. And your application is written the same way whether ModeShape is clustered or non-clustered.

In the last few years, eventually-consistent databases have become very popular, due in part to the increasingly popular goal of creating very large (distributed) databases. When a change is made to an eventually consistent database, that change is not immediately propagated to other processes, but eventually (after a period of time when no changes are made) the database will become consistent. This means that right after one client makes a change, there is no guarantee when other clients will see those changes, and yet those other clients can change the data that they see. The result is that there can be multiple “versions” of the data, and although the database may attempt to resolve these conflicts, it often can only do this for relatively simple conflicts. Ultimately, your application will likely have to deal with the conflict. Additionally, many eventually consistent databases suggest specific usage patterns to make such conflicts less likely, but those usage patterns are often more complicated than you’re used to using. There absolutely are use cases where eventually consistent databases are perfect fits, but there are also lots of use cases and applications that are perfectly unsuitable for eventually consistent databases.

(Note that the next generation of Apache Jackrabbit, codenamed “Oak”, will be eventually consistent. To do that, they are not going to support all the JCR features. When your application saves its session, any conflicts that arise and that can’t be automatically handled will result in exceptions. Their expectation is that your application should then try again to recreate the changes, and that in the worst case your application may have to explicitly resolve the conflicts.)

Hierarchical data

ModeShape stores data in a tree of nodes and properties, where you have full control over the design of that tree structure. At the top of the tree is a single root node, and every node can contain multiple child nodes. Every node has a name and a unique identifier, and can also be identified by a path containing the names of all ancestors, from the parent to the node itself. Names are comprised of a namespace and local part, and there is a namespace registry to centralize short prefixes for each namespace.

You can see that this looks very similar to how a file system is laid out. You already know how to organize a file system, and organizing a ModeShape repository is very similar. In fact, lots of data already has implicit hierarchical structure. Consider that URLs are essentially addresses into a website’s hierarchy. And hierarchical data is easy to use: simply navigate the nodes. It’s also often more efficient to navigate, since related data is very close by.

Scalable and highly available

ModeShape repositories can be small and embedded into Java applications, or they can be very large and distributed across a cluster of machines. You can even decide how (and if) ModeShape should persist your data, ranging from keeping data only in-memory, to storing data on the local file system, to storing data in a relational database, to leveraging the performance, scalability, and durability of an in-memory and elastic data grid. In may seem counter-intuitive, but storing your data in RAM is extremely fast as long as multiple copies of all your data are stored across multiple machines while ensuring that machines can be added and removed and the data is automatically and elastically distributed. This is exactly what a data grid can achieve, and this is how ModeShape can scale to very (very) large databases.

All of ModeShape’s functionality and features are built on top of Infinispan, which is a flexible, fast and highly scalable data grid. ModeShape stores each node in one or more entries in Infinispan (a small node will be stored as one entry, but larger ones are broken into multiple entries), and Infinispan is configured to replicate, distribute and persist the data.

Note that when we talk about scalability and large databases, we’re not talking about the kinds of scales that “big data” often refers to. ModeShape is not a “big data” database and doesn’t scale that big. We’re transactional, after all.

Schema validation

ModeShape supports a very powerful and flexible schema system, but interestingly you get to decide where and how much schema enforcement to use. At one extremely, you allow every node to contain any property and any children – this is essentially using ModeShape as a schema-less database, and it’s a perfectly valid way to use ModeShape. Your application becomes fully in-control of the database structure, making it easy to evolve the structure to suit new or changing requirements.

At the other extreme, you fully define every node to fit a particular node type that constrains the properties and child nodes to fit pre-defined patterns. ModeShape ensures that all the data always adheres to the schema, and your application doesn’t have to do any validation or enforcement.

But between these two extremes is where ModeShape really becomes interesting and advantageous. You can choose which subset of nodes in your tree you want to adhere to a schema, allowing parts of the database to be more schema-less and the rest to be more constrained. But more importantly, you can dynamically expand the schema for any individual node by mixing in additional node types with more property and child patterns. For example, you can define a node type that requires a “title” property, and you can add this node type to any node that is to have a title.

ModeShape’s schema system is very powerful and flexible, and makes it far easier to constrain your data while simultaneously enabling future changes to and evolution of your database’s schema.

Query and search

Navigation isn’t the only way to access your data. Your applications can also query a ModeShape repository to find the subset of content that meets application-specified criteria regardless of where in the hierarchy that data exists. ModeShape offers several query languages, including a subset of XPath, a full-text search language (much like internet search engines), and an extremely powerful SQL-like language called “JCR-SQL2”. Here’s a fairly simple example of a JCR-SQL2 query:

SELECT * FROM [veh:vehicle] AS vehicle
WHERE vehicle.[veh:make] IN ('Chevrolet', 'Toyota', 'Ford')

The queries can be much more complex and can include joins, rich criteria, subqueries, and limits/offsets. The results sets are tabular, but still allow you to access the corresponding node(s) in each row.

Of course, ModeShape evaluates each query across all of the data, even when the repository is distributed in a cluster. That means your application is written the same way, regardless of how ModeShape is configured.

Events

ModeShape provides an event API so that your application can be notified when content changes. Your application can register listeners using a variety of criteria (e.g., “only notify me of the addition or removal of nodes in this subgraph”, or “only notify me when nodes of this type are changed”, or even “notify me of all node and property changes”, etc.), and can then respond to the events with application-specific behavior.

Again, this behavior works the same way regardless of whether ModeShape is clustered – applications see the changes made by sessions in all processes in the cluster.

Other features

ModeShape includes a number of other features, too. ModeShape can automatically manage the history of a subtree of content – all that’s required is adding the “mix:versionable” mixin to the node, and then calling “checkin()”, “checkout()” and “restore()”.

Individual nodes can be locked to prevent other applications from modifying that area of the repository. Locks are intended to be short-term (e.g., scoped to a single session), though it’s possible to lock nodes for a longer duration.

Take the next step

We’ve covered a lot of topics in this post, but hopefully now you have a clear understanding of what kind of database ModeShape is and whether it is a fit for your use cases. Give it a try. ModeShape 3.0.0.Final is due out next week, but get the latest candidate release.

Filed under: features, jcr, repository

September 5, 2012 • 10:53 am 0

New repository backup and restore in ModeShape 3

We recently added a new feature to ModeShape 3.0.0.Beta3 that enables repository administrators to create backups of an entire repository (even when the repository is in use), and to then restore a repository to the state reflected by a particular backup. This works regardless of where the repository content is persisted.

There are several reasons why you might want to restore a repository to a previous state, and many are quite obvious. For example, the application or the process it’s running in might stop unexpectedly. Or perhaps the hardware on which the process is running might fail. Or perhaps the persistent store might have a catastrophic failure (although surely you’re also using the persistent store’s backup system, too).

But there are also non-failure related reasons. Backups of a running repository can be used to transfer the content to a new repository that is perhaps hosted in a different location. It might be possible to manually transfer the persisted content (e.g., in a database or on the file system), but the process of doing so varies with different kinds of persistence options. Also, ModeShape can be configured to use a distributed in-memory data grid that already maintains its own copies for ensuring high availability, and therefore the data grid might not persist anything to disk. In such cases, the content is stored on the data grid’s virtual heap, and getting access to it without ModeShape may be quite difficult. Or, you may initially configure your repository to use a particular persistence approach that suitable given the current needs, but over time the repository grows and you want to move to a different, more scalable (but perhaps more complex) persistence approach. Finally, the backup and restore feature can be used to migrate to a new major version of ModeShape.

In short, you may very well have the need to set the contents of a repository back to an earlier state. ModeShape’s backup and restore feature makes this easy to do.

Getting started

Let’s walk through the basic process of creating a backup of an existing repository and then restoring the repository. Both of these steps require an authenticated Session that has administrative privileges. It actually doesn’t matter which workspace the session uses:

javax.jcr.Repository repository = ...
javax.jcr.Credentials credentials = ...
String workspaceName = ...
javax.jcr.Session session = repository.login(credentials,workspaceName);

So far, this is basic and standard stuff for any JCR client.

Introducing the RepositoryManager

Each JCR Session instance has it’s own Workspace object that provides workspace-level functionality and access to a set of “manager” interfaces: the VersionManager, NodeTypeManager, ObservationManager, LockManager, etc. The JSR-333 (aka, “JCR 2.1”) effort is still incomplete, but has plans to introduce a RepositoryManager that offers some repository-level functionality. The ModeShape public API has created such an interface, and accessing it from a standard JCR Session instance is pretty simple:

org.modeshape.jcr.api.Session msSession = (org.modeshape.jcr.api.Session)session;
org.modeshape.jcr.api.RepositoryManager repoMgr = ((org.modeshape.jcr.api.Session)session).getWorkspace().getRepositoryManager();

The interface is pretty self-explanatory, and defines several methods including two that are related to the backup and restore feature:

public interface RepositoryManager {

    ...

    /**
     * Begin a backup operation of the entire repository, writing the files
     * associated with the backup to the specified directory on the local
     * file system.
     *
     * The repository must be active when this operation is invoked, and
     * it can continue to be used during backup (e.g., this can be a
     * "live" backup operation), but this is not recommended if the backup
     * will be used as part of a migration to a different version of
     * ModeShape or to different installation.
     *

     *
     * Multiple backup operations can operate at the same time, so it is
     * the responsibility of the caller to not overload the repository
     * with backup operations.
     *

     *
     * @param backupDirectory the directory on the local file system into
     *        which all backup files will be written; this directory
     *        need not exist, but the process must have write privilege
     *        for this directory
     * @return the problems that occurred during the backup operation
     * @throws AccessDeniedException if the current session does not
     *         have sufficient privileges to perform the backup
     * @throws RepositoryException if the backup cannot be run
     */
    Problems backupRepository( File backupDirectory ) throws RepositoryException;

    /**
     * Begin a restore operation of the entire repository, reading the
     * backup files in the specified directory on the local file system.
     * Upon completion of the restore operation, the repository will be
     * restarted automatically.
     *
     * The repository must be active when this operation is invoked.
     * However, the repository <em>may not</em> be used by any other
     * activities during the restore operation; doing so will likely
     * result in a corrupt repository.
     *

     *
     * It is the responsibility of the caller to ensure that this method
     * is only invoked once; calling multiple times wil lead to
     * a corrupt repository.
     *

     *
     * @param backupDirectory the directory on the local file system
     *        in which all backup files exist and were written by a
     *        previous {@link #backupRepository(File) backup operation};
     *        this directory must exist, and the process must have read
     *        privilege for all contents in this directory
     * @return the problems that occurred during the restore operation
     * @throws AccessDeniedException if the current session does not
     *         have sufficient privileges to perform the restore
     * @throws RepositoryException if the restoration cannot be run
     */
    Problems restoreRepository( File backupDirectory ) throws RepositoryException;
}

Next, we’ll take a look at each of these two methods.

Creating a backup

The backupRepository(...) method on ModeShape’s RepositoryManager interface is used to create a backup of the entire repository, including all workspaces that existed when the backup was initiated. This method blocks until the backup is completed, so it is the caller’s responsibility to invoke the method asynchronously if that is desired. When this method is called on a repository that is being actively used, all of the changes made while the backup process is underway will be included; at some point near the end of the backup process, however, additional changes will be excluded from the backup. This means that each backup contains a fully-consistent snapshot of the entire repository as it existed near the time at which the backup completed.

Here’s an code example showing how easy it is to call this method:

org.modeshape.jcr.api.RepositoryManager repoMgr = ...
java.io.File backupDirectory = ...
Problems problems = repoMgr.backupRepository(backupDirectory);
if ( problems.hasProblems() ) {
    System.out.println("Problems restoring the repository:");
    // Report the problems (we'll just print them out) ...
    for ( Problem problem : problems ) {
       System.out.println(problem);
    }
} else {
    System.out.println("The backup was successful");
}

Each ModeShape backup is stored on the file system in a directory that contains a series of GZIP-ed files (each containing representations of a approximately 100K nodes) and a subdirectory in which all the large BINARY values are stored.

It is also the application’s responsibility to initiate each backup operation. In other words, there currently is no way to configure ModeShape to perform backups on a schedule. Doing so would add significant complexity to ModeShape and the configuration, whereas leaving it to the application lets the application fully control how and when such backups occur.

Restoring a repository

Once you have a complete backup on disk, you can then restore a repository back to the state captured within the backup. To do that, simply start a repository (or perhaps a new instance of a repository with a different configuration) and, before it’s used by any applications, load into the new repository all of the content in the backup. Here’s a simple code example that shows how this is done:

Here’s an code example showing how easy it is to call this method:

org.modeshape.jcr.api.RepositoryManager repoMgr = ...
java.io.File backupDirectory = ...
Problems problems = repoMgr.restoreRepository(backupDirectory);
if ( problems.hasProblems() ) {
    System.out.println("Problems backing up the repository:");
    // Report the problems (we'll just print them out) ...
    for ( Problem problem : problems ) {
         System.out.println(problem);
    }
} else {
    System.out.println("The restoration was successful");
}

Once a restore succeeds, the newly-restored repository will be restarted and will be ready to be used.

Migrating from ModeShape 2.8 to 3.0

Earlier I mentioned that backup and restore can be used to migrate from one version of ModeShape to the next major version of ModeShape. This is how we plan to support migrating from a ModeShape 2.8 repository instance to a new ModeShape 3.0 instance. We plan to cut one more release of ModeShape 2, which we’ll christen 2.8.4.Final, and that will include a utility that will create a 3.0-compatible backup of the ModeShape 2.8 instance. Then, simply use the “restoreRepository” method on the new (and empty) ModeShape 3.0 repository to load all the backed-up content.

Questions or feedback

This feature is still relatively new and was introduced in ModeShape 3.0.0.Beta3, and we’d love to get your feedback on our forums before we freeze the public API and cut the 3.0.0.Final release.

Filed under: features, jcr, repository, techniques, tools

January 28, 2012 • 10:56 am 14

ModeShape 3.0 Alpha1 is here, and it rocks!

The ModeShape team is happy to announce that we’ve issued the first alpha release of ModeShape 3. This is the first alpha release we’ve ever made, and it’s still rough around the edges. But we’re so excited about ModeShape 3 that we had to share. (And, yes, this post is really long, but it’s a good read.)

Our goal for ModeShape 3 is for it to be the seriously fast, very scalable, and highly available JCR implementation. To do that, we’ve made some pretty significant architectural changes. Some of these are:

We’re using Infinispan for all caching and storage. This gives the foundation we need to meet our goals while giving us the flexibility for how to store the content (via cache stores). ModeShape can still be embedded into applications, but Infinispan will help us scale out to create truly distributed, multi-site, content grids. This completely replaces our old connector framework.
So far our tests show ModeShape 3 is ridiculously fast. It’s all around faster than 2.7 – in fact, most operations are at least one (if not several!) orders of magnitude faster. We’ll publish proper performance and benchmarking results closer to the final release.
Scalability not only includes clustering (and “scaling out”), but it also means handling a wider range of node structures. We’ve tested our new approach with 100s of thousands of child nodes under a single parent, even when those nodes have ordered children with same-name-siblings. Yet it’s still almost just as fast as nodes with just a few child nodes!
Configuring repositories is hopefully much easier. There is no more global configuration of the engine; instead, each repository is configured with a separate JSON file that conforms to a JSON Schema and that your application can validate with one method call. Check out this entirely valid sample configuration file. You can deploy new repositories at runtime, and can even change a repository’s configuration while it is running (some restrictions apply). For example, you can add/change/remove sequencers, authorization providers, and many other configuration options while the repository is being actively used.
ModeShape continues to have great options for storing your content. ModeShape 2 had its own connector framework, but with ModeShape 3 we’re simply using Infinispan’s cache stores, with a number of great options out-of-the-box:
- In-memory (no cache store)
- BerkleyDB, which is quite fast but has license restrictions
- JDBM, a free alternative to BerkleyDB
- Relational databases (via JDBC), including in-memory, disk-based, or remote
- File system
- Cassandra
- Cloud storage (e.g., Amazon’s S3, Rackspace’s Cloudfiles, or any other provider supported by JClouds)
- Remote Infinispan grid
Every session now immediately sees all changes persisted/committed by other sessions, although transient changes of the session still take precedence. This behavior is different from in 2.x, and when combined with the new way node content is being store will hopefully reduce the potential for conflicts during session save operations. This means that all the Sessions using a given workspace can share the cache of persisted content, resulting in faster performance and smaller memory footprint. That means that ModeShape can handle more sessions at the same time in a single process.
Our Session, Workspace, NodeTypeManager and other components are thread safe. The JCR specification only requires that the Repository and RepositoryFactory interfaces are thread-safe. But making our implementations thread-safe means that it’s possible for multiple threads to share one Session for reading. Of course, Session is inherently stateful, so sharing a Session for writes is still a bad thing to do.
We have a new public API for monitoring the history, activity and health of ModeShape.
We’ve changed our sequencing API to use the JCR API. This should make it much easier to create your own sequencers, plus sequencers can also dynamically register namespaces and node types. We’ve already migrated most of our 2.x sequencers to this new API, and will be migrating the rest over the next few weeks.
Handling of binary values is greatly improved with a new facility that can store binary values of all sizes, including those that are (much) larger than available memory. In fact, only small binary values are stored in memory (this is configurable), while all other binary value are only streamed. We’ve started out with a file system store that will work even in clustered environments, but we also plan to add stores that use Infinispan and DBMSes.
We’re still using Lucene for our indexes, but we’re now using Hibernate Search to give us durable and fast ways to update the indexes, even in a cluster. Note that Hibernate Search is part of the Hibernate family, but it’s a small library that does not use, depend on, or require JPA or the Hibernate ORM.

As if that’s not enough, we still have a lot to do:

Kits for deploying ModeShape 3 as a service in JBoss AS7, allowing you to use the AS7 tooling to configure, deploy, manage, monitor, and undeploy your JCR repositories. Infinispan and JGroups are also built-in services in AS7 and can be managed the same way. Plus, ModeShape clustering will work out of the box using AS7’s built-in clustering (domain management) mechanism. ModeShape and JBoss AS7 will be the easiest way to deploy, manage and operate enterprise-grade repositories.
JTA support will allow JCR Sessions to participate in XA and container-managed transactions. We’re already using JTA transactions internally with Infinispan, so we’re already a good way toward this feature.
Map-Reduce is a great way to process in parallel large amounts of information. ModeShape will let you validate the entire repository content against the current set of node types or even a proposed set of node types, making it far easier to safely and confidently change the node types in a large repository. And we’ll provide a way for you to write your own mappers, reducers, and collectors to implement any kind of (read-only) analysis you want.

Hopefully you’re just as excited as we are. We love how far we’ve able to come with ModeShape 3, and we’re only part way there.

The good news is that you can start kicking the tires and seeing for yourself just how fast ModeShape 3 is. Most of the JCR features are working and are ready for trial and testing. In fact, please file bug reports if you find anything that doesn’t work. But unfortunately a few things still aren’t complete or working well enough:

Queries will parse but can’t be executed. Most of it works, but a few key pieces don’t work. Consequently, the JDBC drivers don’t work.
Clustering and shareable nodes don’t work.
AS7 kits are incomplete and not yet usable.
The RESTful and WebDAV services aren’t working as we’d like, so we excluded them from the alpha.
Federation is not yet working; see this discussion for how we want to expand federation capabilities.

We’re also overhauling our documentation to make it even more useful. But it’s a little sparse at the moment, we’re focusing on the code. Our What’s New and Getting Started pages are pretty useful, though, and should help you get your testing going. We also have some sample (and stand-alone) example Maven projects on GitHub that you can clone and hack to start putting ModeShape 3 through its paces.

What’s next? Well, we’re continuing to implement the missing and incomplete features, and we plan to release a second alpha in the next few weeks. We’ll follow that up over the following month with a couple of feature-complete beta releases and the final 3.0. release. Stay tuned!

Now, wasn’t that worth a few minutes of your time? We’re really excited about ModeShape 3, and think you’ll really like it, too.

Filed under: features, jcr, news, releases, repository, testing

December 22, 2011 • 12:52 pm 0

ModeShape 2.7.0.Final is available

The ModeShape team is once again happy to announce the immediate availability of ModeShape 2.7.0.Final. The release artifacts are available in the JBoss Maven repository (see our Maven instructions) and on our downloads page. And as we said earlier this week, we’ve moved our Getting Started and Reference Guides to a new home.

Version 2.7 contains mostly bug fixes and minor improvements:

improved memory usage during export and indexing
fixed JPA connectors use of 2nd level cache for Hibernate 3.3 and later
JPA connector’s background garbage collection can be disabled
JPA connector no longer caches large value entities
fixed race condition in RepositoryConnectionPool
added public API methods to register node types in CND files, eliminating need for depending upon implementation classes
a few public API interfaces/methods that were redundant with JCR 2.0 have been deprecated
added support for setting values with custom javax.jcr.Binary implementations
added public API methods to get the SHA-1 hash of binary values
fixes to query processing
fixes to enable building on Windows
corrected Teiid sequencers generation of transformation queries
upgraded to Tika 1.0
upgraded versions of several Maven plugins

Thanks to the entire ModeShape community for testing our previous betas and helping us improve the stability and performance. And thanks to all the contributors that took part in this release. Great job, everyone!

Filed under: features, jcr, news, open source, repository

August 4, 2011 • 4:07 pm 0

ModeShape 2.6.0.Beta2 is available!

The ModeShape team is happy to announce the second beta release for ModeShape 2.6. The release artifacts are available in the JBoss Maven repository (see our Maven instructions) and on our downloads page. We’ve updated our Reference Guide, Getting Started Guide, and JavaDoc.

Combined with the first beta, 2.6.0.Beta2 includes a number of new features, improvements, and bug fixes compared with 2.5.0.Final:

kits for JBoss Application Server 5.x and 6.x
improved overall performance
new disk-based storage connector
added cache support in several connectors
pluggable authentication and authorization
the JPA connector now support configuring/using Hibernate 2nd-level cache
improved BINARY property support for large files
automatically use the JDK logger if SLF4J binding is not available
upgraded to Infinispan 4.2.1.Final
faster startup of the ModeShape engine
full support in the file system connector for ‘mix:referenceable’ and REFERENCE properties
over two dozen bug fixes

Give ModeShape 2.6.0.Beta2 a try, and let us know if you have any problems. But don’t wait too long, because we hope to wrap up the outstanding issues and release the Final version within a few weeks.

Once again, the ModeShape community has done a great job. Thanks to you all!

Filed under: features, jcr, news, repository

July 11, 2011 • 2:41 pm 1

ModeShape 2.6.0.Beta1 is available

The ModeShape team is happy to announce the first beta release for ModeShape 2.6. The release artifacts are available in the JBoss Maven repository (see our Maven instructions) and on our downloads page. We’ve updated our Reference Guide, Getting Started Guide, and JavaDoc.

This release includes a number of new features, improvements, and bug fixes:

kits for JBoss Application Server 5.x and 6.x
improved overall performance
new disk-based storage connector
added cache support in several connectors
pluggable authentication and authorization
the JPA connector now support configuring/using Hibernate 2nd-level cache
improved BINARY property support for large files
automatically use the JDK logger if SLF4J binding is not available
upgraded to Infinispan 4.2.1.Final
faster startup of the ModeShape engine
over a dozen bug fixes

Give ModeShape 2.6.0.Beta1 a try, and let us know if you have any problems. But don’t wait too long, because we’ve already started work on Beta2 and hope to release that in a few weeks.

Once again, the ModeShape community has done a great job. Thanks to you all!

Filed under: features, news, repository

June 20, 2011 • 10:26 am 2

Finding a JCR repository

Updated 6/21/2011: Added section describing the Seam JCR module
Updated 6/23/2011: Added more detail about the JNDI location when ModeShape is deployed to JBoss AS

Okay, you’re using JCR in your application, and you’re writing all of your code to the JCR API. That’s great, because your application doesn’t have any implementation-specific calls, and you can rely only upon the “javax.jcr” packages.

“But,” you ask, “how do I get a reference to the javax.jcr.Repository instance without using implementation-specific code in my app?”

If you’re using JCR 1.0, you’re basically out of luck. The spec didn’t specify how to do that, and so the implementations all do it differently.

But thankfully JCR 2.0 introduced the javax.jcr.RepositoryFactory interface and described how to use the Java SE Service Locator pattern to get that initial reference to your repository instance without any implementation-specific code. Here’s how that works.

Using the JCR 2.0 RepositoryFactory

Your application will have one (or more) JCR implementations on the classpath, and per JCR 2.0 they will each provide their own RepositoryFactory implementations and manifest entries so that the JVM can find them. Your application can find them by using the Service Locator pattern:

Map parameters = ...
Repository repository = null;
for (RepositoryFactory factory : ServiceLoader.load(RepositoryFactory.class)) {
  repository = factory.getRepository(parameters);
  if (repository != null) break;
}

This basically iterates over all of the RepositoryFactory implementations, and for each one asks that factory to return the JCR Repository instance given the map of parameters. Per JCR 2.0, if the RepositoryFactory understands the parameters, it will return a Repository instance; otherwise, it will return null. Now, each JCR implementation is allows to define their own parameters, so these definitely are still implementation-specific. But since they’re just properties, your application can remain independent of JCR implementation by simply loading them from a file:

Properties parameters = new Properties();
// Read from a file or from other input streams or readers ...
parameters.load(new FileInputStream(file));
// Find the Repository instance ...
Repository repository = null;
for (RepositoryFactory factory : ServiceLoader.load(RepositoryFactory.class)) {
  repository = factory.getRepository(parameters);
  if (repository != null) break;
}

Look, Ma! No implementation-specific code!

ModeShape parameters for RepositoryFactory

So what parameters does ModeShape expect? Just one:

org.modeshape.jcr.URL

If the value of this parameter is a URL that resolves to a ModeShape configuration file, the factory will actually start up a new ModeShape engine using that configuration file, and will look for the repository in the URL. For example:

file:config/configRepository.xml?repositoryName=MyRepository

will look for a ModeShape configuration file named “configRepository.xml” that is in the “config” directory relative to where the JVM was started, and will return the repository defined in the configuration file with the name “MyRepository”. (Remember that a single ModeShape engine can host multiple JCR repositories.) Other URLs are possible, as long as they can be resolved to the configuration file.

If the value of the “org.modeshape.jcr.URL” parameter is a URL that begins with “jndi:”, then the ModeShape factory will attempt to look for a ModeShape engine instance registered in JNDI, and will ask that engine for the named repository. For example:

jndi:name/in/jndi?repositoryName=MyRepository

will look in JNDI for a ModeShape engine at “name/in/jndi”, and will ask it for the repository named “MyRepository”.

The JNDI form is what you’ll use if you’ve deployed ModeShape to JBoss AS and your applications need to access the repositories. ModeShape runs as a service within JBoss AS, so when the app server is started ModeShape will be auto-registered the engine in JNDI at “jcr/local”. If you’ve not changed the configuration, there will be a repository called “repository” (with a default workspace called “default”, though you can create other workspaces using the JCR API), and you can use the following URL for the “org.modeshape.jcr.URL” parameter:

jndi:jcr/local?repositoryName=repository

Of course, you probably want to change the configuration to add other repositories or to control where and how the repositories store the content (by default it is stored in-memory). If you add repositories or change the name of the repository, you’ll need to change the URL accordingly.

Injecting JCR Repositories

If you’re building an application that uses CDI, there’s another option for getting a hold of your Repository instance. The Seam JCR project is a portable extension to CDI that provides annotations for automatically injecting a javax.jcr.Repository object into your application, and Seam JCR works with ModeShape and Jackrabbit. Simple ensure that Seam JCR and your JCR implementation are on your classpath, and then simply use annotations to provide the same parameters normally supplied to the RepositoryFactory. Here’s an example of injecting ModeShape with the same “file:” URL used above:

  @Inject @JcrConfiguration(name="org.modeshape.jcr.URL",
                            value="file:config/configRepository.xml?repositoryName=MyRepository")
  Repository repository;

Seam JCR also makes it easy to inject a JCR Session into your application:

  @Inject @JcrConfiguration(name="org.modeshape.jcr.URL",
                            value="file:config/configRepository.xml?repositoryName=MyRepository")
  Session session;

This code will obtain a Session using the default workspace and no credentials, but the Seam JCR team is working on supporting Credentials and workspace names.

Of course, Seam JCR also works with Jackrabbit, but uses Jackrabbit-specific parameters. For more details, see the Seam JCR site.

Filed under: features, jcr, repository, techniques

June 20, 2011 • 9:42 am 3

What distinguishes ModeShape?

One question we often get about ModeShape is what makes ModeShape different than other JCR implementations, including the reference implementation. We’ve answered it in a previous blog post, but it’s important enough to give a more recent and succinct answer.

Here’s a really brief, very high-level summary of what ModeShape is and where our emphases lie:

ModeShape is a lightweight, embeddable, extensible open source JCR repository implementation that federates and unifies content from multiple systems, including files systems, databases, data grids, other repositories, etc. You can use the JCR API to access the information you already have, or use it like a conventional JCR system. It’s useful for portals, for knowledge bases, for storing/versioning artifacts, for managing configuration, for managing metadata, and more. ModeShape is easy to configure, easy to cluster, and easy to extend.

Of course, we can look at some of the ModeShape features to get an even better understanding of what it does and why it rocks:

Supports all the JCR 2.0 required features: repository acquisition; authentication; reading/navigating; query; export; node type discovery; permissions and capability checking
Supports most of the JCR 2.0 optional features: writing; import; observation; workspace management; versioning; locking; node type management; same-name siblings; orderable child nodes; shareable nodes; and mix:etag, mix:created and mix:lastModified mixins with autocreated properties.
Supports the JCR 1.0 and JCR 2.0 languages (e.g., XPath, JCR-SQL, JCR-SQL2, and JCR-QOM) plus a full-text search language based upon the JCR-SQL2 full-text search expression grammar. Additionally, ModeShape supports some very useful extensions to JCR-SQL2:

subqueries in criteria
set operations (e.g, “UNION“, “INTERSECT“, “EXCEPT“, each with optional “ALL” clause)
limits and offsets
duplicate removal (e.g., “SELECT DISTINCT“)
depth, reference and path criteria
set and range criteria (e.g., “IN“, “NOT IN“, and “BETWEEN“)
arithmetic criteria (e.g., “SCORE(t1) + SCORE(t2)“)
full outer join and cross joins
and more

Choose from multiple storage options, including RDBMSes (via Hibernate), data grids (e.g., Infinispan), file systems, or write your own storage connectors as needed.
Use the JCR API to access information in existing services, file systems, and repositories. ModeShape connectors project the external information into a JCR repository, potentially federating the information from multiple systems into a single workspace. Write custom connectors to access other systems, too.
Upload files and have ModeShape automatically parse and derive structured information representative of what’s in those files and then store this derived information in the repository so you can query and access it just like any other content. ModeShape supports a number of file types out-of-the-box , including: CND, XML, XSD, WSDL, DDL, CSV, ZIP/JAR/EAR/WAR, Java source, Java classfiles, Microsoft Office, image metadata, and Teiid models and VDBs. Writing sequencers for other file types is also very easy.
Automated and extensible MIME type detection, with out-of-the-box detection using file extensions and content-based detection using Aperture.
Extensible text extraction framework, with out-of-the-box support for Microsoft Office, PDF, HTML, plain text, and XML files using Tika.
Simple clustering using JGroups.
Embed ModeShape into your own application, or deploy on JBoss Application Server, or use in any other application server.
RESTful API (requires deployment into an application server).
WebDAV support

These are just some of the highlights. For details on these and other ModeShape features, please see the ModeShape documentation.