January | 2010 | ModeShape

January 29, 2010 • 3:31 pm 1

The shape of your information

This is the first in a series of posts about specific features in the JCR API. We start with node types because they’re such an important and empowering feature of JCR, and critical to so many other features.

We are inundated with data stores: relational databases, file systems, repositories, document stores, or proprietary systems. Newer stores, like data grids and distributed (e.g., “NoSQL”) databases, are likely on the horizon (if not already in use). How do we write applications that use all this valuable information without copying it (horrors!) and without resorting to lots of different APIs? And how much work will it take to update our applications as new information becomes available or the existing information evolves?

Fortunately there are different approaches to federation, which means we get to pick the one that best suits our needs. One such approach is warehouses and ETL, but technically that’s copying and not federation. Another approach is using relational-based technologies like the Teiid project, which uses a relational engine to do the heavy lifting, and provides a virtual database complete with SQL, JDBC and ODBC support to give applications a way of interacting with the data using tables that mirror what the application wants/needs. It is a database – it just gets the data from other sources. This is perfect for some use cases, but for others the relational nature of the interaction is less than ideal.

ModeShape uses a graph-oriented approach that works well in cases where the information is hierarchical and/or has a structure that evolves over time. ModeShape is a JCR implementation that looks and behaves like a regular JCR repository, except that it federates in real-time some (or all) of the content from other systems. In fact, ModeShape doesn’t even have to federate any information, and in such cases it works just like any other JCR implementation (albeit with a wider selection of persistent storage options via its connectors).

ModeShape can do this because of the power and flexibility of the JCR API, which uses a graph of nodes and properties. These nodes form a simple tree, but use properties to create relationships or references to other nodes [1]. Here’s a simple representation:

This approach makes the JCR API very good at exposing information with varying and evolving structure, whether that information exists within the repository itself or defined by and housed in other external systems.

Of course, very generic data structures can have their own challenges. Flexible and abstract stores place few constraints on how you organize your data, but that means you need another way of describing or constraining the structure and shape of your information.

Node types

The JCR API solves this issue very nicely by using a simple but very powerful system of node types. Every node in a JCR repository has a primary type that defines the names, types, and characteristics of the properties and children that the node may (or must) have. Additionally, each node may have one or more mixin types that further define properties and children that beyond those defined by the primary type. You can add mixin types at any time, and can even remove them at any time as long as doing so doesn’t violate the primary type or remaining mixin types.

JCR node types dictate the kinds of properties that can (or must) exist on a node, and can constrain the property’s name, the number and type of values allowed, default values, constraints on the values, whether the property must exist, whether/how to change property, whether the values are queryable and searchable, and the kinds of query operators that apply to the values. All of these are optional, so it’s actually possible to define a node type that can allow any number of properties with any name and any values. JCR calls these property definitions without a name pattern/constraint “residual” properties, because they apply only if there isn’t a more applicable and specific “non-residual” property definition. Node types are also capable of dictating the names, types and order of child nodes. Node types can also define residual child node definitions for cases where a node can contain children with any name and type. Like residual property definitions, these are only used if there isn’t a child definition that is more specific. And node types support inheritance, so it’s possible to reuse and extend other node types. It is even possible to override and further constraint property and child definitions inherited from supertypes.

Know your types – and how to use them

Carefully selecting, defining, extending, and using node types in a JCR repository provides an incredible amount of control over your information and can let your information take on its natural shape. In cases where you do want to constraint the structure, use a primary type with no residual properties or children ensures that the nodes always fit the desired shape. In cases where flexibility is more important, use a primary type that allows any properties and any children (e.g., the “nt:unstructured” built-in node type).

Of course, most situations are probably somewhere in the middle, and this is where mixins shine. Create these nodes using a liberal primary type (like “nt:structured”), and then on a node-by-node basis add mixins defining various “facets” or “characteristics” needed to capture the desired information. For example, the “mix:created” mixin node type defines a “jcr:created” date property and a “jcr:createdBy” string property, and can be mixed into any node where there’s a need to store the “creation” information. This mixin can even be removed from a node without having to remove the properties!

Node types also play a critical role in JCR queries, because they allow forming sets of nodes that have a similar properties. These sets are naturally similar to the fundamental concepts in various query languages: relational tables, XML element types, and Java classes. For example, you could query all the “mix:created” nodes to find all nodes created within a certain period and order the results by the name of the creator.

Note types are also critically important in ModeShape because they describe the structure and semantics of the graph that ModeShape creates from the information in the underlying sources. And as the underlying information changes shape or meaning, the graph can adapt by altering the structure and node types.

Summary

We really just touched the surface of JCR node types, but hopefully we’ve given you a glimpse of how extremely powerful they make the JCR API. Node types make it possible to work with a very flexible graph system while controlling, describing, and understanding the shape of the information content in a JCR repository – even when this information lives in external systems.

[1] JSR-283 (aka “JCR 2.0”) takes this a step further by introducing “shared nodes” that share properties and children with other nodes. For example, if a node at the path “/a/b/c/d” is shared with the node at “/x/y/z”, then a property on the “d” node is also a property of the “z” node and a child of “d” also appears as a child of “z”. Thus, shared nodes make it possible for another node to appear in multiple places in the repository and have multiple paths.

Filed under: features, federation, jcr, repository

January 26, 2010 • 10:11 pm 3

ModeShape isn’t your father’s JCR

It’s true: ModeShape is a JCR implementation that is pretty new. Why on earth would we create another JCR implementation when other implementations have been around for so long?

For many years, the assumption in the persistent storage world is that each store should own all the information. Database vendors tried to sell their databases and claim how easy it would be to migrate all of your data into their system. ETL vendors talk about how to load up a data warehouse with all the useful information you need, so it’s all in once place. Document storage systems and other content management systems worked wonders, as long as everything was in their repository. And the JCR implementations followed suit by implementing the JCR API on top of silo repositories (that often used a relational database under the covers).

We see the world differently. We understand that you already have too many information stores, be they databases, file systems, repositories, document stores, or proprietary systems. We believe you shouldn’t need separate APIs to access all of it, and that you shouldn’t have to move all this information into one big silo (and rewrite the applications you already have). Instead, your databases and repositories should federate all this existing information and provide the view of your information that your applications need [1]. And you should be able to write applications that can take advantage of the information you already have where it is today. And those applications should have to change as little as possible when you have new or different information tomorrow.

It all boils down to using the JCR API to access a variety of information in all kinds of places. The JCR API is an excellent abstraction with powerful features that make it very easy to work with the information in the shape it wants to be today while easily adapting to the shape it will take tomorrow. This is what the ModeShape project is doing, and here’s how we’re doing it:

Killer Feature #1: Connectors

Implementing the full JCR API on top of multiple kinds of systems would be expensive, time consuming, and painful. We’ve created a simple connector framework that is simple enough that its easy to write new connectors, yet efficient enough to do many, many operations with one (potentially remote) call. Your applications can use ModeShape to make these existing system look and behave like a real JCR repository.

ModeShape JCR and connectors

We’ve begun building a library of connectors that allow us to store content on a data grid (Infinispan), on a distributed cache (JBoss Cache), in relational databases (via Hibernate), and in-memory within the Java process (for small transient use cases). Our library also includes connectors that access existing file systems, SVN repositories, and even the schemas from existing JDBC databases.

Of course, we’re already working on more connectors, including a connector to other JCR repositories. And we envision lots of connectors, including connectors to other CMIS repositories, version control systems (like Git and CVS), document databases (like CouchDB and Cassandra), distributed file systems, customer management systems, Maven repositories, LDAP directories, and existing databases. Just to name a few. And we designed the connector framework so that you can write your own.

Killer Feature #2: Federation

Remember all those different silos of information? Using JCR to access each of these is pretty interesting, but what’s really killer is federating the information from multiple sources into a single virtual repository. To your applications, ModeShape looks and behaves like a regular JCR repository. They use the standard JCR API to navigate, search, create, change, and listen for changes in the content. But under the covers, ModeShape is able to federate content from multiple back-end systems using our connectors, ensuring that the repository content stays up-to-date and in-sync with those systems. And those external systems can continue “owning” the information, and existing applications can continue using them, but new applications using ModeShape can easily access the unified and integrated information.

ModeShape federation

Killer Feature #3: Sequencing

A lot of repositories exist to store files and other important artifacts, and contained in all those files is a ton of very valuable information. Sure, the repository might process them for searching, but that just extracts the words and phrases. Or, your applications can read the files and process them one at a time. ModeShape sequencers are able to unlock this valuable structured information and put it back into the repository, where it’s accessible via navigation, queries, and searches.

Sequencing is fully automated and done in the background. Simply configure the sequencer and start uploading content. ModeShape has a library of sequencers, including support for CND, DDL, XML, ZIP, MP3, images, Java source, Java class, text files (character-separated and fixed-width), and Microsoft Office® documents. Of course, we designed it so that you can write your own sequencers, too.

Killer Feature #4: JCR-SQL2

The JCR API provides a single mechanism for querying the repository content, using a variety of query languages. JSR-170 (aka “JCR 1.0”) requires repositories support the JCR XPath language (a subset of XPath 2.0), and defines the optional language called “JCR-SQL” that is a simple subset of SQL SELECT statements. JSR-283 (aka “JCR 2.0”) deprecates both XPath and JCR-SQL, and instead mandates support for an improved “JCR-SQL2” language that is better and more powerful adaptation of SQL.

ModeShape currently supports JCR 1.0, and thus it does support the XPath query language defined by the spec. However, ModeShape also supports the newer JCR-SQL2 query language, along with several major enhancements [2]. In fact, our enhanced JCR-SQL2 is so powerful that ModeShape implements the XPath support by translating XPath expressions into JCR-SQL2 queries.

Not your father’s JCR

Traditionally, applications that use JCR are working with content repositories and content management systems. But chances are you have a lot of valuable information that your JCR repository can’t get to. And you’ve probably come to really like the JCR API, and can imagine how nice it would be to use it to access all that existing information.

So chances are, you need ModeShape. Or at least you need to give it a try. After all, ModeShape is not your father’s JCR. It’s better. Much better.

[1] Federation is in our DNA. The ModeShape project actually came out of the team that built the MetaMatrix commercial data integration and federation engine. MetaMatrix was the first true EII product that allowed applications to access unified and integrated data housed in multiple disparate back-end systems through a single, scalable, virtual database using SQL via JDBC and ODBC. MetaMatrix was acquired by Red Hat in 2007, and seeded the Teiid and Teiid Designer open source projects.

[2] Though not included in JSR-283’s JCR-SQL2, ModeShape adds support for: all the JOIN operators; UNION, INTERSECT and UNION [ALL] set operations, removal of duplicates via SELECT DISTINCT; LIMIT and OFFSET clauses; new DEPTH and PATH dynamic operands for use in constraint clauses; constraints using IN and NOT IN and BETWEEN clauses; and arithmetic operations on dynamic operands. For details, see our Reference Guide.

Filed under: features, federation, jcr, repository

January 21, 2010 • 11:53 am 3

ModeShape 1.0 Beta

Hot on the heels of rebranding our project, the ModeShape project is pleased to announce that the first beta release of ModeShape 1.0 is now available. It’s in the JBoss Maven repository and in our project’s downloads area. Of course, our Getting Started guide and Reference Guide are great places to see. And we always have JavaDocs and release notes. Thanks to our fantastic community of users and contributors!

This release is basically just a rebranded form of the JBoss DNA 0.7 release published last week. Basically, the goal was to make it as easy as possible to migrate an application from JBoss DNA to ModeShape. So the Maven group and artifact IDs have changed, the package names have changed, a few classes with “DNA” in the name have changed, and lots of documentation has changed. If you’re using the JCR API, only a few areas of your applications will be affected. For details, see the migration section of our Reference Guide.

And as before, ModeShape implements all of the JCR Level 1 and most of the Level 2 features, along with the optional locking and observation features. ModeShape supports three query languages (XPath, JCR-SQL2, and a full-text search), a variety of persistent stores (including RDBMS, Infinispan, and JBoss Cache, to name a few), accessing content in non-JCR systems (including SVN, file systems, JDBC database schemas, etc.), and federating multiple stores and systems into a single, virtual repository. As you upload files and other data into the repository, ModeShape sequencers automatically extract structured information and store it in the repository, making that extracted content available for navigation, search, and query. ModeShape can easily be embedded in your application, deployed into your own web applications, or deployed as a REST service on your favorite application server or servlet container. And at this point, ModeShape is passing roughly 97% of the JCR TCK, and our goal is to get that to 100% for a 1.0 release in a few weeks.

So switch to ModeShape, give this latest release a shot, and let us know what you think.

Filed under: features, jcr, news, open source, repository

January 21, 2010 • 11:51 am 1

JBoss DNA is now ModeShape

I’m very pleased to announce that JBoss DNA has a new name: “ModeShape”. Yes, it’s the same project, with the same software (albeit rebranded), and certainly the same fantastic community. Just a new name and a new home.

Why are we rebranding? After all, isn’t “JBoss DNA” a good name? We thought so. But while having “JBoss” in the name comes with a lot of benefit, there are some disadvantages. For a lot of people, “JBoss” means “Application Server“, and though we hope to play a role with AS in the future, right now our JCR implementation is completely independent of JBoss AS. For other people, “JBoss” means products (a la subscriptions and support), and ours is an open-source project that, at this point in time, is not included in any of the current JBoss platforms. So, if we lose the “JBoss” part of “JBoss DNA”, we’re left with “DNA”, and unfortunately that’s just not sufficient for a project name from a trademark or legal perspective.

So, we’ve chosen to rebrand our project as “ModeShape”. We have a great new logo:

and some other great new resources:

new project site
new Twitter account (follow us!)
new blog
new mailing lists and chat room
new SVN repository (with all the history from JBoss DNA)
new JIRA project (with all the old DNA issues)
new forums (with all the old DNA threads)
new swag and desktop wallpapers

We’re also releasing ModeShape 1.0.0.Beta1, but more on that in the next post.

We’re excited to officially have our new brand. Props to James Cobb and Cheyenne Weaver for our new logo, graphics and other branding help. And, as is so often the case, thanks to our fantastic community for their hard and quick work completing the rebranding.

BTW, where did name originate? It’s actually a slight modification of a term used in structural dynamics to help describe and understand how a structure responds dynamically to some force or input, where the response is mathematically a combination of each of its natural mode shapes. And it seemed to fit a project with goals of making it possible to understand the shape and structure of information and content. (Okay, maybe that’s a stretch. But you get the idea.)

Filed under: jcr, news, open source, repository

January 11, 2010 • 10:42 am 1

Announcing JBoss DNA 0.7

We’ve just released JBoss DNA 0.7. It’s in the JBoss Maven repository and in our project’s downloads area. Of course, our Getting Started guide and Reference Guide are great places to see. And we always have JavaDocs.

With this release, JBoss DNA introduces support for JCR query and search with a several languages, including the JCR XPath language (required by the 1.0 specification), the JCR-SQL2 dialect defined by the JCR 2.0 specification, and a full-text search language. It also adds support for observation.

This means that JBoss DNA now implements all of the JCR Level 1 features, almost all Level 2 features (everything except referential integrity), and the optional locking and observation features. This version passes more than 97% of the JCR TCK tests that cover Level 1, Level 2, locking and observation. (All of the failures are because of referential integrity and a handful of known issues.) Fortunately, most of these are either less-frequently-used features of JCR or issues that can be worked around.

This release also introduces a number of new and improved connectors. Both the file system connector and SVN connector were reworked to improve performance and to support updates, and they both offer a preview of an optional caching system. The JPA storage connector was dramatically improved and is now significantly faster, more capable, and more efficient. There is also a new JDBC metadata connector that provides read-only access to the schema information of relational databases through JDBC. The federated connector was also improved, and is now used in several key places within our JCR implementation. Plus, we still have connectors to Infinispan, JBoss Cache, and a simple transient in-memory store.

There are also a number of new and improved sequencers. A new text sequencer is able to extract structured data from comma-separated or fixed-width text files. A new DDL sequencer is capable of parsing a number of DDL dialects to extract the more important DDL statements. The CND sequencer was rewritten to be much simpler, perform better, fix a number of known issues, and eliminate third-party dependencies. There is also a new Java class file sequencer that operates on Java class files and produces output that is comparable to the Java source file sequencer, and that can be used in conjunction with the ZIP file sequencer to extract the Java metadata from JARs, WARs, and EAR files. And don’t forget the XML sequencer or our other sequencers for extracting metadata from images, MP3s, and Microsoft Office documents.

We’ve fixed quite a few bugs, added numerous improvements, and upgraded all third-party dependencies to the latest versions available at this time. The build system now supports running all of the tests against a variety of databases, making it very easy to test against DBMSes that JBoss DNA doesn’t directly test against. And we’ve added a new DDL generation utility that produces the DDL for the database schema created and used by the JPA connector. To top it all off, JCR repositories now support the use of anonymous users, though this can easily be changed for production purposes.

Thanks to the whole JBoss DNA community for all their hard work!

Filed under: features, jcr, news, open source, repository

ModeShape