ModeShape

An open-source, federated content repository

ModeShape 3.2.0.Final is available

I’m happy to announce that ModeShape 3.2.0.Final is now available. This release took us a longer than we’d hoped, but we’ve fixed an incredible 120 issues, including using JBoss EAP 6.1 (instead of JBoss AS 7.1.1) and upgrading the Infinispan, JGroups, Hibernate Search, Lucene, Tika, and other third party libraries. See our  release notes for more specifics. We recommend everyone using 3.1.3 or earlier upgrade as soon as feasible.

Just to reiterate: this version of ModeShape can be installed on top of JBoss EAP 6.1. See this discussion post for background on what this means for community users.

As usual, the artifacts are in the JBoss Maven repository and on our downloads page. Our getting started guide has instructions, ModeShape 3 documentation and JavaDoc are available, and our code is on GitHub. Join us on our forums or IRC channel to ask questions, and please log any issues in our JIRA.

We had a lot more help this release from our community members that reported and fixed issues. Thanks to the whole community for a job well-done!

Filed under: features, jcr, releases, rest

ModeShape 3.1.3.Final is available

I’m happy to announce that ModeShape 3.1.3.Final is available immediately. It contains almost dozen bug fixes, including several clustering-related fixes; see our  release notes for specifics. We recommend everyone using 3.1.2 or earlier upgrade as soon as feasible.

As usual, the artifacts are in the JBoss Maven repository and on our downloads page. Our getting started guide has instructions, ModeShape 3 documentation and JavaDoc are available, and our code is on GitHub. Join us on our forums or IRC channel to ask questions, and please log any issues in our JIRA.

Thanks to the whole community for a job well-done!

Filed under: features, jcr, releases, rest

ModeShape 3.1.2.Final is available

I’m happy to announce that ModeShape 3.1.2.Final is available immediately. It contains two dozen bug fixes; see our  release notes for specifics. We recommend everyone using 3.1.1 or earlier upgrade as soon as feasible.

As usual, the artifacts are in the JBoss Maven repository and on our downloads page. Our getting started guide has instructions, ModeShape 3 documentation and JavaDoc are available, and our code is on GitHub. Join us on our forums or IRC channel to ask questions, and please log any issues in our JIRA.

Thanks to the whole community for a job well-done!

Filed under: features, jcr, releases, rest

ModeShape 3.1.1.Final is available

It’s been just two weeks since our release of 3.1, and several important issues were identified that we wanted to get fixed as soon as possible. As of tonight, ModeShape 3.1.1.Final is available and contains only bug fixes and some minor build corrections. See our  release notes for specifics.

We recommend everyone using 3.1 upgrade as soon as feasible. If you’re using earlier versions, please consider upgrading.

As usual, the artifacts are in the JBoss Maven repository and on our downloads page. See our getting started guide for instructions, our documentationJavaDoc, and our code on GitHub; use our forums or IRC channel to ask questions, and log any issues in our JIRA. Please consider upgrading to 3.1 if you’re still using ModeShape 2.x or 3.0.

Thanks to the whole community for a job well-done!

Filed under: features, jcr, releases, rest

ModeShape 3.1.0.Final is available

I’m very happy to announce that ModeShape 3.1.0.Final is available immediately in the JBoss Maven repository and on our downloads page. See our getting started guide for instructions. As always, check out our documentationrelease notesJavaDoc, and our code on GitHub; use our forums or IRC channel to ask questions, and log any issues in our JIRA. Please consider upgrading to 3.1 if you’re still using ModeShape 2.x or 3.0.

So what does this release include? Lots of goodness:

  • Federation is back! I as mentioned in my last post, ModeShape can now federate content that exists in external systems and project it into the repository as regular content. Several connectors are provided out of the box: a file system connector (very similar to what was in 2.x) that accesses files and folders on the file system and projects them as ‘nt:file’ and ‘nt:folder’ nodes; and a Git connector that accesses a local Git repository (can be a clone of one or more remotes) and projects the branches, tags, commits and trees within the Git repository as a node structure in the repository. You can even write your own connector, too.
  • Access content through CMIS. This is still a ‘technology preview’, and we’re seeking users that can try it out and give us feedback.
  • Installing ModeShape into an existing JBoss AS7.1.1 installation now also installs CMIS API for all repositories. Federation connectors are configured just like all the other repository components.
  • Deploy ModeShape into application servers and containers other than JBoss AS and EAP with our new JCA adapter.
  • Configuration improvements for large strings, JGroups, and variables.
  • Clustering bug fixes and improvements.
  • We’ve tested ModeShape 3.1 with Infinispan 5.1.2.FINAL and 5.1.8.Final. We’ve also tested with 5.2.0.CR1, but we think ModeShape 3.1 will work with Infinispan 5.2.0.Final when it is available.
  • Improved support for very large numbers of child nodes under a single parent, including with federated nodes.
  • Over 50 issues (bugs, tasks, features, etc.) resolved in this release.

Lots of people contributed to this release. Thank you all very much! Keep up the excellent work.

We’re going to focus our next release on improving performance even more and upgrading the libraries we depend on. We’re also hoping that EAP 6.1 becomes available so that we can support installing ModeShape into it.

Now, drop what you’re doing and start using ModeShape 3.1!

Filed under: features, jcr, releases, rest

Federate external data into ModeShape

I’m really excited to announce that ModeShape 3.1 (due out in a few days) will re-introduce the ability to federate data from external systems into ModeShape repositories. That might sound kind of esoteric, but let’s look at some simple scenarios that show how powerful this is.

federation

(Long-time readers might recall that ModeShape 2.x had federation, but due to time constraints we didn’t bring the feature along when we moved to the new architecture in 3.0. We’re now fixing that, except that federation in ModeShape 3 is massively improved compared to what it was in 2.x! In fact, federation in 3.0 is so different, you should probably forget everything you know about federation in ModeShape 2.x.)

Scenario 1: Federating files

Imagine that you have a ModeShape repository (aka database), and it contains all of the data that your application needs. Your application can upload files, but they get stored in ModeShape along with the rest of your content. But you also have a separate file system that is directly accessed and exposed by your web servers. You now want your application to allow users to browse those files and, perhaps, simply pick one so that your application can automatically create a link. Your application already can work with “nt:file” and “nt:folder” nodes, but since that other file system isn’t managed by ModeShape, you have to write new logic to access regular files and folders.

Federation changes this dramatically. You can have ModeShape connect to that separate file system and project the files and folders as “nt:file” and “nt:folder” nodes inside the existing ModeShape repository. In other words, ModeShape will act as though there is an “nt:folder” node inside the repository, but it actually is dynamically created because there is a folder on the separate file system. As your application accesses the name and children of that “nt:folder” node, ModeShape transparently and dynamically maps those requests onto the corresponding file system operations. So your application can continue to work with “nt:file” and “nt:folder” nodes, but ModeShape does all the work of really accessing the files and folders on the separate file system.

Scenario 2: Federating Git repositories

Consider another similar scenario in which the external file system is actually a Git repository, and you want to be able to navigate and access the files and folders in any commit, branch or tag. Again, you could change your application to directly access Git, but that’s quite a bit of work. After all, your application is already accessing most of its content directly from ModeShape.

With federation, ModeShape could access the Git repository and expose not only the files and folders as “nt:file” and “nt:folder” nodes, but it can also expose the Git-specific information on those files and folders, like what was the last commit that changed them. And, you’d also like to be able to navigate (as nodes) the commits (e.g., history), branches, and tags in the Git repository.

How does federation work?

The first thing to understand is that ModeShape does not copy the data from the external system into the repository. Instead, ModeShape (with the help of connectors) dynamically creates nodes upon demand to represent the external data. If the external data doesn’t change too often or is okay to be slightly out of date, then you can optionally have ModeShape cache the nodes in-memory. But either way, the external system remains the owner of its data.

Secondly, federation is transparent to clients. Once federation is configured, the repository’s regular content and federated content all looks to client applications like regular content.

Thirdly, a repository does not use federation by default; you have to configure it for each repository. To do that, a repository configuration must specify:

  1. how ModeShape is to communicate with the external system (e.g., which connector implementation is to be used)
  2. the properties that the connector needs to talk with a particular external system (e.g., an external source)
  3. where and how the data in the external system is to be projected into the repository

All of this is defined inside the “externalSources” area of a repository’s configuration file. Here’s a simple JSON repository configuration that defines one external source called “downloads” and another called “sourceCode”:

{
  "name" : "MyRepository",
  "workspaces" : {
    "predefined" : [ "ws1", "ws2" ],
    "initialContent" : {
      "default" : "resources/initialContent.xml"
    }
  },
  "externalSources" : {
    "downloads" : {
      "classname" : "org.modeshape.connector.filesystem.FileSystemConnector",
      "directoryPath" : "/opt/downloads",
      "readonly" : true,
      "cacheTtlSeconds" : 5,
      "projections" : [ "default:/files/downloads => /" ]
    },
    "sourceCode" : {
      "classname" : "org.modeshape.connector.git.GitConnector",
      "directoryPath" : "data/repo",
      "remoteName" : "origin,upstream",
      "queryableBranches" : "master",
      "cacheTtlSeconds" : 5,
      "projections" : [ "default:/sources/ => /" ]
    }
  }
}

Each external source is identified by a unique name that you assign, and specifies the name of the connector implementation class and other connector-specific properties. A connector is simply a subclass of the “org.modeshape.jcr.federation.spi.Connector” class that contains that logic of how to create nodes that represent external data and, optionally, how to update the external data based upon changes to the nodes. We’ve designed the SPI so that you can easily create your own subclasses of Connector (or ReadOnlyConnector if the connector should never update data in the external system).

Let’s look at this configuration file a bit more. The “downloads” external source (line 10) defines several other properties:

  • the “directoryPath” is the location on the local file system of the top-level directory that is to be accessed by the connector; we use an absolute path here, though relative paths also work.
  • the “readonly” property specifies that ModeShape should never update any of the files or folders on the file system (yes, the FileSystemConnector is capable of creating, updating, and deleting files and folders on the file system in response to applications creating, updating, or deleting the corresponding nodes in the repository).
  • the “cacheTtlSeconds” is the time in seconds (5 in our case) that the nodes created by the connector to represent external files/folders should be cached.
  • the “projections” field is an array of string values that define the paths of the “federated” nodes that will represent the objects in Git. Our value of “default:/files/downloads => /” means that the top-level directory of the external source (that is, the “/opt/downloads” folder) should be projected as a node at “/files/downloads” in the “default” workspace. (Note that we’re also specifying that the “default” workspace should be populated with nodes described by the “resources/initialContent.xml” file. It’s here that we’d define the node type and any properties for the “/files” node.)

The file system connector will create a structure that mirrors the files and folders on the file system. So if the “/opt/downloads” directory contained the following:

aircraft
aircraft/Boeing
aircraft/Boeing/747.jpg
aircraft/Boeing/787.jpg
aircraft/Airbus
aircraft/Airbus/A380.jpg
aircraft/Airbus/A380.jpg
aircraft/Airbus/A320.jpg

then we would then have the following nodes inside the “default” workspace:

/files   (primary type “nt:unstructured”)
/files/downloads   (primary type “nt:folder”)
/files/downloads/aircraft   (primary type “nt:folder”, represents “/opt/downloads/aircraft”)
/files/downloads/aircraft/Boeing   (primary type “nt:folder”, represents “/opt/downloads/aircraft/Boeing”)
/files/downloads/aircraft/Boeing/747.jpg   (primary type “nt:file”, represents “/opt/downloads/aircraft/Boeing/747.png”)
/files/downloads/aircraft/Boeing/787.jpg   (primary type “nt:file”, represents “/opt/downloads/aircraft/Boeing/787.png”)
/files/downloads/aircraft/Airbus   (primary type “nt:folder”, represents “/opt/downloads/aircraft/Airbus”)
/files/downloads/aircraft/Airbus/A380.jpg   (primary type “nt:file”, represents “/opt/downloads/aircraft/Airbus/A380.png”)
/files/downloads/aircraft/Airbus/A320.jpg   (primary type “nt:file”, represents “/opt/downloads/aircraft/Airbus/A320.png”)

Note how the “/files” and “/files/downloads” nodes exist in the workspace, but the “/files/downloads/aircraft” node is dynamically projected to mirror the “/opt/downloads/aircraft” folder. If the “/opt/downloads/aircraft” folder were to be removed, then the “/files/downloads/aircraft” node would automatically be removed as well.

The “sourceCode” external source is pretty similar, but it access a local Git repository:

  • the “classname” specifies that the “org.modeshape.connector.git.GitConnector” connector implementation be used. This class will be included in ModeShape 3.1.0.Final. This connector is read-only.
  • the “directoryPath” is the location on the local file system of the top-level directory that contains a valid Git repository (e.g., it contains a “.git” directory); we use a relative path here.
  • the “remotename” is the name (or comma-separated list of names) of the remote(s).
  • the “cacheTtlSeconds” is the time in seconds (5 in our case) that the nodes created by the connector to represent Git’s files, folders, commits, tags, and branches should be cached.
  • the “projections” field specifies where in the repository the Git nodes should appear. Our value of “default:/sources => /” means that the top-level directory of the external source (that is, the Git repository folder) should be projected as a node at “/sources” in the “default” workspace.

The Git connector doesn’t really work with a local working directory of the Git repository. Instead, it basically exposes all of the commits, branches, and tags (plus the files and folders in each). It does this by mapping Git functionality onto some special nodes:

  • The “branches” node is a container under which all branches can be found.
  • The “tags” node is a container under which all tags can be found.
  • The “commits” node is a container under which all commits appear, with the most recent commits appearing first in the children.
  • The “branches/{branchName}” nodes represent the information about each branch, including the commit ID and references to the node representing the commit.
  • The “tags/{tagName}” nodes represent the information about each tag, including the commit ID and references to the node representing the commit.
  • The “commits/{branchOrTagOrCommit}” nodes show the history of commits for the given branch/tag/commit.
  • The “commits/{branchOrTagOrCommit}/{commitId}” shows the details of a specific commit in the history of a particular branch/tag/commit.
  • The “commit/{branchOrTagOrCommit}” shows the details of the specified commit
  • The “tree/{branchOrTagOrCommit}” is a container for the files and folders within the specified branch, tag or commit.

Thus in our repository, all the Git information is projected under “/sources”, so a number of nodes would appear in our repository:

  • A node representing each branch appears as children under “/sources/branches” (e.g., “/sources/branches/master”)
  • A node representing each tag appears as children under “/sources/tags” (e.g., “/sources/tags/release-1.0″)
  • A node representing each commit appears as children under “/sources/commits” (e.g., “/sources/commits/bbfa3f3d76b0…”)
  • A node structure (with “nt:file” and “nt:folder” descendant nodes) representing the workspace (with its files/folders) for each commit, branch and tag appears as children under “/sources/tree” (e.g., “/sources/tree/master/pom.xml”)

By the way, you can control with the projections which subset of the nodes exposed by the connector should be projected into the repository. For example, if you wanted to expose only a specific branch (e.g., “master”) in the Git repository under the “/sources” node, you could change the projection rule to be “default:/sources => /tree/master”.

Programmatically creating projections

ModeShape’s public API now contains a new FederationManager class that can be used to programmatically create and remove projections. However, external sources still must be configured via the JSON configuration file.

Custom connectors

As we mentioned earlier, we provide two connectors out-of-the-box in 3.1. We do plan to add a few more, but we’ve always expected the some developers would want to create their own custom connectors. Hopefully our SPI is simple enough that doing so is very straightforward. So if you’re interested in this, please take a look and give us feedback on our forums. The SPI should be pretty stable, but if we find some glaring problems, we may need to change the SPI slightly before the 3.2 release; after that, however, we’ll lock down the SPI.

Summary

We described a few ways that federating content from external sources into your ModeShape repository can be quite useful, and we also took a very detailed look into how federation is configured. In reality, however, we just skimmed the surface of what is actually possible with ModeShape federation.

Stay tuned for more on the 3.1 release later this week.

Filed under: features, federation, jcr

Creating and using tags in your content

UPDATE 2: Changed option 3 to use string identifiers, as WEAKREFERENCE and REFERENCE properties both maintain back-references.

UPDATE 1: Added a 5th option, as suggested by Bertrand Delacretaz.

(This post was inspired by a response I recently wrote to a Stack Overflow question. That answer was a bit long, but I thought it would also be suitable as a blog post.)

Many applications offer a way to tag “things” with either user-defined or system-defined tags. Assuming those “things” are nodes, what’s the best way to add tags to a ModeShape repository? I know of four five possible approaches, each with their own benefits and disadvantages.

Option 1: Use Mixins

This approach will use a separate mixin node type definition for each tag. The mixin is a marker mixin (e.g., it has no property definitions or child node definitions). One example of “known-issue” tag is the following (in CND format):

tag="http://www.example.com/tags"
[tag:known-issue] mixin

Create this tag by registering the node type definition using the NodeTypeManager, either by programmatically creating the node type template or by uploading a CND file.

To “tag” a particular node, simply add the tag’s mixin to the node:

node.addMixin("tag:knownIssue");

Note that any node can have multiple tags, since any node can have multiple mixins.

To find all nodes that have a particular tag, simply issue a query:

SELECT * FROM [tag:known-issue]

To find all nodes that have two tags, simply perform a UNION:

SELECT * FROM [tag:known-issue]
UNION
SELECT * FROM [tag:critical-issue]

This approach is pretty straightforward and really uses ModeShape’s mixin feature. However, it is fairly cumbersome to create new tags, since that requires registering new node types. Plus, you cannot easily rename tags, but instead would have to:

  1. create the mixin for the tag with the new name;
  2. find all nodes that have the mixin representing the old tag, and for each remove the old mixin and add the new one;
  3. finally remove the node type definition for the old tag (after it is no longer used anywhere).

Removing old tags is done in a similar manner. Finally, it’s not really possible to associate additional metadata (like a display name) with a tag, since extra properties aren’t allowed on node type definitions.

This approach should perform quite well, however.

Option 2: Use a taxonomy and references

This approach involves using one or more “taxonomies“, each of which consist of a parent node for the taxonomy and child nodes for each tag in that taxonomy. The exact node types used are entirely up to you, but the taxonomy structure can be as rich as you’d like it to be. For example, you can create inheritance between tags in much the same way that classes can inherit from other classes in an ontology. Obviously adding, renaming, and removing tags is straightforward.

To “tag” a node, this approach uses a REFERENCE property. One way to do this is to define a single node type for the tag nodes and a single mixin that we’ll use to add this REFERENCE property to “taggable” nodes:

tags="http://www.example.com/tags"
[tags:tag] > mix:title, mix:referenceable

[tags:taggable] mixin
- tags:tags (REFERENCE) multiple < 'tags:tag'

To “apply” the tag to a node, simply add the “tags:taggable” mixin to the node (if not already there) and add the REFERENCE to the desired tag node. Here’s some code that does this (although it is too simple and assumes the node hasn’t already been tagged):

Node tag = ... // find in taxonomy
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
    n.addMixin("tags.taggable");
}
Value[] values = new Value[1];
values[0] = session.getValueFactory().createValue(tag);
n.setProperty("tags:tags",values);

To find all nodes of a particular tag, simply get the tag and call “getReferences()” on a tag node to find all of the nodes that contain a reference to the tag node:

Node tag = ...
NodeIterator iter = tag.getReferences("tags:tags");
while ( iter.hasNext() ) {
    Node tagged = iter.next();
}

Alternatively, you could use a query to find all of the nodes for a particular tag. Here’s one that finds all the nodes that are tagged with the ‘known-issues’ or ‘critical-issue’ tag (note how easy it is to search for nodes tagged with any of 1, 2, or n tags just by changing the set criteria):

SELECT * FROM [tags:taggable] AS taggable
JOIN [tags:tag] AS tag ON taggable.[tags:tags] = tag.[jcr:uuid]
AND LOCALNAME(tag) IN ('known-issue','critical-issue')

This approach has the benefit that all tags have to be controlled/managed within one or more taxonomies (including perhaps user-specific taxonomies).

However, there is one potentially substantial disadvantage: this option may not scale very well to large numbers of tagged nodes. ModeShape might start to degrade adding and removing REFERENCE values when there are hundreds of nodes pointing to the same tag node. Another disadvantage is that a tag cannot be removed from a taxonomy unless it is no longer used.

You can also use WEAKREFERENCE rather than REFERENCE. The only distinction is that with WEAKREFERENCE you can remove a tag from the taxonomy without having to remove it from the tagged nodes.

Option 3: Use taxonomy and identifier references

This option is similar to Option 2 above in that it involves formally managing one or more taxonomies, in exactly the same was as described above. The difference, however, is that rather than use a REFERENCE (or WEAKREFERENCE) the node that is to be tagged points to the tag node using a STRING property with the identifier of the tag node:

tags="http://www.example.com/tags"
[tags:tag] > mix:title, mix:referenceable

[tags:taggable] mixin
- tags:tags (STRING) multiple

Note that the tag has a “jcr:title” property, which you can use to hold the display name for the tag.

Tagging a node is done similarly to Option 2, except the value of the “tags:tag” property is a string:

Node tag = ... // find in taxonomy
String tagId = tag.getIdentifier();
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
    n.addMixin("tags.taggable");
}
Value[] values = new Value[1];
values[0] = session.getValueFactory().createValue(tagId);
n.setProperty("tags:tags",values);

To find all nodes of a particular tag, simply use a query to find all of the nodes that have the identifier of a particular tag. Here’s one that finds all the nodes that are tagged with the ‘known-issues’ or ‘critical-issue’ tag (note how easy it is to search for nodes tagged with any of 1, 2, or n tags just by changing the set criteria):

SELECT * FROM [tags:taggable] AS taggable
JOIN [tags:tag] AS tag ON taggable.[tags:tags] = tag.[jcr:uuid]
AND LOCALNAME(tag) IN ('known-issue','critical-issue')

You’ll note that this is very similar to the query in Options 2 and 3. That’s because REFERENCE and WEAKREFERENCE properties are physically stored in a property value as an identifier.

Like option 2, this approach does enforce using one or more taxonomies, makes it a bit easier to control the tags, since they must exist in a taxonomy before they can be used. Renaming nodes is also pretty easy, although this is not necessary if using the “jcr:title” property for the display name , since renaming involves simply changing the title property value. Performance-wise, this is far better than the REFERENCE and WEAKREFERENCE approach, since non-reference properties will scale much better and perform better with large numbers of references, regardless of whether they all point to one node or many. Looking up the tag(s) from the “tags:tags” property is also very fast (and faster than navigating a path).

This approach is similar to Option 2 with WEAKREFERENCE properties in that you can remove a tag even if it is still used, although nodes’ “tags:tags” property values that point to that removed tag will not be usable anymore. This can be remedied with some conventions in your application, or by simply keeping tags around and using metadata on the taxonomy to say that a particular tag is “deprecated” and shouldn’t be used. (IMO, the latter is actually a benefit of this approach.)

This option will generally perform and scale much better than Option 2.

Option 4: Use string properties

The final approach is to simply use a STRING property to tag each node with the name of the tag(s) that are to be applied. This works great for ad hoc tags, which is when there is no formal taxonomy and any tag can be used at any time.

Here’s a mixin that defines a multi-valued STRING property:

tags="http://www.example.com/tags"
[tags:taggable] mixin
- tags:tags (STRING) multiple

To tag a node, simply add the mixin (if not already present) and add the name of the tag as a value on the “tags:tags” STRING property (again, if it’s not already present as a value). Here’s some simplified code that does none of the checking, but which gives the basic idea:

Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
    n.addMixin("tags.taggable");
}
String[] tags = new String[1]{"known-issue"};
n.setProperty("tags:tags",tags);

The primary advantage of this approach is that it is very simple: you’re simply using string values on the node that is to be tagged. To find all nodes that are tagged with a particular tag (e.g., “tag1″), simply issue a query:

SELECT * FROM [acme:taggable] AS taggable
WHERE taggable.[tags:tags] = 'known-issue'

Also, there is no taxonomy to manage. But if a tag is to be renamed, then you could simply process the “tags:tags” values. If a tag is to be deleted (and removed from the nodes that are tagged with it), then that can be done by removing the tag name from the “tags:tags” properties (perhaps in a background job).

Note that this allows any tag name to be used, and thus works best for cases where the tag names are not controlled at all. If you want to control the list of strings used as tag values, you could create a taxonomy in the repository (as described in Options 2 and 3 above) and have your application limit the values to those in the taxonomy. You can even have multiple taxonomies, some of which are perhaps user-specific. But this approach doesn’t have quite the same control as Options 2 or 3.

This option will perform just a bit better than Option 3 (since the queries are tad simpler), but will scale just as well.

Option 5: Use taxonomy and paths

A fifth option is very similar to Option 3, except that you use a PATH property (rather than a STRING property) that points to the tag, where the PATH values are paths to the tag. Here are some node types:

tags="http://www.example.com/tags"
[tags:tag] > mix:title

[tags:taggable] mixin
- tags:tags (PATH) multiple

(You could also use a STRING property instead of PATH; really the only advantage of using PATH is that it enforces that each value is a legal path value. But using PATH does not enforce that it is an existing path.)

To tag a node, simply add the mixin (if not already present) and add the path of the tag as a value on the “tags:tags” STRING property (again, if it’s not already present as a value). Here’s some simplified code that does none of the checking, but which gives the basic idea:

Node tag = ... // the tag node
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
    n.addMixin("tags.taggable");
}
String[] tags = new String[1]{tag.getPath()};
n.setProperty("tags:tags",tags);

Unlike Options 2 or 3, this approach does not even use taxonomies. In fact, you’ll notice that the “tags:tags” property node type has no constraints that require it to contain a path; this reduces the constraints and requires your application to use convention, which can be an advantage. Using a title on the tag for the displayable name obviates having to rename tags. Performance-wise, this is far better than the REFERENCE or WEAKREFERENCE approach, and (for ModeShape) just a bit worse than using the STRING property with an identifier (ModeShape can resolve an identifier faster than it can finding it by path). But it will scale far better than Option 2 and similarly to Option 3.

One advantage of this approach (and of Option 3) over Option 2 is that you can remove a tag even if it is still used, although nodes’ PATH properties that point to that removed tag will be readable but not resolvable. (If you’re using the tag’s title for the display name, this might not be useful since the path might not contain meaningful and usable information.) This can be remedied with some conventions in your application, or by simply keeping tags around and using metadata on the taxonomy to say that a particular tag is “deprecated” and shouldn’t be used. (IMO, the latter is actually a benefit of this approach.)

Summary

We looked at five different ways of incorporating tags into your application. Of course, which one works best for you will depend on the needs of your particular application. And use these as a starting point — feel free to customize them, combine them, or even come up with even other alternatives.

If you just need a way to associate informal tags with content, perhaps Option 4 is a good fit. For very small and limited tagging needs, Option 1 might work. Whereas you should seriously look at option 2 for smallish repositories that needs a formal taxonomy.

But for most applications, your repository will be large enough that you will probably want to look at Options 3, 4 or 5, with the deciding factor being whether you need formal or informal taxonomies. Personally, of these three I think I’d tend to lean toward Option 3.

Happy tagging!

Filed under: jcr, techniques

Concurrent writes

It’s almost a certainty that you will have multiple applications and multiple threads within those applications simultaneously update data in your database. The speed of your application will depend significantly on how fast your database can perform these simultaneous updates.

If you’re using ModeShape, the first thing to know is that reading content does not require any locks. In other words, applications or threads that are reading content can always do so with no contention. (ModeShape doesn’t need read locks because it via Infinispan uses MVCC to isolate readers from writers. See the details for more.)

The second thing to know is that, because ModeShape is a hierarchical database, all data is stored in a tree-like structure of nodes and properties, and any transaction updating content must obtain locks for all nodes being updated. Much of the time, applications and threads that change content do tend to update different parts (subtrees) of the database, which means completely different write locks are acquired by the different transactions. In other words, updates to different parts of the database never block each other.

There are times, however, when multiple applications and/or threads do attempt to update the same node at the same time. In this case, the transactions do compete for the node’s lock, and these transactions complete in essentially a serialized fashion. (Again, they still do not block any reading operations or any transactions updating other areas of the repository.) Occasionally two transactions may deadlock, because they each obtain a lock on separate nodes and then try to obtain a lock on the node currently locked by the other. If you run into this situation, you can enable deadlock detection to automatically detect such cases and roll back one of the deadlocked transactions, which your application can simply re-try by performing the save again.

It’s nice to know that most of the time, application will not have any contention. And when there is contention for concurrent writes to the same areas, ModeShape does the logical thing by serializing the transactions. (Isn’t ACID behavior nice?!)

But even after all this, you may find that your applications are still highly contentious while trying to concurrently update the same nodes. In these cases, you have several options:

  1. Can you initialize the highly-contentious area when the database is created? If so, then the different transactions will update different areas of the database.
  2. Can you alter the hierarchical design of your database to eliminate the contention? Consider if your hierarchy would improve by adding one or more time-based levels. Or consider inserting a level for different contexts (e.g., users, groups, customers, etc.).
  3. Can you centralize where/how your application is updating these areas? For example, a hierarchy that includes a level for users might have contention when adding users. Try centralizing the process of adding users. (Queues often work great for these kinds of patterns.)

By the way, how does ModeShape compare to other hierarchical data stores? Really well, actually. One of the more popular JCR implementations uses a single, cluster-wide, global write lock that guarantees that only one write will proceed at a time. Yikes.

Filed under: features, jcr, techniques

ModeShape 3.0.0.CR3 is available

We released our second candidate release (CR) of ModeShape 3.0 last week, and the community discovered a few more issues with our AS7 integration, backup/restore, and query processing. Although the changes were relatively isolated, we still wanted to give the community a chance to test these fixes before we cut the final release.

So, ModeShape 3.0.0.CR3 is available immediate in the JBoss Maven repository (you can follow our instructions for using ModeShape in your Maven application) and on our downloads page (which include a kit that install ModeShape as a service within JBoss AS7.1.1.Final). Check out our documentationrelease notesJavaDoc, and our code on GitHub; use our forums or IRC channel to ask questions, and log any issues in our JIRA.

We’re pleased with this candidate release, and we believe that we’ll be able to release the final release next week. So please give this latest release a spin and let us know what you think. As with earlier beta releases, this release passes 100% of the JSR-283 TCK tests with all JCR features that were available in ModeShape 2.x. (Note that we won’t officially certify until 3.0.0.Final, but you can run the TCK tests yourself by downloading our source and running “mvn -s settings.xml clean install -Pjcr-tck”.)

As always, we couldn’t do this without all the help from our community members. Well done, and keep up the great work!

Filed under: jcr, news, releases

When is ModeShape a good fit?

Update: changed the Scalability section to make more clear the scope of the term.

When it comes down to it, ModeShape is a database. But there are lots of kinds of databases, and it’s always very important to choose a database that fits your application’s needs. Here are some of the characteristics that distinguish ModeShape from other kinds of databases, which should help you decide whether ModeShape is a good fit for your use cases.

Strongly consistent

ModeShape is strongly-consistent and adheres to the ACID principles, meaning that all operations are atomic, consistent, isolated, and durable. Your applications create Sessions to interact with the information stored in a repository workspace. Each session sees the latest persisted information, even as other applications (or parts of your application) are persisting changes through their own sessions. Your session can make changes, which are overlaid upon the latest persisted information, but only your session sees these changes until you save your session and the changes are persisted. Internally, ModeShape uses a transaction to make sure that all the session’s changes are made (or none of them are), that the changes are consistent, are seen by other sessions only when the changes are completed, and that the changes are durable.

What this means for you is that it’s very easy to develop and write applications, and in many ways is very similar to how you’ve worked with other ACID systems (like relational databases) in the past. You can even use JTA transactions so that the changes are persisted only upon transaction commit. And your application is written the same way whether ModeShape is clustered or non-clustered.

In the last few years, eventually-consistent databases have become very popular, due in part to the increasingly popular goal of creating very large (distributed) databases. When a change is made to an eventually consistent database, that change is not immediately propagated to other processes, but eventually (after a period of time when no changes are made) the database will become consistent. This means that right after one client makes a change, there is no guarantee when other clients will see those changes, and yet those other clients can change the data that they see. The result is that there can be multiple “versions” of the data, and although the database may attempt to resolve these conflicts, it often can only do this for relatively simple conflicts. Ultimately, your application will likely have to deal with the conflict. Additionally, many eventually consistent databases suggest specific usage patterns to make such conflicts less likely, but those usage patterns are often more complicated than you’re used to using. There absolutely are use cases where eventually consistent databases are perfect fits, but there are also lots of use cases and applications that are perfectly unsuitable for eventually consistent databases.

(Note that the next generation of Apache Jackrabbit, codenamed “Oak”, will be eventually consistent. To do that, they are not going to support all the JCR features. When your application saves its session, any conflicts that arise and that can’t be automatically handled will result in exceptions. Their expectation is that your application should then try again to recreate the changes, and that in the worst case your application may have to explicitly resolve the conflicts.)

Hierarchical data

ModeShape stores data in a tree of nodes and properties, where you have full control over the design of that tree structure. At the top of the tree is a single root node, and every node can contain multiple child nodes. Every node has a name and a unique identifier, and can also be identified by a path containing the names of all ancestors, from the parent to the node itself. Names are comprised of a namespace and local part, and there is a namespace registry to centralize short prefixes for each namespace.

You can see that this looks very similar to how a file system is laid out. You already know how to organize a file system, and organizing a ModeShape repository is very similar. In fact, lots of data already has implicit hierarchical structure. Consider that URLs are essentially addresses into a website’s hierarchy. And hierarchical data is easy to use: simply navigate the nodes. It’s also often more efficient to navigate, since related data is very close by.

Scalable and highly available

ModeShape repositories can be small and embedded into Java applications, or they can be very large and distributed across a cluster of machines. You can even decide how (and if) ModeShape should persist your data, ranging from keeping data only in-memory, to storing data on the local file system, to storing data in a relational database, to leveraging the performance, scalability, and durability of an in-memory and elastic data grid. In may seem counter-intuitive, but storing your data in RAM is extremely fast as long as multiple copies of all your data are stored across multiple machines while ensuring that machines can be added and removed and the data is automatically and elastically distributed. This is exactly what a data grid can achieve, and this is how ModeShape can scale to very (very) large databases.

All of ModeShape’s functionality and features are built on top of Infinispan, which is a flexible, fast and highly scalable data grid. ModeShape stores each node in one or more entries in Infinispan (a small node will be stored as one entry, but larger ones are broken into multiple entries), and Infinispan is configured to replicate, distribute and persist the data.

Note that when we talk about scalability and large databases, we’re not talking about the kinds of scales that “big data” often refers to. ModeShape is not a “big data” database and doesn’t scale that big. We’re transactional, after all.

Schema validation

ModeShape supports a very powerful and flexible schema system, but interestingly you get to decide where and how much schema enforcement to use. At one extremely, you allow every node to contain any property and any children – this is essentially using ModeShape as a schema-less database, and it’s a perfectly valid way to use ModeShape. Your application becomes fully in-control of the database structure, making it easy to evolve the structure to suit new or changing requirements.

At the other extreme, you fully define every node to fit a particular node type that constrains the properties and child nodes to fit pre-defined patterns. ModeShape ensures that all the data always adheres to the schema, and your application doesn’t have to do any validation or enforcement.

But between these two extremes is where ModeShape really becomes interesting and advantageous. You can choose which subset of nodes in your tree you want to adhere to a schema, allowing parts of the database to be more schema-less and the rest to be more constrained. But more importantly, you can dynamically expand the schema for any individual node by mixing in additional node types with more property and child patterns. For example, you can define a node type that requires a “title” property, and you can add this node type to any node that is to have a title.

ModeShape’s schema system is very powerful and flexible, and makes it far easier to constrain your data while simultaneously enabling future changes to and evolution of your database’s schema.

Query and search

Navigation isn’t the only way to access your data. Your applications can also query a ModeShape repository to find the subset of content that meets application-specified criteria regardless of where in the hierarchy that data exists. ModeShape offers several query languages, including a subset of XPath, a full-text search language (much like internet search engines), and an extremely powerful SQL-like language called “JCR-SQL2″. Here’s a fairly simple example of a JCR-SQL2 query:

SELECT * FROM [veh:vehicle] AS vehicle
WHERE vehicle.[veh:make] IN ('Chevrolet', 'Toyota', 'Ford')

The queries can be much more complex and can include joins, rich criteria, subqueries, and limits/offsets. The results sets are tabular, but still allow you to access the corresponding node(s) in each row.

Of course, ModeShape evaluates each query across all of the data, even when the repository is distributed in a cluster. That means your application is written the same way, regardless of how ModeShape is configured.

Events

ModeShape provides an event API so that your application can be notified when content changes. Your application can register listeners using a variety of criteria (e.g., “only notify me of the addition or removal of nodes in this subgraph”, or “only notify me when nodes of this type are changed”, or  even “notify me of all node and property changes”, etc.), and can then respond to the events with application-specific behavior.

Again, this behavior works the same way regardless of whether ModeShape is clustered – applications see the changes made by sessions in all processes in the cluster.

Other features

ModeShape includes a number of other features, too. ModeShape can automatically manage the history of a subtree of content – all that’s required is adding the “mix:versionable” mixin to the node, and then calling “checkin()”, “checkout()” and “restore()”.

Individual nodes can be locked to prevent other applications from modifying that area of the repository. Locks are intended to be short-term (e.g., scoped to a single session), though it’s possible to lock nodes for a longer duration.

Take the next step

We’ve covered a lot of topics in this post, but hopefully now you have a clear understanding of what kind of database ModeShape is and whether it is a fit for your use cases. Give it a try. ModeShape 3.0.0.Final is due out next week, but get the latest candidate release.

Filed under: features, jcr, repository

ModeShape is

a lightweight, fast, pluggable, open-source JCR repository that federates and unifies content from multiple systems, including files systems, databases, data grids, other repositories, etc.

Use the JCR API to access the information you already have, or use it like a conventional JCR system (just with more ways to persist your content).

ModeShape used to be 'JBoss DNA'. It's the same project, same community, same license, and same software.

ModeShape

Topics

Follow

Get every new post delivered to your Inbox.