ModeShape

An open-source, federated content repository

Structured, unstructured, and everything between

Shane gives a good breakdown of the various ways to classify data as structured or unstructured. He points out that very often data is a mixture of both structured and unstructured data, and he gives several examples.

What I find so interesting about this, however, is how well ModeShape can handle these varieties of data.

ModeShape handles structured data really well. Most data structures are very easily mapped to the nodes and properties that ModeShape uses. And when those nodes also say which node types apply to them, ModeShape can enforce the node structure by validating it against those built-in and/or custom node types and prevent invalid data from being stored.

The other end of the spectrum is unstructured data, and ModeShape handles that beautifully, too. You can store unstructured data in a property using a string value or a binary value. Typically you would use a string value when the data is some form of text, and a binary value in any other cases (or when you don’t want to treat it as text).

But the best part is that ModeShape naturally handles combinations of structured and unstructured data. Recall that ModeShape is a hierarchical database, which means that each database consists of a single tree of nodes, and each node has one or more properties. That hierarchy is by definition structured, though it’s up to you whether ModeShape validates and enforces that structure using node types. But the leaves of that tree — that is the properties and their values — typically unstructured (though property value like dates and even some string values could be considered structured).

ModeShape’s query languages can also deal with both structured and unstructured data. Relationships between nodes, specific properties defined by node types, and the definitions of those properties all are addressable within the query language. But ModeShape queries can include full-text search constraints on both string and binary property values!

ModeShape can search those binary values when it can extract text using the Tika library, which supports many formats, including PDF, Microsoft Office™, RTF, HTML, and many others.

There’s one more way that ModeShape can deal with unstructured data: it can sequence unstructured data (string and binary property values) using built-in or custom sequencers to extract structure and save it as more nodes and properties in the repository. This is ideal for getting at that unstructured data that has the implicit structure defined by the format. For example, if an image is loaded into the repository, ModeShape’s image sequencer can extract the EXIF data in the image (e.g., ISO setting, focal length, aperture, shutter, geo-location, etc.) and save it as properties in the repository. ModeShape has a number of built-in sequencers that can extract this implicit structure from a variety of file formats:

  • DDL files
  • images (JPEG, GIF, BMP, PCX, PNG, IFF, RAS, PBM, PGM, PPM and PSD)
  • audio (MP3)
  • comma-separated and delimited text files
  • Java source and class files
  • Microsoft Office™
  • ZIP archives
  • XML
  • XML Schema
  • WSDL

In summary, ModeShape deals very naturally and easily with data that is part unstructured and part structured. What else could you want?

Advertisements

Filed under: features, repository, techniques

Infinispan 5.2 and ModeShape 3.2

Congratulations to the Infinispan community on today’s release of Infinispan 5.2.0.Final! It looks like it’s full of really great new features and fixes, including:

We had been testing some of the recent 5.2.0.CRs, but ran into a couple of issues that were fixed in 5.2.0.Final. So if you’re feeling adventurous you can give ModeShape 3.1.1.Final + Infinispan 5.2.0.Final a whirl.

Our next minor release of ModeShape, 3.2, will use Infinispan 5.2 and newer versions of several other dependencies, including Hibernate Search, Lucene, and Tika. We hope to to release ModeShape 3.2.0.Final toward the end of February.

Filed under: news, releases

ModeShape 3.1.1.Final is available

It’s been just two weeks since our release of 3.1, and several important issues were identified that we wanted to get fixed as soon as possible. As of tonight, ModeShape 3.1.1.Final is available and contains only bug fixes and some minor build corrections. See our  release notes for specifics.

We recommend everyone using 3.1 upgrade as soon as feasible. If you’re using earlier versions, please consider upgrading.

As usual, the artifacts are in the JBoss Maven repository and on our downloads page. See our getting started guide for instructions, our documentationJavaDoc, and our code on GitHub; use our forums or IRC channel to ask questions, and log any issues in our JIRA. Please consider upgrading to 3.1 if you’re still using ModeShape 2.x or 3.0.

Thanks to the whole community for a job well-done!

Filed under: features, jcr, releases, rest

One CND file or many?

One question people have when starting out with ModeShape and JCR is this:

Should I create one CND file with all of my node types, or should I create multiple CND files?

Naturally, the answer is “It depends”, since there really is no one way that works for everyone. So to help you figure out your own answer, here are several guidelines that we consider. (If you know of others, please let us know in the comments.)

(BTW, do you know about our CND Editor for Eclipse? If not, check it out because it really makes it easy to edit CND files. It doesn’t even require you to have or use ModeShape.)

Simplicity

A single CND file is easy, and can have node type names that use different namespaces. Just look at a CND file with the standard built-in typesWhere possible, keep things simple and use just a single CND file.

Organization

Perhaps you have enough node types that it’s becomes harder to keep them all in one CND file. In that case, consider putting semantically-related node type definitions in separate files.

Reusability

Some node types might be more reusable than others. In this case, consider putting common or more reusable node types in separate files. For example, you might reuse a subset of your node type definitions in multiple repositories. Or multiple applications might need a common set of node type definitions, while also having their own application-specific types.

When you have reusable node types, then you might also want to consider how those common node types are governed. Are they defined by a central group of people? Are they put into their own Maven artifacts that can be easily shared, and if so are they released separately from the applications? These may all affect how you break up those common node types into one or more files.

Deployment

You may also want to create separate CND files when they’re used/registered at different times. Deployment mechanisms will often dictate the ability or desire to do this.

A ModeShape repository configuration can specify a set of CND files that are to be registered immediately upon startup of the repository. But applications using that repository can also use the JCR API to register new (or updated) node types, too. So you might want some node types to be installed upon initialization, while others might be registered only when the application needs them. You’ll likely want different CND files for each purpose.

Here’s an example. Consider that you have ModeShape installed into an AS7 instance, so the ModeShape repositories are configured as part of the AS7 configuration and started when they’re needed/used. Then web application (or web service) can simply get a hold of the (running) repository via the JNDI, @Resource injection, or RepositoryFactory techniques. In fact, multiple applications can use the same repository. Perhaps some basic node types should installed as part of the repository configuration, while each application (if deployed) can register their own application-specific (or even common) node types.

Filed under: uncategorized

ModeShape 3.1.0.Final is available

I’m very happy to announce that ModeShape 3.1.0.Final is available immediately in the JBoss Maven repository and on our downloads page. See our getting started guide for instructions. As always, check out our documentationrelease notesJavaDoc, and our code on GitHub; use our forums or IRC channel to ask questions, and log any issues in our JIRA. Please consider upgrading to 3.1 if you’re still using ModeShape 2.x or 3.0.

So what does this release include? Lots of goodness:

  • Federation is back! I as mentioned in my last post, ModeShape can now federate content that exists in external systems and project it into the repository as regular content. Several connectors are provided out of the box: a file system connector (very similar to what was in 2.x) that accesses files and folders on the file system and projects them as ‘nt:file’ and ‘nt:folder’ nodes; and a Git connector that accesses a local Git repository (can be a clone of one or more remotes) and projects the branches, tags, commits and trees within the Git repository as a node structure in the repository. You can even write your own connector, too.
  • Access content through CMIS. This is still a ‘technology preview’, and we’re seeking users that can try it out and give us feedback.
  • Installing ModeShape into an existing JBoss AS7.1.1 installation now also installs CMIS API for all repositories. Federation connectors are configured just like all the other repository components.
  • Deploy ModeShape into application servers and containers other than JBoss AS and EAP with our new JCA adapter.
  • Configuration improvements for large strings, JGroups, and variables.
  • Clustering bug fixes and improvements.
  • We’ve tested ModeShape 3.1 with Infinispan 5.1.2.FINAL and 5.1.8.Final. We’ve also tested with 5.2.0.CR1, but we think ModeShape 3.1 will work with Infinispan 5.2.0.Final when it is available.
  • Improved support for very large numbers of child nodes under a single parent, including with federated nodes.
  • Over 50 issues (bugs, tasks, features, etc.) resolved in this release.

Lots of people contributed to this release. Thank you all very much! Keep up the excellent work.

We’re going to focus our next release on improving performance even more and upgrading the libraries we depend on. We’re also hoping that EAP 6.1 becomes available so that we can support installing ModeShape into it.

Now, drop what you’re doing and start using ModeShape 3.1!

Filed under: features, jcr, releases, rest

Federate external data into ModeShape

I’m really excited to announce that ModeShape 3.1 (due out in a few days) will re-introduce the ability to federate data from external systems into ModeShape repositories. That might sound kind of esoteric, but let’s look at some simple scenarios that show how powerful this is.

federation

(Long-time readers might recall that ModeShape 2.x had federation, but due to time constraints we didn’t bring the feature along when we moved to the new architecture in 3.0. We’re now fixing that, except that federation in ModeShape 3 is massively improved compared to what it was in 2.x! In fact, federation in 3.0 is so different, you should probably forget everything you know about federation in ModeShape 2.x.)

Scenario 1: Federating files

Imagine that you have a ModeShape repository (aka database), and it contains all of the data that your application needs. Your application can upload files, but they get stored in ModeShape along with the rest of your content. But you also have a separate file system that is directly accessed and exposed by your web servers. You now want your application to allow users to browse those files and, perhaps, simply pick one so that your application can automatically create a link. Your application already can work with “nt:file” and “nt:folder” nodes, but since that other file system isn’t managed by ModeShape, you have to write new logic to access regular files and folders.

Federation changes this dramatically. You can have ModeShape connect to that separate file system and project the files and folders as “nt:file” and “nt:folder” nodes inside the existing ModeShape repository. In other words, ModeShape will act as though there is an “nt:folder” node inside the repository, but it actually is dynamically created because there is a folder on the separate file system. As your application accesses the name and children of that “nt:folder” node, ModeShape transparently and dynamically maps those requests onto the corresponding file system operations. So your application can continue to work with “nt:file” and “nt:folder” nodes, but ModeShape does all the work of really accessing the files and folders on the separate file system.

Scenario 2: Federating Git repositories

Consider another similar scenario in which the external file system is actually a Git repository, and you want to be able to navigate and access the files and folders in any commit, branch or tag. Again, you could change your application to directly access Git, but that’s quite a bit of work. After all, your application is already accessing most of its content directly from ModeShape.

With federation, ModeShape could access the Git repository and expose not only the files and folders as “nt:file” and “nt:folder” nodes, but it can also expose the Git-specific information on those files and folders, like what was the last commit that changed them. And, you’d also like to be able to navigate (as nodes) the commits (e.g., history), branches, and tags in the Git repository.

How does federation work?

The first thing to understand is that ModeShape does not copy the data from the external system into the repository. Instead, ModeShape (with the help of connectors) dynamically creates nodes upon demand to represent the external data. If the external data doesn’t change too often or is okay to be slightly out of date, then you can optionally have ModeShape cache the nodes in-memory. But either way, the external system remains the owner of its data.

Secondly, federation is transparent to clients. Once federation is configured, the repository’s regular content and federated content all looks to client applications like regular content.

Thirdly, a repository does not use federation by default; you have to configure it for each repository. To do that, a repository configuration must specify:

  1. how ModeShape is to communicate with the external system (e.g., which connector implementation is to be used)
  2. the properties that the connector needs to talk with a particular external system (e.g., an external source)
  3. where and how the data in the external system is to be projected into the repository

All of this is defined inside the “externalSources” area of a repository’s configuration file. Here’s a simple JSON repository configuration that defines one external source called “downloads” and another called “sourceCode”:

{
  "name" : "MyRepository",
  "workspaces" : {
    "predefined" : [ "ws1", "ws2" ],
    "initialContent" : {
      "default" : "resources/initialContent.xml"
    }
  },
  "externalSources" : {
    "downloads" : {
      "classname" : "org.modeshape.connector.filesystem.FileSystemConnector",
      "directoryPath" : "/opt/downloads",
      "readonly" : true,
      "cacheTtlSeconds" : 5,
      "projections" : [ "default:/files/downloads => /" ]
    },
    "sourceCode" : {
      "classname" : "org.modeshape.connector.git.GitConnector",
      "directoryPath" : "data/repo",
      "remoteName" : "origin,upstream",
      "queryableBranches" : "master",
      "cacheTtlSeconds" : 5,
      "projections" : [ "default:/sources/ => /" ]
    }
  }
}

Each external source is identified by a unique name that you assign, and specifies the name of the connector implementation class and other connector-specific properties. A connector is simply a subclass of the “org.modeshape.jcr.federation.spi.Connector” class that contains that logic of how to create nodes that represent external data and, optionally, how to update the external data based upon changes to the nodes. We’ve designed the SPI so that you can easily create your own subclasses of Connector (or ReadOnlyConnector if the connector should never update data in the external system).

Let’s look at this configuration file a bit more. The “downloads” external source (line 10) defines several other properties:

  • the “directoryPath” is the location on the local file system of the top-level directory that is to be accessed by the connector; we use an absolute path here, though relative paths also work.
  • the “readonly” property specifies that ModeShape should never update any of the files or folders on the file system (yes, the FileSystemConnector is capable of creating, updating, and deleting files and folders on the file system in response to applications creating, updating, or deleting the corresponding nodes in the repository).
  • the “cacheTtlSeconds” is the time in seconds (5 in our case) that the nodes created by the connector to represent external files/folders should be cached.
  • the “projections” field is an array of string values that define the paths of the “federated” nodes that will represent the objects in Git. Our value of “default:/files/downloads => /” means that the top-level directory of the external source (that is, the “/opt/downloads” folder) should be projected as a node at “/files/downloads” in the “default” workspace. (Note that we’re also specifying that the “default” workspace should be populated with nodes described by the “resources/initialContent.xml” file. It’s here that we’d define the node type and any properties for the “/files” node.)

The file system connector will create a structure that mirrors the files and folders on the file system. So if the “/opt/downloads” directory contained the following:

aircraft
aircraft/Boeing
aircraft/Boeing/747.jpg
aircraft/Boeing/787.jpg
aircraft/Airbus
aircraft/Airbus/A380.jpg
aircraft/Airbus/A380.jpg
aircraft/Airbus/A320.jpg

then we would then have the following nodes inside the “default” workspace:

/files   (primary type “nt:unstructured”)
/files/downloads   (primary type “nt:folder”)
/files/downloads/aircraft   (primary type “nt:folder”, represents “/opt/downloads/aircraft”)
/files/downloads/aircraft/Boeing   (primary type “nt:folder”, represents “/opt/downloads/aircraft/Boeing”)
/files/downloads/aircraft/Boeing/747.jpg   (primary type “nt:file”, represents “/opt/downloads/aircraft/Boeing/747.png”)
/files/downloads/aircraft/Boeing/787.jpg   (primary type “nt:file”, represents “/opt/downloads/aircraft/Boeing/787.png”)
/files/downloads/aircraft/Airbus   (primary type “nt:folder”, represents “/opt/downloads/aircraft/Airbus”)
/files/downloads/aircraft/Airbus/A380.jpg   (primary type “nt:file”, represents “/opt/downloads/aircraft/Airbus/A380.png”)
/files/downloads/aircraft/Airbus/A320.jpg   (primary type “nt:file”, represents “/opt/downloads/aircraft/Airbus/A320.png”)

Note how the “/files” and “/files/downloads” nodes exist in the workspace, but the “/files/downloads/aircraft” node is dynamically projected to mirror the “/opt/downloads/aircraft” folder. If the “/opt/downloads/aircraft” folder were to be removed, then the “/files/downloads/aircraft” node would automatically be removed as well.

The “sourceCode” external source is pretty similar, but it access a local Git repository:

  • the “classname” specifies that the “org.modeshape.connector.git.GitConnector” connector implementation be used. This class will be included in ModeShape 3.1.0.Final. This connector is read-only.
  • the “directoryPath” is the location on the local file system of the top-level directory that contains a valid Git repository (e.g., it contains a “.git” directory); we use a relative path here.
  • the “remotename” is the name (or comma-separated list of names) of the remote(s).
  • the “cacheTtlSeconds” is the time in seconds (5 in our case) that the nodes created by the connector to represent Git’s files, folders, commits, tags, and branches should be cached.
  • the “projections” field specifies where in the repository the Git nodes should appear. Our value of “default:/sources => /” means that the top-level directory of the external source (that is, the Git repository folder) should be projected as a node at “/sources” in the “default” workspace.

The Git connector doesn’t really work with a local working directory of the Git repository. Instead, it basically exposes all of the commits, branches, and tags (plus the files and folders in each). It does this by mapping Git functionality onto some special nodes:

  • The “branches” node is a container under which all branches can be found.
  • The “tags” node is a container under which all tags can be found.
  • The “commits” node is a container under which all commits appear, with the most recent commits appearing first in the children.
  • The “branches/{branchName}” nodes represent the information about each branch, including the commit ID and references to the node representing the commit.
  • The “tags/{tagName}” nodes represent the information about each tag, including the commit ID and references to the node representing the commit.
  • The “commits/{branchOrTagOrCommit}” nodes show the history of commits for the given branch/tag/commit.
  • The “commits/{branchOrTagOrCommit}/{commitId}” shows the details of a specific commit in the history of a particular branch/tag/commit.
  • The “commit/{branchOrTagOrCommit}” shows the details of the specified commit
  • The “tree/{branchOrTagOrCommit}” is a container for the files and folders within the specified branch, tag or commit.

Thus in our repository, all the Git information is projected under “/sources”, so a number of nodes would appear in our repository:

  • A node representing each branch appears as children under “/sources/branches” (e.g., “/sources/branches/master”)
  • A node representing each tag appears as children under “/sources/tags” (e.g., “/sources/tags/release-1.0”)
  • A node representing each commit appears as children under “/sources/commits” (e.g., “/sources/commits/bbfa3f3d76b0…”)
  • A node structure (with “nt:file” and “nt:folder” descendant nodes) representing the workspace (with its files/folders) for each commit, branch and tag appears as children under “/sources/tree” (e.g., “/sources/tree/master/pom.xml”)

By the way, you can control with the projections which subset of the nodes exposed by the connector should be projected into the repository. For example, if you wanted to expose only a specific branch (e.g., “master”) in the Git repository under the “/sources” node, you could change the projection rule to be “default:/sources => /tree/master”.

Programmatically creating projections

ModeShape’s public API now contains a new FederationManager class that can be used to programmatically create and remove projections. However, external sources still must be configured via the JSON configuration file.

Custom connectors

As we mentioned earlier, we provide two connectors out-of-the-box in 3.1. We do plan to add a few more, but we’ve always expected the some developers would want to create their own custom connectors. Hopefully our SPI is simple enough that doing so is very straightforward. So if you’re interested in this, please take a look and give us feedback on our forums. The SPI should be pretty stable, but if we find some glaring problems, we may need to change the SPI slightly before the 3.2 release; after that, however, we’ll lock down the SPI.

Summary

We described a few ways that federating content from external sources into your ModeShape repository can be quite useful, and we also took a very detailed look into how federation is configured. In reality, however, we just skimmed the surface of what is actually possible with ModeShape federation.

Stay tuned for more on the 3.1 release later this week.

Filed under: features, federation, jcr

Creating and using tags in your content

UPDATE 2: Changed option 3 to use string identifiers, as WEAKREFERENCE and REFERENCE properties both maintain back-references.

UPDATE 1: Added a 5th option, as suggested by Bertrand Delacretaz.

(This post was inspired by a response I recently wrote to a Stack Overflow question. That answer was a bit long, but I thought it would also be suitable as a blog post.)

Many applications offer a way to tag “things” with either user-defined or system-defined tags. Assuming those “things” are nodes, what’s the best way to add tags to a ModeShape repository? I know of four five possible approaches, each with their own benefits and disadvantages.

Option 1: Use Mixins

This approach will use a separate mixin node type definition for each tag. The mixin is a marker mixin (e.g., it has no property definitions or child node definitions). One example of “known-issue” tag is the following (in CND format):

tag="http://www.example.com/tags"
[tag:known-issue] mixin

Create this tag by registering the node type definition using the NodeTypeManager, either by programmatically creating the node type template or by uploading a CND file.

To “tag” a particular node, simply add the tag’s mixin to the node:

node.addMixin("tag:knownIssue");

Note that any node can have multiple tags, since any node can have multiple mixins.

To find all nodes that have a particular tag, simply issue a query:

SELECT * FROM [tag:known-issue]

To find all nodes that have two tags, simply perform a UNION:

SELECT * FROM [tag:known-issue]
UNION
SELECT * FROM [tag:critical-issue]

This approach is pretty straightforward and really uses ModeShape’s mixin feature. However, it is fairly cumbersome to create new tags, since that requires registering new node types. Plus, you cannot easily rename tags, but instead would have to:

  1. create the mixin for the tag with the new name;
  2. find all nodes that have the mixin representing the old tag, and for each remove the old mixin and add the new one;
  3. finally remove the node type definition for the old tag (after it is no longer used anywhere).

Removing old tags is done in a similar manner. Finally, it’s not really possible to associate additional metadata (like a display name) with a tag, since extra properties aren’t allowed on node type definitions.

This approach should perform quite well, however.

Option 2: Use a taxonomy and references

This approach involves using one or more “taxonomies“, each of which consist of a parent node for the taxonomy and child nodes for each tag in that taxonomy. The exact node types used are entirely up to you, but the taxonomy structure can be as rich as you’d like it to be. For example, you can create inheritance between tags in much the same way that classes can inherit from other classes in an ontology. Obviously adding, renaming, and removing tags is straightforward.

To “tag” a node, this approach uses a REFERENCE property. One way to do this is to define a single node type for the tag nodes and a single mixin that we’ll use to add this REFERENCE property to “taggable” nodes:

tags="http://www.example.com/tags"
[tags:tag] > mix:title, mix:referenceable

[tags:taggable] mixin
- tags:tags (REFERENCE) multiple < 'tags:tag'

To “apply” the tag to a node, simply add the “tags:taggable” mixin to the node (if not already there) and add the REFERENCE to the desired tag node. Here’s some code that does this (although it is too simple and assumes the node hasn’t already been tagged):

Node tag = ... // find in taxonomy
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
    n.addMixin("tags.taggable");
}
Value[] values = new Value[1];
values[0] = session.getValueFactory().createValue(tag);
n.setProperty("tags:tags",values);

To find all nodes of a particular tag, simply get the tag and call “getReferences()” on a tag node to find all of the nodes that contain a reference to the tag node:

Node tag = ...
NodeIterator iter = tag.getReferences("tags:tags");
while ( iter.hasNext() ) {
    Node tagged = iter.next();
}

Alternatively, you could use a query to find all of the nodes for a particular tag. Here’s one that finds all the nodes that are tagged with the ‘known-issues’ or ‘critical-issue’ tag (note how easy it is to search for nodes tagged with any of 1, 2, or n tags just by changing the set criteria):

SELECT * FROM [tags:taggable] AS taggable
JOIN [tags:tag] AS tag ON taggable.[tags:tags] = tag.[jcr:uuid]
AND LOCALNAME(tag) IN ('known-issue','critical-issue')

This approach has the benefit that all tags have to be controlled/managed within one or more taxonomies (including perhaps user-specific taxonomies).

However, there is one potentially substantial disadvantage: this option may not scale very well to large numbers of tagged nodes. ModeShape might start to degrade adding and removing REFERENCE values when there are hundreds of nodes pointing to the same tag node. Another disadvantage is that a tag cannot be removed from a taxonomy unless it is no longer used.

You can also use WEAKREFERENCE rather than REFERENCE. The only distinction is that with WEAKREFERENCE you can remove a tag from the taxonomy without having to remove it from the tagged nodes.

Option 3: Use taxonomy and identifier references

This option is similar to Option 2 above in that it involves formally managing one or more taxonomies, in exactly the same was as described above. The difference, however, is that rather than use a REFERENCE (or WEAKREFERENCE) the node that is to be tagged points to the tag node using a STRING property with the identifier of the tag node:

tags="http://www.example.com/tags"
[tags:tag] > mix:title, mix:referenceable

[tags:taggable] mixin
- tags:tags (STRING) multiple

Note that the tag has a “jcr:title” property, which you can use to hold the display name for the tag.

Tagging a node is done similarly to Option 2, except the value of the “tags:tag” property is a string:

Node tag = ... // find in taxonomy
String tagId = tag.getIdentifier();
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
    n.addMixin("tags.taggable");
}
Value[] values = new Value[1];
values[0] = session.getValueFactory().createValue(tagId);
n.setProperty("tags:tags",values);

To find all nodes of a particular tag, simply use a query to find all of the nodes that have the identifier of a particular tag. Here’s one that finds all the nodes that are tagged with the ‘known-issues’ or ‘critical-issue’ tag (note how easy it is to search for nodes tagged with any of 1, 2, or n tags just by changing the set criteria):

SELECT * FROM [tags:taggable] AS taggable
JOIN [tags:tag] AS tag ON taggable.[tags:tags] = tag.[jcr:uuid]
AND LOCALNAME(tag) IN ('known-issue','critical-issue')

You’ll note that this is very similar to the query in Options 2 and 3. That’s because REFERENCE and WEAKREFERENCE properties are physically stored in a property value as an identifier.

Like option 2, this approach does enforce using one or more taxonomies, makes it a bit easier to control the tags, since they must exist in a taxonomy before they can be used. Renaming nodes is also pretty easy, although this is not necessary if using the “jcr:title” property for the display name , since renaming involves simply changing the title property value. Performance-wise, this is far better than the REFERENCE and WEAKREFERENCE approach, since non-reference properties will scale much better and perform better with large numbers of references, regardless of whether they all point to one node or many. Looking up the tag(s) from the “tags:tags” property is also very fast (and faster than navigating a path).

This approach is similar to Option 2 with WEAKREFERENCE properties in that you can remove a tag even if it is still used, although nodes’ “tags:tags” property values that point to that removed tag will not be usable anymore. This can be remedied with some conventions in your application, or by simply keeping tags around and using metadata on the taxonomy to say that a particular tag is “deprecated” and shouldn’t be used. (IMO, the latter is actually a benefit of this approach.)

This option will generally perform and scale much better than Option 2.

Option 4: Use string properties

The final approach is to simply use a STRING property to tag each node with the name of the tag(s) that are to be applied. This works great for ad hoc tags, which is when there is no formal taxonomy and any tag can be used at any time.

Here’s a mixin that defines a multi-valued STRING property:

tags="http://www.example.com/tags"
[tags:taggable] mixin
- tags:tags (STRING) multiple

To tag a node, simply add the mixin (if not already present) and add the name of the tag as a value on the “tags:tags” STRING property (again, if it’s not already present as a value). Here’s some simplified code that does none of the checking, but which gives the basic idea:

Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
    n.addMixin("tags.taggable");
}
String[] tags = new String[1]{"known-issue"};
n.setProperty("tags:tags",tags);

The primary advantage of this approach is that it is very simple: you’re simply using string values on the node that is to be tagged. To find all nodes that are tagged with a particular tag (e.g., “tag1”), simply issue a query:

SELECT * FROM [acme:taggable] AS taggable
WHERE taggable.[tags:tags] = 'known-issue'

Also, there is no taxonomy to manage. But if a tag is to be renamed, then you could simply process the “tags:tags” values. If a tag is to be deleted (and removed from the nodes that are tagged with it), then that can be done by removing the tag name from the “tags:tags” properties (perhaps in a background job).

Note that this allows any tag name to be used, and thus works best for cases where the tag names are not controlled at all. If you want to control the list of strings used as tag values, you could create a taxonomy in the repository (as described in Options 2 and 3 above) and have your application limit the values to those in the taxonomy. You can even have multiple taxonomies, some of which are perhaps user-specific. But this approach doesn’t have quite the same control as Options 2 or 3.

This option will perform just a bit better than Option 3 (since the queries are tad simpler), but will scale just as well.

Option 5: Use taxonomy and paths

A fifth option is very similar to Option 3, except that you use a PATH property (rather than a STRING property) that points to the tag, where the PATH values are paths to the tag. Here are some node types:

tags="http://www.example.com/tags"
[tags:tag] > mix:title

[tags:taggable] mixin
- tags:tags (PATH) multiple

(You could also use a STRING property instead of PATH; really the only advantage of using PATH is that it enforces that each value is a legal path value. But using PATH does not enforce that it is an existing path.)

To tag a node, simply add the mixin (if not already present) and add the path of the tag as a value on the “tags:tags” STRING property (again, if it’s not already present as a value). Here’s some simplified code that does none of the checking, but which gives the basic idea:

Node tag = ... // the tag node
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
    n.addMixin("tags.taggable");
}
String[] tags = new String[1]{tag.getPath()};
n.setProperty("tags:tags",tags);

Unlike Options 2 or 3, this approach does not even use taxonomies. In fact, you’ll notice that the “tags:tags” property node type has no constraints that require it to contain a path; this reduces the constraints and requires your application to use convention, which can be an advantage. Using a title on the tag for the displayable name obviates having to rename tags. Performance-wise, this is far better than the REFERENCE or WEAKREFERENCE approach, and (for ModeShape) just a bit worse than using the STRING property with an identifier (ModeShape can resolve an identifier faster than it can finding it by path). But it will scale far better than Option 2 and similarly to Option 3.

One advantage of this approach (and of Option 3) over Option 2 is that you can remove a tag even if it is still used, although nodes’ PATH properties that point to that removed tag will be readable but not resolvable. (If you’re using the tag’s title for the display name, this might not be useful since the path might not contain meaningful and usable information.) This can be remedied with some conventions in your application, or by simply keeping tags around and using metadata on the taxonomy to say that a particular tag is “deprecated” and shouldn’t be used. (IMO, the latter is actually a benefit of this approach.)

Summary

We looked at five different ways of incorporating tags into your application. Of course, which one works best for you will depend on the needs of your particular application. And use these as a starting point — feel free to customize them, combine them, or even come up with even other alternatives.

If you just need a way to associate informal tags with content, perhaps Option 4 is a good fit. For very small and limited tagging needs, Option 1 might work. Whereas you should seriously look at option 2 for smallish repositories that needs a formal taxonomy.

But for most applications, your repository will be large enough that you will probably want to look at Options 3, 4 or 5, with the deciding factor being whether you need formal or informal taxonomies. Personally, of these three I think I’d tend to lean toward Option 3.

Happy tagging!

Filed under: jcr, techniques

ModeShape is

a lightweight, fast, pluggable, open-source JCR repository that federates and unifies content from multiple systems, including files systems, databases, data grids, other repositories, etc.

Use the JCR API to access the information you already have, or use it like a conventional JCR system (just with more ways to persist your content).

ModeShape used to be 'JBoss DNA'. It's the same project, same community, same license, and same software.

ModeShape

Topics