UPDATE 2: Changed option 3 to use string identifiers, as WEAKREFERENCE and REFERENCE properties both maintain back-references.
UPDATE 1: Added a 5th option, as suggested by Bertrand Delacretaz.
(This post was inspired by a response I recently wrote to a Stack Overflow question. That answer was a bit long, but I thought it would also be suitable as a blog post.)
Many applications offer a way to tag “things” with either user-defined or system-defined tags. Assuming those “things” are nodes, what’s the best way to add tags to a ModeShape repository? I know of four five possible approaches, each with their own benefits and disadvantages.
Option 1: Use Mixins
This approach will use a separate mixin node type definition for each tag. The mixin is a marker mixin (e.g., it has no property definitions or child node definitions). One example of “known-issue” tag is the following (in CND format):
tag="http://www.example.com/tags"
[tag:known-issue] mixin
Create this tag by registering the node type definition using the NodeTypeManager, either by programmatically creating the node type template or by uploading a CND file.
To “tag” a particular node, simply add the tag’s mixin to the node:
node.addMixin("tag:knownIssue");
Note that any node can have multiple tags, since any node can have multiple mixins.
To find all nodes that have a particular tag, simply issue a query:
SELECT * FROM [tag:known-issue]
To find all nodes that have two tags, simply perform a UNION:
SELECT * FROM [tag:known-issue]
UNION
SELECT * FROM [tag:critical-issue]
This approach is pretty straightforward and really uses ModeShape’s mixin feature. However, it is fairly cumbersome to create new tags, since that requires registering new node types. Plus, you cannot easily rename tags, but instead would have to:
- create the mixin for the tag with the new name;
- find all nodes that have the mixin representing the old tag, and for each remove the old mixin and add the new one;
- finally remove the node type definition for the old tag (after it is no longer used anywhere).
Removing old tags is done in a similar manner. Finally, it’s not really possible to associate additional metadata (like a display name) with a tag, since extra properties aren’t allowed on node type definitions.
This approach should perform quite well, however.
Option 2: Use a taxonomy and references
This approach involves using one or more “taxonomies“, each of which consist of a parent node for the taxonomy and child nodes for each tag in that taxonomy. The exact node types used are entirely up to you, but the taxonomy structure can be as rich as you’d like it to be. For example, you can create inheritance between tags in much the same way that classes can inherit from other classes in an ontology. Obviously adding, renaming, and removing tags is straightforward.
To “tag” a node, this approach uses a REFERENCE property. One way to do this is to define a single node type for the tag nodes and a single mixin that we’ll use to add this REFERENCE property to “taggable” nodes:
tags="http://www.example.com/tags"
[tags:tag] > mix:title, mix:referenceable
[tags:taggable] mixin
- tags:tags (REFERENCE) multiple < 'tags:tag'
To “apply” the tag to a node, simply add the “tags:taggable” mixin to the node (if not already there) and add the REFERENCE to the desired tag node. Here’s some code that does this (although it is too simple and assumes the node hasn’t already been tagged):
Node tag = ... // find in taxonomy
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
n.addMixin("tags.taggable");
}
Value[] values = new Value[1];
values[0] = session.getValueFactory().createValue(tag);
n.setProperty("tags:tags",values);
To find all nodes of a particular tag, simply get the tag and call “getReferences()” on a tag node to find all of the nodes that contain a reference to the tag node:
Node tag = ...
NodeIterator iter = tag.getReferences("tags:tags");
while ( iter.hasNext() ) {
Node tagged = iter.next();
}
Alternatively, you could use a query to find all of the nodes for a particular tag. Here’s one that finds all the nodes that are tagged with the ‘known-issues’ or ‘critical-issue’ tag (note how easy it is to search for nodes tagged with any of 1, 2, or n tags just by changing the set criteria):
SELECT * FROM [tags:taggable] AS taggable
JOIN [tags:tag] AS tag ON taggable.[tags:tags] = tag.[jcr:uuid]
AND LOCALNAME(tag) IN ('known-issue','critical-issue')
This approach has the benefit that all tags have to be controlled/managed within one or more taxonomies (including perhaps user-specific taxonomies).
However, there is one potentially substantial disadvantage: this option may not scale very well to large numbers of tagged nodes. ModeShape might start to degrade adding and removing REFERENCE values when there are hundreds of nodes pointing to the same tag node. Another disadvantage is that a tag cannot be removed from a taxonomy unless it is no longer used.
You can also use WEAKREFERENCE rather than REFERENCE. The only distinction is that with WEAKREFERENCE you can remove a tag from the taxonomy without having to remove it from the tagged nodes.
Option 3: Use taxonomy and identifier references
This option is similar to Option 2 above in that it involves formally managing one or more taxonomies, in exactly the same was as described above. The difference, however, is that rather than use a REFERENCE (or WEAKREFERENCE) the node that is to be tagged points to the tag node using a STRING property with the identifier of the tag node:
tags="http://www.example.com/tags"
[tags:tag] > mix:title, mix:referenceable
[tags:taggable] mixin
- tags:tags (STRING) multiple
Note that the tag has a “jcr:title” property, which you can use to hold the display name for the tag.
Tagging a node is done similarly to Option 2, except the value of the “tags:tag” property is a string:
Node tag = ... // find in taxonomy
String tagId = tag.getIdentifier();
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
n.addMixin("tags.taggable");
}
Value[] values = new Value[1];
values[0] = session.getValueFactory().createValue(tagId);
n.setProperty("tags:tags",values);
To find all nodes of a particular tag, simply use a query to find all of the nodes that have the identifier of a particular tag. Here’s one that finds all the nodes that are tagged with the ‘known-issues’ or ‘critical-issue’ tag (note how easy it is to search for nodes tagged with any of 1, 2, or n tags just by changing the set criteria):
SELECT * FROM [tags:taggable] AS taggable
JOIN [tags:tag] AS tag ON taggable.[tags:tags] = tag.[jcr:uuid]
AND LOCALNAME(tag) IN ('known-issue','critical-issue')
You’ll note that this is very similar to the query in Options 2 and 3. That’s because REFERENCE and WEAKREFERENCE properties are physically stored in a property value as an identifier.
Like option 2, this approach does enforce using one or more taxonomies, makes it a bit easier to control the tags, since they must exist in a taxonomy before they can be used. Renaming nodes is also pretty easy, although this is not necessary if using the “jcr:title” property for the display name , since renaming involves simply changing the title property value. Performance-wise, this is far better than the REFERENCE and WEAKREFERENCE approach, since non-reference properties will scale much better and perform better with large numbers of references, regardless of whether they all point to one node or many. Looking up the tag(s) from the “tags:tags” property is also very fast (and faster than navigating a path).
This approach is similar to Option 2 with WEAKREFERENCE properties in that you can remove a tag even if it is still used, although nodes’ “tags:tags” property values that point to that removed tag will not be usable anymore. This can be remedied with some conventions in your application, or by simply keeping tags around and using metadata on the taxonomy to say that a particular tag is “deprecated” and shouldn’t be used. (IMO, the latter is actually a benefit of this approach.)
This option will generally perform and scale much better than Option 2.
Option 4: Use string properties
The final approach is to simply use a STRING property to tag each node with the name of the tag(s) that are to be applied. This works great for ad hoc tags, which is when there is no formal taxonomy and any tag can be used at any time.
Here’s a mixin that defines a multi-valued STRING property:
tags="http://www.example.com/tags"
[tags:taggable] mixin
- tags:tags (STRING) multiple
To tag a node, simply add the mixin (if not already present) and add the name of the tag as a value on the “tags:tags” STRING property (again, if it’s not already present as a value). Here’s some simplified code that does none of the checking, but which gives the basic idea:
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
n.addMixin("tags.taggable");
}
String[] tags = new String[1]{"known-issue"};
n.setProperty("tags:tags",tags);
The primary advantage of this approach is that it is very simple: you’re simply using string values on the node that is to be tagged. To find all nodes that are tagged with a particular tag (e.g., “tag1”), simply issue a query:
SELECT * FROM [acme:taggable] AS taggable
WHERE taggable.[tags:tags] = 'known-issue'
Also, there is no taxonomy to manage. But if a tag is to be renamed, then you could simply process the “tags:tags” values. If a tag is to be deleted (and removed from the nodes that are tagged with it), then that can be done by removing the tag name from the “tags:tags” properties (perhaps in a background job).
Note that this allows any tag name to be used, and thus works best for cases where the tag names are not controlled at all. If you want to control the list of strings used as tag values, you could create a taxonomy in the repository (as described in Options 2 and 3 above) and have your application limit the values to those in the taxonomy. You can even have multiple taxonomies, some of which are perhaps user-specific. But this approach doesn’t have quite the same control as Options 2 or 3.
This option will perform just a bit better than Option 3 (since the queries are tad simpler), but will scale just as well.
Option 5: Use taxonomy and paths
A fifth option is very similar to Option 3, except that you use a PATH property (rather than a STRING property) that points to the tag, where the PATH values are paths to the tag. Here are some node types:
tags="http://www.example.com/tags"
[tags:tag] > mix:title
[tags:taggable] mixin
- tags:tags (PATH) multiple
(You could also use a STRING property instead of PATH; really the only advantage of using PATH is that it enforces that each value is a legal path value. But using PATH does not enforce that it is an existing path.)
To tag a node, simply add the mixin (if not already present) and add the path of the tag as a value on the “tags:tags” STRING property (again, if it’s not already present as a value). Here’s some simplified code that does none of the checking, but which gives the basic idea:
Node tag = ... // the tag node
Node n = ... // the node that we're going to tag
if ( !n.isNodeType("tags:taggable") ) {
n.addMixin("tags.taggable");
}
String[] tags = new String[1]{tag.getPath()};
n.setProperty("tags:tags",tags);
Unlike Options 2 or 3, this approach does not even use taxonomies. In fact, you’ll notice that the “tags:tags” property node type has no constraints that require it to contain a path; this reduces the constraints and requires your application to use convention, which can be an advantage. Using a title on the tag for the displayable name obviates having to rename tags. Performance-wise, this is far better than the REFERENCE or WEAKREFERENCE approach, and (for ModeShape) just a bit worse than using the STRING property with an identifier (ModeShape can resolve an identifier faster than it can finding it by path). But it will scale far better than Option 2 and similarly to Option 3.
One advantage of this approach (and of Option 3) over Option 2 is that you can remove a tag even if it is still used, although nodes’ PATH properties that point to that removed tag will be readable but not resolvable. (If you’re using the tag’s title for the display name, this might not be useful since the path might not contain meaningful and usable information.) This can be remedied with some conventions in your application, or by simply keeping tags around and using metadata on the taxonomy to say that a particular tag is “deprecated” and shouldn’t be used. (IMO, the latter is actually a benefit of this approach.)
Summary
We looked at five different ways of incorporating tags into your application. Of course, which one works best for you will depend on the needs of your particular application. And use these as a starting point — feel free to customize them, combine them, or even come up with even other alternatives.
If you just need a way to associate informal tags with content, perhaps Option 4 is a good fit. For very small and limited tagging needs, Option 1 might work. Whereas you should seriously look at option 2 for smallish repositories that needs a formal taxonomy.
But for most applications, your repository will be large enough that you will probably want to look at Options 3, 4 or 5, with the deciding factor being whether you need formal or informal taxonomies. Personally, of these three I think I’d tend to lean toward Option 3.
Happy tagging!
Filed under: jcr, techniques