Skip to end of metadata
Go to start of metadata

A common use for repositories is to manage (among other things) files, so ModeShape is now capable of handling even extremely large binary values that are larger than available memory. This is because ModeShape never loads the whole value onto the heap, but instead streams the value to and from the persistent store. And you can configure where ModeShape stores the binary values independently of where the rest of the content is stored.

How it works

The key to understanding how ModeShape manages the Binary values is to remember how the JCR API exposes them. To set a property to a Binary value, the JCR client creates the javax.jcr.Binary instance from the binary stream:

Then, to access the binary content, the JCR client gets the property, gets the binary value(s), and then obtains the binary value's InputStream:

When ModeShape creates the actual javax.jcr.Binary value, it reads the supplied java.io.InputStream and immediately stores the content in the repository's binary storage area, which then returns a Binary instance that contains a pointer to the persisted binary content. When needed, that Binary instance (or another one obtained at a later time) obtains from the binary storage area the InputStream for the content and simply returns it.

Note that the same Binary value can be read from one property and set on any other properties:

This works because the Binary value contains only the pointer to the binary content, copying or reusing the Binary objects is very efficient and lightweight. It also works because of what ModeShape uses for the pointers.

ModeShape stores all binary content by its SHA-1 hash. The SHA-1 cryptographic hash function is not used for security purposes, but is instead used because the SHA-1 can reliably be determined entirely from the content itself, and because two binary contents will only have the same SHA-1 if they are indeed identical. Thus, the SHA-1 hash of some binary content serves as an excellent key for storing and referencing that content. The pointer we mentioned in the previous paragraph is merely the SHA-1 of the binary content. The following diagram represents how this works:

Using the SHA-1 hash as the identifier for the binary content also means that ModeShape never needs to store a given binary content more than once, no matter how many nodes or properties refer to it. It also means that if your JCR client already knows (or can compute) the SHA-1 of a large value, the JCR client can use ModeShape-specific APIs to easily determine if that value has already been stored in the repository. (We'll see an example of this later on.)

Extended Binary interface

The ModeShape public API defines the org.modeshape.jcr.api.Binary interface as a simple extension to the standard javax.jcr.Binary interface. ModeShape's extension adds useful methods to get the SHA-1 hash (as a binary array and as a hexadecimal string) and the MIME type for the content:

All javax.jcr.Binary values returned by ModeShape will implement this public interface, so feel free to cast the values to gain access to the additional methods.

Importing and Exporting

When exporting content from a workspace with large Binary values, be sure to export using JCR's System View format. Only the System View treats properties as child elements. This allows each large value to be streamed (using buffered streams) into the XML element's content as a Base64-encoded string. Importing can also take advantage of streaming.

Exporting content using JCR's Document View results in all properties being treated as XML attributes, and various XML processing libraries treat large attributes poorly (e.g., using values that are in-memory String objects). Another critical disadvantage of the Document View is that it is unable to represent multi-valued properties, since attributes can have only one value.

Implementation design

This section describes the internal design of how ModeShape stores binary values, and is typically useful to either understand the nuances of the various configuration choices or to implement custom binary stores.

None of the interfaces described in this section are part of the public API, and should never be directly used by JCR client applications.

BinaryValue

In addition to the ModeShape-specific org.modeshape.jcr.api.Binary extension, ModeShape also defines a org.modeshape.jcr.value.BinaryValue interface that adds several other features required to properly persist and manage Binary values. These other features that are part of ModeShape's internal design and therefore not appropriate for inclusion in the public API. Specifically, BinaryValue instances are themselves immutable, they have an immutable BinaryKey that is a comparable representation of the SHA-1 hash, they are comparable with each other (based upon their keys), they can be serialized, and the getSize() method does not throw an exception like the standard method:

BinaryStore

The ModeShape-specific BinaryStore interface is thus defined to use the internal BinaryValue interface:

Each BinaryStore implementation must provide a no-arg constructor and member fields can be configured via the repository configuration. Note that the BinaryStore implementation must also implement several setter methods, which the repository calls when the BinaryStore is initialized and may be called at any time after that (due to the repository configuration changing).

Minimum binary size

When the BinaryStore is initialized, the repository will use the setMinimumBinarySizeInBytes(...) method to specify the size for BinaryValue s that must be persisted within the BinaryStore. Any binary content smaller than this can be represented with InMemoryBinaryValue instances (meaning they will be persisted with property where it's used) or persisted in the BinaryStore . Note that if repository's configuration changes, the repository may set a minimum size threshold.

Minimum string size

The repository can also use the BinaryStore to store large string values. Any strings larger than the threshold set in the repository configuration will be stored in the BinaryStore and referenced in the node. Note that there is nothing to configure in the BinaryStore itself.

MIME type detection

When the BinaryStore is initialized, the repository will use the setMimeTypeDetector(...) method to give the BinaryStore a MimeTypeDetector instance it can use to determine the MIME type for any binary content. The BinaryStore is free to determine the MIME type at any time, including when the binary content is stored or only when the MIME type is needed (via the getMimeType(...) method). The BinaryStore is also free to persist this information, since binary content for a given SHA-1 never changes. Note that if repository's configuration changes, the repository may set a different MIME type detector.

Text extraction

When the BinaryStore is initialized, the repository will use the setTextExtractor(...) method to give the BinaryStore a TextExtract instance it can use to extract the content's full-text search terms. The BinaryStore is free to extract these terms at any time, including when the binary content is stored or only when the terms are requested (via the getText(...) method). The BinaryStore is also free to persist this information, since binary content for a given SHA-1 never changes. Note that if repository's configuration changes, the repository may set a different text extractor.

To extract content from a binary value, ModeShape relies on 3rd party libraries for the extraction functionality. ModeShape comes with one built-in extractor which uses Apache Tika. See Built-in text extractors to see how you can configure this extractor.

Garbage collection

There are a number of ways in which the BinaryStore may contain binary content (keyed by the SHA-1) that are no longer used or referenced. The first is when a JCR client or the repository removes the last Property containing the Binary. A second case is when a JCR client uses a Session to create a javax.jcr.Binary value and clears the transient state (before the Session's transient state saved). Neither of these pose a problem, since the minimum requirement is that the BinaryStore contain at least the content that is referenced in the repository content. However, all unused binary content in the BinaryStore takes up storage space, so ModeShape defines a way for the repository and the BinaryStore to recover that unused storage.

The repository checks with each successful transaction whether any binaries involved in that transaction are or are not referenced by repository content. If they are not used, The BinaryStore quarantines the binaries; if any quarantined binaries are used again, the BinaryStore can remove them from quarantine. The repository then periodically calls off a garbage-collection thread the BinaryStore's removeValuesUnusedLongerThan(...) method to purge all binaries that have been quarantined for at least the specified period of time.

The quarantine approach means that when {{BinaryValue}}s are removed, there is a period of time that they can be reused without the expensive removal and re-adding of the binary content.

BinaryStore implementations

See this page for the list of available binary stores.

We would like to have other options, including storage in S3 and Hadoop. But it's also possible for developers using ModeShape to write their own implementations.

Configuring Binary Stores

If you no explicit binary store configuration is present, the TransientBinaryStore implementation will be used by default. As explained above, this is not really suitable outside a testing context, as any binaries will be lost between restart.

For information about each particular binary store type, see this page.

Files and Folders

A very simple way of adding binary content into a repository is uploading files & folders.
The JCR specification defines the following node types:

which means that a natural folder/file hierarchy would use a nt:folder/nt:file/jcr:content->jcr:data hierarchy.
ModeShape provides via the modeshape-jcr-api artifact a utility for creating such a hierarchy: org.modeshape.jcr.api.JcrTools#uploadFile. See the implementation for more information.

Mime Type detection

When working with binary values on nodes which also have the mix:mimeType mixin (like the default files and folders) ModeShape automatically detects, via Apache Tika, the mime type of the binary value and sets it on the jcr:mimeType property whenever new nodes are created and the JCR session is saved. By default, the mime type detection process involves reading at least the header of each binary value and based on a "magic byte pattern" from the header determining the mime type.

This is a very powerful feature which works well, but there may be cases (for example working with lots of binary values) when reading the stream of each binary value is counter-productive and memory-intensive. Therefore, starting with version 4.4.0.Final, ModeShape offers the possibility of configuring the mime type detection mechanism:

or in the AS server kit:

The possible values for the mime type detection are:

content (default) determine the mime type based on the actual content of each binary. Requires at least reading the header of each binary into memory
name use only the name (which ideally contains an extension) of the parent of the node which owns the binary value. This means that when using the default nt:file type, this will be the name of the file. This type of detection does not read anything from the binary stream
none disables mime type detection altogether
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.