Skip to end of metadata
Go to start of metadata

A common use for repositories is to manage (among other things) files, so ModeShape 3 is now capable of handling even extremely large binary values that are larger than available memory. This is because ModeShape never loads the whole value onto the heap, but instead streams the value to and from the persistent store. And you can configure where ModeShape stores the binary values independently of where the rest of the content is stored.

How it works

The key to understanding how ModeShape manages the Binary values is to remember how the JCR API exposes them. To set a property to a Binary value, the JCR client creates the javax.jcr.Binary instance from the binary stream:

Then, to access the binary content, the JCR client gets the property, gets the binary value(s), and then obtains the binary value's InputStream:

When ModeShape creates the actual javax.jcr.Binary value, it reads the supplied java.io.InputStream and immediately stores the content in the repository's binary storage area, which then returns a Binary instance that contains a pointer to the persisted binary content. When needed, that Binary instance (or another one obtained at a later time) obtains from the binary storage area the InputStream for the content and simply returns it.

Note that the same Binary value can be read from one property and set on any other properties:

This works because the Binary value contains only the pointer to the binary content, copying or reusing the Binary objects is very efficient and lightweight. It also works because of what ModeShape uses for the pointers.

ModeShape stores all binary content by its SHA-1 hash. The SHA-1 cryptographic hash function is not used for security purposes, but is instead used because the SHA-1 can reliably be determined entirely from the content itself, and because two binary contents will only have the same SHA-1 if they are indeed identical. Thus, the SHA-1 hash of some binary content serves as an excellent key for storing and referencing that content. The pointer we mentioned in the previous paragraph is merely the SHA-1 of the binary content. The following diagram represents how this works:

Using the SHA-1 hash as the identifier for the binary content also means that ModeShape never needs to store a given binary content more than once, no matter how many nodes or properties refer to it. It also means that if your JCR client already knows (or can compute) the SHA-1 of a large value, the JCR client can use ModeShape-specific APIs to easily determine if that value has already been stored in the repository. (We'll see an example of this later on.)

Extended Binary interface

The ModeShape public API defines the org.modeshape.jcr.api.Binary interface as a simple extension to the standard javax.jcr.Binary interface. ModeShape's extension adds useful methods to get the SHA-1 hash (as a binary array and as a hexadecimal string) and the MIME type for the content:

All javax.jcr.Binary values returned by ModeShape will implement this public interface, so feel free to cast the values to gain access to the additional methods.

Importing and Exporting

When exporting content from a workspace with large Binary values, be sure to export using JCR's System View format. Only the System View treats properties as child elements. This allows each large value to be streamed (using buffered streams) into the XML element's content as a Base64-encoded string. Importing can also take advantage of streaming.

Exporting content using JCR's Document View results in all properties being treated as XML attributes, and various XML processing libraries treat large attributes poorly (e.g., using values that are in-memory String objects). Another critical disadvantage of the Document View is that it is unable to represent multi-valued properties, since attributes can have only one value.

Implementation design

This section describes the internal design of how ModeShape stores binary values, and is typically useful to either understand the nuances of the various configuration choices or to implement custom binary stores.

None of the interfaces described in this section are part of the public API, and should never be directly used by JCR client applications.

BinaryValue

In addition to the ModeShape-specific org.modeshape.jcr.api.Binary extension, ModeShape also defines a org.modeshape.jcr.value.BinaryValue interface that adds several other features required to properly persist and manage Binary values. These other features that are part of ModeShape's internal design and therefore not appropriate for inclusion in the public API. Specifically, BinaryValue instances are themselves immutable, they have an immutable BinaryKey that is a comparable representation of the SHA-1 hash, they are comparable with each other (based upon their keys), they can be serialized, and the getSize() method does not throw an exception like the standard method:

BinaryStore

The ModeShape-specific BinaryStore interface is thus defined to use the internal BinaryValue interface:

Each BinaryStore implementation must provide a no-arg constructor and member fields can be configured via the repository configuration. Note that the BinaryStore implementation must also implement several setter methods, which the repository calls when the BinaryStore is initialized and may be called at any time after that (due to the repository configuration changing).

Minimum binary size

When the BinaryStore is initialized, the repository will use the setMinimumBinarySizeInBytes(...) method to specify the size for BinaryValue}}s that must be persisted within the {{BinaryStore. Any binary content smaller than this can be represented with InMemoryBinaryValue instances (meaning they will be persisted with property where it's used) or persisted in the BinaryStore. Note that if repository's configuration changes, the repository may set a minimum size threshold.

Minimum string size

The repository can also use the BinaryStore to store large string values. Any strings larger than the threshold set in the repository configuration will be stored in the BinaryStore and referenced in the node. Note that there is nothing to configure in the BinaryStore itself.

MIME type detection

When the BinaryStore is initialized, the repository will use the setMimeTypeDetector(...) method to give the BinaryStore a MimeTypeDetector instance it can use to determine the MIME type for any binary content. The BinaryStore is free to determine the MIME type at any time, including when the binary content is stored or only when the MIME type is needed (via the getMimeType(...) method). The BinaryStore is also free to persist this information, since binary content for a given SHA-1 never changes. Note that if repository's configuration changes, the repository may set a different MIME type detector.

Text extraction

When the BinaryStore is initialized, the repository will use the setTextExtractor(...) method to give the BinaryStore a TextExtract instance it can use to extract the content's full-text search terms. The BinaryStore is free to extract these terms at any time, including when the binary content is stored or only when the terms are requested (via the getText(...) method). The BinaryStore is also free to persist this information, since binary content for a given SHA-1 never changes. Note that if repository's configuration changes, the repository may set a different text extractor.

To extract content from a binary value, ModeShape relies on 3rd party libraries for the extraction functionality. ModeShape comes with one built-in extractor which uses Apache Tika. See Built-in text extractors to see how you can configure this extractor.

Garbage collection

There are a number of ways in which the BinaryStore may contain binary content (keyed by the SHA-1) that are no longer used or referenced. The first is when a JCR client or the repository removes the last Property containing the Binary. A second case is when a JCR client uses a Session to create a javax.jcr.Binary value and clears the transient state (before the Session's transient state saved). Neither of these pose a problem, since the minimum requirement is that the BinaryStore contain at least the content that is referenced in the repository content. However, all unused binary content in the BinaryStore takes up storage space, so ModeShape defines a way for the repository and the BinaryStore to recover that unused storage.

The repository periodically runs a multi-phase garbage collection process to identify those binaries that are no longer referenced by repository content. When such binaries are discovered, the repository calls the BinaryStore's markAsUnused(...) method. The BinaryStore then quarantines the binaries; if any quarantined binaries are used again, the BinaryStore can remove them from quarantine. The repository then periodically calls the BinaryStore's removeValuesUnusedLongerThan(...) method to purge all binaries that have been quarantined for at least the specified period of time.

The quarantine approach means that when {{BinaryValue}}s are removed, there is a period of time that they can be reused without the expensive removal and re-adding of the binary content.

BinaryStore implementations

There are currently a couple of implementations of BinaryStore:

  • org.modeshape.jcr.value.binary.FileSystemBinaryStore - Stores each binary in a file on the file system, in a hierarchy of directories based upon the SHA-1 hash. The store does use Java's native OS file locks to prevent other processes from concurrently writing the files, and it also uses an internal set of locks to prevent mulitple threads from simultaneously writing to the persisted files. This store exposes buffered FileInputStream instances that directly access the underlying files.
  • org.modeshape.jcr.value.binary.InfinispanBinaryStore - Stores binary values within Infinispan, allowing the binary values to be chunked and distributed across the data grid (while the binary metadata is replicated across the grid). This option works really well for clustered topologies, since all processes in the cluster can access the same store. Two different caches are used: one for binary value metadata and one for the chunked values. Because the metadata for each value is very small (roughly 120 bytes), the metadata cache can be replicated, whereas the value cache can be replicated or distributed. Added in 3.1.0.Final
  • org.modeshape.jcr.value.binary.MongodbBinaryStore - Store binary values within a MongoDB instance, where the binary values are chunked and stored inside the database. It does use a local cache of binary values (backed by the file system store). Added in 3.1.0.Final
  • org.modeshape.jcr.value.binary.DatabaseBinaryStore - Store binary values within a JDBC database, where the binary values are stored as BLOBs in the underlying database. Added in 3.1.0.Final
  • org.modeshape.jcr.value.binary.CassandraBinaryStore - Store binary values within a Cassandra database, where the binary values are stored as BLOBs in the underlying database. Added in 3.4.0.Final
  • org.modeshape.jcr.value.binary.TransientBinaryStore - A customization of the FileSystemBinaryStore that uses the System's temporary directory (as defined by java.io.tmpdir). Useful for testing or transient repositories only.
  • org.modeshape.jcr.value.binary.CompositeBinaryStore - A binary store which is able to aggregate several binary stores of the type: file, infinispan, database or custom. Each nested binary store must have a unique name, under which it is aggregated by the composite store. When creating binary values, this name acts as a hint to binary value factory based on which a created value will go in one store or another. To create binary values for this type of store, you must use the org.modeshape.jcr.api.ValueFactory interface and the public Binary createBinary( InputStream value, String hint ) method. Added in 3.3.0.Final

We would like to have other options, including storage in S3 and Hadoop. But it's also possible for developers using ModeShape to write their own implementations.

Configuring Binary Stores

If you no explicit binary store configuration is present, the TransientBinaryStore implementation will be used by default. As explained above, this is not really suitable outside a testing context, as any binaries will be lost between restart.

To explicitly configure a Binary Store in the repository JSON configuration file, add a binaryStorage section to the main storage section.

For example:

will configure a FileSystemBinaryStore while

will configure a DatabaseBinaryStore.

The valid list of types for the type attribute are: file, database, transient, cache, composite and custom.

Regardless of the type, all binary stores support the following attributes:

minimumBinarySizeInBytes the minimum size (in bytes) above which binary values will be stored in the store. Any binary value lower in size will be stored together with the other node information
minimumStringSize the minimum length of a string above which all strings are stored in the binary store (as an optimization)

Beside these, each binary store type has its own list of custom attributes it supports. For more information about each possible value see the repository schema .

Files and Folders

A very simple way of adding binary content into a repository is uploading files & folders.
The JCR specification defines the following node types:

which means that a natural folder/file hierarchy would use a nt:folder/nt:file/jcr:content->jcr:data hierarchy.
ModeShape provides via the modeshape-jcr-api artifact a utility for creating such a hierarchy: org.modeshape.jcr.api.JcrTools#uploadFile. See the implementation for more information.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.