JBoss Community Archive (Read Only)

ModeShape 5

Binary values

A common use for repositories is to manage (among other things) files, so ModeShape is now capable of handling even extremely large binary values that are larger than available memory. This is because ModeShape never loads the whole value onto the heap, but instead streams the value to and from the persistent store. And you can configure where ModeShape stores the binary values independently of where the rest of the content is stored.

How it works

The key to understanding how ModeShape manages the Binary values is to remember how the JCR API exposes them. To set a property to a Binary value, the JCR client creates the javax.jcr.Binary instance from the binary stream:

javax.jcr.Session session = ...
javax.jcr.ValueFactory factory = session.getValueFactory();

// Create the binary value ...
java.io.InputStream stream = ...
javax.jcr.Binary binary = factory.createBinary(stream);

// Use the binary value ...
javax.jcr.Property property = ...
property.setValue(binary);

// Save the changes ...
session.save();

Then, to access the binary content, the JCR client gets the property, gets the binary value(s), and then obtains the binary value's InputStream:

javax.jcr.Property property = ...
javax.jcr.Binary binary = property.getValue().getBinary();
java.io.InputStream stream = binary.getStream();
// Use the stream ...

When ModeShape creates the actual javax.jcr.Binary value, it reads the supplied java.io.InputStream and immediately stores the content in the repository's binary storage area, which then returns a Binary instance that contains a pointer to the persisted binary content. When needed, that Binary instance (or another one obtained at a later time) obtains from the binary storage area the InputStream for the content and simply returns it.

Note that the same Binary value can be read from one property and set on any other properties:

javax.jcr.Session session = ...

// Get the Binary value from one property ...
javax.jcr.Property property = ...
javax.jcr.Binary binary = property.getValue().getBinary();

// And set it as the value for other properties ...
javax.jcr.Property property = ...
property.setValue(binary);

// Save the changes ...
session.save();

This works because the Binary value contains only the pointer to the binary content, copying or reusing the Binary objects is very efficient and lightweight. It also works because of what ModeShape uses for the pointers.

ModeShape stores all binary content by its SHA-1 hash. The SHA-1 cryptographic hash function is not used for security purposes, but is instead used because the SHA-1 can reliably be determined entirely from the content itself, and because two binary contents will only have the same SHA-1 if they are indeed identical. Thus, the SHA-1 hash of some binary content serves as an excellent key for storing and referencing that content. The pointer we mentioned in the previous paragraph is merely the SHA-1 of the binary content. The following diagram represents how this works:

images/author/download/attachments/103547139/binary-storage.png

Using the SHA-1 hash as the identifier for the binary content also means that ModeShape never needs to store a given binary content more than once, no matter how many nodes or properties refer to it. It also means that if your JCR client already knows (or can compute) the SHA-1 of a large value, the JCR client can use ModeShape-specific APIs to easily determine if that value has already been stored in the repository. (We'll see an example of this later on.)

Extended Binary interface

The ModeShape public API defines the org.modeshape.jcr.api.Binary interface as a simple extension to the standard javax.jcr.Binary interface. ModeShape's extension adds useful methods to get the SHA-1 hash (as a binary array and as a hexadecimal string) and the MIME type for the content:

@Immutable
public interface Binary extends javax.jcr.Binary {

    /**
     * Get the SHA-1 hash of the contents. This hash can be used to determine whether two
     * Binary instances contain the same content.
     *
     * Repeatedly calling this method should generally be efficient, as it most implementations
     * will compute the hash only once.
     *
     * @return the hash of the contents as a byte array, or an empty array if the hash could
     * not be computed.
     * @see #getHexHash()
     */
    byte[] getHash();

    /**
     * Get the hexadecimal form of the SHA-1 hash of the contents. This hash can be used to
     * determine whether two Binary instances contain the same content.
     *
     * Repeatedly calling this method should generally be efficient, as it most implementations
     * will compute the hash only once.
     *
     * @return the hexadecimal form of the getHash(), or a null string if the hash could
     * not be computed or is not known
     * @see #getHash()
     */
    String getHexHash();

    /**
     * Get the MIME type for this binary value.
     *
     * @return the MIME type, or null if it cannot be determined (e.g., the Binary is empty)
     * @throws IOException if there is a problem reading the binary content
     * @throws RepositoryException if an error occurs.
     */
    String getMimeType() throws IOException, RepositoryException;

    /**
     * Get the MIME type for this binary value.
     *
     * @param name the name of the binary value, useful in helping to determine the MIME type
     * @return the MIME type, or null if it cannot be determined (e.g., the Binary is empty)
     * @throws IOException if there is a problem reading the binary content
     * @throws RepositoryException if an error occurs.
     */
    String getMimeType( String name ) throws IOException, RepositoryException;
}

All javax.jcr.Binary values returned by ModeShape will implement this public interface, so feel free to cast the values to gain access to the additional methods.

Importing and Exporting

When exporting content from a workspace with large Binary values, be sure to export using JCR's System View format. Only the System View treats properties as child elements. This allows each large value to be streamed (using buffered streams) into the XML element's content as a Base64-encoded string. Importing can also take advantage of streaming.

Exporting content using JCR's Document View results in all properties being treated as XML attributes, and various XML processing libraries treat large attributes poorly (e.g., using values that are in-memory String objects). Another critical disadvantage of the Document View is that it is unable to represent multi-valued properties, since attributes can have only one value.

Implementation design

This section describes the internal design of how ModeShape stores binary values, and is typically useful to either understand the nuances of the various configuration choices or to implement custom binary stores.

None of the interfaces described in this section are part of the public API, and should never be directly used by JCR client applications.

BinaryValue

In addition to the ModeShape-specific org.modeshape.jcr.api.Binary extension, ModeShape also defines a org.modeshape.jcr.value.BinaryValue interface that adds several other features required to properly persist and manage Binary values. These other features that are part of ModeShape's internal design and therefore not appropriate for inclusion in the public API. Specifically, BinaryValue instances are themselves immutable, they have an immutable BinaryKey that is a comparable representation of the SHA-1 hash, they are comparable with each other (based upon their keys), they can be serialized, and the getSize() method does not throw an exception like the standard method:

@Immutable
public interface BinaryValue extends Comparable<BinaryValue>, Serializable, org.modeshape.jcr.api.Binary {

    /**
     * Get the length of this binary data.
     *
     * Note that this method, unlike the standard {@link javax.jcr.Binary#getSize()} method,
     * does not throw an exception.
     *
     * @return the number of bytes in this binary data
     */
    @Override
    public long getSize();

    /**
     * Get the key for the binary value.
     *
     * @return the key; never null
     */
    public BinaryKey getKey();
}

BinaryStore

The ModeShape-specific BinaryStore interface is thus defined to use the internal BinaryValue interface:

@ThreadSafe
public interface BinaryStore {
    /**
     * Initialize the store and get ready for use.
     */
    public void start();

    /**
     * Shuts down the store.
     */
    public void shutdown();

   /**
     * Get the minimum number of bytes that a binary value must contain before it can
     * be stored in the binary store.
     * @return the minimum number of bytes for a stored binary value; never negative
     */
    long getMinimumBinarySizeInBytes();

    /**
     * Set the minimum number of bytes that a binary value must contain before it can
     * be stored in the binary store.
     * @param minSizeInBytes the minimum number of bytes for a stored binary value; never negative
     */
    void setMinimumBinarySizeInBytes( long minSizeInBytes );

    /**
     * Set the text extractor that can be used for extracting text from binary content.
     * @param textExtractor the text extractor
     */
    void setTextExtractor( TextExtractor textExtractor );

    /**
     * Set the MIME type detector that can be used for determining the MIME type for binary content.
     * @param mimeTypeDetector the detector; never null
     */
    void setMimeTypeDetector( MimeTypeDetector mimeTypeDetector );

    /**
     * Store the binary value and return the JCR representation. Note that if the binary
     * content in the supplied stream is already persisted in the store, the store may
     * simply return the binary value referencing the existing content.
     *
     * @param stream the stream containing the binary content to be stored; may not be null
     * @return the binary value representing the stored binary value; never null
     * @throws BinaryStoreException if there is a problem storing the content
     */
    BinaryValue storeValue( InputStream stream ) throws BinaryStoreException;

    /**
     * Get an InputStream to the binary content with the supplied key.
     *
     * @param key the key to the binary content; never null
     * @return the input stream through which the content can be read
     * @throws BinaryStoreException if there is a problem reading the content from the store
     */
    InputStream getInputStream( BinaryKey key ) throws BinaryStoreException;

    /**
     * Mark the supplied binary keys as unused, but key them in quarantine until needed again
     * (at which point they're removed from quarantine) or until
     * removeValuesUnusedLongerThan(long, TimeUnit) is called. This method ignores any keys for
     * values not stored within this store.
     *
     * Note that the implementation must *never* block.
     *
     * @param keys the keys for the binary values that are no longer needed
     * @throws BinaryStoreException if there is a problem marking any of the supplied
     * binary values as unused
     */
    void markAsUnused( Iterable<BinaryKey> keys ) throws BinaryStoreException;

    /**
     * Remove binary values that have been unused for at least the specified amount of time.
     *
     * Note that the implementation must *never* block.
     *
     * @param minimumAge the minimum time that a binary value has been unused before it can be
     *        removed; must be non-negative
     * @param unit the time unit for the minimum age; may not be null
     * @throws BinaryStoreException if there is a problem removing the unused values
     */
    void removeValuesUnusedLongerThan( long minimumAge,
                                       TimeUnit unit ) throws BinaryStoreException;

    /**
     * Get the text that can be extracted from this binary content.
     *
     * @param binary the binary content; may not be null
     * @return the extracted text, or null if none could be extracted
     * @throws BinaryStoreException if the binary content could not be accessed
     */
    String getText( BinaryValue binary ) throws BinaryStoreException;

    /**
     * Get the MIME type for this binary value.
     *
     * @param binary the binary content; may not be null
     * @param name the name of the content, useful for determining the MIME type;
     * may be null if not known
     * @return the MIME type, or null if it cannot be determined (e.g., the Binary is empty)
     * @throws IOException if there is a problem reading the binary content
     * @throws RepositoryException if an error occurs.
     */
    String getMimeType( BinaryValue binary,
                        String name ) throws IOException, RepositoryException;

    /**
     * Obtain an iterable implementation containing all of the store's binary keys. The resulting iterator may be lazy, in the
     * sense that it may determine additional {@link BinaryKey}s only as the iterator is used.
     *
     * @return the iterable set of binary keys; never null
     * @throws BinaryStoreException if anything unexpected happens.
     */
    Iterable<BinaryKey> getAllBinaryKeys() throws BinaryStoreException;

Each BinaryStore implementation must provide a no-arg constructor and member fields can be configured via the repository configuration. Note that the BinaryStore implementation must also implement several setter methods, which the repository calls when the BinaryStore is initialized and may be called at any time after that (due to the repository configuration changing).

Minimum binary size

When the BinaryStore is initialized, the repository will use the setMinimumBinarySizeInBytes(...) method to specify the size for BinaryValue s that must be persisted within the BinaryStore. Any binary content smaller than this can be represented with InMemoryBinaryValue instances (meaning they will be persisted with property where it's used) or persisted in the BinaryStore . Note that if repository's configuration changes, the repository may set a minimum size threshold.

Minimum string size

The repository can also use the BinaryStore to store large string values. Any strings larger than the threshold set in the repository configuration will be stored in the BinaryStore and referenced in the node. Note that there is nothing to configure in the BinaryStore itself.

MIME type detection

When the BinaryStore is initialized, the repository will use the setMimeTypeDetector(...) method to give the BinaryStore a MimeTypeDetector instance it can use to determine the MIME type for any binary content. The BinaryStore is free to determine the MIME type at any time, including when the binary content is stored or only when the MIME type is needed (via the getMimeType(...) method). The BinaryStore is also free to persist this information, since binary content for a given SHA-1 never changes. Note that if repository's configuration changes, the repository may set a different MIME type detector.

Text extraction

When the BinaryStore is initialized, the repository will use the setTextExtractor(...) method to give the BinaryStore a TextExtract instance it can use to extract the content's full-text search terms. The BinaryStore is free to extract these terms at any time, including when the binary content is stored or only when the terms are requested (via the getText(...) method). The BinaryStore is also free to persist this information, since binary content for a given SHA-1 never changes. Note that if repository's configuration changes, the repository may set a different text extractor.

To extract content from a binary value, ModeShape relies on 3rd party libraries for the extraction functionality. ModeShape comes with one built-in extractor which uses Apache Tika. See Built-in text extractors to see how you can configure this extractor.

Garbage collection

There are a number of ways in which the BinaryStore may contain binary content (keyed by the SHA-1) that are no longer used or referenced. The first is when a JCR client or the repository removes the last Property containing the Binary. A second case is when a JCR client uses a Session to create a javax.jcr.Binary value and clears the transient state (before the Session's transient state saved). Neither of these pose a problem, since the minimum requirement is that the BinaryStore contain at least the content that is referenced in the repository content. However, all unused binary content in the BinaryStore takes up storage space, so ModeShape defines a way for the repository and the BinaryStore to recover that unused storage.

The repository checks with each successful transaction whether any binaries involved in that transaction are or are not referenced by repository content. If they are not used, The BinaryStore quarantines the binaries; if any quarantined binaries are used again, the BinaryStore can remove them from quarantine. The repository then periodically calls off a garbage-collection thread the BinaryStore's removeValuesUnusedLongerThan(...) method to purge all binaries that have been quarantined for at least the specified period of time.

The quarantine approach means that when {{BinaryValue}}s are removed, there is a period of time that they can be reused without the expensive removal and re-adding of the binary content.

BinaryStore implementations

See this page for the list of available binary stores.

We would like to have other options, including storage in S3 and Hadoop. But it's also possible for developers using ModeShape to write their own implementations.

Configuring Binary Stores

If you no explicit binary store configuration is present, the TransientBinaryStore implementation will be used by default. As explained above, this is not really suitable outside a testing context, as any binaries will be lost between restart.

For information about each particular binary store type, see this page.

Files and Folders

A very simple way of adding binary content into a repository is uploading files & folders.
The JCR specification defines the following node types:

[nt:folder] > nt:hierarchyNode
  + * (nt:hierarchyNode) version

[nt:file] > nt:hierarchyNode
  + jcr:content (nt:base) primary mandatory

[nt:resource] > mix:mimeType, mix:lastModified
  - jcr:data (binary) primary mandatory

which means that a natural folder/file hierarchy would use a nt:folder/nt:file/jcr:content->jcr:data hierarchy.
ModeShape provides via the modeshape-jcr-api artifact a utility for creating such a hierarchy: org.modeshape.jcr.api.JcrTools#uploadFile. See the implementation for more information.

Mime Type detection

When working with binary values on nodes which also have the mix:mimeType mixin (like the default files and folders) ModeShape automatically detects, via Apache Tika, the mime type of the binary value and sets it on the jcr:mimeType property whenever new nodes are created and the JCR session is saved. By default, the mime type detection process involves reading at least the header of each binary value and based on a "magic byte pattern" from the header determining the mime type.

This is a very powerful feature which works well, but there may be cases (for example working with lots of binary values) when reading the stream of each binary value is counter-productive and memory-intensive. Therefore, starting with version 4.4.0.Final, ModeShape offers the possibility of configuring the mime type detection mechanism:

"storage" : {
        .........
        "binaryStorage" : {
            ""mimeTypeDetection" : "content" | "name" | "none"
        }
    }

or in the AS server kit:

 <file-binary-storage path="modeshape/sample3/binaries"  mime-type-detection="none | name | content"/>

The possible values for the mime type detection are:

content (default)

determine the mime type based on the actual content of each binary. Requires at least reading the header of each binary into memory

name

use only the name (which ideally contains an extension) of the parent of the node which owns the binary value. This means that when using the default nt:file type, this will be the name of the file. This type of detection does not read anything from the binary stream

none

disables mime type detection altogether

JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-11 12:12:27 UTC, last content change 2017-03-07 13:43:04 UTC.