Before we can explain why you might want to use a repository, we first need to state the obvious: applications have different data requirements, and no one data storage solution will fit all applications. So while most of the applications you've worked on probably used relational databases, the truth is that a relational database is often not the best fit. And once you understand the sweet spot of repositories, you can make an informed decision about whether repositories make sense for your application.
Application developers have a lot of options for persisting their application's data. Choices over the years have included a variety of databases, including relational, network, graph, object, document, XML, and hybrids. While relational databases have certainly been the norm for many years, more recently there has been a lot of interest in alternative databases that have different characteristics and features than traditional relational databases. Much of this interest has its origins in several trends, including:
- installing databases on larger numbers of smaller hardware (scale out)
- storing very large amounts of data
- storing simple data structures, such as simple JSON documents
- using the (sometimes multiple) natural hierarchies in data
- looking up data by keys rather than using queries
- searching for data based upon relevance rather than criteria
- map-reduce patterns for distributing data operations in parallel across all the data
- evolving schemas and/or data structures
- services that access/store only one kind of data
- caching data in-memory for performance
- giving up consistency guarantees for increased availability
Relational databases are less suited for applications with these needs. For example, large relational databases are usually installed on large, very capable hardware, and are difficult to configure and use with large clusters. Also, in many situations described above, relational databases are simply overkill because few of the features are used, or because they lack features that are required.
And while early adopters of "NoSQL" databases advocated avoiding relational databases, it is now much more widely understood that each type of database system has advantages, features, and sweet spots that have to be understood and matched to the application and operational requirements. Even relational databases have their sweet spot, including:
- finding data based on ad hoc queries with various (even user-defined) criteria
- aggregate operations (e.g., sum, average, min/max, etc.)
- transactional guarantees
- simplified updates of (normalized) data
- constrained data structure (schema) independent of application
- well-understood operational behaviors
- used by many installed applications
In short, don't pick a database system just because you used it on your last application or because its new and you want to use it on your new application. Figure out what your application needs, and pick the data storage solution that best satisfies those needs.
No single data storage technology is good at all the data usage patterns described above. So let's examine what kind of data and access patterns repositories are very good at handling.
- Hierarchical data - Some kinds of data are naturally hierarchical, and repositories enable you to use this hierarchy as an asset. A few examples include: ZIP codes partition the US geographically; the Dewey Decimal classification scheme breaks knowledge into 10 main classes, which are then subdivided into 10 divisions, which are then divided into 10 sections; merchandise is often classified using various categories or taxonomies; temporal and historical data is often naturally stored in a hierarchy based on time; physical products and assemblies are composed of parts and other assemblies, forming natural hierarchies; metadata is often structured assemblies of components that describe other components. Other kinds of data are not inherently hierarchical, but have characteristics that enable you to easily treat them as such. For example, cryptographic hash functions like SHA-1 are well distributed even in the first few bits of the hash, so a hierarchy can easily be created by segmenting the first few pairs of characters of the hexadecimal form of the hashes. Repositories can very naturally manage this kind of data, whereas other data storage technologies require you to jump through complex and non-trivial hoops. There are entire books that describe the multiple ways (none of which are ideal in all cases) to store trees and hierarchies in relational databases.
- File storage - JCR repositories are very good at storing files right alongside your other data. In fact, you can store quite a bit of metadata about the files you're storing, and some JCR repositories (like ModeShape) can automatically determine the SHA1 hash and the MIME type for your files.
- Navigation-based access - Any data stored in a hierarchy can be identified by the path in that hierarchy. Often applications that deal with hierarchical data need to work with subsets of the hierarchy, and thus navigate to a particular location and deal with the subgraph of data below that node. This form of navigation-based access is a natural advantage of repositories.
- Programmatic API - Above all else, JCR is a programming interface for Java (and other JVM languages) to interact with content repositories independently from how those repositories are implemented. Thus, the API defines all the useful components necessary to create, read, update, delete, observer, version, relate, and query the persisted content. And while you can write your application to use this API, under the covers the repository may support various data storage options, clustering and scaling options, federation capabilities, and other features designed to add value to your application's data management needs.
- Flexible data structure - Repositories do not require your data structure to be designed a priori. Instead, you can start storing your data, yet allow additional kinds of information to be stored as the need arises over time. Plus, not all similar data need be treated identically. Consider how an application might use user-defined tags. Tags might apply to many different kinds of data, yet with JCR's mixins you can enable a particular piece of data to become a holder of tags only when users want to apply tags to it. JCR repositories give tremendous flexibility to your data needs, allowing your data to vary while evolving over time as needs change.
- Rigid data structure - Flexible data structures are often very useful, but sometimes your applications need more control. JCR repositories let you choose how constrained your data should be. You can ensure that only specific properties are used, with values that fit specific constraints. If you want, you can start out with these restrictions or you can add them over time. With JCR, you're in control.
- Searching for data - JCR supports multiple query languages, ranging from hierarchically oriented languages like XPath, to highly structured set-oriented languages like JCR-SQL2, to very unstructured that enable full-text search.
- Observing data - Receive notifications when content is changed. use filters to simplify identifying the particular cases your interested in.
- Transactions - Ensure that changes made by a session are either all persisted, or that none of the changes are persisted. Plus, integrate with the Java Transaction API (JTA) and J2EE to make changes to your repositories within container-managed transactions in your applications.
- Versioning - JCR defines a built-in mechanism for versioning the changes made to single nodes or entire subgraphs. An application simply marks a node as being versionable, and from that point forward the repository automatically tracks the changes by recording the snapshots of the versionable subgraph. The version history of the versionable nodes is easily accessible through the standard API, allowing the application to access this history and roll back the versionable content to a prior state.
- Locking - The JCR API defines a mechanism for temporarily locking a single node or an entire subgraph to prevent others from modifying or removing that content. This of course works across all of the clients using the repository, making it easy for applications to take advantage of this built-in capability.
- References - JCR references that allow one node to refer to one (or more) other nodes, allowing easy navigation from the referring node to the referenced node.
- Referential integrity - JCR defines two styles of references: strong references require that referenced nodes not be removed, whereas weak references do not prevent removal of the referenced nodes.
In the previous section we talked about what repositories are good at, and in this section we'll cover some of the kinds of data and access patterns that JCR repositories are less capable of doing well. If your application uses data in these ways, you might consider using something other than a repository.
- Flat, unorganized data - Content repositories work very well with hierarchically structured data. And while we think most data does have hierarchical nature, not all data does. And content repositories are far less capable of handling very large amounts of flat, non-hiearchically organized data.
- Bulk declarative update - Relational databases support inserting, updating and deleting data via declarative statements. JCR provides no such facility, and all inserts and updates must be done via import or direct manipulation of nodes via stateful sessions. However, removing a node does remove all descendants.
- Massively large file storage - As mentioned above, JCR repositories are very good at storing files, and can even store massively large files (e.g., many GB or more). However, you may have very critical performance requirements or highly-optimized infrastructure for delivering such gigantic files to clients, and perhaps the Java streaming mechanism used by the JCR API may not be ideal. In that case, you may want to consider storing such massive files outside of the repository while storing inside the repository the metadata and location for that file. So in short, JCR repositories may work for large files, but be sure to test and evaluate a content repository before committing to it.
- Access by non-JVM languages - The JCR API is a standard Java programming interface, so by definition it can only be used by languages running on the JVM. However, many content repositories (including ModeShape) do offer non-Java APIs, such as REST, WebDAV and/or CMIS.
- Complex merging of versions - Although JCR can version content, the structure of the version history is relatively limited, and the built-in functionality for merging is not terribly sophisticated. You may find it more effective to create your own versioning mechanism on top of JCR. Or, if you have very complex versioning requirements and are dealing with mostly files, then perhaps other file versioning systems (e.g., Git) may be a better fit.
Below are some categories of applications for which content repositories are a good fit.
- Content management systems - Content management systems (CMS) and web content management systems (WCM) allow users to easily create and change the information used on web sites and other information systems. So storing information in a hierarchical manner that mirrors the structured web site and the XHTML files allows such systems to easily manage and serve the information.
- Document repositories - Storing and versioning documents and associated metadata is something repositories do very well, and so they're often used within document management systems.
- Artifact repositories - Systems that store artifacts (files that are the output of some process and used by other systems) often use repositories. For example, several Maven repository systems use JCR for storage of the artifacts and metadata, and rely upon not only the direct navigational access but also query and search.
- Governance systems - Repositories are a great fit for applications or systems that govern the lifecycle of artifacts, services, and files. Such applications typically need to store a wide variety of metadata with each governed artifact, and often this metadata changes as the lifecycle process is changed.
- Configuration management - Configuration information usually consists of structured files (e.g., XML, YAML, JSON, etc.). A JCR repository can provide a more formal way to manage and version multiple configurations. Plus, JCR's event system allows easy notification of configuration changes, while versioning can help guarantee the ability to revert back to a previous (valid) configuration.
- Knowledge systems - Repositories provide an excellent way to store the varied and changing information managed by knowledge management systems. Such systems also require search, references, referential integrity, and versioning capabilities.