Design

Motivation

The initial implementation of RHQ's content subsystem was pretty simplistic. It largely consisted of the following ideas:

The user would define one or more content sources which would connect to some external source and import packages into the system. They functioned as an adapter between the protocol/package representation of an external system and the RHQ domain model. Our two biggest use cases at the time were the JBoss CSP feed and a yum repository.
The user would define one or more channels which functioned as a collection of packages. Channels were only created by users, there was no way for RHQ to do so.
The user would associate one or more content sources with one or more channels. The rationale is that when the source provided packages, it would provide them directly into those channels.

The important take away is that the subsystem is largely content source-centric. Specifically, the synccing schedules and logic are tied to the content source. Channels are a pretty thin view of the packages introduced into the system by a content source (or manually uploaded by the user, but that's irrelevant to this discussion).

However, recent enhancements to the content subsystem such as distribution and errata synccing along with feedback from the UX team have driven us more towards a channel-centric (now called repo) model. This model changes the responsiblity of the content sources (now called content providers). Both content providers and repos have separate sync jobs that perform different functions.

Below is a high level explanation of the responsibilities of content provder sync v. repo sync.

Content Provider Sync

Candidate Repos - Ask the content provider for a list of repos, if any, it would like to introduce into the system.

Repo Sync

Package Metadata - Request a report containing package additions/updates/removals from each content provider associated with the repo. At this point, just the metadata about the package is downloaded the bits are downloaded through a separate call to the plugin. This report will be scoped to just the repo being syncced, which represents a large change in how the initial implementation was designed.
Package Bits - For all packages that do not have their bits downloaded to the RHQ server, request the bits from the content provider which introduced the package (assuming it's not a user added package). This request will only occur for content providers assigned to the repo that are not configured for lazy load.
Distribution Tree Metadata - Very similar to synccing package metadata, the distribution tree metadata for this repo are requested from each associated content provider.
Distribution Tree Bits - Also very similar to the package model, the bits for each distribution tree found for the repo are downloaded.
Errata - This is currently being developed in Sprint 3.

Implementation

Summary

Quartz is used to schedule the two synchronization tasks (content provider and repo). The jobs call EJB methods to begin the synchronization, which forward to the ContentProviderManager class which determines what tasks are done as part of the sync and make the appropriate calls to the plugins. The results of those calls are then passed to EJB methods to process the incoming data and make the necessary changes to the RHQ database.

Classes

ContentProviderSyncJob - Quartz job implementation that handles triggering a content provider sync.
RepoSyncJob - Quartz job implementation that handles triggering a specific repo sync.

ContentProviderManager - Ultimately handles talking back and forth with the server-side plugins. This class contains high-level methods for the sorts of tasks we want to do with content plugins, such as requesting bits be loaded or a repo be syncced. Since there is a lot of code in this class to general tasks, the logic for each type of sync was split out into their own classes:

DistributionSourceSynchronizer - Handles the logic for synchronizing distribution trees and their bits.
PackageSourceSynchronizer - Handles the logic for synchronizing package metadata and bits.
RepoSourceSynchronizer - Handles the logic for synchronizing candidate repos into the system.

Various EJBs - A number of EJBs come into play, including ContentSourceManagerBean, ContentManagerBean, DistributionManagerBean, and RepoManagerBean.

Package Sync

Flow

The following describes the flow and logic associated with synccing the packages for a particular repo. The steps below are executed for each content provider associated with the repo. Again, the majority of this flow can be found in the PackageSourceSynchronizer class.

If that content provider does not support package synccing (i.e. does not implement PackageSource), nothing happens.
The existing list of packages in the repo (for that content provider) are loaded. These are sent to the plugin as part of the package sync request. The rationale is that the plugin uses this list to determine which packages it should mark as new, deleted, or updated.
The plugin (for that content provider) is called to synchronize packages. It is sent the following data:
- Repo name
- Existing package metadata (translated to a plugin API specific DTO)
- Repo object to populate with the results of the sync
The plugin does whatever is necessary to populate the report, indicating packages that were added, deleted, or updated. If there are no changes from the existing packages sent in the sync call, the report is empty.
The ContentSourceManager EJB is called to process the report (see below).

Once the provider has provided the populated package sync report (PackageSyncReport), it's task is done. The report is then processed by the RHQ server before the next content provider is requested to sync packages for the given repo. I suspect this has the potential to cause issues.

The following describes the flow and logic associated with processing a package sync report (the code can be found at ContentSourceManagerBean.mergeDistributionSyncRepo. It can be thought of as having three primary steps, handling the packages marked for removal, addition, and up date in that order.

Package Removal - Handles any packages are indicated as being removed. In other words, they were returned from a previous package sync against this repo and content provider but were not in the sync being processed. The following takes place for each package to be removed:
- Delete the mapping between the content provider and the package.
- If there are no other content providers that provide this package AND any providers that do are not associated with the repo - Delete the mapping between the repo and the package (although I listed it as a logical step here, in the code it is only mentioned in a comment; the cascade setting on PackageVersion takes care of this mapping).
- If the above two conditions were true, the package is now orphaned, belonging to no repos and having no provider backing it
  - If an orphan has no associated resources (isn't installed on any known resources) - Delete it from the RHQ database.
  - If it is still associated to a resource (InstalledPackage) - Do not delete the package from the database. See note 1 below.

Package Addition - Handles any packages that were flagged in the report as new packages. While this list is determined by the content plugin, the idea is that packages in this list were not present in the list of existing packages passed to the plugin in the sync call.
- The package additions are done in chunks, where the chunk size is hardcoded in the bean.
- The resource type in which the package type of the package being added is loaded from the database. This is a result of the way the agent plugins introduce metadata into the system; package types are considered part of a particular resource type to allow for multiple package types of the same name (for different resource types).
  - If the resource type cannot be found, the addition of that particular package is aborted. This likely represents a bug in the content plugin that it is attempting to define a package with invalid metadata.
- The type of the package being added is loaded from the database.
  - Again, if the package type cannot be loaded, the package addition is aborted. The type should be in the database from the agent plugin metadata loading.
- The package entity is created if it does not already exist. This entity has one or more package versions associated with it, which explains why it may already exist in the system (i.e. the new package being added is a new version of an existing package).
- The architecture of the package is loaded from the database, being created if it does not already exist.
- Create the actual package version entity that represents the new package being added.
  - If it already exists, we assume another content provider has created it.
    - If the file metadata (size, name, MD5, SHA256) is different from the existing package version, log a message warning about this and update the package version entity with the new package's metadata.
- Associate resource versions with the package version (i.e. "This package only applies to emacs 2.1 and 2.2").
- Create mapping between package version and content provider, indicating where the package version came from.
- Create a mapping between the repo being syncced and the newly added package version. This represents a change from the previous handling prior to repo specific syncs. Previously, all repos associated with the provider being syncced would be updated with the new package.
- Cleanup: flush if necessary and increment the number of packages added.

Package Update - Handles packages the content provider has introduced into the system previously but whose 'metadata has changed since the last sync.' This actually represents a pretty small probability of occurring in our current use cases for package synccing.
- If the package does not already exist in the database, add it. Something odd occurred between requesting the package sync and processing the result, but since the provider felt the need to provide updated package details we want that package in the database.
- Update the package details from the previous package version entity and merge them into the database.
- Cleanup: flush if necessary and increment the number of packages updated.

Notes

1. When a resource's package discovery is run, any packages that it indicates it found that are not in RHQ's content inventory will be created. By our terminology, they are orphaned packages. Whether or not that is desired is a separate discussion. For now, in order to maintain that model we cannot delete the package from the database if it is still referenced by a resource.