Tracker Bug: BZ 549852
|There is a community plugin I want to try out, do some development with, and be able to remove with confidence that when I delete it, the plugin will be completely removed from the system|
|I upgrade a plugin, and that introduces some instability into the system. I can effectively rollback to the previous version by deleting the newly installed plugin, and then reinstalling the prior version of the plugin|
|I want to slim the footprint of my agent|
|Support developers that want to just get rid of a plugin without the need to wipe the whole database|
RHQ already supports disabling plugins. The work for this was done under BZ 535894. Disabling a plugin immediately marks the plugin as disabled in the database, but no change happens on the agent's plugin container until the plugin container is restarted. Once the plugin container is restarted, discovery of types in the disabled plugin will be stopped. Disabling a plugin is a not an adequate solution because the meta data from the plugin remains in the system. This would prevent rolling back to a previous version of the plugin and could still result in upgrade problems. Another reason that this won't suffice is because resources already in inventory of types that have been disabled remain in inventory. When a plugin is deleted, all types and instances of those types must be removed from the system.
We cannot delete some plugin P1 that has other plugins, P2 and P3, that depend on P1. We have to also delete P2 and P3. Let's consider a couple example. First, we attempt to delete the apache plugin. This should proceed without error because there are no other plugins that depend on the apache plugin. Now let's say we have the platform, jmx, hibernate, and jboss-as plugins in the system, and we try to delete the jmx plugin. This should not be allowed without also deleting the hibernate and jboss-as plugins since they both depend on the jmx plugin.
Consider another example. Suppose we want to delete the jboss-as plugin. The rhq-server plugin depends on it. This means we would have to delete both.
As a last example, consider the hibernate plugin which has an optional dependency on the jboss-as plugin. When both plugins are deployed, we create types from the hibernate plugin whose parent types come out of the jboss-as plugin. If we want to delete the jboss-as plugin, we will need to either delete the hibernate types or modify their hierarchy.
- The plugin in the database (RHQ_PLUGIN)
- The plugin jar file on the file system (on the server)
- All resource types declared in the plugin
- Compatible resource groups
- Alert notifications that execute operations on types that are disabled
- Alert history records that audit invoked operations on types that are disabled
- All resources and associated instance data that are of type declared in the plugin
Not only do we want to delete the plugin jar file on the agent, but we also want to free up resources associated with the plugin. We want to stop running discovery components, purge local inventory of resources and resource types, and we want to discard classloaders that are no longer needed. Fortunately, all of this already happens when the PC is started or restarted.
We need a way for the PC to tell the agent that it needs to reboot the PC. But because the PC cannot directly use any agent APIs (since the PC does not always run inside the agent), we need to add some indirection. We can create a new PC listener type, RebootRequestListener, that the agent can implement. When the PC needs to be rebooted it notifies the listener, which turns control over to the agent. The agent can then reboot the PC.
There is a race condition of sorts that could occur when we start the deletion process. An agent could send up an inventory report to the discovery containing resources from the plugin being deleted while the deletion is underway. Resources could get committed into inventory while the resource type are being delete, which could potentially put the inventory into an inconsistent state. We need to effectively turn off discovery of all resource types from the plugins being deleted. This is discussed and covered to a large degree in BZ 535289. The work done resulted in turning off discovery components on a per-agent basis when a plugin component is misbehaving in some way.
Ultimately, we do not want to merge inventory reports that contain resources have already been deleted or are of types that are in the process of being deleted. The server needs to check reports and throw an exception if the reports contain stale data. The exception is propagated back to the PC which informs it that it has stale types/plugins and that it needs to be restarted.
Resources will be uninventoried. Resource types will be marked for deletion.
We already have the async resource delete scheduled job (see the class AsyncResourceDeleteJob) that handles removing resources and their associated data from inventory.
We will delete resource types out of band. Deleting resource types could very well be an expensive, time-consuming operation. As such, it should be done asynchronously as a scheduled job. There are a couple important preconditions we must have in place. First, discovery of and importing of resources of the types to be deleted has been turned off. Secondly, the async resource delete job will remove all of the resources along with all of their associated instance data. With these preconditions in place, our job can run independently of the resource deletion job and do the resource type deletion. When the job runs, it will first check to make sure all resources of the type to be deleted have been deleted. If there are still resources in the system, then we skip over that resource type. This work needs to take into account the resource type hierarchy. We cannot remove some type A if some type B depends on A. Each time this job is executed, we have to construct the dependency graph in memory of all resource types to be deleted, and delete them in the appropriate order respecting dependencies.
Note: estimates are in hours.
| Add deleted field to ResourceType
|| This includes updating the dbsetup and dbupgrade scripts in addition to updating Java code.
| Refactor code base so that the system is unaware of types marked for deletion
|| In UI code and places where ResourceType objects are returned from method calls (primarily from the SLSB layer), we essentially need to put filters in place so the code is only dealing with resource types that have not been marked for deletion. Only those parts of the system that are dealing with the actual deletion need to be concerned with those types that are marked for deletion. Queries need to be updated as well. The only queries that should include types marked for deletion in the results are those queries that are used in the code that deals with the actual deletion.
| In Band Deletion
|| We need to write code that marks resource types deleted and uninventories resources. Something like ResourceMetadataManagerBean.deletePlugins might be a logical place for this since ResourceMetadataManagerBean already contains methods for enabling and deleting plugins.
| Determine how to delete the resource types
|| This boils down to a performance analysis of whether EntityManager.remove is a viable solution or if we ought to use JPQL delete statements. EntityManager.remove requires the entity to first be loaded in the persistence context. Having to load all of the resource types and potentially loading some/all of their associations could turn out to be very expensive. With JPQL statements, we can avoid loading object graphs into the persistence context.
| Resource Type Deletion Job
|| This is the scheduled job that carries out the resource type deletion.
| Reject Invetory Reports Containing Stale Data
||We consider an inventory report stale if the following holds true. The report contains one or more resources of a type that is marked for deletion or that has already been deleted. When a stale report is detected, throw an exception.|| 8
| Purge Plugin Container of Stale Types
|| This should occur when the server throws an exception after the PC sends up a report such as an inventory report that contains stale data. Restarting the PC should effectively handle the removal of stale types. The PC needs to tell the agent that it needs to be restarted. Since the PC cannot call agent APIs, we need to put a level of indirection in place. We can create a new PC listener type, RebootRequestListener, that the agent can implement. When the PC needs to be rebooted it notifies the listener, which turns control over to the agent. The agent can then reboot the PC. Before restarting the PC, we need to delete the inventory.dat file to account for any resources that have already been persisted. This will avoid reloading types for which class loaders no longer exist. Note that it will be the agent's reboot listener that has to delete inventory.dat. The PC write to inventory.dat during shutdown; so, the agent must delete the file after the PC has been shutdown.