Chapter 9. Participant Crash Recovery

A key requirement of a transaction service is to be resilient to a system crash by a host running a participant, as well as the host running the transaction coordination services. Crashes which happen before a transaction terminates or before a business activity completes are relatively easy to accommodate. The transaction service and participants can adopt a presumed abort policy.

Procedure 9.1. Presumed Abort Policy

If the coordinator crashes, it can assume that any transaction it does not know about is invalid, and reject a participant request which refers to such a transaction.
If the participant crashes, it can forget any provisional changes it has made, and reject any request from the coordinator service to prepare a transaction or complete a business activity.

Crash recovery is more complex if the crash happens during a transaction commit operation, or between completing and closing a business activity. The transaction service must ensure as far as possible that participants arrive at a consistent outcome for the transaction.

WS-AT Transaction: The transaction needs to commit all provisional changes or roll them all back to the state before the transaction started.
WS-Business Activity Transaction: All participants need to close the activity or cancel the activity, and run any required compensating actions.

On the rare occasions where such a consensus cannot be reached, the transaction service must log and report transaction failures.

XTS includes support for automatic recovery of WS-AT and WS-BA transactions, if either or both of the coordinator and participant hosts crashes. The XTS recovery manager begins execution on coordinator and participant hosts when the XTS service restarts. On a coordinator host, the recovery manager detects any WS-AT transactions which have prepared but not committed, as well as any WS-BA transactions which have completed but not yet closed. It ensures that all their participants are rolled forward in the first case, or closed in the second.

On a participant host, the recovery manager detects any prepared WS-AT participants which have not responded to a transaction rollback, and any completed WS-BA participants which have not yet responded to an activity cancel request, and ensures that the former are rolled back and the latter are compensated. The recovery service also allows for recovery of subordinate WS-AT transactions and their participants if a crash occurs on a host where an interposed WS-AT coordinator has been employed.

9.1. WS-AT Recovery

9.1.1. WS-AT Coordinator Crash Recovery

The WS-AT coordination service tracks the status of each participant in a transaction as the transaction progresses through its two-phase commit. When all participants have been sent a prepare message and have responded with a prepared message, the coordinator writes a log record storing each participant's details, indicating that the transaction is ready to complete. If the coordinator service crashes after this point has been reached, completion of the two-phase commit protocol is still guaranteed, by reading the log file after reboot and sending a commit message to each participant. Once all participants have responded to the commit with a committed message, the coordinator can safely delete the log entry.

Since the prepared messages returned by the participants imply that they are ready to commit their provisional changes and make them permanent, this type of recovery is safe. Additionally, the coordinator does not need to account for any commit messages which may have been sent before the crash, or resend messages if it crashes several times. The XTS participant implementation is resilient to redelivery of the commit messages. If the participant has implemented the recovery functions described in Section 9.1.2.1, “WS-AT Participant Crash Recovery APIs”, the coordinator can guarantee delivery of commit messages if both it crashes, and one or more of the participant service hosts also crash, at the same time.

If the coordination service crashes before the prepare phase completes, the presumed abort protocol ensures that participants are rolled back. After system restart, the coordination service has the information about about all the transactions which could have entered the commit phase before the reboot, since they have entries in the log. It also knows about any active transactions started after the reboot. If a participant is waiting for a response, after sending its prepared message, it automatically re sends the prepared message at regular intervals. When the coordinator detects a transaction which is not active and has no entry in the log file after the reboot, it instructs the participant to abort, ensuring that the web service gets a chance to roll back any provisional state changes it made on behalf of the transaction.

A web service may decide to unilaterally commit or roll back provisional changes associated with a given participant, if configured to time out after a specified length of time without a response. In this situation, the the web service should record this action and log a message to persistent storage. When the participant receives a request to commit or roll back, it should throw an exception if its unilateral decision action does not match the requested action. The coordinator detects the exception and logs a message marking the outcome as heuristic. It also saves the state of the transaction permanently in the transaction log, to be inspected and reconciled by an administrator.

9.1.2. WS-AT Participant Crash Recovery

WS-AT participants associated with a transactional web service do not need to be involved in crash recovery if the Web service's host machine crashes before the participant is told to prepare. The coordinator will assume that the transaction has aborted, and the Web service can discard any information associated with unprepared transactions when it reboots.

When a participant is told to prepare, the Web service is expected to save to persistent storage the transactional state it needs to commit or roll back the transaction. The specific information it needs to save is dependent on the implementation and business logic of the Web Service. However, the participant must save this state before returning a Prepared vote from the prepare call. If the participant cannot save the required state, or there is some other problem servicing the request made by the client, it must return an Aborted vote.

The XTS participant services running on a Web Service's host machine cooperate with the Web service implementation to facilitate participant crash recovery. These participant services are responsible for calling the participant's prepare, commit, and rollback methods. The XTS implementation tracks the local state of every enlisted participant. If the prepare call returns a Prepared vote, the XTS implementation ensures that the participant state is logged to the local transaction log before forwarding a prepared message to the coordinator.

A participant log record contains information identifying the participant, its transaction, and its coordinator. This is enough information to allow the rebooted XTS implementation to reinstate the participant as active and to continue communication with the coordinator, as though the participant had been enlisted and driven to the prepared state. However, a participant instance is still necessary for the commit or rollback process to continue.

Full recovery requires the log record to contain information needed by the Web service which enlisted the participant. This information must allow it to recreate an equivalent participant instance, which can continue the commit process to completion, or roll it back if some other Web Service fails to prepare. This information might be as simple as a String key which the participant can use to locate the data it made persistent before returning its Prepared vote. It may be as complex as a serialized object tree containing the original participant instance and other objects created by the Web service.

If a participant instance implements the relevant interface, the XTS implementation will append this participant recovery state to its log record before writing it to persistent storage. In the event of a crash, the participant recovery state is retrieved from the log and passed to the Web Service which created it. The Web Service uses this state to create a new participant, which the XTS implementation uses to drive the transaction to completion. Log records are only deleted after the participant's commit or rollback method is called.

Warning

If a crash happens just before or just after a commit method is called, a commit or rollback method may be called twice.

9.1.2.1. WS-AT Participant Crash Recovery APIs

9.1.2.1.1. Saving Participant Recovery State

When a Business Activity participant web service completes its work, it may want to save the information which will be required later to close or compensate actions performed during the activity. The XTS implementation automatically acquires this information from the participant as part of the completion process and writes it to a participant log record. This ensures that the information can be restored and used to recreate a copy of the participant even if the web service container crashes between the complete and close or compensate operations.

For a Participant Completion participant, this information is acquired when the web service invokes the completed method of the BAParticipantManager instance returned from the call which enlisted the participant. For a Coordinator Completion participant this occurs immediately after the call to it's completed method returns. This assumes that the completed method does not throw an exception or call the participant manager's cannotComplete or fail method.

A participant may signal that it is capable of performing recovery processing, by implementing the java.lang.Serializable interface. An alternative is to implement the Example 9.1, “PersistableATParticipant Interface”.

Example 9.1. PersistableATParticipant Interface

public interface PersistableATParticipant

{

    byte[] getRecoveryState() throws Exception;

}

If a participant implements the Serializable interface, the XTS participant services implementation uses the serialization API to create a version of the participant which can be appended to the participant log entry. If it implements the PersistableATParticipant interface, the XTS participant services implementation call the getRecoveryState method to obtain the state to be appended to the participant log entry.

If neither of these APIs is implemented, the XTS implementation logs a warning message and proceeds without saving any recovery state. In the event of a crash on the host machine for the Web service during commit, the transaction cannot be recovered and a heuristic outcome may occur. This outcome is logged on the host running the coordinator services.

9.1.2.1.2. Recovering Participants at Reboot

A Web service must register with the XTS implementation when it is deployed, and unregister when it is undeployed, in order to participate in recovery processing. Registration is performed using class XTSATRecoveryManager defined in package org.jboss.jbossts.xts.recovery.participant.at.

Example 9.2. Registering for Recovery

public abstract class XTSATRecoveryManager {

    . . .

        public static XTSATRecoveryManager getRecoveryManager() ;

    public void registerRecoveryModule(XTSATRecoveryModule module);

    public abstract void unregisterRecoveryModule(XTSATRecoveryModule module)

        throws NoSuchElementException;

    . . .

}

The Web service must provide an implementation of interface XTSBARecoveryModule in package org.jboss.jbossts.xts.recovery.participant.ba, as an argument to the register and unregister calls. This instance identifies saved participant recovery records and recreates new, recovered participant instances:

Example 9.3. XTSBARecoveryModule Interface

public interface XTSATRecoveryModule

{

    public Durable2PCParticipant

        deserialize(String id, ObjectInputStream stream) 

        throws Exception;

    public Durable2PCParticipant

        recreate(String id, byte[] recoveryState) 

        throws Exception;

    public void endScan();

}

If a participant's recovery state was saved using serialization, the recovery module's deserialize method is called to recreate the participant. Normally, the recovery module is required to read, cast, and return an object from the supplied input stream. If a participant's recovery state was saved using the PersistableATParticipant interface, the recovery module's recreate method is called to recreate the participant from the byte array it provided when the state was saved.

The XTS implementation cannot identify which participants belong to which recovery modules. A module only needs to return a participant instance if the recovery state belongs to the module's Web service. If the participant was created by another Web service, the module should return null. The participant identifier, which is supplied as argument to the deserialize or recreate method, is the identifier used by the Web service when the original participant was enlisted in the transaction. Web Services participating in recovery processing should ensure that participant identifiers are unique per service. If a module recognizes that a participant identifier belongs to its Web service, but cannot recreate the participant, it should throw an exception. This situation might arise if the service cannot associate the participant with any transactional information which is specific to the business logic.

Even if a module relies on serialization to create the participant recovery state saved by the XTS implementation, it still must be registered by the application. The deserialization operation must employ a class loader capable of loading classes specific to the Web service. XTS fulfills this requirement by devolving responsibility for the deserialize operation to the recovery module.

9.2. WS-BA Recovery

9.2.1. WS-BA Coordinator Crash Recovery

The WS-BA coordination service implementation tracks the status of each participant in an activity as the activity progresses through completion and closure. A transition point occurs during closure, once all CoordinatorCompletion participants receive a complete message and respond with a completed message. At this point, all ParticipantCompletion participants should have sent a completed message. The coordinator writes a log record storing the details of each participant, and indicating that the transaction is ready to close. If the coordinator service crashes after the log record is written, the close operation is still guaranteed to be successful. The coordinator checks the log after the system reboots and re sends a close message to all participants. After all participants respond to the close with a closed message, the coordinator can safely delete the log entry.

The coordinator does not need to account for any close messages sent before the crash, nor resend messages if it crashes several times. The XTS participant implementation is resilient to redelivery of close messages. Assuming that the participant has implemented the recovery functions described below, the coordinator can even guarantee delivery of close messages if both it, and one or more of the participant service hosts, crash simultaneously.

If the coordination service crashes before it has written the log record, it does not need to explicitly compensate any completed participants. The presumed abort protocol ensures that all completed participants are eventually sent a compensate message. Recovery must be initiated from the participant side.

A log record does not need to be written when an activity is being canceled. If a participant does not respond to a cancel or compensate request, the coordinator logs a warning and continues. The combination of the presumed abort protocol and participant-led recovery ensures that all participants eventually get canceled or compensated, as appropriate, even if the participant host crashes.

If a completed participant does not detect a response from its coordinator after resending its completed response a suitable number of times, it switches to sending getstatus messages, to determine whether the coordinator still knows about it. If a crash occurs before writing the log record, the coordinator has no record of the participant when the coordinator restarts, and the getstatus request returns a fault. The participant recovery manager automatically compensates the participant in this situation, just as if the activity had been canceled by the client.

After a participant crash, the participant recovery manager detects the log entries for each completed participant. It sends getstatus messages to each participant's coordinator host, to determine whether the activity still exists. If the coordinator has not crashed and the activity is still running, the participant switches back to resending completed messages, and waits for a close or compensate response. If the coordinator has also crashed or the activity has been canceled, the participant is automatically canceled.

9.2.2. WS-BA Participant Crash Recovery APIs

9.2.2.1. Saving Participant Recovery State

Example 9.4. PersistableBAParticipant Interface

public interface PersistableBAParticipant

{

    byte[] getRecoveryState() throws Exception;

}

If a participant implements the Serializable interface, the XTS participant services implementation uses the serialization API to create a version of the participant which can be appended to the participant log entry. If the participant implements the PersistableBAParticipant, the XTS participant services implementation call the getRecoveryState method to obtain the state, which is appended to the participant log entry.

If neither of these APIs is implemented, the XTS implementation logs a warning message and proceeds without saving any recovery state. If the Web service's host machine crashes while the activity is being closed, the activity cannot be recovered and a heuristic outcome will probably be logged on the coordinator's host machine. If the activity is canceled, the participant is not compensated and the coordinator host machine may log a heuristic outcome for the activity.

9.2.2.2. Recovering Participants at Reboot

A Web service must register with the XTS implementation when it is deployed, and unregister when it is undeployed, so it can take part in recovery processing.

Registration is performed using the XTSBARecoveryManager, defined in the org.jboss.jbossts.xts.recovery.participant.ba package.

Example 9.5. XTSBARecoveryManager Class

public abstract class XTSBARecoveryManager {

    . . .

        public static XTSBARecoveryManager getRecoveryManager() ;

    public void registerRecoveryModule(XTSBARecoveryModule module);

    public abstract void unregisterRecoveryModule(XTSBARecoveryModule module)

        throws NoSuchElementException;

    . . .

}

The Web service must provide an implementation of the XTSBARecoveryModule in the org.jboss.jbossts.xts.recovery.participant.ba, as an argument to the register and unregister calls. This instance identifies saved participant recovery records and recreates new, recovered participant instances:

Example 9.6. XTSBARecoveryModule Interface

public interface XTSBARecoveryModule

{

    public BusinessAgreementWithParticipantCompletionParticipant

        deserializeParticipantCompletionParticipant(String id,

                                                    ObjectInputStream stream)

        throws Exception;

    public BusinessAgreementWithParticipantCompletionParticipant

        recreateParticipantCompletionParticipant(String id,

                                                 byte[] recoveryState)

        throws Exception;

    public BusinessAgreementWithCoordinatorCompletionParticipant

        deserializeCoordinatorCompletionParticipant(String id,

                                                    ObjectInputStream stream)

        throws Exception;

    public BusinessAgreementWithCoordinatorCompletionParticipant

        recreateCoordinatorCompletionParticipant(String id,

                                                 byte[] recoveryState)

        throws Exception;

    public void endScan();

}

If a participant's recovery state was saved using serialization, one of the recovery module's deserialize methods is called, so that it can recreate the participant. Which method to use depends on whether the saved participant implemented the ParticipantCompletion protocol or the CoordinatorCompletion protocol. Normally, the recovery module reads, casts and returns an object from the supplied input stream. If a participant's recovery state was saved using the PersistableBAParticipant interface, one of the recovery module's recreate methods is called, so that it can recreate the participant from the byte array provided when the state was saved. The method to use depends on which protocol the saved participant implemented.

The XTS implementation does not track which participants belong to which recovery modules. A module is only expected to return a participant instance if it can identify that the recovery state belongs to its Web service. If the participant was created by some other Web service, the module should return null. The participant identifier supplied as an argument to the deserialize or recreate calls is the identifier used by the Web service when the original participant was enlisted in the transaction. Web Services which participate in recovery processing should ensure that the participant identifiers they employ are unique per service. If a module recognizes a participant identifier as belonging to its Web service, but cannot recreate the participant, it throws an exception. This situation might arise if the service cannot associate the participant with any transactional information specific to business logic.

A module must be registered by the application, even when it relies upon serialization to create the participant recovery state saved by the XTS implementation. The deserialization operation must employ a class loader capable of loading Web service-specific classes. The XTS implementation achieves this by delegating responsibility for the deserialize operation to the recovery module.

9.2.2.3. Securing Web Service State Changes

When a BA participant completes, it is expected to commit changes to the web service state made during the activity. The web service usually also needs to persist these changes to a local storage device. This leaves open a window where the persisted changes may not be guarded with the necessary compensation information. The web service container may crash after the changes to the service state have been written but before the XTS implementation is able to acquire the recovery state and write a recovery log record for the participant. Participants may close this window by employing a two phase update to the local store used to persist the web service state.

A participant which needs to persist changes to local web service state should implement interface ConfirmCompletedParticipant in package com.arjuna.wst11. This signals to the XTS implementation that it expects confirmation after a successful write of the participant recovery record, allowing it to roll forward provisionally persisted changes to the web service state. Delivery of this confirmation can be guaranteed even if the web service container crashes after writing the participant log record. Conversely, if a recovery record cannot be written because of a fault or a crash prior to writing, the provisional changes can be guaranteed to be rolled back.

Example 9.7. ConfirmCompletedParticipant Interface

public interface ConfirmCompletedParticipant

{

    public void confirmCompleted(boolean confirmed);

}

When the participant is ready to complete, it should prepare its persistent changes by temporarily locking access to the relevant state in the local store and writing the changed data to disk, retaining both the old and new versions of the service state. For a Participant Completion participant, this prepare operation should be done just before calling the participant manager's completed method. For a Coordinator Completion participant, it should be done just before returning from the call to the participant's completed method. After writing the participant log record, the XTS implementation calls the participant's confirmCompleted method, providing value true as the argument. The participant should respond by installing the provisional state changes and releasing any locks. If the log record cannot be written, the XTS implementation calls the participant's confirmCompleted method, providing value false as the argument. The participant should respond by restoring the original state values and releasing any locks.

If a crash occurs before the call to confirmCompleted, the application's recovery module can make sure that the provisional changes to the web service state are rolled forward or rolled back as appropriate. The web service must identify all provisional writes to persistent state before it starts serving new requests or processing recovered participants. It must reobtain any locks required to ensure that the state is not changed by new transactions. When the recovery module recovers a participant from the log, its compensation information is available. If the participant still has prepared changes, the recovery code must call confirmCompleted, passing value true. This allows the participant to finish the complete operation. The XTS implementation then forwards a completed message to the coordinator, ensuring that the participant is subsequently notified either to close or to compensate. At the end of the first recovery scan, the recovery module may find some prepared changes on disk which are still unaccounted for. This means that the participant recovery record is not available. The recovery module should restore the original state values and release any locks. The XTS implementation responds to coordinator requests regarding the participant with an unknown participant fault, forcing the activity as a whole to be rolled back.