JBoss.orgCommunity Documentation

ArjunaCore Failure Recovery Guide

Failure Recovery for TxCore and TXOJ

by Mark Little

Abstract

The ArjunaCore Failure Recovery Guide contains information on how to use Narayana to develop applications that use transaction technology to manage business processes.


This manual uses several conventions to highlight certain words and phrases and draw attention to specific pieces of information.

In PDF and paper editions, this manual uses typefaces drawn from the Liberation Fonts set. The Liberation Fonts set is also used in HTML editions if the set is installed on your system. If not, alternative but equivalent typefaces are displayed. Note: Red Hat Enterprise Linux 5 and later includes the Liberation Fonts set by default.

Four typographic conventions are used to call attention to specific words and phrases. These conventions, and the circumstances they apply to, are as follows.

Mono-spaced Bold

Used to highlight system input, including shell commands, file names and paths. Also used to highlight keycaps and key combinations. For example:

The above includes a file name, a shell command and a keycap, all presented in mono-spaced bold and all distinguishable thanks to context.

Key combinations can be distinguished from keycaps by the hyphen connecting each part of a key combination. For example:

The first paragraph highlights the particular keycap to press. The second highlights two key combinations (each a set of three keycaps with each set pressed simultaneously).

If source code is discussed, class names, methods, functions, variable names and returned values mentioned within a paragraph will be presented as above, in mono-spaced bold. For example:

Proportional Bold

This denotes words or phrases encountered on a system, including application names; dialog box text; labeled buttons; check-box and radio button labels; menu titles and sub-menu titles. For example:

The above text includes application names; system-wide menu names and items; application-specific menu names; and buttons and text found within a GUI interface, all presented in proportional bold and all distinguishable by context.

Mono-spaced Bold Italic or Proportional Bold Italic

Whether mono-spaced bold or proportional bold, the addition of italics indicates replaceable or variable text. Italics denotes text you do not input literally or displayed text that changes depending on circumstance. For example:

Note the words in bold italics above — username, domain.name, file-system, package, version and release. Each word is a placeholder, either for text you enter when issuing a command or for text displayed by the system.

Aside from standard usage for presenting the title of a work, italics denotes the first use of a new and important term. For example:

In this chapter we shall cover information on failure recovery that is specific to TxCore, TXOJ or using Narayana outside the scope of a supported application server.

The failure recovery subsystem of Narayana will ensure that results of a transaction are applied consistently to all resources affected by the transaction, even if any of the application processes or the machine hosting them crash or lose network connectivity. In the case of machine (system) crash or network failure, the recovery will not take place until the system or network are restored, but the original application does not need to be restarted – recovery responsibility is delegated to the Recovery Manager process (see below). Recovery after failure requires that information about the transaction and the resources involved survives the failure and is accessible afterward: this information is held in the ActionStore, which is part of the ObjectStore.

Until the recovery procedures are complete, resources affected by a transaction that was in progress at the time of the failure may be inaccessible. For database resources, this may be reported as tables or rows held by “in-doubt transactions”. For TransactionalObjects for Java resources, an attempt to activate the Transactional Object (as when trying to get a lock) will fail.

The failure recovery subsystem of Narayana requires that the stand-alone Recovery Manager process be running for each ObjectStore (typically one for each node on the network that is running Narayana applications). The RecoveryManager file is located in the package com.arjuna.ats.arjuna.recovery.RecoveryManager. To start the Recovery Manager issue the following command:

<!-- <br/> --><span class="java_plain">java&nbsp;com</span><!-- <br/> --><span class="java_separator">.</span><!-- <br/> --><span class="java_plain">arjuna</span><!-- <br/> --><span class="java_separator">.</span><!-- <br/> --><span class="java_plain">ats</span><!-- <br/> --><span class="java_separator">.</span><!-- <br/> --><span class="java_plain">arjuna</span><!-- <br/> --><span class="java_separator">.</span><!-- <br/> --><span class="java_plain">recovery</span><!-- <br/> --><span class="java_separator">.</span><!-- <br/> --><span class="java_type">RecoveryManager</span>

If the -test flag is used with the Recovery Manager then it will display a “Ready” message when initialised, i.e.,

<!-- <br/> --><span class="java_plain">java&nbsp;com</span><!-- <br/> --><span class="java_separator">.</span><!-- <br/> --><span class="java_plain">arjuna</span><!-- <br/> --><span class="java_separator">.</span><!-- <br/> --><span class="java_plain">ats</span><!-- <br/> --><span class="java_separator">.</span><!-- <br/> --><span class="java_plain">arjuna</span><!-- <br/> --><span class="java_separator">.</span><!-- <br/> --><span class="java_plain">recovery</span><!-- <br/> --><span class="java_separator">.</span><!-- <br/> --><span class="java_type">RecoveryManager</span><!-- <br/> --><span class="java_plain">&nbsp;</span><!-- <br/> --><span class="java_operator">-</span><!-- <br/> --><span class="java_plain">test</span>

The RecoveryManager scans the ObjectStore and other locations of information, looking for transactions and resources that require, or may require recovery. The scans and recovery processing are performed by recovery modules, (instances of classes that implement the com.arjuna.ats.arjuna.recovery.RecoveryModule interface), each with responsibility for a particular category of transaction or resource. The set of recovery modules used are dynamically loaded, using properties found in the RecoveryManager property file.

The interface has two methods: periodicWorkFirstPass and periodicWorkSecondPass. At an interval (defined by property com.arjuna.ats.arjuna.recovery.periodicRecoveryPeriod), the RecoveryManager will call the first pass method on each property, then wait for a brief period (defined by property com.arjuna.ats.arjuna.recovery.recoveryBackoffPeriod), then call the second pass of each module. Typically, in the first pass, the module scans (e.g. the relevant part of the ObjectStore) to find transactions or resources that are in-doubt (i.e. are part way through the commitment process). On the second pass, if any of the same items are still in-doubt, it is possible the original application process has crashed and the item is a candidate for recovery.

An attempt, by the RecoveryManager, to recover a transaction that is still progressing in the original process(es) is likely to break the consistency. Accordingly, the recovery modules use a mechanism (implemented in the com.arjuna.ats.arjuna.recovery.TransactionStatusManager package) to check to see if the original process is still alive, and if the transaction is still in progress. The RecoveryManager only proceeds with recovery if the original process has gone, or, if still alive, the transaction is completed. (If a server process or machine crashes, but the transaction-initiating process survives, the transaction will complete, usually generating a warning. Recovery of such a transaction is the RecoveryManager’s responsibility).

It is clearly important to set the interval periods appropriately. The total iteration time will be the sum of the periodicRecoveryPeriod, recoveryBackoffPeriod and the length of time it takes to scan the stores and to attempt recovery of any in-doubt transactions found, for all the recovery modules. The recovery attempt time may include connection timeouts while trying to communicate with processes or machines that have crashed or are inaccessible (which is why there are mechanisms in the recovery system to avoid trying to recover the same transaction for ever). The total iteration time will affect how long a resource will remain inaccessible after a failure – periodicRecoveryPeriod should be set accordingly (default is 120 seconds). The recoveryBackoffPeriod can be comparatively short (default is 10 seconds) – its purpose is mainly to reduce the number of transactions that are candidates for recovery and which thus require a “call to the original process to see if they are still in progress

Two recovery modules (implementations of the com.arjuna.ats.arjuna.recovery.RecoveryModule interface) are supplied with Narayana, supporting various aspects of transaction recovery including JDBC recovery. It is possible for advanced users to create their own recovery modules and register them with the Recovery Manager. The recovery modules are registered with the RecoveryManager using RecoveryEnvironmentBean.recoveryExtensions. These will be invoked on each pass of the periodic recovery in the sort-order of the property names – it is thus possible to predict the ordering (but note that a failure in an application process might occur while a periodic recovery pass is in progress). The default Recovery Extension settings are:


The operation of the recovery subsystem will cause some entries to be made in the ObjectStore that will not be removed in normal progress. The RecoveryManager has a facility for scanning for these and removing items that are very old. Scans and removals are performed by implementations of the com.arjuna.ats.arjuna.recovery.ExpiryScanner interface. Implementations of this interface are loaded by giving the class names as the value of a property RecoveryEnvironmentBean.expiryScanners. The RecoveryManager calls the scan() method on each loaded Expiry Scanner implementation at an interval determined by the property RecoveryEnvironmentBean.expiryScanInterval”. This value is given in hours – default is 12. An expiryScanInterval value of zero will suppress any expiry scanning. If the value as supplied is positive, the first scan is performed when RecoveryManager starts; if the value is negative, the first scan is delayed until after the first interval (using the absolute value)

The kinds of item that are scanned for expiry are:

TransactionStatusManager items: one of these is created by every application process that uses Narayana – they contain the information that allows the RecoveryManager to determine if the process that initiated the transaction is still alive, and what the transaction status is. The expiry time for these is set by the property com.arjuna.ats.arjuna.recovery.transactionStatusManagerExpiryTime (in hours – default is 12, zero means never expire). The expiry time should be greater than the lifetime of any single Narayana-using process.

The Expiry Scanner properties for these are:


To illustrate the behavior of a recovery module, the following pseudo code describes the basic algorithm used for Atomic Action transactions and Transactional Objects for java.



In order to recover from failure, we have seen that the Recovery Manager contacts recovery modules by invoking periodically the methods periodicWorkFirstPass and periodicWorkSecondPass. Each Recovery Module is then able to manage recovery according the type of resources that need to be recovered. The Narayana product is shipped with a set of recovery modules (TOReceveryModule, XARecoveryModule…), but it is possible for a user to define its own recovery module that fit his application. The following basic example illustrates the steps needed to build such recovery module

This basic example does not aim to present a complete process to recover from failure, but mainly to illustrate the way to implement a recovery module.

The application used here consists to create an atomic transaction, to register a participant within the created transaction and finally to terminate it either by commit or abort. A set of arguments are provided:

The code of the main class that control the application is given below

Example 1.5. TestRecoveryModule.java

package com.arjuna.demo.recoverymodule;


import com.arjuna.ats.arjuna.AtomicAction;
import com.arjuna.ats.arjuna.coordinator.*;
public class TestRecoveryModule {
    public static void main(String args[]) {
        try {
            AtomicAction tx = new AtomicAction();
            tx.begin(); // Top level begin
            // enlist the participant
            tx.add(SimpleRecord.create());
            System.out.println("About to complete the transaction ");
            for (int i = 0; i < args.length; i++) {
                if ((args[i].compareTo("-commit") == 0))
                    _commit = true;
                if ((args[i].compareTo("-rollback") == 0))
                    _commit = false;
                if ((args[i].compareTo("-crash") == 0))
                    _crash = true;
            }
            if (_commit)
                tx.commit(); // Top level commit
            else
                tx.abort(); // Top level rollback
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    protected static boolean _commit = true;
    protected static boolean _crash = false;
}
        

The registered participant has the following behavior:

  • During the prepare phase, it writes a simple message - “I’m prepared”- on the disk such The message is written in a well known file

  • During the commit phase, it writes another message - “I’m committed”- in the same file used during prepare

  • If it receives an abort message, it removes from the disk the file used for prepare if any.

  • If a crash has been decided for the test, then it crashes during the commit phase – the file remains with the message “I’m prepared”.

The main portion of the code illustrating such behavior is described hereafter.

Warning

that the location of the file given in variable filename can be changed

Example 1.6. SimpleRecord.java

package com.arjuna.demo.recoverymodule;


import com.arjuna.ats.arjuna.coordinator.*;
import java.io.File;
public class SimpleRecord extends AbstractRecord {
    public String filename = "c:/tmp/RecordState";
    public SimpleRecord() {
        System.out.println("Creating new resource");
    }
    public static AbstractRecord create() {
        return new SimpleRecord();
    }
    public int topLevelAbort() {
        try {
            File fd = new File(filename);
            if (fd.exists()) {
                if (fd.delete())
                    System.out.println("File Deleted");
            }
        } catch (Exception ex) {
            // &
        }
        return TwoPhaseOutcome.FINISH_OK;
    }
    public int topLevelCommit() {
        if (TestRecoveryModule._crash)
            System.exit(0);
        try {
            java.io.FileOutputStream file = new java.io.FileOutputStream(
                    filename);
            java.io.PrintStream pfile = new java.io.PrintStream(
                    file);
            pfile.println("I'm Committed");
            file.close();
        } catch (java.io.IOException ex) {
            // ...
        }
        return TwoPhaseOutcome.FINISH_OK;
    }
    public int topLevelPrepare() {
        try {
            java.io.FileOutputStream file = new java.io.FileOutputStream(
                    filename);
            java.io.PrintStream pfile = new java.io.PrintStream(
                    file);
            pfile.println("I'm prepared");
            file.close();
        } catch (java.io.IOException ex) {
            // ...
        }
        return TwoPhaseOutcome.PREPARE_OK;
    }
    // &
}
        

The role of the Recovery Module in such application consists to read the content of the file used to store the status of the participant, to determine that status and print a message indicating if a recovery action is needed or not.

Example 1.7. SimpleRecoveryModule.java

package com.arjuna.demo.recoverymodule;


import com.arjuna.ats.arjuna.recovery.RecoveryModule;
public class SimpleRecoveryModule implements RecoveryModule {
    public String filename = "c:/tmp/RecordState";
    public SimpleRecoveryModule() {
        System.out
                .println("The SimpleRecoveryModule is loaded");
    }
    public void periodicWorkFirstPass() {
        try {
            java.io.FileInputStream file = new java.io.FileInputStream(
                    filename);
            java.io.InputStreamReader input = new java.io.InputStreamReader(
                    file);
            java.io.BufferedReader reader = new java.io.BufferedReader(
                    input);
            String stringState = reader.readLine();
            if (stringState.compareTo("I'm prepared") == 0)
                System.out
                        .println("The transaction is in the prepared state");
            file.close();
        } catch (java.io.IOException ex) {
            System.out.println("Nothing found on the Disk");
        }
    }
    public void periodicWorkSecondPass() {
        try {
            java.io.FileInputStream file = new java.io.FileInputStream(
                    filename);
            java.io.InputStreamReader input = new java.io.InputStreamReader(
                    file);
            java.io.BufferedReader reader = new java.io.BufferedReader(
                    input);
            String stringState = reader.readLine();
            if (stringState.compareTo("I'm prepared") == 0) {
                System.out
                        .println("The record is still in the prepared state");
                System.out.println(" Recovery is needed");
            } else if (stringState
                    .compareTo("I'm Committed") == 0) {
                System.out
                        .println("The transaction has completed and committed");
            }
            file.close();
        } catch (java.io.IOException ex) {
            System.out.println("Nothing found on the Disk");
            System.out
                    .println("Either there was no transaction");
            System.out.println("or it as been rolled back");
        }
    }
}
        

The recovery module should now be deployed in order to be called by the Recovery Manager. To do so, we just need to add an entry in the the config file for the extension:


Once started, the Recovery Manager will automatically load the listed Recovery modules.

Note

The source of the code can be retrieved under the trailmap directory of the Narayana installation.

As mentioned, the basic application presented above does not present the complete process to recover from failure, but it was just presented to describe how the build a recovery module. In case of the OTS protocol, let’s consider how a recovery module that manages recovery of OTS resources can be configured.

To manage recovery in case of failure, the OTS specification has defined a recovery protocol. Transaction’s participants in a doubt status could use the RecoveryCoordinator to determine the status of the transaction. According to that transaction status, those participants can take appropriate decision either by roll backing or committing. Asking the RecoveryCoordinator object to determine the status consists to invoke the replay_completion operation on the RecoveryCoordinator.

For each OTS Resource in a doubt status, it is well known which RecoveyCoordinator to invoke to determine the status of the transaction in which the Resource is involved – It’s the RecoveryCoordinator returned during the Resource registration process. Retrieving such RecoveryCoordinator per resource means that it has been stored in addition to other information describing the resource.

A recovery module dedicated to recover OTS Resources could have the following behavior. When requested by the recovery Manager on the first pass it retrieves from the disk the list of resources that are in the doubt status. During the second pass, if the resources that were retrieved in the first pass still remain in the disk then they are considered as candidates for recovery. Therefore, the Recovery Module retrieves for each candidate its associated RecoveryCoordinator and invokes the replay_completion operation that the status of the transaction. According to the returned status, an appropriate action would be taken (for instance, rollback the resource is the status is aborted or inactive).

Revision History
Revision 1Tue Apr 13 2010Tom Jenkinson
Initial creation of book by publican