JBoss.orgCommunity Documentation

Chapter 7. Failure Recovery

7.1. Configuring the failure recovery subsystem for your ORB
7.2. JTS specific recovery
7.2.1. XA resource recovery
7.2.2. Recovery behavior
7.2.3. Expired entry removal
7.2.4. Recovery domains
7.3. Transaction status and replay_comparison

The failure recovery subsystem of JBossTS ensure that results of a transaction are applied consistently to all resources affected by the transaction, even if any of the application processes or the hardware hosting them crash or lose network connectivity. In the case of hardware crashes or network failures, the recovery does not take place until the system or network are restored, but the original application does not need to be restarted. Recovery is handled by the Recovery Manager process. For recover to take place, information about the transaction and the resources involved needs to survive the failure and be accessible afterward. This information is held in the ActionStore , which is part of the ObjectStore . If the ObjectStore is destroyed or modified, recovery may not be possible.

Until the recovery procedures are complete, resources affected by a transaction which was in progress at the time of the failure may be inaccessible. Database resources may report this as as tables or rows held by in-doubt transactions . For TXOJ resources, an attempt to activate the Transactional Object, such as when trying to get a lock, fails.

Although some ORB-specific configuration is necessary to configure the ORB sub-system, the basic settings are ORB-independent. The configuration which applies to JBossTS is in the RecoveryManager-properties.xml file and the orportability-properties.xml file. Contents of each file are below.



These entries cause instances of the named classes to be loaded. The named classes then load the ORB-specific classes needed and perform other initialization. This enables failure recovery for transactions initiated by or involving applications using this property file. The default RecoveryManager-properties.xml file and orportability-properties.xml with the distribution include these entries.

Important

Failure recovery is NOT supported with the JavaIDL ORB that is part of JDK. Failure recovery is supported for JacOrb only.

To disable recovery, remove or comment out the RecoveryEnablement line in the property file.

Recovery of XA resources accessed via JDBC is handled by the XARecoveryModule . This module includes both transaction-initiated and resource-initiated recovery.

Transaction-initiated recovery is automatic. The XARecoveryModule finds the JTA_ResourceRecord which needs recovery, using the two-pass mechanism described above. It then uses the normal recovery mechanisms to find the status of the transaction the resource was involved in, by running replay_completion on the RecoveryCoordinator for the transaction branch. Next, it creates or recreates the appropriate XAResource and issues commit or rollback on it as appropriate. The XAResource creation uses the same database name, username, password, and other information as the application.

Resource-initiated recovery must be specifically configured, by supplying the RecoveryManager with the appropriate information for it to interrogate all the XADataSources accessed by any JBossTS application. The access to each XADataSource is handled by a class that implements the com.arjuna.ats.jta.recovery.XAResourceRecovery interface. Instances of this class are dynamically loaded, as controlled by property JTAEnvironmentBean.xaResourceRecoveryInstances .

The XARecoveryModule uses the XAResourceRecovery implementation to get an XAResource to the target datasource. On each invocation of periodicWorkSecondPass , the recovery module issues an XAResource.recover request. This request returns a list of the transaction identifiers that are known to the datasource and are in an in-doubt state. The list of these in-doubt Xids is compared across multiple passes, using periodicWorkSecondPass-es . Any Xid that appears in both lists, and for which no JTA_ResourceRecord is found by the intervening transaction-initiated recovery, is assumed to belong to a transaction involved in a crash before any JTA_Resource_Record was written, and a rollback is issued for that transaction on the XAResource .

This double-scan mechanism is used because it is possible the Xid was obtained from the datasource just as the original application process was about to create the corresponding JTA_ResourceRecord. The interval between the scans should allow time for the record to be written unless the application crashes (and if it does, rollback is the right answer).

An XAResourceRecovery implementation class can contain all the information needed to perform recovery to a specific datasource. Alternatively, a single class can handle multiple datasources which have some similar features. The constructor of the implementation class must have an empty parameter list, because it is loaded dynamically. The interface includes an initialise method, which passes in further information as a string . The content of the string is taken from the property value that provides the class name. Everything after the first semi-colon is passed as the value of the string. The XAResourceRecovery implementation class determines how to use the string.

An XAResourceRecovery implementation class, com.arjuna.ats.internal.jdbc.recovery.BasicXARecovery , supports resource-initiated recovery for any XADataSource. For this class, the string received in method initialise is assumed to contain the number of connections to recover, and the name of the properties file containing the dynamic class name, the database username, the database password and the database connection URL. The following example is for an Oracle 8.1.6 database accessed via the Sequelink 5.1 driver:

XAConnectionRecoveryEmpay=com.arjuna.ats.internal.jdbc.recovery.BasicXARecovery;2;OraRecoveryInfo
      

This implementation is only meant as an example, because it relies upon usernames and passwords appearing in plain text properties files. You can create your own implementations of XAConnectionRecovery . See the javadocs and the example com.arjuna.ats.internal.jdbc.recovery.BasicXARecovery .

Example 7.3. XAConnectionRecovery implementation



/*
 * Copyright (C) 2000, 2001,
 *
 * Hewlett-Packard,
 * Arjuna Labs,
 * Newcastle upon Tyne,
 * Tyne and Wear,
 * UK.
 *
 */
package com.arjuna.ats.internal.jdbc.recovery;
import com.arjuna.ats.jdbc.TransactionalDriver;
import com.arjuna.ats.jdbc.common.jdbcPropertyManager;
import com.arjuna.ats.jdbc.logging.jdbcLogger;
import com.arjuna.ats.internal.jdbc.*;
import com.arjuna.ats.jta.recovery.XAConnectionRecovery;
import com.arjuna.ats.arjuna.common.*;
import com.arjuna.common.util.logging.*;
import java.sql.*;
import javax.sql.*;
import javax.transaction.*;
import javax.transaction.xa.*;
import java.util.*;
import java.lang.NumberFormatException;
/**
 * This class implements the XAConnectionRecovery interface for XAResources.
 * The parameter supplied in setParameters can contain arbitrary information
 * necessary to initialise the class once created. In this instance it contains
 * the name of the property file in which the db connection information is
 * specified, as well as the number of connections that this file contains
 * information on (separated by ;).
 *
 * IMPORTANT: this is only an *example* of the sorts of things an
 * XAConnectionRecovery implementor could do. This implementation uses
 * a property file which is assumed to contain sufficient information to
 * recreate connections used during the normal run of an application so that
 * we can perform recovery on them. It is not recommended that information such
 * as user name and password appear in such a raw text format as it opens up
 * a potential security hole.
 *
 * The db parameters specified in the property file are assumed to be
 * in the format:
 *
 * DB_x_DatabaseURL=
 * DB_x_DatabaseUser=
 * DB_x_DatabasePassword=
 * DB_x_DatabaseDynamicClass=
 *
 * DB_JNDI_x_DatabaseURL= 
 * DB_JNDI_x_DatabaseUser= 
 * DB_JNDI_x_DatabasePassword= 
 *
 * where x is the number of the connection information.
 *
 * @since JTS 2.1.
 */
public class BasicXARecovery implements XAConnectionRecovery
{    
    /*
     * Some XAConnectionRecovery implementations will do their startup work
     * here, and then do little or nothing in setDetails. Since this one needs
     * to know dynamic class name, the constructor does nothing.
     */
    public BasicXARecovery () throws SQLException
    {
        numberOfConnections = 1;
        connectionIndex = 0;
        props = null;
    }
    /**
     * The recovery module will have chopped off this class name already.
     * The parameter should specify a property file from which the url,
     * user name, password, etc. can be read.
     */
    public boolean initialise (String parameter) throws SQLException
    {
        int breakPosition = parameter.indexOf(BREAKCHARACTER);
        String fileName = parameter;
        if (breakPosition != -1)
            {
                fileName = parameter.substring(0, breakPosition -1);
                try
                    {
                        numberOfConnections = Integer.parseInt(parameter.substring(breakPosition +1));
                    }
                catch (NumberFormatException e)
                    {
                        //Produce a Warning Message
                        return false;
                    }
            }
        PropertyManager.addPropertiesFile(fileName);
        try
            {
                PropertyManager.loadProperties(true);
                props = PropertyManager.getProperties();
            }
        catch (Exception e)
            {
                //Produce a Warning Message 
                return false;
            }  
        return true;
    }    
    public synchronized XAConnection getConnection () throws SQLException
    {
        JDBC2RecoveryConnection conn = null;
        if (hasMoreConnections())
            {
                connectionIndex++;
                conn = getStandardConnection();
                if (conn == null)
                    conn = getJNDIConnection();
                if (conn == null)
                    //Produce a Warning message
                    }
        return conn;
    }
    public synchronized boolean hasMoreConnections ()
    {
        if (connectionIndex == numberOfConnections)
            return false;
        else
            return true;
    }
    private final JDBC2RecoveryConnection getStandardConnection () throws SQLException
    {
        String number = new String(""+connectionIndex);
        String url = new String(dbTag+number+urlTag);
        String password = new String(dbTag+number+passwordTag);
        String user = new String(dbTag+number+userTag);
        String dynamicClass = new String(dbTag+number+dynamicClassTag);
        Properties dbProperties = new Properties();
        String theUser = props.getProperty(user);
        String thePassword = props.getProperty(password);
        if (theUser != null)
            {
                dbProperties.put(ArjunaJDBC2Driver.userName, theUser);
                dbProperties.put(ArjunaJDBC2Driver.password, thePassword);
                String dc = props.getProperty(dynamicClass);
                if (dc != null)
                    dbProperties.put(ArjunaJDBC2Driver.dynamicClass, dc);
                return new JDBC2RecoveryConnection(url, dbProperties);
            }
        else
            return null;
    }
    private final JDBC2RecoveryConnection getJNDIConnection () throws SQLException
    {
        String number = new String(""+connectionIndex);
        String url = new String(dbTag+jndiTag+number+urlTag);
        String password = new String(dbTag+jndiTag+number+passwordTag);
        String user = new String(dbTag+jndiTag+number+userTag);
        Properties dbProperties = new Properties();
        String theUser = props.getProperty(user);
        String thePassword = props.getProperty(password);
        if (theUser != null)
            {
                dbProperties.put(ArjunaJDBC2Driver.userName, theUser);
                dbProperties.put(ArjunaJDBC2Driver.password, thePassword);    
                return new JDBC2RecoveryConnection(url, dbProperties);
            }
        else
            return null;
    }
    private int        numberOfConnections;
    private int        connectionIndex;
    private Properties props;   
    private static final String dbTag = "DB_";
    private static final String urlTag = "_DatabaseURL";
    private static final String passwordTag = "_DatabasePassword";
    private static final String userTag = "_DatabaseUser";
    private static final String dynamicClassTag = "_DatabaseDynamicClass";
    private static final String jndiTag = "JNDI_";
    /*
     * Example:
     *
     * DB2_DatabaseURL=jdbc\:arjuna\:sequelink\://qa02\:20001
     * DB2_DatabaseUser=tester2
     * DB2_DatabasePassword=tester
     * DB2_DatabaseDynamicClass=
     *      com.arjuna.ats.internal.jdbc.drivers.sequelink_5_1 
     *
     * DB_JNDI_DatabaseURL=jdbc\:arjuna\:jndi
     * DB_JNDI_DatabaseUser=tester1
     * DB_JNDI_DatabasePassword=tester
     * DB_JNDI_DatabaseName=empay
     * DB_JNDI_Host=qa02
     * DB_JNDI_Port=20000
     */
    private static final char BREAKCHARACTER = ';';  // delimiter for parameters
}

Multiple recovery domains and resource-initiated recovery

XAResource.recover returns the list of all transactions that are in-doubt with in the datasource. If multiple recovery domains are used with a single datasource, resource-initiated recovery sees transactions from other domains. Since it does not have a JTA_ResourceRecord available, it rolls back the transaction in the database, if the Xid appears in successive recover calls. To suppress resource-initiated recovery, do not supply an XAConnectionRecovery property, or confine it to one recovery domain.

Property OTS_ISSUE_RECOVERY_ROLLBACK controls whether the RecoveryManager explicitly issues a rollback request when replay_completion asks for the status of a transaction that is unknown. According to the presume-abort mechanism used by OTS and JTS, the transaction can be assumed to have rolled back, and this is the response that is returned to the Resource , including a subordinate coordinator, in this case. The Resource should then apply that result to the underlying resources. However, it is also legitimate for the superior to issue a rollback, if OTS_ISSUE_RECOVERY_ROLLBACK is set to YES .

The OTS transaction identification mechanism makes it possible for a transaction coordinator to hold a Resource reference that will never be usable. This can occur in two cases:

In the first case, the RecoveryManager for the Resource ObjectStore eventually reconstructs a new Resource (with a different CORBA object reference (IOR), and issues a replay_completion request containing the new Resource IOR. The RecoveryManager for the coordinator substitutes this in place of the original, useless one, and issues commit to the new reconstructed Resource . The Resource has to have been in a commit state, or there would be no transaction intention list. Until the replay_completion is received, the RecoveryManager tries to send commit to its Resource reference.–This will fail with a CORBA System Exception. Which exception depends on the ORB and other details.

In the second case, the Resource no longer exists. The RecoveryManager at the coordinator will never get through, and will receive System Exceptions forever.

The RecoveryManager cannot distinguish these two cases by any protocol mechanism. There is a perceptible cost in repeatedly attempting to send the commit to an inaccessible Resource . In particular, the timeouts involved will extend the recovery iteration time, and thus potentially leave resources inaccessible for longer.

To avoid this, the RecoveryManager only attempts to send commit to a Resource a limited number of times. After that, it considers the transaction assumed complete . It retains the information about the transaction, by changing the object type in the ActionStore , and if the Resource eventually does wake up and a replay_completion request is received, the RecoveryManager activates the transaction and issues the commit request to the new Resource IOR. The number of times the RecoveryManager attempts to issue commit as part of the periodic recovery is controlled by the property variable COMMITTED_TRANSACTION_RETRY_LIMIT , and defaults to 3 .

The operation of the recovery subsystem causes some entries to be made in the ObjectStore that are not removed in normal progress. The RecoveryManager has a facility for scanning for these and removing items that are very old. Scans and removals are performed by implementations of the >com.arjuna.ats.arjuna.recovery.ExpiryScanner . Implementations of this interface are loaded by giving the class names as the value of the property RecoveryEnvironmentBean.expiryScannerClassNames . The RecoveryManager calls the scan method on each loaded ExpiryScanner implementation at an interval determined by the property RecoveryEnvironmentBean.expiryScanInterval . This value is given in hours, and defaults to 12 . A property value of 0 disables any expiry scanning. If the value as supplied is positive, the first scan is performed when RecoveryManager starts. If the value is negative, the first scan is delayed until after the first interval, using the absolute value.

There are two kinds of item that are scanned for expiry:

Contact items

One contact item is created by every application process that uses JBossTS. They contain the information that the RecoveryManager uses to determine if the process that initiated the transaction is still alive, and what the transaction status is. The expiry time for these is set by the property RecoveryEnvironmentBean.transactionStatusManagerExpiryTime , which is expressed in hours. The default is 12 , and 0 suppresses the expiration. This is the interval after which a process that cannot be contacted is considered to be dead. It should be long enough to avoid accidentally removing valid entries due to short-lived transient errors such as network downtime.

Assumed complete transactions

The expiry time is counted from when the transactions were assumed to be complete. A replay_completion request resets the clock. The risk with removing assumed complete transactions it that a prolonged communication outage means that a remote Resource cannot connect to the RecoveryManager for the parent transaction. If the assumed complete transaction entry is expired before the communications are recovered, the eventual replay_completion will find no information and the Resource will be rolled back, although the transaction committed. Consequently, the expiry time for assumed complete transactions should be set to a value that exceeds any anticipated network outage. The parameter is ASSUMED_COMPLETE_EXPIRY_TIME . It is expressed in hours, with 240 being the default, and 0 meaning never to expire.


There are two ExpiryScannner s for the assumed complete transactions, because there are different types in the ActionStore.

When a transaction successfully commits, the transaction log is removed from the system. The log is no longer required, since all registered Resources have responded successfully to the two-phase commit sequence. However, if a Resource calls replay_completion on the RecoveryCoordinator after the transaction it represents commits, the status returned is StatusRolledback . The transaction system does not keep a record of committed transactions, and assumes that in the absence of a transaction log, the transaction must have rolled back. This is in line with the presumed abort protocol used by the OTS.