JBoss Community Archive (Read Only)

RHQ

HA Testing

The following was done on or about the first two weeks of October 2008.

Geography #1 - Atlanta
Geography #2 - Boston

Testing was done in two parts. First, a small cloud of 2 and then 3 servers in the cloud all in the same subnet/geography #1 as well as the Oracle database. ~100 agents started via the agentspawn performance tool. No affinity groups. Put one server in maintenance mode (MM) leaving the other in Normal mode. Watch all agents switch over such that the agent code for the server in MM dropped to 0 within 1 minute. THIS IS VERY IMPORTANT TO TEST! If that number does not drop down to 0 (and remains at some number N, higher than 0), it means N agents failed to successfully send their connectAgent message to its new server. This assumes all agents were running at the time the server went into MM. Then switched servers' modes - the MM server went to Normal and Normal server went to MM. Watched all agents switch over.

Next part was using two subnets in different geographies. Had ~230 agents, about 2/3rds in geo #2 and the other 1/3rd in the other geo #1 (which is where the Oracle DB was). Did the same MM/Normal switch and saw agents failover properly. Had three servers initially. I shutdown all three servers and kept them down while the agents were running for several hours. Restarts the servers and the agents reconnected properly and spread out to their respective primary servers. Added a fourth server in geo #2 and put it in the cloud and put two servers sitting in geo #2 in an affinity group A and two servers (one in geo #2 and one in geo #1) in another affinity group B. Put all agents running in geo #1 in affinity group B. The rest of the agents running in geo #2 were spread out so both affinity groups had roughly the same number of agents.

In the order you see in the HA Servers list page, I had agent counts of 58/57/55/58 - the first two are in aff group A and the last two are aff group B. I put server 2 and server 3 server in MM. After a few minutes, the counts went to 115/0/0/113 as expected.

Alert Cache

Joe added about 11K alert conditions, 4K of which are disabled after firing, 1K of which can re-enable those 4K, all but a handful use complex dampening rules (that affects out-of-band processing performance), and ALL of them are either measurement-based and/or event-based (the two subsystems with the highest traffic).

JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-11 12:40:18 UTC, last content change 2008-10-27 16:06:53 UTC.