Design-HA Design Detail

The basic premise is that we don't want the JON infrastructure to grind to a halt if the JON Server fails. To achieve this, we will run multiple servers, and come up with a mechanism for the agents to automatically fail over to other servers if one of them goes down.

server cloud

N number of servers running simultaneously (to support Y number of agents)
- servers need to be the same version
how does the server expose itself? - the "public route" to the server (single ip/port) that we expect the agents to try the server on will be written to the HA tables
all servers in the cloud are live, not hot/cold standbys
what if database fails? - do we detect this, and automatically put the servers into maintenance mode?
- if not, the fear here is that the agents will thrash because they attempt to fail over to other servers that can take their load...but no servers can take that load because they are all dead in the water with no available database to accept that data
  - handle db connection failures, trap them at the comm layer and turn them into "server is no good" for the agent, so that the agents can initiated fail over (but only after they back-off, try again, and reach their fail over threshold)
- but...maybe only this one server's connection to the db has been severed, and there exists a different route that the agent can find through one of its backups
server-side control
- use clustered quartz, which ensures only one server is doing some work at some point in time
  - so only one guy in the cloud periodically repartitions the agents according to heuristic
- SuspectAgent job in HA environment: might need modification because agents aren't down if some server can't contact it, an agent should only be suspect if NO server can contact it

linear scaling of agents

agent knows about one primary server, and (N-1) ordered backups (i.e., the agent knows about every server)
- info is persisted in db, in new tables that track HA metadata
fail over support
- initiated when server goes down (crash, shutdown, loss of connectivity [after exponential back-off threshold is met], entering a "hard" maintenance mode)
- agent attaches to next in its list, if available, and so on...
  - if agent gets to the end of the list, it waits for a period, and then starts again with the first server in the list, i.e. agents never just give up
- agent will not attach to servers that are not ready (next server in the list might also be sick)
maybe not fail over so quickly? - concern is thrashing
- - exponentially back-off the sending interval when the server appears to be having problems accepting load (as opposed to being down completely [which would immediately trigger a fail over to begin]) before attempting to fail over (use threshold to determine when fail over should really occur)
  - linear sliding back to a quick sending interval (no faster than basal period of 1 min) – if we starting sending again too quickly, then we might bombard a server with backlogged data from a temporary network blip.

server-agent comm

data flow
- each server is responsible for full bidirectional communication with currently attached agents
- however, any server can talk to any agent because the clustered quartz job request can trigger on a server that isn't some agent's current primary - but that's OK because requests should be small
  - the agent will send any results back up through that agent's primary, not through the server it got the request from - we want the possibly large response to be spread across servers not centralized back through the server getting the quartz job triggers

reliable messaging considerations

agent->server

agent will queue data up until it connects to some primary, and then send to that new primary - make sure that reliable data is only guaranteed to be sent, not sent to a particular endpoint

server->agent

some server-side operation is performed which is set up to use reliable messaging but the agent is down to the agent - if the agent comes back up and connects to a different server than the one that initiated this reliable message, how does the agent ever get it? will the message be stale by the time it gets it?

option 1: RHQ-292, move the persistent fifo into the database instead of server-local filesystem
option 2: analyze whether the message really needs to be sent reliably at all
- for example, now that we have a more sophisticated, bi-directional inventory synchronization process, the agent should be able to tell that a resource was uninventoried on the server-side, so those resources would be removed from the agent-side resource tree altogether and (unless the resource was physically deleted) be rediscovered soon in the NEW state

Design Decision

The decision is to remove use of server->agent reliable messaging and to minimally add necessary comments to indicate in the code that the @Asynchronous( guaranteedDelivery="true") annotation should not be used for AgentService services (i.e. Services defined for Server->Agent communication, org.rhq.core.clientapi.agent.*.*AgentService.java). This annotation will be removed in:

DiscoveryAgentService.synchronizeInventory() : notification of newly committed resources in AD portlet
DiscoveryAgentService.removeResource() : notification of uninventoried resources
In both of these cases the new, more robust synchronization algorithms for 1.1 will ensure proper synch on agent startup regardless of the delivery of these messages.

A third scenario discussed was notification of updated (measurement) schedules. It turns out that reliable messaging was not in place for these updates and the agent must be up for the server-side update to succeed. So, it was a non-issue. If this behavior needed to change it could be handled by adding a getLatestSchedulesForResourceId for modified resources (see InventoryManager.synchInventory), or if that is too coarse, a new update time specific to schedule update.

Or, in the case above, or in general, if we need to re-introduce reliable server->agent messaging we can revisit the options listed above, particularly RHQ-292.

distribution protocol

use clustered quartz job so that only one server in the cloud executes the repartition job at a time, on a periodic basis, and according to heuristic
if a new server joins the cloud, the distribution protocol runs immediately to repartition the network to take advantage of the new resource
- by "join" we mean either a previously unknown server or a known server coming (back) online
if an agent joins / register, we examine the current topology and make the best choice for how to generate this agent's server-list, in other words we do not need to repartition the entire network
- also, it's possible that multiple agents are trying to register simultaneously - this is ok. the window for this to happen is small to begin with, but even if a non-optimal choice is made at this point (because multiple servers are generating the server-lists based off the same information, as opposed to only letting one of them do it at a time), the agents will rebalance when the quartz job run and re-examines the topology as a whole

distribution algorithm

algorithm needs to be computationally efficient (approaching O(1) time) to sustain anticipated 10K agent / 100 server setup
- algorithm can be a weighted sum of: # agents, # metrics / min, time it takes to process X metrics (relative processing #s), # cores
- # cores will be exposed in the HA admin console as relative weights for each server, by default all servers will have 1 weight / core
need to implement one (or more) thresholds so that minor changes to measurement schedules / # agents do not cause unnecessary thrashing (i.e., don't repartition unless the average difference in load is greater than some threshold value)

distribution algorithm - datacenter considerations

given multiple datacenters, is it better to have agents talk to a local server remoted to the db, or to go to a remote server that is locally connected to the db
agents will need affinity to particular servers to provide best performance given wide-scale infrastructure
- need AffinityGroups so that agents prefer certain servers over others, i.e. for within a data center, to deal with geographic and multi-DC issues
  - however, there shouldn't be complete isolation - agent should tend to fail over to servers in their affinity group, but there will need to be a distribution of those that fail over to servers outside of that group (think of 100 agents and 2 servers in one DC, 1 of those servers fail, we don't want 100:1 ratio after the fail over completes)
- use HA admin console to setup AffinityGroups
- agents are put in NO affinity group by default, must be specified during setup (or overridden via HA admin cosole)
- expose AffinityGroup as an agent resource configuration, so we can use group-wide resource configuration updates to push these changes down

clustered server data

alerts cache
- doesn't need to be clustered, because all in-progress data (such as dampening events) are persisted
- so we can blindly have the agent data follow the distribution protocol, i.e. when a new agent connects to the server, alerts cache pulls in definitions relevant to that agent
web session repl - not in 1.1.0
- can be clustered, but not much benefit - the only things that use the session today are multi-page UI flows, nothing critical

segmented / partitioned data

alerts cache
- server loads data at startup by inspecting HA table and noticing which agents this server is marked as a primary for (as opposed to today where it loads all alert conditions for all alert definitions as well as creates OOBs for all measurement baselines for all measurement schedules)
- also, server loads data at agent connection time (if this is a new agent in the infrastructure or if the agent is failing over from a different server), and blocks the agent from sending data until it is finished warming up this cache data
  - loads alert conditions for alert definitions on resources that are managed by this agent
  - creates OOB conditions for measurement baselines for measurement schedules on resources that are managed by this agent
- implementation changes
  - today, we write-lock the entire cache when it loads, and write lock certain portion of the cache when it is updated
  - tomorrow, do we even need locking anymore?
    - yes, because we have a dual-loading model: some data at startup, and some at agent connection time
    - no need to change the locking model until performance becomes an issue
      - we could lock on a per-resource level, but this is probably too granular and could be overheard if we tried to keep thousands and thousands of locks for the thousands of resources that would be amnaged by our handful of agents
      - we could lock on a per-agent level, seems a healthy balance of # locks and improved performance...but, again, unnecessary until we find that locking the entire cache for the infrequent updates caused by changing the inventory hierarchy or alert definition conditions is really necessary
- cleanup
  - if an agent thinks a server has crashed when it really hasn't, the agent will fail over to another server
  - the fail over process, when the agent connects to its secondary, will update the HA tracking tables in the db
  - need a way to tell the former server that it no longer has to manage that agent (because we want specific servers to handle certain quartz jobs) / to clear the stale cache data out of a server that no longer manages that agent
    - have each server periodically check the HA table, determine whether it is really still managing this agent, and expunge stale data for agents whose primary is not itself

agent registration

even though there are many servers in the cloud, an agent still chooses to register with a single server - it doesn't matter which one
server receives registration request, and computes fail-over list for that single agent only
- no need to compute all lists for all agents here, because if each server does this work each time a new agent is imported, the load can be incrementally balanced by the distribution algorithm - and if imbalanced are introduced, they will be eliminated the next time the quartz job runs to do a full network analyze and possible large-scale repartition
agent receives this fail over list and begins the "connection phase" to the 1st items in its server-list
- yes, this means that an agent might be immediately redirected to communicate with some other server in the cloud, not the one is just registered with - everything is driven now from the server-list that each agent has
- this makes setting up agents en masse much simpler because a single agent-configuration.xml file can be used across ALL agents - i.e., they all initially connect to the same server for registration - but then they get dispersed across the network to other servers for all other communication according to distribution algorithm

agent connection

use server list to find proper server
agent will send a blocking request to the server to connect, and server must return a positive connection response
- this synchronous / blocking call is a natural way of allowing the agent to wait until the server has performed activities required before it is ready for this agent's data load
  - server will warning up caches
  - server will look at HA db tables, and attempt to contact the agent's previous server (under the theory that the agent-previousServer connection was lost, but that the currentServer-previousServer connect might still work
    - if the previousServer is still up, this server will send it an async message telling it to discard cached data for the agent it has now lost; otherwise, the currentServer is ready to continue other processing for this newly connecting agent
- if the server crashes during this time, that's OK - the agent will detect this (because the comm layer will report an exception on the remote endpoint), and the agent can use its backup list to fail over to its secondary, tertiary, etc
agent restart - when it starts, it connects to the primary server in its list (again, server-list drives the agent->server comm now)

datacenter support

identification
- servers need to be identified as being in a particular, named datacenter
- agents need to be identified as being in a particular, named datacenter
installation / registration
- when a server is installed, it needs to identify what datacenter it belongs to, otherwise it gets put into the named, unknown datacenter
- when an agent registers with some server in the cloud, it initially gets put into the datacenter that the server belongs to (this is only for ease of installation, but will be configurable via the HA admin console)
administration
- datacenter membership information needs to be represented in the HA administrative console, and RHQ admins need to be able to change this group membership information easily
distribution algorithm
- prefer to connect to servers in your own datacenter first, secondary based on load; if necessary, agents can still fail over to another datacenter – should "affinity to a particular datacenter" versus "agent load distribution" be configurable parameters so that admins with different networks can have different affinities accordingly?

ease of installation

if no db exist yet, install it cause i'm the first guy
if old db, upgrade it cause i'm the first guy (servers started up later will already see that the db is upgraded)
all servers need to inject themselves into the db now, so we need to extend the schema to track these
datacenter considerations
- don't require user to fill in AffinityGroup info for simple, single-server installations
  - just assume the default, unnamed DC so that all server are put into the same DC
- require user to fill in AffinityGroup information for the agent
  - if not, assume NO affinity group

ease of upgrade

if agents use their server list to reconfigure themselves, then they will attempt to do so during an upgrade procedure as well (since the servers go down for the upgrade), so we need to support a "maintenance mode", wherein agents will temporarily suspend their HA logic so as to prevent thrashing
- use our concurrency framework and set things to "zero"
- perhaps "flip the bit" on the underlying invoker framework?
develop "maintenance mode" mechanism?
- hard maintenance mode: used primarily for testing purposes, it simulates a server going down without the server actually going down (something like shutting down the low-level communication mechanism to make agents think this server has failed); this way, a toggle button within a centralized UI can be leveraged to more rapidly and more conveniently test the failover mechanism is working as intended
- soft maintenance mode: all about putting the agents not servers into a maintenance mode, so that proactive (scheduled ops, etc) and reactive (alerting, etc) elements are disabled for that agent; this is useful for when you know you will be performing maintenance on some managed server and you want to suppress non-monitoring effects you have registered against resources managed by that agent
add proto vers identifier in the message to connect
- version string must match on the server and agent side, otherwise comm is disallowed
- if any side is missing the version id (pre 1.1 instance), communication is disallowed

ease of upgrade - agent auto-upgrades?

do agent continue collecting data if put into maintenance mode for server upgrade?
- if so, need better mechanism for upgrading agent (maybe agent auto-upgrade, or at the least laying down new agent on top of the old one)
can we lay a new agent on top of the old one?
- well, only if the domain hasn't changed because of the data spooled until the new server is up, assuming we don't want to support an agent-side migration process for data
methods
- agent auto-grade - agent knows how to update itself, perhaps getting the new package from the new server?
- backwards compatibility - old agents can talk to new server (but model changes make this difficult without a migration process / protocol translation)