RHQ uses Sigar native library to gather information on systems and processes. This page lists important things to know about Sigar and the RHQ objects which work with it.
Sigar
Known Issues
getProcState may return wrong value on consecutive calls
Once a process has died, if you call getProcState method twice on the same Sigar instance in less than two seconds, you will get the last ProcState value known by Sigar (when the process was still alive). See http://communities.vmware.com/message/2187972
As a workaround, RHQ ProcessInfo#refresh will internally call Sigar only if the last execution reported the process was in running state.
getProcCredName may fail to resolve process owner name
So far, the problem was only found on a RHEL 64 bit system for a user defined in a an external database (LDAP). It has already been detected by Hyperic as well (system type/version not provided). Solaris platforms may also be affected. See https://jira.hyperic.com/browse/SIGAR-231
Possible consequences
1. Failure to discover/connect to JMX servers
Workaround: enable JMX remoting to avoid using Sun's attach API
Many external plugins depend on JMX plugin discovery feature (Infinispan for instance)
2. Failure to discover AS7 instances
Workaround: run AS7 instances with same user and group as RHQ agent
3. Failure when checking Hadoop Server availability
Workaround: disable events on Hadoop Server resource
Problem details
Here is an excerpt of Sigar source:
sigar_format.c excerpt
/* sysconf(_SC_GET{PW,GR}_R_SIZE_MAX) */
#define R_SIZE_MAX 1024
int sigar_user_name_get(sigar_t *sigar, int uid, char *buf, int buflen)
{
struct passwd *pw = NULL;
/* XXX cache lookup */
# ifdef HAVE_GETPWUID_R
struct passwd pwbuf;
char buffer[R_SIZE_MAX];
if (getpwuid_r(uid, &pwbuf, buffer, sizeof(buffer), &pw) != 0) {
return errno;
}
if (!pw) {
return ENOENT;
}
# else
if ((pw = getpwuid(uid)) == NULL) {
return errno;
}
# endif
strncpy(buf, pw->pw_name, buflen);
buf[buflen-1] = '\0';
return SIGAR_OK;
}
getpwuid_r returns ERANGE (numerical result out of range) indicating that the buffer size was not large enough.
Hyperic fix consists in increasing the value of _R_SIZE_MAX_ up to 2048. The fix has been tested by RHQ team and solves our particular case. Still, it is not understood why the unpatched code extracted in a simple C test case works :
Extracted C test case
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <pwd.h>
#include <grp.h>
#define R_SIZE_MAX 1024
int main(void) {
struct passwd *pw = NULL;
struct passwd pwbuf;
char buffer[R_SIZE_MAX];
// 600 is a userid
if (getpwuid_r(600, &pwbuf, buffer, sizeof(buffer), &pw) != 0) {
return errno;
}
if (!pw) {
return ENOENT;
}
puts (pw->pw_name);
return EXIT_SUCCESS;
}
A bug is opened on glibc side to find why getpwuid_r may not consistently return ERANGE errors.
See http://sourceware.org/bugzilla/show_bug.cgi?id=15139
Also discussed is the appropriate way to call getpwuid_r. According to glibc manual, this type of lookup function should be called in a loop where ERANGE errors would lead to buffer reallocation with larger size:
Example from the glibc manual
struct hostent *
gethostname (char *host)
{
struct hostent hostbuf, *hp;
size_t hstbuflen;
char *tmphstbuf;
int res;
int herr;
hstbuflen = 1024;
/* Allocate buffer, remember to free it to avoid memory leakage. */
tmphstbuf = malloc (hstbuflen);
while ((res = gethostbyname_r (host, &hostbuf, tmphstbuf, hstbuflen,
&hp, &herr)) == ERANGE)
{
/* Enlarge the buffer. */
hstbuflen *= 2;
tmphstbuf = realloc (tmphstbuf, hstbuflen);
}
/* Check for errors. */
if (res || hp == NULL)
return NULL;
return hp;
}
// Copyright 2013 Free Software Foundation, Inc.
// Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.
SigarAccess
SigarAccess is an RHQ utility class. It creates:
-
a unique instance of Sigar
-
a proxy to this instance, implementing the SigarProxy interface
-
an invocation handler which serializes calls to the shared Sigar instance
Any RHQ plugin class which needs system or process information should get the SigarProxy from SigarAccess (SigarAccess.getSigar method). This guarantees that RHQ agent will not waste/leak resources and that two threads will not concurrently call the same Sigar instance.
The invocation handler uses a lock to serialize calls. If a thread waits more than sharedSigarLockMaxWait seconds, it will be given a new Sigar instance, which will be destroyed at the end of the call. Every 5 minutes, a background task checks that localSigarInstancesWarningThreshold has not been exceeded. It it has, a warning message will be logged, optionally with a thread dump.
The invocation handler behavior is configurable with System properties:
-
sharedSigarLockMaxWait: maximum time in seconds a thread will wait for the shared Sigar lock acquisition; defaults to 2 seconds
-
localSigarInstancesWarningThreshold: threshold of currently living Sigar instances at which the background task will print warning messages; defaults to 50
-
maxLocalSigarInstances: maximum number of local Sigar instances which can be created, zero and negative values being interpreted as 'no limit'; defaults to 50
-
threadDumpOnlocalSigarInstancesWarningThreshold: if set to true (case insensitive), the background task will also log a thread dump when localSigarInstancesWarningThreshold is met
ProcessInfo
ProcessInfo encapsulates information about a known process and behaves like a cache which can be refreshed.
A few process properties (i.e. PID, command line) will never change during the lifetime of the process and can be read directly with ProcessInfo accessors. Other process properties (i.e. state, CPU usage) will vary and their values are grouped in ProcessInfoSnapshot class instances.
New snapshots of changing data will be taken when calling ProcessInfo.freshSnapshot or ProcessInfo.refresh methods.