Diagnose and clean an Event overloaded HP-SIM database

From Wiki-UX.info
Jump to: navigation, search

Abstract

HP System Insight Manager (SIM) database can be overloaded by a large number of managed host events. Signs of the problem are:

  1. Very slow response of the SIM server system pages that shows the Unclear Events.
  2. The mxdomainmgr java virtual machine may halt after experiencing "java.lang.OutOfMemoryError: Java heap space".

Increasing the allocated memory by the java process may temporary remedy the problem, but the problem will return in the near future.

Symptoms

1. The mxdomainmgr log file halts with Java heap space errors when accesing a host event web form on the web GUI. Be careful not to false identifed with other well know Java heap space errors, like problems running the Capacity Advisor capcollect and capreport process.

# cat /var/opt/mx/logs/mxdomainmgr.0.log 
...
11:50:50,046 ERROR [JIoEndpoint] Socket accept failed
java.lang.OutOfMemoryError: Java heap space
11:50:50,057 ERROR [Http11Protocol] Error reading request, ignored
java.lang.reflect.UndeclaredThrowableException
        at org.apache.commons.logging.impl.Log4jProxy.log(Log4jProxy.java:309)
        at org.apache.commons.logging.impl.Log4jProxy.error(Log4jProxy.java:283)
        at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:208)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:858)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:580)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
        at java.lang.Thread.run(Thread.java:595)
Caused by: java.lang.OutOfMemoryError: Java heap space
11:50:50,070 ERROR [[jsp]] Servlet.service() for servlet jsp threw exception
java.lang.OutOfMemoryError: Java heap space
11:50:50,071 ERROR [Message] java.io.IOException:
ClientAbortException:  java.net.SocketException: Broken pipe (errno:32)
        at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:358)
        at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:434)
        at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:309)
        at org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:288)
        at org.apache.catalina.connector.CoyoteOutputStream.flush(CoyoteOutputStream.java:98)
        at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:410)
        at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:152)
        at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213)
        at java.io.BufferedWriter.flush(BufferedWriter.java:236)
        at org.apache.axis.Message.writeTo(Message.java:441)
        at org.apache.axis.transport.http.AxisServlet.sendResponse(AxisServlet.java:1018)
        at org.apache.axis.transport.http.AxisServlet.doPost(AxisServlet.java:895)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
        at org.apache.axis.transport.http.AxisServletBase.service(AxisServletBase.java:339)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
        at sun.reflect.GeneratedMethodAccessor487.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:244)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
        at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:276)
        at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:162)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:283)
        at org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:56)
        at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:189)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:185)
        at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
        at sun.reflect.GeneratedMethodAccessor486.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:244)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
        at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:276)
        at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:218)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230)
        at org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:56)
        at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:189)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:185)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
        at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:179)
        at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
        at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:156)
        at org.apache.catalina.authenticator.SingleSignOn.invoke(SingleSignOn.java:393)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:241)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:580)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
        at java.lang.Thread.run(Thread.java:595)
Caused by: java.net.SocketException: Broken pipe (errno:32)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:97)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:141)
        at com.sun.net.ssl.internal.ssl.OutputRecord.writeBuffer(OutputRecord.java:283)
        at com.sun.net.ssl.internal.ssl.OutputRecord.write(OutputRecord.java:272)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:666)
        at com.sun.net.ssl.internal.ssl.AppOutputStream.write(AppOutputStream.java:59)
        at org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.doWrite(InternalOutputBuffer.java:764)
        at org.apache.coyote.http11.filters.IdentityOutputFilter.doWrite(IdentityOutputFilter.java:127)
        at org.apache.coyote.http11.InternalOutputBuffer.doWrite(InternalOutputBuffer.java:570)
        at org.apache.coyote.Response.doWrite(Response.java:560)
        at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:353)
        ... 55 more
11:53:19,374 INFO  [Http11Protocol] Pausing Coyote HTTP/1.1 on http-280
11:53:19,375 INFO  [Http11Protocol] Pausing Coyote HTTP/1.1 on http-50000
11:53:19,375 INFO  [Http11Protocol] Pausing Coyote HTTP/1.1 on http-50001
11:53:19,375 INFO  [Http11Protocol] Pausing Coyote HTTP/1.1 on http-50002
11:53:26,197 INFO  [StandardService] Stopping service jboss.web
11:53:26,223 INFO  [Http11Protocol] Stopping Coyote HTTP/1.1 on http-280
11:53:26,224 INFO  [Http11Protocol] Stopping Coyote HTTP/1.1 on http-50000
11:53:26,226 INFO  [Http11Protocol] Stopping Coyote HTTP/1.1 on http-50001
11:53:26,227 INFO  [Http11Protocol] Stopping Coyote HTTP/1.1 on http-50002
11:53:48,151 INFO  [WebappClassLoader] Illegal access: this web application instance has been stopped already.  Could not load org/apache/axis/configuration/
...
11:55:19,595 INFO  [TransactionManagerService] Stopping recovery manager
11:55:32,713 WARN  [LocalJBossServerDomain] Destroying failed jboss.management.local:j2eeType=J2EEDomain,name=Manager
java.lang.OutOfMemoryError: Java heap space
Shutdown complete
Halting VM
  • Note that the Coyote HTTP/1.1 Connect grafully closes the SIM server.

2. Restarting the SIM server takes much longer than normally. This can be seen upon restaring the product:

# /sbin/init.d/hpsim start

# tail -f /var/opt/mx/logs/mxdomainmgr.0.log
10:55:45,544 INFO [Server] JBoss (MX MicroKernel) [4.2.0.GA (build: SVNTag=JBoss_4_2_0_GA date=200705111440)] Started in 15m:41s:195ms

Diagnose

Enable mxdomainmgr debugging. Use the following procedure:

1. Edit the /etc/opt/mx/config/mx.properties file and add:

MX_DEBUG_TRACE=ON 

2. Stop the mx daemons (if running):

# /sbin/init.d/hpsim stop

3. Verify that they have stopped:

# ps -ef | grep [m]x

If there are any of mxdomainmgr, mxdtf and mxagent process, use:

# mxstop 
# ps -ef | grep mx

  • Should not be processes for mxdomainmgr, mxdtf, or mxagent if there are, kill them.

4. Remove any previous debug log files:

# cd /var/opt/mx/logs
# rm *.dbg
# rm *.out 

5. Start the mx daemons:

# /sbin/init.d/hpsim start &

6. Monitor the /var/opt/mx/logs/mxdomainmgr.0.log file.

# tail -f /var/opt/mx/logs/mxdomainmgr.0.log

7. Check the log for the HPSIM_DEBUG entries related with the database check:

13:41:38,467 INFO  [HPSIM_DEBUG] Checking for possible database corruption...
13:41:38,579 INFO  [HPSIM_DEBUG]   Number of records in table 'devices': 139
13:41:43,186 INFO  [HPSIM_DEBUG]   Number of records in table 'notices': 2940772
13:41:43,212 INFO  [HPSIM_DEBUG]   Number of records in table 'tasks': 16
13:41:43,221 INFO  [HPSIM_DEBUG]   Number of records in table 'hpmxJob': 9
13:41:43,224 INFO  [HPSIM_DEBUG]   Number of records in table 'hpmxAutomationTaskResults': 0
13:41:43,243 INFO  [HPSIM_DEBUG]   Number of records in table 'hpmxQuery': 87
13:41:43,275 INFO  [HPSIM_DEBUG]   Number of records in table 'collections': 48
13:41:43,285 INFO  [HPSIM_DEBUG]   Number of records in table 'hpmxUser': 6
13:41:43,288 INFO  [HPSIM_DEBUG]   Number of records in table 'hpmxAuthorization': 21
13:41:43,301 INFO  [HPSIM_DEBUG]   Number of records in table 'consolidatedNodeAuths': 843
13:42:12,770 INFO  [HPSIM_DEBUG] EXCEPTION: org.postgresql.util.PSQLException: ERROR: could not open relation "noticexmldata": No such file or directory

13:42:12,771 INFO  [HPSIM_DEBUG] select count(*) from noticeXMLData where  not exists ( select noticeId from notices where notices.noticeId = noticeXMLData.noticeId)
13:42:12,771 INFO  [HPSIM_DEBUG] 
13:42:32,952 INFO  [HPSIM_DEBUG] Completed database checking with no errors.
  • Note that number of devices and notices (Event notifications). Put attention to a high number of notifications. Divide the notices between the number of devices to have a gross idea of the average number of notices per device. In this example, there are almost 3 million of notices in the database, so there is not doubt there is a problem, but in general terms, more than 50 events per device is a clear sign of a problem.

Solution

1. Use the mxexec command to manually delete all all the events in the SIM database.

# /opt/mx/bin/mxexec -t "Delete Events" -q "All Events"
Running tool Delete Events with job id 161098.
Task Name       :defRunNowTaskId_1241395905778_2
Job ID          :161098
Tool Name       :Delete Events
Job State       :Complete
User Name       :root
Execute As User :root
Start Time      :Sunday, May 3, 2009 5:11:49 PM CDT
End Time        :Sunday, May 3, 2009 7:23:32 PM CDT
Elapsed Time    :2 hours 11 minutes 43 seconds 54 milliseconds
  • Note that the clear event may take several hours depending on CPU and I/O Subsystem speed and the total number of event to remove. Several hours runs are not surprise. Use top or sar commands to monitor the server resources to have an idea if the task is been executed.

2. Disable debug: remove or comment out MX_DEBUG_TRACE=ON from the /etc/opt/mx/config/mx.properties file.

3. Stop and start the SIM server.

# /sbin/init.d/hpsim stop

# ps -ef | grep [m]x

# /sbin/init.d/hpsim start

Reference

Authors