From Wiki-UX.info

Wiki-UX / How to recover System Insight Manager database after file system corruption
Jump to: navigation, search

How to recover System Insight Manager database after file system corruption

Abstract

After the /var file system failed after exhausting all space (file system full!). The System Insight Manager Postgres database didn't start correctly.

The complete database can be recovered from a backup done following the instructions on the Backing up and restoring HP SIM 5.2 or greater data files in an HP-UX and Linux environment Whitepaper.

If this database lvel backup is not available, the process described on this article can be used to try to recover the database integrity. If the database cannot be cleaned recovered, the whole /var/opt/hpsmdb directory can be restored from a file system backup source and then the database cleaned again following the same procedure.

Contents


Vacuum System Insight Manager Postgres Database

1. Check for any running HPSIM process

# ps -ef | grep -e [m]x -e [h]psmdb
    root 11362     1  0 13:10:43 ?         4:38 /opt/mx/lbin/mxdomainmgr
  hpsmdb 11335     1  0 13:10:09 pts/ta    0:00 /opt/hpsmdb/pgsql/bin/postmaster -i -D /var/opt/hpsmdb/pgsql
  hpsmdb 11337 11335  0 13:10:09 pts/ta    0:00 postgres: stats buffer process
  hpsmdb 11338 11337  0 13:10:09 pts/ta    0:00 postgres: stats collector process	
  hpsmdb 11452 11335 95 13:12:35 pts/ta    2:23 postgres: mxadmin insight_v1_0 127.0.0.1 idle
  hpsmdb 11483 11335 99 13:13:44 pts/ta    3:12 postgres: mxadmin insight_v1_0 127.0.0.1 idle
  hpsmdb 11466 11335 72 13:12:37 pts/ta    3:08 postgres: mxadmin insight_v1_0 127.0.0.1 SELECT
    root 11361     1  0 13:10:43 ?         0:19 /opt/mx/lbin/mxdtf

2. Stop HPSIM. Wait a couple of minutes after the mxstop command return the prompt back and check for any HP SIM running processes.

# mxstop
# ps -ef | grep -e [m]x -e [h]psmdb
  hpsmdb 11335     1  0 13:10:09 pts/ta    0:00 /opt/hpsmdb/pgsql/bin/postmaster -i -D /var/opt/hpsmdb/pgsql
  hpsmdb 11337 11335  0 13:10:09 pts/ta    0:00 postgres: stats buffer process
  hpsmdb 11338 11337  0 13:10:09 pts/ta    0:00 postgres: stats collector process
NOTE: If the hpsmdb is still running follow step 3, if not jump to step 4.

3. Manually stop HPSMBD Postgres database.

# /sbin/init.d/hpsmdb stop
Stopping hpsmdb postgresql service:
 
# ps -ef | grep [h]psmdb

4. Backup (raw) the SIM posgres data directory, in case you need to recover the to the previous state before. Send the file backup to a file system with enough space.

# du -ks /var/opt/hpsmdb
82488   /var/opt/hpsmdb
# ( cd /var/opt; tar -cvf - ./hpsmdb | /usr/contrib/bin/gzip > /tmp/hpsmdb.tgz )
# ls -l /tmp/hpsmdb.tgz
-rw-r--r--   1 root       sys        13472802 Jan  5 13:54 /tmp/hpsmdb.tgz

5. Start SIM Postgre database. Check for any error messages.

# /sbin/init.d/hpsmdb start
Starting hpsmdb postgresql service!
 
# ps -ef | grep [h]psmdb
  hpsmdb 12742 12740  0 13:40:13 pts/ta    0:00 postgres: stats buffer process
  hpsmdb 12740     1  0 13:40:12 pts/ta    0:00 /opt/hpsmdb/pgsql/bin/postmaster -i -D /var/opt/hpsmdb/pgsql
  hpsmdb 12743 12742  0 13:40:13 pts/ta    0:00 postgres: stats collector process

6. The password automatically generated by HP SIM for mxadmin must be changed before cleaning the database. To change the password, at the HP SIM command line enter the following:

# /opt/mx/bin/mxpassword -m -x MxDBUserPassword=<newpassword>

Where <newpassword> is the new password.

Example:

# /opt/mx/bin/mxpassword -m -x MxDBUserPassword=hpsmbd1
Password Key "MxDBUserPassword" modified
 
Please close this command window immediately to prevent any clear text passwords from being exposed to unintended eyes.


7. Use the vacuumdb command in full mode to check the database integrity. This step can take a long time on large databases. Provided the new password when requested. For example:

# cd /opt/hpsmdb/pgsql/bin
 
# ./vacuumdb -f -h 127.0.0.1 -U mxadmin -p 50006 insight_v1_0
Password: [hpsmbd1]
VACUUM

8. Start HP SIM Server.

# mxstart &

9. Check the startup log

# tail -f /var/opt/mx/logs/mxdomainmgr.0.log
*** classPath=/opt/mx/j2re/lib/tools.jar:/opt/mx/jboss/bin/run.jar
17:16:53,273 INFO  [Server] Starting JBoss (MX MicroKernel)...
17:16:53,281 INFO  [Server] Release ID: JBoss [Trinity] 4.2.0.GA (build: SVNTag=JBoss_4_2_0_GA date=200705111440)
17:16:53,283 INFO  [Server] Home Dir: /opt/mx/jboss
17:16:53,283 INFO  [Server] Home URL: file:/opt/mx/jboss/
17:16:53,285 INFO  [Server] Patch URL: null
17:16:53,285 INFO  [Server] Server Name: hpsim
17:16:53,286 INFO  [Server] Server Home Dir: /opt/mx/jboss/server/hpsim
17:16:53,286 INFO  [Server] Server Home URL: file:/opt/mx/jboss/server/hpsim/
17:16:53,286 INFO  [Server] Server Log Dir: /opt/mx/jboss/server/hpsim/log
17:16:53,287 INFO  [Server] Server Temp Dir: /opt/mx/jboss/server/hpsim/tmp
17:16:53,288 INFO  [Server] Root Deployment Filename: jboss-service.xml
17:16:54,925 INFO  [ServerInfo] Java version: 1.5.0.12,Hewlett-Packard Co.
17:16:54,926 INFO  [ServerInfo] Java VM: Java HotSpot(TM) Server VM 1.5.0.12 jinteg:03.21.08-11:00 PA2.0 (aCC_AP),Hewlett-Packard Company
17:16:54,926 INFO  [ServerInfo] OS-System: HP-UX B.11.31,PA_RISC2.0
17:16:56,679 INFO  [Server] Core system initialized
17:17:06,844 INFO  [WebService] Using RMI server codebase: http://127.0.0.1:50013/
17:17:06,848 INFO  [Log4jService$URLWatchTimerTask] Configuring from URL: resource:jboss-log4j.xml
17:17:07,909 INFO  [TransactionManagerService] JBossTS Transaction Service (JTA version) - JBoss Inc.
17:17:07,909 INFO  [TransactionManagerService] Setting up property manager MBean and JMX layer
17:17:08,365 INFO  [TransactionManagerService] Starting recovery manager
17:17:09,107 INFO  [TransactionManagerService] Recovery manager started
17:17:09,108 INFO  [TransactionManagerService] Binding TransactionManager JNDI Reference
17:17:19,095 INFO  [EJB3Deployer] Starting java:comp multiplexer
17:18:37,783 INFO  [ServiceEndpointManager] jbossws-1.2.1.GA (build=200704151756)
17:18:40,667 INFO  [nodeDiscoveryComplete] Bound to JNDI name: topic/nodeDiscoveryComplete
17:18:40,669 INFO  [hpsim/nodeDatacollectionComplete] Bound to JNDI name: topic/hpsim/nodeDatacollectionComplete
17:18:40,672 INFO  [hpsim/core/users] Bound to JNDI name: topic/hpsim/core/users
17:18:40,674 INFO  [hpsim/core/node] Bound to JNDI name: topic/hpsim/core/node
17:18:40,675 INFO  [nodeCredentialChange] Bound to JNDI name: topic/nodeCredentialChange
17:18:40,677 INFO  [lifecycleindicationhealth] Bound to JNDI name: topic/lifecycleindicationhealth
17:18:40,679 INFO  [lifecycleindicationconfig] Bound to JNDI name: topic/lifecycleindicationconfig
17:18:40,681 INFO  [jobUpdate] Bound to JNDI name: topic/HPSIM.jobUpdate
17:18:45,234 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main) Checking for possible database corruption...
17:18:45,235 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'devices': 2
17:18:45,240 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'notices': 3
17:18:45,240 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'tasks': 25
17:18:45,241 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'hpmxJob': 5
17:18:45,246 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'hpmxAutomationTaskResults': 0
17:18:45,254 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'hpmxQuery': 76
17:18:45,258 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'collections': 29
17:18:45,259 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'hpmxUser': 6
17:18:45,263 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'hpmxAuthorization': 21
17:18:45,264 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'consolidatedNodeAuths': 21
17:18:45,662 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main) ERROR: TABLE SecurityNoticesData KEY noticeId may contain 1 entrie(s) not in tab
le notices KEY noticeId
17:18:51,162 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main) Completed database checking with errors, CMS may not be able to run.
17:18:51,162 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main) Fixing possible database corruption...
17:18:51,396 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main) 1 entries fixed in TABLE SecurityNoticesData
17:18:56,808 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main) Checking for possible database corruption...
17:18:56,809 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'devices': 2
17:18:56,810 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'notices': 3
17:18:56,810 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'tasks': 25
17:18:56,811 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'hpmxJob': 5
17:18:56,811 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'hpmxAutomationTaskResults': 0
17:18:56,812 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'hpmxQuery': 76
17:18:56,817 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'collections': 29
17:18:56,818 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'hpmxUser': 6
17:18:56,818 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'hpmxAuthorization': 21
17:18:56,819 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main)   Number of records in table 'consolidatedNodeAuths': 21
17:19:02,478 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main) Completed database checking with no errors.
...
17:20:48,365 INFO  [Http11Protocol] Starting Coyote HTTP/1.1 on http-280
17:20:48,467 INFO  [Http11Protocol] Starting Coyote HTTP/1.1 on http-50000
17:20:48,696 INFO  [Http11Protocol] Starting Coyote HTTP/1.1 on http-50001
17:20:48,740 INFO  [Http11Protocol] Starting Coyote HTTP/1.1 on http-50002
17:20:48,780 INFO  [HPSIM_DEBUG] [Panic Logger-0] (main) DataOperations setup complete in: 1 ms
17:20:48,781 INFO  [Server] JBoss (MX MicroKernel) [4.2.0.GA (build: SVNTag=JBoss_4_2_0_GA date=200705111440)] Started in 3m:55s:488ms

Database Integrity errors

During this procedure, the following error may appear during the vacuum process and also in the mxdomainmgr log file at HPSIM startup:

06 Jan 04:30:21,388 INFO  [HPSIM_DEBUG] [Panic Logger-0] (ConstructionWrapper:TrapHandler) ALERTDB ADDINGALERT 779 Error:org.postgresql.util.PSQLException: ERROR: failed to re-find parent key in "pk_notices_1__13"

This error indicate a serious problem on the files keeping the database information. According to a Google search: "failed to re-find parent key" question.

It may be possible to use the pg_resetxlog command, " but is that it's difficult to predict how much corruption or data loss would result".

1. Verify that the database is inactive:

# mxstop
# /sbin/init.d/hpsmdb stop

2. Purge the files and logs using:

# cd /opt/hpsmdb/pgsql/bin
# ./pg_resetxlog -f /var/opt/hpsmdb/pgsql
Transaction log reset
pg_resetxlog clears the write-ahead log (WAL) and optionally resets some other control information stored in the pg_control file. This function is sometimes needed if these files have become corrupted. It should be used only as a last resort, when the server will not start due to such corruption.
After running this command, it should be possible to start the server, but bear in mind that the database may contain inconsistent data due to partially-committed transactions. You should immediately dump your data, run initdb, and reload. After reload, check for inconsistencies and repair as needed.
This utility can only be run by the user who installed the server, because it requires read/write access to the data directory. For safety reasons, you must specify the data directory on the command line. pg_resetxlog does not use the environment variable PGDATA.

If this succeed, stop the database again, and follow the whole process details on the backup whitepaper I provided to manually dump the tables and rebuild the database.

3. Start the database

# /sbin/init.d/hpsmdb start

4. Repeat the vacuum process, begining on step 3 of the vacuum process.

5. Enable HP SIM.

# mxstart

If this fails, the database backing files are damage beyond repair. Try the whole process again recoverying /var/opt/hpsmdb directory from backup media. Keep in mind that without a database level backup, the data may be unrecoverable and the configuration should be rebuild from scratch.

Reference

Authors

This page was last modified on 11 January 2010, at 12:01. This page has been accessed 3,674 times.