Monday, December 14, 2009

Solaris Fault Management

Solaris Fault Management
The Solaris Fault Management Facility is designed to be integrated into the Service Management Facility to provide a self-healing capability to Solaris 10 systems.

The fmd daemon is responsible for monitoring several aspects of system health.

The fmadm config command shows the current configuration for fmd.

The Fault Manager logs can be viewed with fmdump -v and fmdump -e -v.

fmadm faulty will list any devices flagged as faulty.

fmstat shows statistics gathered by fmd.

Fault Management
With Solaris 10, Sun has implemented a daemon, fmd, to track and react to fault management. In addition to sending traditional syslog messages, the system sends binary telemetry events to fmd for correlation and analysis. Solaris 10 implements default fault management operations for several pieces of hardware in Sparc systems, including CPU, memory, and I/O bus events. Similar capabilities are being implemented for x64 systems.

Once the problem is defined, failing components may be offlined automatically without a system crash, or other corrective action may be taken by fmd. If a service dies as a result of the fault, the Service Management Facility (SMF) will attempt to restart it and any dependent processes.

The Fault Management Facility reports error messages in a well-defined and explicit format. Each error code is uniquely specified by a Universal Unique Identifier (UUID) related to a document on the Sun web site at http://www.sun.com/msg/ .

Resources are uniquely identified by a Fault Managed Resource Identifier (FMRI). Each Field Replaceable Unit (FRU) has its own FMRI. FMRIs are associated with one of the following conditions:

ok: Present and available for use.
unknown: Not present or not usable, perhaps because it has been offlined or unconfigured.
degraded: Present and usable, but one or more problems have been identified.
faulted: Present but not usable; unrecoverable problems have been diagnosed and the resource has been disabled to prevent damage to the system.

The fmdump -V -u eventid command can be used to pull information on the type and location of the event. (The eventid is included in the text of the error message provided to syslog.) The -e option can be used to pull error log information rather than fault log information.

Statistical information on the performance of fmd can be viewed via the fmstat command. In particular, fmstat -m modulename provides information for a given module.

1 comment:

  1. great article, I prefer the combination of backups and recovery services, have you ever heard about the service of fix sql sever mdf file, provided by Recovery Toolbox for SQL Server? Indeed, it is not possible to write a separate article for all database-related issues

    ReplyDelete