1. Introduction:
The aim of this document is to list some troubleshooting procedures associated with the monitoring of a cluster database. In our RAC 7 nodes environment (11gR2), we are configured SNMP traps to be sent from Enterprise Manager 11gR1 to ZENOSS system.
The challenge to demonstrate to the client when the alerts get fired and how there are showing on Zenoss system. Demonstrations are including crashing the cluster to generate database or cluster alerts.
Some of alerts testing are straightforward when others need to deeply know how the monitoring system works. In this document, I am going to show you how I tested some cluster alerts.
2. Notification Rule creation:
To setup the notification rules, go to preferences on the top corner of the database console. Click on “Rules” Then click on “Create” to create a new notification rule.
3. Identify the targets;
This step is useful to force a metric collection and upload. We need to determine;
1. The Target Name
2. The Target Type
3. The Collection Name
Run the following command logging with oracle software owner. In our case, we need to identify the collection name for the target type cluster. From the output we identify the information;
cluster:tstcluster:CRSAlert+CRSStatus
From now on, we will need to use the collection names of CRSAlert and CRSStatus for cluster metrics collection.
-bash-3.2$ emctl config agent listtargets Oracle Enterprise Manager 11g Database Control Release 11.2.0.2.0 Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved. [dbt01.example.com:3938, oracle_emd] [dbt01.example.com, host] [tstcluster, cluster] [tstdb.example.com_tstdb1, oracle_database] [tstdb.example.com, rac_database] [+ASM1_dbt01.example.com, osm_instance] [LISTENER_dbt01.example.com, oracle_listener] [LISTENER_2_dbt01.example.com, oracle_listener] [LISTENER_SCAN3_tstcluster, oracle_listener] -bash-3.2$ emctl status agent scheduler Oracle Enterprise Manager 11g Database Control Release 11.2.0.2.0 Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved. --------------------------------------------------------------- Scheduler status at 2012-02-01 18:20:49 Running entries:: Ready entries:: Scheduled entries:: 2012-02-01 18:20:53 : host:dbt01.example.com:Load 2012-02-01 18:20:53 : osm_instance:+ASM1_dbt01.example.com:diskgroup_space_usage 2012-02-01 18:20:56 : oracle_database:tstdb.example.com_tstdb1:health_check 2012-02-01 18:21:04 : oracle_emd:dbt01.example.com:3938:EMDUploadStats 2012-02-01 18:21:09 : oracle_database:tstdb.example.com_tstdb1:Response 2012-02-01 18:21:23 : host:dbt01.example.com:Network+ProgramResourceUtilization+CRSAlert+CRSStatus 2012-02-01 18:21:34 : Upload Files Recount 2012-02-01 18:21:38 : osm_instance:+ASM1_dbt01.example.com:ofs_collections+incident_meter 2012-02-01 18:21:40 : Ping Manager 2012-02-01 18:21:41 : rac_database:tstdb.example.com:streams_processes_count_item 2012-02-01 18:21:58 : rac_database:tstdb.example.com:activity_pending 2012-02-01 18:22:02 : oracle_listener:LISTENER_2_dbt01.example.com:Load+General Status 2012-02-01 18:22:11 : osm_instance:+ASM1_dbt01.example.com:adr_alert_log_rollup 2012-02-01 18:22:21 : cluster:tstcluster:CRSAlert+CRSStatus 2012-02-01 18:22:24 : rac_database:tstdb.example.com:dbjob_status+UserBlock+cardinality+service_performance+qos_psm 2012-02-01 18:22:43 : oracle_listener:LISTENER_SCAN3_tstcluster:Response 2012-02-01 18:22:54 : oracle_database:tstdb.example.com_tstdb1:haconfig2_collection+ha_rac_intrconn_traffic+sga_start+incident_meter 2012-02-01 18:23:08 : oracle_database:tstdb.example.com_tstdb1:sql_response 2012-02-01 18:23:20 : osm_instance:+ASM1_dbt01.example.com:performance_metrics 2012-02-01 18:24:32 : oracle_database:tstdb.example.com_tstdb1:DatabaseVaultRealmViolation_collection+DatabaseVaultCommandRuleViolation_collection+DatabaseVaultRealmConfigurationIssue_collection+DatabaseVaultCommandRuleConfigurationIssue_collection+DatabaseVaultPolicyChanges_collection 2012-02-01 18:24:33 : oracle_database:tstdb.example.com_tstdb1:adr_alert_log_rollup 2012-02-01 18:25:34 : rac_database:tstdb.example.com:streams_statistics 2012-02-01 18:25:38 : oracle_listener:LISTENER_2_dbt01.example.com:Response 2012-02-01 18:25:41 : osm_instance:+ASM1_dbt01.example.com:Response 2012-02-01 18:25:43 : oracle_listener:LISTENER_dbt01.example.com:Response 2012-02-01 18:26:23 : rac_database:tstdb.example.com:Recovery_Area+haconfig3_collection 2012-02-01 18:26:25 : oracle_listener:LISTENER_SCAN3_tstcluster:Load+General Status 2012-02-01 18:26:47 : osm_instance:+ASM1_dbt01.example.com:disk_status 2012-02-01 18:27:13 : oracle_database:tstdb.example.com_tstdb1:latest_hdm_findings_coll_item 2012-02-01 18:27:30 : oracle_database:tstdb.example.com_tstdb1:log_full 2012-02-01 18:28:06 : rac_database:tstdb.example.com:haconfig1_collection+segment_advisor_count+DatabaseVaultRealmViolation_collection+DatabaseVaultCommandRuleViolation_collection+DatabaseVaultRealmConfigurationIssue_collection 2012-02-01 18:28:21 : osm_instance:+ASM1_dbt01.example.com:diskgroup_failgroup_checks 2012-02-01 18:28:42 : oracle_database:tstdb.example.com_tstdb1:baseline_metadata 2012-02-01 18:30:57 : host:dbt01.example.com:Filesystems+DiskActivity+PagingActivity+CPUUsage+proc_zombie 2012-02-01 18:32:03 : oracle_database:tstdb.example.com_tstdb1:UserAudit 2012-02-01 18:32:11 : host:dbt01.example.com:LogFileMonitoring+FileMonitoring 2012-02-01 18:32:33 : Upload Manager 2012-02-01 18:33:25 : rac_database:tstdb.example.com:latest_db_hdm_findings_coll_item 2012-02-01 18:34:48 : oracle_emd:dbt01.example.com:3938:ProcessInfo 2012-02-01 18:35:15 : oracle_listener:LISTENER_dbt01.example.com:Load+General Status 2012-02-01 18:45:00 : oracle_database:tstdb.example.com_tstdb1:aq_monitoring_alerts 2012-02-01 18:45:39 : rac_database:tstdb.example.com:problemTbsp_10i_Dct+audit_failed_logins 2012-02-01 18:50:17 : rac_database:tstdb.example.com:aq_monitoring_alerts 2012-02-01 19:05:32 : Reap Connection Pools 2012-02-01 19:05:55 : osm_instance:+ASM1_dbt01.example.com:cluster_performance_metrics 2012-02-01 19:11:28 : rac_database:tstdb.example.com:DatabaseVaultCommandRuleConfigurationIssue_collection+DatabaseVaultPolicyChanges_collection+key_profiles_collection 2012-02-01 19:17:36 : osm_instance:+ASM1_dbt01.example.com:ofs_performance_metrics 2012-02-01 22:05:39 : oracle_database:tstdb.example.com_tstdb1:oracle_security 2012-02-01 22:05:43 : host:dbt01.example.com:Inventory 2012-02-01 22:05:48 : oracle_listener:LISTENER_2_dbt01.example.com:oracle_security 2012-02-01 22:05:49 : oracle_database:tstdb.example.com_tstdb1:cluster_resource_name 2012-02-01 22:05:51 : osm_instance:+ASM1_dbt01.example.com:cluster_resource_name 2012-02-01 22:05:53 : oracle_listener:LISTENER_dbt01.example.com:oracle_security 2012-02-01 22:05:58 : oracle_listener:LISTENER_2_dbt01.example.com:cluster_resource_name 2012-02-01 22:05:59 : oracle_database:tstdb.example.com_tstdb1:isHasManaged 2012-02-01 22:06:03 : oracle_listener:LISTENER_dbt01.example.com:cluster_resource_name 2012-02-01 22:06:03 : host:dbt01.example.com:oracle_security 2012-02-01 22:06:08 : oracle_listener:LISTENER_2_dbt01.example.com:isHasManaged 2012-02-01 22:06:13 : host:dbt01.example.com:host_storage 2012-02-01 22:06:13 : oracle_listener:LISTENER_dbt01.example.com:isHasManaged 2012-02-01 22:06:19 : oracle_database:tstdb.example.com_tstdb1:oracle_security_inst 2012-02-01 22:07:11 : cluster:tstcluster:ha_cls_intrconn+crs_event+resource_status 2012-02-01 22:07:37 : rac_database:tstdb.example.com:oracle_security 2012-02-01 22:07:47 : rac_database:tstdb.example.com:cluster_resource_name 2012-02-01 22:07:53 : oracle_listener:LISTENER_SCAN3_tstcluster:oracle_security 2012-02-01 22:07:57 : rac_database:tstdb.example.com:isHasManaged 2012-02-01 22:08:03 : oracle_listener:LISTENER_SCAN3_tstcluster:cluster_resource_name 2012-02-01 22:08:13 : oracle_listener:LISTENER_SCAN3_tstcluster:isHasManaged 2012-02-01 22:08:19 : rac_database:tstdb.example.com:tbspAllocation 2012-02-01 22:08:31 : rac_database:tstdb.example.com:oracle_storage 2012-02-01 22:09:28 : oracle_database:tstdb.example.com_tstdb1:oracle_dbconfig 2012-02-01 22:09:49 : rac_database:tstdb.example.com:problemSegTbsp+feature_usage_collection_item 2012-02-01 22:10:13 : rac_database:tstdb.example.com:invalid_objects_rollup 2012-02-01 22:11:40 : rac_database:tstdb.example.com:oracle_racconfig 2012-02-01 22:13:21 : rac_database:tstdb.example.com:audit_failed_logins_historical 2012-02-01 22:13:47 : osm_instance:+ASM1_dbt01.example.com:oracle_osm 2012-02-01 22:20:00 : cluster:tstcluster:mgmt_rac_services 2012-02-01 22:25:55 : oracle_database:tstdb.example.com_tstdb1:pwd_expiry 2012-02-01 22:27:14 : oracle_database:tstdb.example.com_tstdb1:ha_rac_intrconn+ha_rac_intrconn_type 2012-02-01 22:30:08 : oracle_emd:dbt01.example.com:3938:EMDIdentity+EMDUserLimits 2012-02-01 22:30:23 : osm_instance:+ASM1_dbt01.example.com:Disk_Path 2012-02-01 22:32:22 : host:dbt01.example.com:Swap_Area_Status+HostStorageSupport 2012-02-01 22:35:19 : rac_database:tstdb.example.com:StgPerf --------------------------------------------------------------- Agent is Running and Ready -bash-3.2$
4. OCR Alert Log Error:
This metric belongs to cluster target. This metric collects CRS-1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1010 and 1011 messages from CRS alert log at the cluster level and issue alerts based on the error code.
Simplest case: Perform the following steps to generate the alert;
1. Identify the CRS alert log (typically in $GRID_HOME/log/
2. Add the following lines to the end of the file: (The Timestamp must be in the present)
2012-02-01 10:13:27.508 [cssd(16636)]CRS-1006: test OCR Alert log error.
3. Run on the agent of the DB Console, use the following command to perform an immediate reevaluation of a metric collection;
$AGENT_HOME/bin/emctl control agent runCollection target_name:target_type collection_name
emctl control agent runCollection
emctl control agent runCollection tstcluster:cluster CRSAlert
Use this command to force an immediate upload of the current management data from the managed host to the Management Service. Use this command instead of waiting until the next scheduled upload of the data.
emctl upload agent
4. Wait for 5 minutes and check for a new alert.
5. Open the EM console, click on the Cluster tab, go to All metrics (on the page bottom)
6. Click now on the OCR Alert Log error (see image below);
5. Node Configuration Alert Log Error:
This metric belongs to cluster target. This metric collects CRS-1607, 1802, 1803, 1804 and 1805 messages from the CRS alert log at the cluster level, and issues alerts based on the error code.
Simplest case: Perform the following steps to generate the alert;
1. Identify the CRS alert log (typically in $GRID_HOME/log/
2. Add the following lines to the end of the file: (The Timestamp must be in the present)
2012-02-01 10:13:27.508 [cssd(16636)]CRS-1607: test Node configuration error.
3. Run on the agent of the DB Console, use the following command to perform an immediate reevaluation of a metric collection;
$AGENT_HOME/bin/emctl control agent runCollection target_name:target_type collection_name
emctl control agent runCollection tstcluster:cluster CRSAlert
Use this command to force an immediate upload of the current management data from the managed host to the Management Service. Use this command instead of waiting until the next scheduled upload of the data.
emctl upload agent
4. Wait for 5 minutes and check for a new alert.
5. Open the EM console, click on the Cluster tab, go to All metrics (on the page bottom)
6. Click now on the Node Configuration Alert Log Error
6. Node(s) with Clusterware Problem:
This metric belongs to cluster target. This metric shows how many nodes have clusterware problems. This metric uses the cluster verify utility to check cluster nodes.
cluvfy comp crs -n node1, node2 …
Where node1, node2 is the node list for the cluster.
Simplest case: Perform the following steps to generate the alert;
1. Backup and edit $GRID_HOME/bin/cluvfy
2. At the beginning of the file, add the following on the second line to exit the cluster verify utility;
#!/bin/sh
echo ERROR
exit 1;
3. Run on the agent of the DB Console, use the following command to perform an immediate reevaluation of a metric collection;
$AGENT_HOME/bin/emctl control agent runCollection target_name:target_type collection_name
emctl control agent runCollection tstcluster:cluster CRSStatus
4. Wait for 15 minutes and check for a new alert.
5. Open the EM console, click on the Cluster tab, go to All metrics (on the page bottom)
6. Click now on the Node(s) with Clusterware Problem
7. References:
http://docs.oracle.com/cd/B14099_19/manage.1012/b16242/emctl.htm
http://docs.oracle.com/cd/E11857_01/em.111/e16790/emctl.htm#BABHAFAA
Scridb filter





