Yann Neuhaus Blog

All around High Availability (Open Source and Oracle)

Oktober 2007 - Einträge

Dataguard configuration modification to MAXIMUM AVAILABILITY

In order to change the Dataguard configuration from MAXIMUM PERFORMANCE to MAXIMUM AVAILABILITY, Oracle doesn't require to bounce the databases anymore since 10.2.0.3.

In fact this is not always the case. The only way to verify if it will occur is:

1. change log transport mode

edit database 'MYDB_SITE1' set property LogXptMode = 'SYNC';
edit database 'MYDB_SITE2' set property LogXptMode = 'SYNC';

2. look for the following entry in the alert.log file of the PRIMARY database :

Destination LOG_ARCHIVE_DEST_2 is SYNCHRONIZED

If this string has been found, the Dataguard configuration can be changed smoothly (without database reboot) to MAXIMUM AVAILABILITY as described below :

edit configuration set protection mode as 'MaxAvailability';

Of course the standby redo log files must exist, see previous posts.

If the entry in the alert.log file has not been found a workaround could be to to set the LogXptMode property to 'ASYNC' and then back to 'SYNC'.

Many thanks to Hervé who was the first one to discover this by Trivadis :-)

Data Guard in SYNC mode – does my DB suffer from network weakness ?

My Grid Control console is often polluted by warnings about wait events of the network event class :



Is my Data Guard setup really suffering from poor performance?

I operate with the Maximum Availability mode.

We will investigate why the Grid Control warnings occur. The database is up since about 3 days (72 hours). First of all try to identify the wait classes which generated the most important wait times:

set lines 130
select * from V$SYSTEM_WAIT_CLASS
order by 5
/

WAIT_CLASS_ID WAIT_CLASS# WAIT_CLASS          TOTAL_WAITS TIME_WAITED
------------- ----------- ------------------- ----------- -----------
   4217450380           1 Application                1472         434
   3875070507           4 Concurrency                1178        1338
   3290255840           2 Configuration              1097        5250
   3386400367           5 Commit                     3512        6981
   4166625743           3 Administrative             4906      111527
   2000153315           7 Network                 2549790      325016
   1740759767           8 User I/O                1384276      427440
   4108307767           9 System I/O              3151272      646415
   1893977003           0 Other                    251373     2207566
   2723168908           6 Idle                    3720999   639262672

In the database, after a performance alert occurred in Grid Control, we can identify the problem in V$WAITCLASSMETRIC_HISTORY :

select BEGIN_TIME, END_TIME, DBTIME_IN_WAIT, TIME_WAITED WAIT_COUNT
  from V$WAITCLASSMETRIC_HISTORY
 where WAIT_CLASS# = 7; # 7 for network

BEGIN_TIME           END_TIME               DBTIME_IN_WAIT TIME_WAITED WAIT_COUNT
-------------------- --------------------   -------------- ----------- ----------
25-SEP-2007 11:00:54 25-SEP-2007 11:01:54   21.0387822     2143.1923        19262
25-SEP-2007 10:59:54 25-SEP-2007 11:00:54   51.2713653     2147.2517        19047
25-SEP-2007 10:58:54 25-SEP-2007 10:59:54   55.8887214     2480.2726        18405
25-SEP-2007 10:57:53 25-SEP-2007 10:58:54   44.2017456     2374.9075        21223
25-SEP-2007 10:56:53 25-SEP-2007 10:57:53   33.1204214     2676.6979        28336
25-SEP-2007 10:55:53 25-SEP-2007 10:56:53   37.3184263     3498.7629        30965
25-SEP-2007 10:54:54 25-SEP-2007 10:55:53   27.779576      3130.5472        29403
25-SEP-2007 10:53:54 25-SEP-2007 10:54:54   26.117611      2752.5801        23430

Now we will try to analyze in depth the events mentioned in Grid control :

"Database Time Spent Waiting (%)" is at 88.26568 for event class "Network"

Let’s have a look at the primary site. According to the white paper presented in the Metalink note 387174.1, the event “LNS wait on SENDREQ” covers the sequence of the following actions (when a commit transaction is propagated to the standby database).

(Extract of the Metalink note:)
The RFS process receives the redo being sent by LNS and completes the I/O to the standby redo log (important to understand the analysis below).

V$SYSTEM_EVENT is the view which will allow us to identify the current waits events :

select EVENT, TOTAL_WAITS, TOTAL_TIMEOUTS, TIME_WAITED, AVERAGE_WAIT
from v$system_event
where event like '%LNS%'
   or event like '%LGWR%'
/

EVENT                     TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT
------------------------- ----------- -------------- ----------- ------------
. . .
LNS wait on SENDREQ            125643              0      156673         1.25
. . .

On the standby side we perform almost the same select (a few seconds later) , to identify the LNS/RFS related wait times :


select EVENT, TOTAL_WAITS, TOTAL_TIMEOUTS, TIME_WAITED, AVERAGE_WAIT

  from v$system_event
 where event like '%RFS%'
 order by 4
/

EVENT                     TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT
------------------------- ----------- -------------- ----------- ------------
. . .
RFS write                      125270              0      121825          .97
. . .


We clearly observe that out of the 156673 ms waited because of "LNS wait on SENDREQ" (on the primary, see above for detailed description of this event), 121825 ms are due to the waits on writing to the standby redo log files on the standby side ("RFS write")

In the whole propagation/synchronization process between primary and standby the most important part is simply the writing on the standby redo log files (normal I/Os to keep the transactions on the standby side) and not the network as expected.

It is worth to mention that the "redo write time" on the primary is 128333 ms which means that the 121825 ms of waits for the RFS writes are fully acceptable (normal I/O operations/performance)

During these 3 days while the database was up and running having waits of about 120000 to 130000 ms (120 to 130 s) is absolutely negligible!

As a conclusion: although the most important waits on the primary database are related to the network, there are negligible in absolute values. 90% of 120 seconds of waits in 3 days is still only 1,66 second of wait per hour in average

More details on www.trivadis.com -> publications :

http://www.trivadis.com/Images/Dataguard_network_tuning_tvd_layout_tcm16-16758.pdf