Are Virtual IPs required for Data Guard?
Until 10.2.0.3, the Data Guard (and broker) technology suffered from several weaknesses. One of the "highlights" was that no Virtual IP management was provided by Oracle. This has quite important consequences for the clients as soon as the primary server is down (no IP responding to the TCP requests e.g. of a connection build up). In this case, the failover from the first to the second address (in the address list of the TNS entry used by the client) could take a long time, depending on network/TCP timeouts. The RAC/Clusterware architecture solved this issue by introducing the Virtual IPs (VIP) which where always available (maybe moved to a surviving node).
Of course several solutions are available to overcome that problem, for instance managing your own VIP with self-made scripts (and all the subsequent risks, you try to fake a failover cluster!), combine Data Guard with Failover Cluster technology (e.g. Heartbeat under Linux, Veritas Cluster Services), and so on ...
All that solutions have as almost as many drawbacks as advantages, the most important are:
- How can we garantee that the VIP is always started on the Primary database side?
- How is it possible to prevent two Virtual IP running in parallel in the cluster?
However, one of the mentioned solutions is almost "mandatory" for Oracle clients before 10.2.0.3.
Starting with 10.2.0.3 on client and server side, Oracle proposes the OUTBOUNT_CONNECT_TIMEOUT parameter (not very well documented though). This option must be set in the sqlnet.ora file and defines the maximum number of seconds to be waited before trying to access the second, third etc. address in the ADDRESS_LIST, independent of TCP/IP timeouts. Examples:
In sqlnet.ora:
OUTBOUND_CONNECT_TIMEOUT=5
In tnsnames.ora on the client side or in the LDAP server:
PCMDB =
(DESCRIPTION =
(ADDRESS_LIST=
(ADDRESS = (PROTOCOL = TCP)(HOST = server1 )(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = server2 )(PORT = 1521))
)
(CONNECT_DATA =
(SERVICE_NAME = PCMDB_RW )
)
)
PCMDB_RW is a service created beforehand and activated through a "startup database" trigger (see end of the post). This makes sure, that the client can only connect to the primary side. If after 5 seconds (in our case), server1 is still unreachable, the client will try to connect to server2 (ignoring existing TCP/IP parameters).
Unfortunately, during our tests we observed that this parameter requires the remote_listener parameter to be set on the server side, at least for Linux/Sun(/... ?) clients. This for the non-error situation that a client first tries the standby database (and therefore should failover to the primary). For windows clients, patch 6038241 (10.2.0.3.7P) can be installed (only available for 32 Bit), see metalink note 342443.1. With this patch, the remote_listener parameter does not have to be.
So the Windows clients have a valid/stable solution using the patch.
The Linux clients however suffer from the side effects of the remote_listener parameter: The remote_listener parameter is used to register to other (remote) listeners in order to inform them about the current instance load. This means that in case of crash of the primary server and activation of the standby database, the service PCMDB_RW is known by the listener of the standby side as being offered by the new primary (ok) and the old primary (not ok)!
This situation may result in "ORA-12560: TNS:protocol adapter error" errors.
The solution is quite simple: the listener should be "reloaded" on the new primary side (old standby) when the standby database becomes the primary. This could be easily automated with a script and a DB_CHANGE_ROLE trigger. See the code below :
Script to reload the listener:
#!/bin/ksh
# Set the ORACLE environment variables (ORACLE_HOME, PATH, aso ...)
lsnrctl reload LISTENER_PCMDB
Below the trigger to reload the listener:
CREATE OR REPLACE TRIGGER RELOAD_LISTENER AFTER DB_ROLE_CHANGE ON DATABASE
BEGIN
dbms_scheduler.create_job(job_name=>'reload_lsn',
job_type=>'executable',job_action=>
'/u00/app/oracle/local/custom/bin/reload_lsn_PCMDB.ksh',
enabled=>TRUE);
END;
/
With this solution, both Linux and Windows clients do not require any VIP or Failover Cluster anymore to fully benefit from a Data Guard architecture. The only 'restriction' is to use 10.2.0.3 Oracle client releases. Even 10.2.0.2 does not work properly. Of course, wait for 10.2.0.4 and hope that the "problem" is fixed could be another solution :-)
Appendix : startup trigger for "_RW" service :
create or replace trigger service_trigger after
startup on database
declare
db_name varchar(512);
db_domain varchar(512);
database_role varchar(16);
service_names varchar(512);
instance_name varchar2(16);
begin
select value into service_names from v$parameter where name = 'service_names';
select value into instance_name from v$parameter where name = 'instance_name';
select value into db_name from v$parameter where name = 'db_name';
select value into db_domain from v$parameter where name = 'db_domain';
select database_role into database_role from v$database;
if database_role = 'PRIMARY' then
service_names := rtrim(service_names || ', ' || db_name || '_RW.' || db_domain, '.');
else
service_names := rtrim(service_names || ', ' || db_name || '_RO.' || db_domain, '.');
end if;
execute immediate 'alter system set service_names = ''' || service_names || ''' scope=memory sid=''' || instance_name || '''';
end;
/