ORACLE E MAA (Maximum Availability Architecture) – Parte VI – Failover Broker
Neste artigo vamos ver como fazer um failover e reinstate em um ambiente que temos o Broker configurado. Se você está acompanhando a série de artigos até verá que já passamos por failover manual, resintate, switchover (todos manuais e sem Broker) e no último nós configuramos o Broker.
De qualquer forma, ainda não chegamos a um ambiente ideal de Oracle MAA (Maximum Availability Architecture) com DataGuard e Oracle RAC. Até o momento não foi configurado o Fast-Start Failover e em caso de falha, mesmo com o Broker, ainda precisamos de intervenção manual em caso de falha no ambiente.
SEXTO ARTIGO
Neste artigo vamos ver como proceder em caso de falha do ambiente quando estamos com o Broker, vamos ver como realizar o failover e reinstate através do Broker. Como disse acima, ainda não estamos com o ambiente de MAA finalizado, mas isso não impede que ocorra uma falha e você tenha que interferir.
Acredito que já tenha ficado claro através da séria sobre MAA que muitos passos não são aplicados em um ambiente real na mesma sequência que a apresentada na série, você não vai fazer um failover só para testar o Broker – você já configura tudo sem precisar “restaurar” o ambiente. Através dos artigos cobrimos os diversos pontos que envolvem um ambiente MAA, você tem que estar preparado para agir mesmo quer o seu ambiente não esteja completamente configurado. Você vai dizer para o seu gerente que não pode realizar um failover porque não tem o Broker configurado?
AMBIENTE
Neste artigo temos um ambiente com dois Oracle RAC sincronizados através de DataGuard. O banco primary é o “maa” e o standby é o “maastb”, ambos sincronizados com DG operando em MAXIMUM AVAILABILITY e com real-time apply. Por fim, temos o Broker configurado e “cuidando” do ambiente.
Antes de continuar voltamos brevemente ao ponto destacado no artigo do failover manual: sincronia e importância do ambiente. Sobre a sincronia destaco que em um ambiente DG ela deve ser um dos pontos chaves, você opera em MAA e não deixa seu ambiente sincronizado? Quanto a importância, se você está operando em MAA com Oracle RAC em ambos os lados o seu ambiente primário e standby é importante também não é?
Então, você tem um banco de dados que armazena dados importantes e não mantêm seu standby sincronizado com o primary? Você já avisou seu gerente que vocês vão perder dados?
Não que todo o ambiente MAA com DG nunca deve perder dados. Talvez por razões de projeto exista um acordo quanto o que possa ser perdido de informação, algumas informações podem ser recuperadas por outros métodos. De qualquer forma, conheça bem o seu ambiente, saiba todos os seus detalhes.
CONTROLE
Em artigos anteriores já foi usado o mesmo artifício para exemplificar a sincronia do DG, voltamos com ele aqui. Utilizando a mesma tabela que foi criada antes, limpamos seus dados e em cada instância do primary serão inseridos dados. Com esta tabela mostramos que no ambiente atual, mesmo sem gerar archivelog, as transações estão protegidas.
Na instância primary maa1:
SQL> SELECT database_role FROM v$database;
DATABASE_ROLE
----------------
PRIMARY
SQL> SELECT instance_name FROM v$instance;
INSTANCE_NAME
----------------
maa1
SQL> DELETE FROM testedg;
1 row deleted.
SQL> INSERT INTO testedg(c1, c2) VALUES(1, sysdate);
1 row created.
SQL> COMMIT;
Commit complete.
SQL>
Na instância primary maa2:
SQL> SELECT instance_name FROM v$instance;
INSTANCE_NAME
----------------
maa2
SQL> INSERT INTO testedg(c1, c2) VALUES(2, sysdate);
1 row created.
SQL> COMMIT;
Commit complete.
SQL> SELECT c1, TO_CHAR(c2, 'DD/MM/RRRR HH24:MI:SS') AS momento FROM testedg;
C1 MOMENTO
---------- -------------------
1 11/05/2014 12:01:07
2 11/05/2014 12:02:17
SQL>
FAILOVER
Criando a falha
Aqui a falha será um pouco mais catastrófica que a do último failover, os servidores do Oracle RAC primary foram fisicamente desligados. Suponha que algo parecido aconteça no seu ambiente, que a sua controladora do Storage dos discos responsáveis pelo sue banco deu erro e você não estava com o seu Raid na vertical? Imagine que seu Exadata perdeu mais que duas células de Storage no turno da noite.
Então, você chega de manhã (isso se já não te ligaram a noite) e já tem uma comissão te “recepcionando” com cara assustada e pronta para entregar a bomba e dizer: Boa Sorte! Bom, vamos ver que o cenário não é tão negro assim dá pra sair dessa bem tranquilo.
FAILOVER TO
Já falei acima que a primeira preocupação é a síncrona, mas como você já tem todo o ambiente configurado de maneira correta sabe que a sincronia não será problema. Se você quiser conferir conecte-se no standby e verifique como ela está:
SQL> SELECT database_role FROM v$database;
DATABASE_ROLE
----------------
PHYSICAL STANDBY
SQL> SELECT protection_mode, protection_level FROM v$database;
PROTECTION_MODE PROTECTION_LEVEL
-------------------- --------------------
MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY
SQL>
Como estamos com o Broker configurado, vamos usar ele. Aqui um detalhe, se você tentar se conectar no primary irá receber um erro, você tem que recorrer a conexão do standby no Broker:
[oracle@rac11stb01 ~]$ dgmgrl
DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production
Copyright (c) 2000, 2009, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> CONNECT sys@maa>
Password:
Unable to connect to database
ORA-12543: TNS:destination host unreachable
Failed.
DGMGRL> CONNECT sys@maastb>
Password:
Connected.
DGMGRL>
Como vimos acima o Broker não consegue conectar no ambiente primary. Sabemos que ele está fora e precisamos do failover. Claro que antes de você fazer o failover você conversou com a equipe de infra/sistema e verificou a indisponibilidade de retorno rápido do ambiente. Assim, a única opção é o failover.
Depois de conectado no Broker você pode verificar como está o seu ambiente:
DGMGRL> SHOW CONFIGURATION VERBOSE;
Configuration - dgrac
Protection Mode: MaxAvailability
Databases:
maa - Primary database
maastb - Physical standby database
Properties:
FastStartFailoverThreshold = '30'
OperationTimeout = '30'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
Fast-Start Failover: DISABLED
Configuration Status:
ORA-12543: TNS:destination host unreachable
ORA-16625: cannot reach database "maa"
DGM-17017: unable to determine configuration status
DGMGRL> SHOW DATABASE maa
Database - maa
Role: PRIMARY
Intended State: TRANSPORT-ON
Instance(s):
maa1
maa2
Database Status:
DGM-17016: failed to retrieve status for database "maa"
ORA-12543: TNS:destination host unreachable
ORA-16625: cannot reach database "maa"
DGMGRL> SHOW DATABASE maastb;
Database - maastb
Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: 0 seconds
Apply Lag: 0 seconds
Real Time Query: OFF
Instance(s):
maastb1 (apply instance)
maastb2
Database Status:
SUCCESS
DGMGRL>
Como você viu acima, o banco primary (maa) está fora e o Broker está esperando você se decidir. Para realizar o failover através do Broker basta digitar o comando “FAILOVER TO <nome do standby>” como mostrarei abaixo. Para realizar o failover execute:
DGMGRL> FAILOVER TO maastb;
Performing failover NOW, please wait...
Operation requires a connection to instance "maastb1" on database "maastb"
Connecting to instance "maastb1"...
Connected.
Failover succeeded, new primary is "maastb"
DGMGRL>
No log do broker da instância maastb1:
05/11/2014 12:52:28
FAILOVER TO maastb
Beginning failover to database maastb
Notifying Oracle Clusterware to teardown database for FAILOVER
05/11/2014 12:53:18
Protection mode set to MAXIMUM AVAILABILITY
05/11/2014 12:53:29
Deferring associated archivelog destinations of sites permanently disabled due to Failover
Notifying Oracle Clusterware to buildup primary database after FAILOVER
Data Guard notifying Oracle Clusterware to start services and other instances change
05/11/2014 12:53:33
Posting DB_DOWN alert ...
... with reason Data Guard Failover
05/11/2014 12:53:37
Command FAILOVER TO maastb completed
No log do broker da instância maastb2:
05/11/2014 18:52:26
FAILOVER TO maastb
Command FAILOVER TO maastb completed with warning ORA-16523
05/11/2014 18:53:18
Protection mode set to MAXIMUM AVAILABILITY
05/11/2014 18:53:33
Command FAILOVER TO maastb completed
Observe atentamente as saídas mostradas acima. Na primeira temos a execução do comando e observe que o Broker para proceder com o comando conectou especificamente em uma única instância. Nos logs do Broker temos a informação de que a instância foi reiniciada devido a failover (teardown) e que os destinos de archives foram “desabilitados”.
Vamos ver o que aconteceu no alertlog da instância “maastb1” (faça uma relação com os horários dos logs anteriores). Adianto as desculpas, mas o log abaixo é extenso exatamente para ilustrar tudo o que ocorreu:
No alertlog da standby maastb (nova primary):
Sun May 11 12:52:28 2014
Data Guard Broker: Beginning failover
Sun May 11 12:52:29 2014
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
Sun May 11 12:52:29 2014
MRP0: Background Media Recovery cancelled with status 16037
Errors in file /u01/app/oracle/diag/rdbms/maastb/maastb1/trace/maastb1_pr00_10962.trc:
ORA-16037: user requested cancel of managed recovery operation
Sun May 11 12:52:29 2014
Managed Standby Recovery not using Real Time Apply
Recovery interrupted!
Recovered data files to a consistent state at change 8878906
Sun May 11 12:52:33 2014
MRP0: Background Media Recovery process shutdown (maastb1)
Managed Standby Recovery Canceled (maastb1)
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
Attempt to do a Terminal Recovery (maastb1)
Media Recovery Start: Managed Standby Recovery (maastb1)
started logmerger process
Managed Standby Recovery not using Real Time Apply
Parallel Media Recovery started with 2 slaves
Sun May 11 12:52:36 2014
Media Recovery Waiting for thread 2 sequence 99 (in transit)
Sun May 11 12:52:37 2014
krsv_proc_kill: Killing 6 processes (all RFS, wait for I/O)
Begin: Standby Redo Logfile archival
End: Standby Redo Logfile archival
Terminal Recovery timestamp is '05/11/2014 12:52:40'
Terminal Recovery: applying standby redo logs.
Terminal Recovery: thread 2 seq# 99 redo required
Terminal Recovery:
Recovery of Online Redo Log: Thread 2 Group 9 Seq 99 Reading mem 0
Mem# 0: +DG01/maastb/onlinelog/group_9.260.844716095
Mem# 1: +FRA/maastb/onlinelog/group_9.591.844716097
Identified End-Of-Redo (failover) for thread 2 sequence 99 at SCN 0xffff.ffffffff
Terminal Recovery: thread 1 seq# 140 redo required
Terminal Recovery:
Recovery of Online Redo Log: Thread 1 Group 6 Seq 140 Reading mem 0
Mem# 0: +DG01/maastb/onlinelog/group_6.267.844716079
Mem# 1: +FRA/maastb/onlinelog/group_6.604.844716081
Identified End-Of-Redo (failover) for thread 1 sequence 140 at SCN 0xffff.ffffffff
Incomplete Recovery applied until change 8878917 time 05/11/2014 18:08:24
Sun May 11 12:52:41 2014
Media Recovery Complete (maastb1)
Terminal Recovery: successful completion
Sun May 11 12:52:41 2014
ARC0: Archiving not possible: error count exceeded
Sun May 11 12:52:41 2014
ARC3: Archiving not possible: error count exceeded
Forcing ARSCN to IRSCN for TR 0:8878917
Attempt to set limbo arscn 0:8878917 irscn 0:8878917
Resetting standby activation ID 723321957 (0x2b1d0465)
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance maastb1 - Archival Error
ORA-16014: log 6 sequence# 140 not archived, no available destinations
ORA-00312: online log 6 thread 1: '+DG01/maastb/onlinelog/group_6.267.844716079'
ORA-00312: online log 6 thread 1: '+FRA/maastb/onlinelog/group_6.604.844716081'
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
ALTER DATABASE SWITCHOVER TO PRIMARY (maastb1)
Maximum wait for role transition is 15 minutes.
Backup controlfile written to trace file /u01/app/oracle/diag/rdbms/maastb/maastb1/trace/maastb1_rsm0_18329.trc
Standby terminal recovery start SCN: 8878906
RESETLOGS after incomplete recovery UNTIL CHANGE 8878917
Online log +DG01/maastb/onlinelog/group_1.257.844716051: Thread 1 Group 1 was previously cleared
Online log +FRA/maastb/onlinelog/group_1.568.844716053: Thread 1 Group 1 was previously cleared
Online log +DG01/maastb/onlinelog/group_2.268.844716057: Thread 1 Group 2 was previously cleared
Online log +FRA/maastb/onlinelog/group_2.564.844716059: Thread 1 Group 2 was previously cleared
Online log +DG01/maastb/onlinelog/group_3.269.844716063: Thread 2 Group 3 was previously cleared
Online log +FRA/maastb/onlinelog/group_3.562.844716065: Thread 2 Group 3 was previously cleared
Online log +DG01/maastb/onlinelog/group_4.270.844716067: Thread 2 Group 4 was previously cleared
Online log +FRA/maastb/onlinelog/group_4.559.844716071: Thread 2 Group 4 was previously cleared
Standby became primary SCN: 8878905
Sun May 11 12:52:45 2014
Setting recovery target incarnation to 3
Switchover: Complete - Database mounted as primary
Sun May 11 12:52:51 2014
Completed: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
Completed: ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
ALTER DATABASE OPEN
Data Guard Broker initializing...
This instance was first to open
Picked broadcast on commit scheme to generate SCNs
Sun May 11 12:52:53 2014
Assigning activation ID 723379342 (0x2b1de48e)
LGWR: Primary database is in MAXIMUM AVAILABILITY mode
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
Thread 1 advanced to log sequence 2 (thread open)
Sun May 11 12:52:54 2014
ARC3: Becoming the 'no SRL' ARCH
Thread 1 opened at log sequence 2
Sun May 11 12:52:54 2014
Current log# 2 seq# 2 mem# 0: +DG01/maastb/onlinelog/group_2.268.844716057ARC0: Becoming the 'no SRL' ARCH
Current log# 2 seq# 2 mem# 1: +FRA/maastb/onlinelog/group_2.564.844716059
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Sun May 11 12:52:55 2014
SMON: enabling cache recovery
ARC0: Becoming the 'no SRL' ARCH
Archived Log entry 323 added for thread 1 sequence 1 ID 0x2b1de48e dest 1:
Archiver process freed from errors. No longer stopped
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sun May 11 12:53:04 2014
***********************************************************************
Fatal NI connect error 12543, connecting to:
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle))))
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 11-MAY-2014 12:53:04
Tracing not turned on.
Tns error struct:
ns main err code: 12543
TNS-12543: TNS:destination host unreachable
ns secondary err code: 12560
nt main err code: 513
TNS-00513: Destination host unreachable
nt secondary err code: 113
nt OS err code: 0
Sun May 11 12:53:06 2014
[18329] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:3723812044 end:3723813584 diff:1540 (15 seconds)
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
Sun May 11 12:53:06 2014
SMON: enabling tx recovery
Database Characterset is WE8MSWIN1252
Sun May 11 12:53:07 2014
minact-scn: Inst 1 is a slave inc#:4 mmon proc-id:10424 status:0x2
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000
No Resource Manager plan active
***********************************************************************
Fatal NI connect error 12543, connecting to:
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle))))
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 11-MAY-2014 12:53:13
Tracing not turned on.
Tns error struct:
ns main err code: 12543
TNS-12543: TNS:destination host unreachable
ns secondary err code: 12560
nt main err code: 513
TNS-00513: Destination host unreachable
nt secondary err code: 113
nt OS err code: 0
Starting background process GTX0
Sun May 11 12:53:14 2014
GTX0 started with pid=40, OS id=10940
Starting background process RCBG
Sun May 11 12:53:14 2014
RCBG started with pid=41, OS id=10942
replication_dependency_tracking turned off (no async multimaster replication found)
Sun May 11 12:53:16 2014
Starting background process QMNC
Sun May 11 12:53:16 2014
QMNC started with pid=42, OS id=10944
Completed: ALTER DATABASE OPEN
ALTER SYSTEM SET log_archive_trace=0 SCOPE=BOTH SID='maastb1';
ALTER SYSTEM SET log_archive_format='arch_%t_%s_%r.arc' SCOPE=SPFILE SID='maastb1';
Sun May 11 12:53:22 2014
Starting background process SMCO
Sun May 11 12:53:22 2014
SMCO started with pid=44, OS id=10957
Sun May 11 12:53:22 2014
***********************************************************************
Fatal NI connect error 12543, connecting to:
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle))))
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 11-MAY-2014 12:53:22
Tracing not turned on.
Tns error struct:
ns main err code: 12543
TNS-12543: TNS:destination host unreachable
ns secondary err code: 12560
nt main err code: 513
TNS-00513: Destination host unreachable
nt secondary err code: 113
nt OS err code: 0
Error 12543 received logging on to the standby
PING[ARC2]: Heartbeat failed to connect to standby 'maa'. Error is 12543.
Sun May 11 12:53:22 2014
ARC1: STARTING ARCH PROCESSES
Sun May 11 12:53:22 2014
ARC0: Becoming the 'no SRL' ARCH
Sun May 11 12:53:22 2014
ARC4 started with pid=48, OS id=10961
ARC4: Archival started
ARC1: STARTING ARCH PROCESSES COMPLETE
ARC1: Becoming the 'no SRL' ARCH
Sun May 11 12:53:26 2014
ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH SID='*';
ALTER SYSTEM SET archive_lag_target=0 SCOPE=BOTH SID='*';
ALTER SYSTEM SET log_archive_max_processes=4 SCOPE=BOTH SID='*';
ALTER SYSTEM SET log_archive_min_succeed_dest=1 SCOPE=BOTH SID='*';
ALTER SYSTEM SET db_file_name_convert='+DATA/maa','+DG01/maastb','+FRA/maa','+FRA/maastb' SCOPE=SPFILE;
ALTER SYSTEM SET log_file_name_convert='+DATA/maa','+DG01/maastb','+FRA/maa','+FRA/maastb' SCOPE=SPFILE;
Sun May 11 12:53:29 2014
ALTER SYSTEM SET log_archive_dest_state_2='RESET' SCOPE=BOTH;
Sun May 11 12:53:31 2014
Starting background process CJQ0
Sun May 11 12:53:31 2014
CJQ0 started with pid=52, OS id=10995
Failover succeeded. Primary database is now maastb.
Sun May 11 12:53:33 2014
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Sun May 11 12:53:33 2014
Starting background process VKRM
Sun May 11 12:53:33 2014
VKRM started with pid=60, OS id=11000
Setting Resource Manager plan SCHEDULER[0x318F]:DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Sun May 11 12:53:38 2014
Active Session History (ASH) performed an emergency flush. This may mean that ASH is undersized. If emergency flushes are a recurring issue, you may consider increasing ASH size by setting the value of _ASH_SIZE to a sufficiently large value. Currently, ASH size is 4194304 bytes. Both ASH size and the total number of emergency flushes since instance startup can be monitored by running the following query:
select total_size,awr_flush_emergency_count from v$ash_info;
Analisando com calma tudo o que temos acima observamos que a primeira coisa que o Broker fez foi cancelar a aplicação de redo oriundos da antiga primary. Identificamos também que o standby se tornou primary no SCN# 8878905 e marcou um End-Of-Redo em ambas as threads. Depois, todas as transações e informações sofreram commit para que o banco passasse a ser o primary. Além disso, o banco abriu em MAXIMUM AVAILABILITY e já detectou que o log_archive_dest_2 está dessincronizado (por isso as falhas de TNS que aparecem). Acima eu não demonstrei alertlog da segunda instância que compõe o banco “maastb”, mas as informações contidas lá são simples (já que tudo foi feito pela “maastb1”).
Se você não tivesse o Broker você teria que fazer tudo isso manualmente. Todos os comandos desde o cancel ante o failover seriam manuais, inclusive o ajuste do CRS. Para verificar como ficou o ambiente podemos executar no Broker:
DGMGRL> SHOW CONFIGURATION VERBOSE;
Configuration - dgrac
Protection Mode: MaxAvailability
Databases:
maastb - Primary database
Warning: ORA-16629: database reports a different protection level from the protection mode
maa - Physical standby database (disabled)
ORA-16661: the standby database needs to be reinstated
Properties:
FastStartFailoverThreshold = '30'
OperationTimeout = '30'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
Fast-Start Failover: DISABLED
Configuration Status:
WARNING
DGMGRL> SHOW DATABASE VERBOSE maastb;
Database - maastb
Role: PRIMARY
Intended State: TRANSPORT-ON
Instance(s):
maastb1
maastb2
Database Warning(s):
ORA-16629: database reports a different protection level from the protection mode
Properties:
DGConnectIdentifier = 'maastb'
ObserverConnectIdentifier = ''
LogXptMode = 'SYNC'
DelayMins = '0'
Binding = 'OPTIONAL'
MaxFailure = '0'
MaxConnections = '1'
ReopenSecs = '300'
NetTimeout = '30'
RedoCompression = 'DISABLE'
LogShipping = 'ON'
PreferredApplyInstance = ''
ApplyInstanceTimeout = '0'
ApplyParallel = 'AUTO'
StandbyFileManagement = 'AUTO'
ArchiveLagTarget = '0'
LogArchiveMaxProcesses = '4'
LogArchiveMinSucceedDest = '1'
DbFileNameConvert = '+DATA/maa, +DG01/maastb, +FRA/maa, +FRA/maastb'
LogFileNameConvert = '+DATA/maa, +DG01/maastb, +FRA/maa, +FRA/maastb'
FastStartFailoverTarget = ''
InconsistentProperties = '(monitor)'
InconsistentLogXptProps = '(monitor)'
SendQEntries = '(monitor)'
LogXptStatus = '(monitor)'
RecvQEntries = '(monitor)'
SidName(*)
StaticConnectIdentifier(*)
StandbyArchiveLocation(*)
AlternateLocation(*)
LogArchiveTrace(*)
LogArchiveFormat(*)
TopWaitEvents(*)
(*) - Please check specific instance for the property value
Database Status:
WARNING
DGMGRL>
Observe acima que temos alguns detalhes a serem averiguados. Temos um belo “ORA-16629: database reports a different protection level from the protection mode” e ele ocorre pois o novo standby não está disponível para sincronização de archives (vou demonstrar abaixo). Vemos também que o “maastb” é o novo primary.
Se verificarmos o antigo primary iremos ver informações importantes:
DGMGRL> SHOW DATABASE VERBOSE maa;
Database - maa
Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: (unknown)
Apply Lag: (unknown)
Real Time Query: OFF
Instance(s):
maa1
maa2
Properties:
DGConnectIdentifier = 'maa'
ObserverConnectIdentifier = ''
LogXptMode = 'SYNC'
DelayMins = '0'
Binding = 'optional'
MaxFailure = '0'
MaxConnections = '1'
ReopenSecs = '300'
NetTimeout = '30'
RedoCompression = 'DISABLE'
LogShipping = 'ON'
PreferredApplyInstance = ''
ApplyInstanceTimeout = '0'
ApplyParallel = 'AUTO'
StandbyFileManagement = 'AUTO'
ArchiveLagTarget = '0'
LogArchiveMaxProcesses = '4'
LogArchiveMinSucceedDest = '1'
DbFileNameConvert = '+DG01/maastb, +DATA/maa, +FRA/maastb, +FRA/maa'
LogFileNameConvert = '+DG01/maastb, +DATA/maa, +FRA/maastb, +FRA/maa'
FastStartFailoverTarget = ''
InconsistentProperties = '(monitor)'
InconsistentLogXptProps = '(monitor)'
SendQEntries = '(monitor)'
LogXptStatus = '(monitor)'
RecvQEntries = '(monitor)'
SidName(*)
StaticConnectIdentifier(*)
StandbyArchiveLocation(*)
AlternateLocation(*)
LogArchiveTrace(*)
LogArchiveFormat(*)
TopWaitEvents(*)
(*) - Please check specific instance for the property value
Database Status:
ORA-16661: the standby database needs to be reinstated
DGMGRL>
O principal detalhe aqui é a informação que ele necessita de um reinstate: “ORA-16661: the standby database needs to be reinstated”. Se você disponibilizar o seu antigo primary você terá que fazer o reinstate dele.
CONTROLE 2
Até o momento tudo foi muito legal e bonito, fizemos o failover através do Broker e ele cuidou de tudo por nós. Claro que tudo isso aconteceu porque deixamos e mantivemos o ambiente são, a sincronização entre primary e standby estava correta e nenhuma informação foi perdida. Vamos ver a tabela de teste que usamos no início do artigo:
SQL> SELECT instance_name, status FROM gv$instance;
INSTANCE_NAME STATUS
---------------- ------------
maastb2 OPEN
maastb1 OPEN
SQL> SELECT c1, TO_CHAR(c2, 'DD/MM/RRRR HH24:MI:SS') AS momento FROM testedg;
C1 MOMENTO
---------- -------------------
1 11/05/2014 12:01:07
2 11/05/2014 12:02:17
SQL>
REINSTATE
Já aviso com antecedência que os próximos logs são extensos. De qualquer forma eles estão recheados de detalhes técnicos que merecem atenção e um olhar mais cuidadoso.
Se você deixar tudo da forma como está agora você tem o banco “maastb” atuando como primary e esperando que o banco “maa” assuma seu lugar como standy. No log do Broker das instâncias do primary você verá a seguinte informação (bem como nas saídas do dgmgrl que mostrei acima):
05/11/2014 19:40:53
Error: The actual protection level 'RESYNCHRONIZATION' is different from the configured protection mode 'MAXIMUM AVAILABILITY'.
Data Guard Broker Status Summary:
Type Name Severity Status
Configuration dgrac Warning ORA-16608
Primary Database maastb Warning ORA-16629
Physical Standby Database maa Error ORA-16661
Além disso, você deve estar se perguntando o que acontecerá se o meu antigo primary voltar e ficar disponível? Será que vai dar problema? Hum… não.
O simples fato de usarmos o Broker já nos ajuda tremendamente. Quando o antigo primary voltar ele precisará de um reinstate assim que retorne. Observe abaixo alertlog do antigo primary assim que ele foi religado (este log foi “cortado”, mas suas partes interessantes estão mantidas):
Mon May 12 05:14:46 2014
Starting ORACLE instance (normal)
****************** Large Pages Information *****************
Total Shared Global Region in Large Pages = 0 KB (0%)
Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB) (alloc incr 4096 KB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB
...
...
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning and Real Application Clusters options.
ORACLE_HOME = /u01/app/oracle/product/11.2.0.3/db_1
System name: Linux
Node name: rac11pri01.tjsc.jus.br
Release: 2.6.39-400.17.1.el6uek.x86_64
Version: #1 SMP Fri Feb 22 18:16:18 PST 2013
Machine: x86_64
VM name: VMWare Version: 6
...
...
fal_server = "MAASTB"
log_archive_trace = 0
log_archive_config = "DG_CONFIG=(maa,maastb)"
...
...
DBW0 started with pid=17, OS id=3509
Mon May 12 05:14:51 2014
LGWR started with pid=18, OS id=3516
Mon May 12 05:14:51 2014
CKPT started with pid=19, OS id=3519
Mon May 12 05:14:51 2014
SMON started with pid=20, OS id=3521
Mon May 12 05:14:51 2014
RECO started with pid=21, OS id=3523
...
...
Reconfiguration complete
Mon May 12 05:14:52 2014
LCK0 started with pid=28, OS id=3558
Starting background process RSMN
Mon May 12 05:14:52 2014
RSMN started with pid=29, OS id=3562
ORACLE_BASE not set in environment. It is recommended
that ORACLE_BASE be set in the environment
Reusing ORACLE_BASE from an earlier startup = /u01/app/oracle
Mon May 12 05:14:53 2014
DMON started with pid=30, OS id=3565
Mon May 12 05:14:55 2014
ALTER SYSTEM SET local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.17.42.14)(PORT=1521))))' SCOPE=MEMORY SID='maa1';
ALTER DATABASE MOUNT /* db agent *//* {1:51382:2} */
This instance was first to mount
NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Loaded library: System
...
...
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Lost write protection disabled
Completed: ALTER DATABASE MOUNT /* db agent *//* {1:51382:2} */
ALTER DATABASE OPEN /* db agent *//* {1:51382:2} */
Data Guard Broker initializing...
Data Guard Broker initialization complete
Data Guard: verifying database primary role...
Mon May 12 05:15:06 2014
Starting Data Guard Broker (DMON)
Mon May 12 05:15:06 2014
INSV started with pid=35, OS id=3657
Mon May 12 05:15:10 2014
NSV1 started with pid=36, OS id=3669
Mon May 12 05:15:16 2014
Data Guard: version check completed
Data Guard determines a failover has occurred - this is no longer a primary database
ORA-16649 signalled during: ALTER DATABASE OPEN /* db agent *//* {1:51382:2} */...
Observe acima que ao subir ele abriu o banco, mas logo em seguida que o processo do Broker foi iniciado ele detectou o ORA-16649 ao fazer o “verifying database primary role”. Assim, ele detectou que “this is no longer a primary database” e deixou o banco em modo mount. No log do Broker da antiga primary nós temos:
05/12/2014 05:15:06
>> Starting Data Guard Broker bootstrap <<
Broker Configuration File Locations:
dg_broker_config_file1 = "+DATA/maa/dr1maa.dat"
dg_broker_config_file2 = "+FRA/maa/dr2maa.dat"
05/12/2014 05:15:10
DMON Registering service maa_DGB with listener(s)
Broker Configuration: "dgrac"
Protection Mode: Maximum Availability
Fast-Start Failover (FSFO): Disabled, flags=0x0, version=0
Primary Database: maa (0x01010000)
Standby Database: maastb, Enabled Physical Standby (0x02010000)
05/12/2014 05:15:16
Site maastb returned ORA-16623.
Version Check Results:
Database maastb returned ORA-16623
Disable redo transport to all standby databases
RSM0 failed to remove all destinations, error ORA-16718
Observe que também temos um “ORA-16623” e “ORA-16718”. Isso simplesmente nos diz que o banco atual não pode abrir e precisa de reinstate para mudar de papel. Outro detalhe interessante que peço que observe é quem sinalizou o ORA-16623, se você notar foi a nova primary que fez isso e o erro nos diz que ocorreu uma troca de papeis.
Para resolver tudo nós precisamos fazer o reinstate do antigo primary para que ele se torne o novo standby. Para isso execute no dgmgrl o seguinte comando (estou conectando através de uma instância do standby):
DGMGRL> REINSTATE DATABASE maa;
Reinstating database "maa", please wait...
Operation requires shutdown of instance "maa1" on database "maa"
Shutting down instance "maa1"...
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.
Operation requires startup of instance "maa1" on database "maa"
Starting instance "maa1"...
ORACLE instance started.
Database mounted.
Continuing to reinstate database "maa" ...
Reinstatement of database "maa" succeeded
DGMGRL>
Observe acima que ao chamar o reinstate do banco através do Broker ele conectou no antigo primary (maa), realizou o shutdown e abriu a instância em modo mount. Enquanto o comando acima era executado a seguinte informação aparecia no log do Broker da instância do primary:
Log Broker instancia maastb1:
05/11/2014 20:11:54
REINSTATE DATABASE maa
Database maa can be reinstated
05/11/2014 20:13:02
Command REINSTATE DATABASE maa completed with warning ORA-16570
05/11/2014 20:13:46
REINSTATE DATABASE maa
Database maa can be reinstated
05/11/2014 20:14:06
Successfully completed reinstatement of database 0x01001000; removing ReinstateContextArray property
05/11/2014 20:14:47
Command REINSTATE DATABASE maa completed
EDIT DATABASE maa SET PROPERTY ActualApplyInstance = maa1
Apply Instance for database maa is maa1
Command EDIT DATABASE maa SET PROPERTY ActualApplyInstance = maa1 completed
A informação é praticamente a mesma que apareceu no dgmgrl. As informações ficam mais interessantes no alertlog das instâncias do primary, observe abaixo:
Alertlog da instância maastb1:
Sun May 11 20:12:28 2014
ALTER SYSTEM SET log_archive_dest_2='service="maa"','LGWR SYNC AFFIRM delay=0 optional compression=disable max_failure=0 max_connections=1 reopen=1 db_unique_name="maa" net_timeout=30','valid_for=(all_logfiles,primary_role)' SCOPE=BOTH;
ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH;
Sun May 11 20:12:30 2014
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
LGWR: Error 16009 disconnecting from destination LOG_ARCHIVE_DEST_2 standby host 'maa'
Error 16009 for archive log file 1 to 'maa'
LGWR: Failed to archive log 1 thread 1 sequence 3 (16009)
Thread 1 advanced to log sequence 3 (LGWR switch)
Current log# 1 seq# 3 mem# 0: +DG01/maastb/onlinelog/group_1.257.844716051
Current log# 1 seq# 3 mem# 1: +FRA/maastb/onlinelog/group_1.568.844716053
Sun May 11 20:12:42 2014
Archived Log entry 328 added for thread 1 sequence 2 ID 0x2b1de48e dest 1:
Sun May 11 20:12:51 2014
ALTER SYSTEM SET log_archive_dest_state_2='RESET' SCOPE=BOTH;
Sun May 11 20:13:48 2014
***********************************************************************
Fatal NI connect error 12514, connecting to:
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa_DGB)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle))))
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 11-MAY-2014 20:13:48
Tracing not turned on.
Tns error struct:
ns main err code: 12564
TNS-12564: TNS:connection refused
ns secondary err code: 0
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
***********************************************************************
Fatal NI connect error 12514, connecting to:
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa_DGB)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle))))
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 11-MAY-2014 20:13:48
Tracing not turned on.
Tns error struct:
ns main err code: 12564
TNS-12564: TNS:connection refused
ns secondary err code: 0
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
***********************************************************************
Fatal NI connect error 12514, connecting to:
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa_DGB)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle))))
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 11-MAY-2014 20:13:48
Tracing not turned on.
Tns error struct:
ns main err code: 12564
TNS-12564: TNS:connection refused
ns secondary err code: 0
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Sun May 11 20:13:53 2014
ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH;
Sun May 11 20:13:54 2014
Thread 1 cannot allocate new log, sequence 4
Checkpoint not complete
Current log# 1 seq# 3 mem# 0: +DG01/maastb/onlinelog/group_1.257.844716051
Current log# 1 seq# 3 mem# 1: +FRA/maastb/onlinelog/group_1.568.844716053
LGWR: Standby redo logfile selected to archive thread 1 sequence 4
LGWR: Standby redo logfile selected for thread 1 sequence 4 for destination LOG_ARCHIVE_DEST_2
Thread 1 advanced to log sequence 4 (LGWR switch)
Current log# 2 seq# 4 mem# 0: +DG01/maastb/onlinelog/group_2.268.844716057
Current log# 2 seq# 4 mem# 1: +FRA/maastb/onlinelog/group_2.564.844716059
Sun May 11 20:13:59 2014
Archived Log entry 330 added for thread 1 sequence 3 ID 0x2b1de48e dest 1:
ARC0: Standby redo logfile selected for thread 1 sequence 3 for destination LOG_ARCHIVE_DEST_2
Sun May 11 20:14:07 2014
ALTER SYSTEM SET log_archive_dest_2='service="maa"','LGWR SYNC AFFIRM delay=0 optional compression=disable max_failure=0 max_connections=1 reopen=300 db_unique_name="maa" net_timeout=30','valid_for=(all_logfiles,primary_role)' SCOPE=BOTH;
ALTER SYSTEM ARCHIVE LOG
Sun May 11 20:14:07 2014
Thread 1 cannot allocate new log, sequence 5
Checkpoint not complete
Current log# 2 seq# 4 mem# 0: +DG01/maastb/onlinelog/group_2.268.844716057
Current log# 2 seq# 4 mem# 1: +FRA/maastb/onlinelog/group_2.564.844716059
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
LGWR: Standby redo logfile selected to archive thread 1 sequence 5
LGWR: Standby redo logfile selected for thread 1 sequence 5 for destination LOG_ARCHIVE_DEST_2
Thread 1 advanced to log sequence 5 (LGWR switch)
Current log# 1 seq# 5 mem# 0: +DG01/maastb/onlinelog/group_1.257.844716051
Current log# 1 seq# 5 mem# 1: +FRA/maastb/onlinelog/group_1.568.844716053
Archived Log entry 339 added for thread 1 sequence 4 ID 0x2b1de48e dest 1:
Archived Log entry 342 added for thread 2 sequence 4 ID 0x2b1de48e dest 1:
Sun May 11 20:15:14 2014
Destination LOG_ARCHIVE_DEST_2 is SYNCHRONIZED
LGWR: Standby redo logfile selected to archive thread 1 sequence 6
LGWR: Standby redo logfile selected for thread 1 sequence 6 for destination LOG_ARCHIVE_DEST_2
Thread 1 advanced to log sequence 6 (LGWR switch)
Current log# 2 seq# 6 mem# 0: +DG01/maastb/onlinelog/group_2.268.844716057
Current log# 2 seq# 6 mem# 1: +FRA/maastb/onlinelog/group_2.564.844716059
Sun May 11 20:15:15 2014
Archived Log entry 344 added for thread 1 sequence 5 ID 0x2b1de48e dest 1:
Alertlog instância maastb2:
Mon May 12 02:12:33 2014
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
LGWR: Error 16009 disconnecting from destination LOG_ARCHIVE_DEST_2 standby host 'maa'
Error 16009 for archive log file 3 to 'maa'
LGWR: Failed to archive log 3 thread 2 sequence 3 (16009)
Thread 2 advanced to log sequence 3 (LGWR switch)
Current log# 3 seq# 3 mem# 0: +DG01/maastb/onlinelog/group_3.269.844716063
Current log# 3 seq# 3 mem# 1: +FRA/maastb/onlinelog/group_3.562.844716065
Mon May 12 02:12:36 2014
Deleted Oracle managed file +FRA/maastb/flashback/log_10.540.846646525
Archived Log entry 327 added for thread 2 sequence 2 ID 0x2b1de48e dest 1:
Mon May 12 02:13:56 2014
LGWR: Standby redo logfile selected to archive thread 2 sequence 4
LGWR: Standby redo logfile selected for thread 2 sequence 4 for destination LOG_ARCHIVE_DEST_2
Thread 2 advanced to log sequence 4 (LGWR switch)
Current log# 4 seq# 4 mem# 0: +DG01/maastb/onlinelog/group_4.270.844716067
Current log# 4 seq# 4 mem# 1: +FRA/maastb/onlinelog/group_4.559.844716071
Mon May 12 02:13:59 2014
Archived Log entry 329 added for thread 2 sequence 3 ID 0x2b1de48e dest 1:
ARC3: Standby redo logfile selected for thread 2 sequence 3 for destination LOG_ARCHIVE_DEST_2
Mon May 12 02:14:09 2014
Thread 2 cannot allocate new log, sequence 5
Checkpoint not complete
Current log# 4 seq# 4 mem# 0: +DG01/maastb/onlinelog/group_4.270.844716067
Current log# 4 seq# 4 mem# 1: +FRA/maastb/onlinelog/group_4.559.844716071
Mon May 12 02:14:10 2014
FAL[server, ARC0]: FAL archive failed, see trace file.
Mon May 12 02:14:12 2014
ARC3: Archive log rejected (thread 1 sequence 1) at host 'maa'
FAL[server, ARC3]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance maastb2 - Archival Error. Archiver continuing.
ARC0: Archive log rejected (thread 2 sequence 2) at host 'maa'
FAL[server, ARC0]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance maastb2 - Archival Error. Archiver continuing.
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
LGWR: Standby redo logfile selected to archive thread 2 sequence 5
LGWR: Standby redo logfile selected for thread 2 sequence 5 for destination LOG_ARCHIVE_DEST_2
Thread 2 advanced to log sequence 5 (LGWR switch)
Current log# 3 seq# 5 mem# 0: +DG01/maastb/onlinelog/group_3.269.844716063
Current log# 3 seq# 5 mem# 1: +FRA/maastb/onlinelog/group_3.562.844716065
Mon May 12 02:15:14 2014
Thread 2 cannot allocate new log, sequence 6
Checkpoint not complete
Current log# 3 seq# 5 mem# 0: +DG01/maastb/onlinelog/group_3.269.844716063
Current log# 3 seq# 5 mem# 1: +FRA/maastb/onlinelog/group_3.562.844716065
Destination LOG_ARCHIVE_DEST_2 is SYNCHRONIZED
LGWR: Standby redo logfile selected to archive thread 2 sequence 6
LGWR: Standby redo logfile selected for thread 2 sequence 6 for destination LOG_ARCHIVE_DEST_2
Thread 2 advanced to log sequence 6 (LGWR switch)
Current log# 4 seq# 6 mem# 0: +DG01/maastb/onlinelog/group_4.270.844716067
Current log# 4 seq# 6 mem# 1: +FRA/maastb/onlinelog/group_4.559.844716071
Mon May 12 02:15:19 2014
Archived Log entry 346 added for thread 2 sequence 5 ID 0x2b1de48e dest 1:
Se você observar acima verá que o log_archive_dest_2 foi “reabilitado” e passou a sincronizar com o seu destino. Observe que ele respeitou as mesmas informações que havíamos definido desde o início da configuração do DG.
Os logs mais interessantes ficam no antigo primary (e novo standby), primeiro observe abaixo o log do Broker:
Log do Broker na maa1:
05/12/2014 05:30:07
Data Guard notifying Oracle Clusterware of database role change
05/12/2014 05:30:29
Creating process RSM0
05/12/2014 05:30:47
Physical RSM: Reinstatement... Converting old primary control file to physical standby control file
05/12/2014 05:30:52
Purging diverged redos on resetlogs branch 844762805, starting SCN 8878906
Purged 0 archived logs
05/12/2014 05:31:11
Data Guard Broker shutting down
RSM0 successfully terminated
05/12/2014 05:31:14
>> DMON Process Shutdown <<
05/12/2014 05:31:43
>> Starting Data Guard Broker bootstrap <<
Broker Configuration File Locations:
dg_broker_config_file1 = "+DATA/maa/dr1maa.dat"
dg_broker_config_file2 = "+FRA/maa/dr2maa.dat"
05/12/2014 05:31:47
Database needs to be reinstated or re-created, Data Guard broker ready
05/12/2014 05:31:49
DMON Registering service maa_DGB with listener(s)
05/12/2014 05:31:54
Apply Instance for Database maa set to maa1
Data Guard notifying Oracle Clusterware of database role change
Creating process RSM0
05/12/2014 05:32:23
Notifying Oracle Clusterware to buildup after REINSTATEMENT
Data Guard notifying Oracle Clusterware to start services and other instances change
05/12/2014 05:32:26
Command REINSTATE DATABASE maa completed
05/12/2014 05:32:50
Command EDIT DATABASE maa SET PROPERTY ActualApplyInstance = maa1 completed
Se analisarmos com calma o log acima veremos que ele detectou qual o SCN necessário para o reinstate, realizou o reinstate e corrigiu o CRS também. No alertlog da instância do antigo primary nós temos:
Alertlog da maa1:
Mon May 12 05:30:29 2014
RSM0 started with pid=39, OS id=3988
Mon May 12 05:30:32 2014
Using STANDBY_ARCHIVE_DEST parameter default value as USE_DB_RECOVERY_FILE_DEST
RFS[1]: Assigned to RFS process 3990
RFS[1]: Database mount ID mismatch [0x2b1de48e:0x2b44d48f] (723379342:725931151)
FLASHBACK DATABASE TO SCN 8878905
Mon May 12 05:30:35 2014
RFS[2]: Assigned to RFS process 3993
RFS[2]: Database mount ID mismatch [0x2b1de48e:0x2b44d48f] (723379342:725931151)
Flashback Restore Start
Flashback Restore Complete
Flashback Media Recovery Start
started logmerger process
Parallel Media Recovery started with 2 slaves
Mon May 12 05:30:38 2014
Recovery of Online Redo Log: Thread 2 Group 3 Seq 99 Reading mem 0
Mem# 0: +DATA/maa/onlinelog/group_3.257.843489101
Mem# 1: +FRA/maa/onlinelog/group_3.284.843489101
Recovery of Online Redo Log: Thread 1 Group 1 Seq 139 Reading mem 0
Mem# 0: +DATA/maa/onlinelog/group_1.272.843488553
Mem# 1: +FRA/maa/onlinelog/group_1.286.843488555
Recovery of Online Redo Log: Thread 1 Group 2 Seq 140 Reading mem 0
Mem# 0: +DATA/maa/onlinelog/group_2.271.843488555
Mem# 1: +FRA/maa/onlinelog/group_2.285.843488555
Incomplete Recovery applied until change 8878906 time 05/11/2014 18:08:14
Mon May 12 05:30:47 2014
Flashback Media Recovery Complete
Completed: FLASHBACK DATABASE TO SCN 8878905
alter database convert to physical standby
ALTER DATABASE CONVERT TO PHYSICAL STANDBY (maa1)
Flush standby redo logfile failed:1649
Clearing standby activation ID 723321957 (0x2b1d0465)
The primary database controlfile was created using the
'MAXLOGFILES 192' clause.
There is space for up to 188 standby redo logfiles
Use the following SQL commands on the standby database to create
standby redo logfiles that match the primary database:
ALTER DATABASE ADD STANDBY LOGFILE 'srl1.f' SIZE 52428800;
ALTER DATABASE ADD STANDBY LOGFILE 'srl2.f' SIZE 52428800;
ALTER DATABASE ADD STANDBY LOGFILE 'srl3.f' SIZE 52428800;
ALTER DATABASE ADD STANDBY LOGFILE 'srl4.f' SIZE 52428800;
ALTER DATABASE ADD STANDBY LOGFILE 'srl5.f' SIZE 52428800;
Waiting for all non-current ORLs to be archived...
All non-current ORLs have been archived.
Clearing online redo logfile 1 +DATA/maa/onlinelog/group_1.272.843488553
Clearing online log 1 of thread 1 sequence number 139
Clearing online redo logfile 1 complete
Clearing online redo logfile 2 +DATA/maa/onlinelog/group_2.271.843488555
Clearing online log 2 of thread 1 sequence number 140
Clearing online redo logfile 2 complete
Clearing online redo logfile 3 +DATA/maa/onlinelog/group_3.257.843489101
Clearing online log 3 of thread 2 sequence number 99
Clearing online redo logfile 3 complete
Clearing online redo logfile 4 +DATA/maa/onlinelog/group_4.262.843489103
Clearing online log 4 of thread 2 sequence number 98
Clearing online redo logfile 4 complete
Completed: alter database convert to physical standby
Mon May 12 05:31:05 2014
Shutting down instance (immediate)
Shutting down instance: further logons disabled
Stopping background process MMNL
Stopping background process MMON
License high water mark = 6
alter database CLOSE NORMAL
ORA-1109 signalled during: alter database CLOSE NORMAL...
alter database DISMOUNT
Shutting down archive processes
Archiving is disabled
...
...
Shutting down Data Guard Broker processes
Mon May 12 05:31:14 2014
Completed: Data Guard Broker shutdown
NOTE: force a map free for map id 2
Mon May 12 05:31:15 2014
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
Mon May 12 05:31:15 2014
Stopping background process VKTM
...
...
Instance shutdown complete
Mon May 12 05:31:23 2014
Starting ORACLE instance (normal)
****************** Large Pages Information *****************
Total Shared Global Region in Large Pages = 0 KB (0%)
...
...
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning and Real Application Clusters options.
...
...
DMON started with pid=30, OS id=4114
Mon May 12 05:31:31 2014
alter database mount
This instance was first to mount
Mon May 12 05:31:31 2014
ALTER SYSTEM SET local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.17.42.14)(PORT=1521))))' SCOPE=MEMORY SID='maa1';
NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Loaded library: System
SUCCESS: diskgroup DATA was mounted
SUCCESS: diskgroup FRA was mounted
NOTE: dependency between database maa and diskgroup resource ora.DATA.dg is established
NOTE: dependency between database maa and diskgroup resource ora.FRA.dg is established
Mon May 12 05:31:37 2014
ARCH: STARTING ARCH PROCESSES
NSS2 started with pid=33, OS id=4147
Mon May 12 05:31:37 2014
ARC0 started with pid=34, OS id=4149
ARC0: Archival started
...
...
ARC2: Becoming the 'no FAL' ARCH
ARC2: Becoming the 'no SRL' ARCH
ARC1: Becoming the heartbeat ARCH
Physical Standby Database mounted.
Lost write protection disabled
ARC1: Becoming the active heartbeat ARCH
ARC1: Becoming the active heartbeat ARCH
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
Completed: alter database mount
Mon May 12 05:31:43 2014
Starting Data Guard Broker (DMON)
Mon May 12 05:31:43 2014
...
...
Using STANDBY_ARCHIVE_DEST parameter default value as USE_DB_RECOVERY_FILE_DEST
Primary database is in MAXIMUM AVAILABILITY mode
Changing standby controlfile to RESYNCHRONIZATION level
Standby controlfile consistent with primary
RFS[1]: Assigned to RFS process 4190
RFS[1]: Selected log 8 for thread 2 sequence 4 dbid 722024964 branch 847284763
Mon May 12 05:31:59 2014
Primary database is in MAXIMUM AVAILABILITY mode
Standby controlfile consistent with primary
Standby controlfile consistent with primary
RFS[2]: Assigned to RFS process 4192
RFS[2]: Selected log 5 for thread 1 sequence 4 dbid 722024964 branch 847284763
ALTER SYSTEM SET log_archive_trace=0 SCOPE=BOTH SID='maa1';
ALTER SYSTEM SET log_archive_format='arch_%t_%s_%r.arc' SCOPE=SPFILE SID='maa1';
ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH SID='*';
ALTER SYSTEM SET archive_lag_target=0 SCOPE=BOTH SID='*';
ALTER SYSTEM SET log_archive_max_processes=4 SCOPE=BOTH SID='*';
ALTER SYSTEM SET log_archive_min_succeed_dest=1 SCOPE=BOTH SID='*';
ALTER SYSTEM SET db_file_name_convert='+DG01/maastb','+DATA/maa','+FRA/maastb','+FRA/maa' SCOPE=SPFILE;
ALTER SYSTEM SET log_file_name_convert='+DG01/maastb','+DATA/maa','+FRA/maastb','+FRA/maa' SCOPE=SPFILE;
ALTER SYSTEM SET fal_server='maastb' SCOPE=BOTH;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE THROUGH ALL SWITCHOVER DISCONNECT USING CURRENT LOGFILE
Attempt to start background Managed Standby Recovery process (maa1)
Mon May 12 05:32:01 2014
MRP0 started with pid=45, OS id=4198
MRP0: Background Managed Standby Recovery process started (maa1)
Mon May 12 05:32:01 2014
RFS[3]: Assigned to RFS process 4194
RFS[3]: Selected log 9 for thread 2 sequence 3 dbid 722024964 branch 847284763
Mon May 12 05:32:01 2014
RFS[4]: Assigned to RFS process 4196
RFS[4]: Selected log 6 for thread 1 sequence 3 dbid 722024964 branch 847284763
RFS[4]: New Archival REDO Branch(resetlogs_id): 847284763 Prior: 844762805
RFS[4]: Archival Activation ID: 0x2b1de48e Current: 0x0
RFS[4]: Effect of primary database OPEN RESETLOGS
RFS[4]: Managed Standby Recovery process is active
RFS[4]: Incarnation entry added for Branch(resetlogs_id): 847284763 (maa1)
Mon May 12 05:32:01 2014
Setting recovery target incarnation to 3
Mon May 12 05:32:02 2014
Deleted Oracle managed file +FRA/maa/flashback/log_2.339.847184419
Mon May 12 05:32:02 2014
Archived Log entry 632 added for thread 1 sequence 3 ID 0x2b1de48e dest 1:
Archived Log entry 633 added for thread 2 sequence 3 ID 0x2b1de48e dest 1:
Mon May 12 05:32:04 2014
RFS[5]: Assigned to RFS process 4212
RFS[5]: Opened log for thread 2 sequence 99 dbid 722024964 branch 844762805
RFS[4]: Opened log for thread 1 sequence 140 dbid 722024964 branch 844762805
Archived Log entry 634 added for thread 2 sequence 99 rlc 844762805 ID 0x2b1d0465 dest 2:
started logmerger process
Managed Standby Recovery starting Real Time Apply
RFS[3]: Opened log for thread 2 sequence 1 dbid 722024964 branch 847284763
Parallel Media Recovery started with 2 slaves
Mon May 12 05:32:07 2014
Media Recovery start incarnation depth : 1, target inc# : 3, irscn : 8878917
Archived Log entry 635 added for thread 2 sequence 1 rlc 847284763 ID 0x2b1de48e dest 2:
RFS[5]: Opened log for thread 2 sequence 2 dbid 722024964 branch 847284763
Waiting for all non-current ORLs to be archived...
All non-current ORLs have been archived.
Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_2_seq_99.603.847344725
Identified End-Of-Redo (failover) for thread 2 sequence 99 at SCN 0x0.877b45
Archived Log entry 636 added for thread 1 sequence 140 rlc 844762805 ID 0x2b1d0465 dest 2:
Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_1_seq_140.605.847344725
Identified End-Of-Redo (failover) for thread 1 sequence 140 at SCN 0x0.877b45
Mon May 12 05:32:08 2014
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE THROUGH ALL SWITCHOVER DISCONNECT USING CURRENT LOGFILE
Resetting standby activation ID 723321957 (0x2b1d0465)
Media Recovery End-Of-Redo indicator encountered
Media Recovery Continuing
Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_2_seq_1.592.847344727
Archived Log entry 637 added for thread 1 sequence 4 ID 0x2b1de48e dest 1:
Media Recovery Waiting for thread 1 sequence 1
Fetching gap sequence in thread 1, gap sequence 1-2
Archived Log entry 638 added for thread 2 sequence 2 rlc 847284763 ID 0x2b1de48e dest 2:
Mon May 12 05:32:10 2014
Primary database is in MAXIMUM AVAILABILITY mode
Standby controlfile consistent with primary
Standby controlfile consistent with primary
RFS[6]: Assigned to RFS process 4242
RFS[6]: Selected log 5 for thread 1 sequence 5 dbid 722024964 branch 847284763
Mon May 12 05:32:11 2014
RFS[7]: Assigned to RFS process 4234
RFS[7]: Opened log for thread 1 sequence 1 dbid 722024964 branch 847284763
Mon May 12 05:32:11 2014
RFS[8]: Assigned to RFS process 4238
RFS[8]: Opened log for thread 1 sequence 2 dbid 722024964 branch 847284763
Archived Log entry 639 added for thread 1 sequence 1 rlc 847284763 ID 0x2b1de48e dest 2:
Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_1_seq_1.595.847344731
Media Recovery Waiting for thread 1 sequence 2 (in transit)
Mon May 12 05:32:14 2014
Archived Log entry 640 added for thread 2 sequence 4 ID 0x2b1de48e dest 1:
Archived Log entry 641 added for thread 1 sequence 2 rlc 847284763 ID 0x2b1de48e dest 2:
Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_1_seq_2.593.847344731
Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_2_seq_2.584.847344727
Mon May 12 05:32:15 2014
Primary database is in MAXIMUM AVAILABILITY mode
Standby controlfile consistent with primary
Standby controlfile consistent with primary
RFS[9]: Assigned to RFS process 4250
RFS[9]: Selected log 8 for thread 2 sequence 5 dbid 722024964 branch 847284763
Mon May 12 05:32:25 2014
Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_1_seq_3.604.847344723
Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_2_seq_3.339.847344723
Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_2_seq_4.581.847344735
Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_1_seq_4.594.847344731
Media Recovery Waiting for thread 1 sequence 5 (in transit)
Recovery of Online Redo Log: Thread 1 Group 5 Seq 5 Reading mem 0
Mem# 0: +DATA/maa/onlinelog/group_5.263.843615365
Mem# 1: +FRA/maa/onlinelog/group_5.289.843615367
Media Recovery Waiting for thread 2 sequence 5 (in transit)
Recovery of Online Redo Log: Thread 2 Group 8 Seq 5 Reading mem 0
Mem# 0: +DATA/maa/onlinelog/group_8.259.843615383
Mem# 1: +FRA/maa/onlinelog/group_8.703.843615385
Mon May 12 05:33:16 2014
Changing standby controlfile to MAXIMUM AVAILABILITY level
RFS[6]: Selected log 6 for thread 1 sequence 6 dbid 722024964 branch 847284763
Mon May 12 05:33:16 2014
Deleted Oracle managed file +FRA/maa/flashback/log_3.259.843502613
Mon May 12 05:33:16 2014
Media Recovery Waiting for thread 1 sequence 6 (in transit)
Recovery of Online Redo Log: Thread 1 Group 6 Seq 6 Reading mem 0
Mem# 0: +DATA/maa/onlinelog/group_6.261.843615373
Mem# 1: +FRA/maa/onlinelog/group_6.670.843615373
Archived Log entry 642 added for thread 1 sequence 5 ID 0x2b1de48e dest 1:
Mon May 12 05:33:19 2014
Standby controlfile consistent with primary
RFS[9]: Selected log 9 for thread 2 sequence 6 dbid 722024964 branch 847284763
Mon May 12 05:33:19 2014
Archived Log entry 643 added for thread 2 sequence 5 ID 0x2b1de48e dest 1:
Media Recovery Waiting for thread 2 sequence 6 (in transit)
Recovery of Online Redo Log: Thread 2 Group 9 Seq 6 Reading mem 0
Mem# 0: +DATA/maa/onlinelog/group_9.256.843615389
Mem# 1: +FRA/maa/onlinelog/group_9.504.843615389
Fazendo uma análise detalhada nós encontramos que o banco “maa” sofreu um FLASHBACK para o SCN 8878905 e que foi concluído com sucesso. Após, o banco foi convertido para PHYSICAL STANDBY e devido aos passos do procedimento ele foi fechado. Depois ele foi religado (automaticamente) e detectou que precisava de sincronia de archives com o seu primary (maastb). Ele detectou que ocorreu o End-Of-Redo e após isso continuou aplicando os archives que ocorreram após esse evento (via RFS – Media Recovery Continuing). Nesse caso, dependendo do tempo que seu antigo primary ficou “off-line”, a quantidade de archivelog necessário para replicar pode ser grande. Pode ocorrer inclusive a necessidade de recuperar do backup (do novo primary) os archives que foram apagados e necessários para fechar o gap existente entre os ambientes.
Depois, o banco foi configurado para atuar como standby e usar o real-time apply (USING CURRENT LOGFILE). Aqui um detalhe, o Broker configura automaticamente o real-time apply se detectar que o nanco standby tem configurado os standbylog files. Observe que ele detectou a presença de uma nova incarnação devido ao RESETLOGS sofrido pelo flashback e pelo End-Of-Redo. Para finalizar, ao identificar que o primary opera em MAXIMUM AVAILABILITY ele sincroniza os archivelog e deixa tudo sincronizado.
DETALHES
Se você notar todos os logs até aqui (principalmente os últimos) irá ver como eles se encaixam, todas estas informações que aqui estão apresentadas de forma serial acontecem em paralelo. O fluxo basicamente é um comando reinstate detectado através que dgmgrl que é comunicado ao Broker do primary e do standby. Durante o processo de reinstate o antigo primary sofre um flashback, é reiniciado, sincronizado e configurado como novo standby. Como o banco primary está operando em MAXIMUM AVAILABILITY tudo fica sincronizado, primary e standby.
Todos estes comandos que antes foram manuais (observe nos artigos de failover e de reinstate manual) agora foram realizados automaticamente pelo Broker. Não precisamos detectar qual o SCN que ocorreu a troca de papeis, muito menos fazer o flashback e corrigir o CRS.
Claro que tudo isso precisou de algumas configurações, você acha que o Broker iria conseguir conectar e reiniciar as instâncias sem o registro manual delas no listener? Lembra do StaticConnectIdentifier do Broker que configuramos no final do artigo anterior? Viu como alguns passos anteriores (que podem ter sido chatos) poupam algum tempo?
Um detalhe ao religar outras instâncias do antigo primary. Ao subir ele irá detectar um “ORA-01153: an incompatible media recovery is active” e “ORA-10458: standby database requires recovery”, isso é “normal” e forçará um shutdown abort da sua instância. Basta você religar a instância em modo mount para tudo voltar ao normal. Isso está ligado a um detalhe co CRS que explico no próximo parágrafo.
Como nem tudo são flores na vida de um DBA, você precisará ajustar o CRS das instâncias. O Broker ajusta o CRS do ambiente, mas em partes. Ele “esquece” de trocar o modo de abertuda das bases, assim o antigo primary e novo standby estará definido para iniciar em modo open e o novo primary estará em modo mount (por ser o antigo standby). Observe abaixo isso e a sua correção:
No antigo primary e novo standby:
[oracle@rac11pri02 ~]$ srvctl config database -d maa -v
Database unique name: maa
Database name: maa
Oracle home: /u01/app/oracle/product/11.2.0.3/db_1
Oracle user: oracle
Spfile: +DATA/maa/spfilemaa.ora
Domain:
Start options: open
Stop options: immediate
Database role: PHYSICAL_STANDBY
Management policy: AUTOMATIC
Server pools: maa
Database instances: maa1,maa2
Disk Groups: DATA,FRA
Mount point paths:
Services:
Type: RAC
Database is administrator managed
[oracle@rac11pri02 ~]$
[oracle@rac11pri02 ~]$
[oracle@rac11pri02 ~]$
[oracle@rac11pri02 ~]$ srvctl modify database -d maa -s MOUNT
[oracle@rac11pri02 ~]$
[oracle@rac11pri02 ~]$
[oracle@rac11pri02 ~]$
[oracle@rac11pri02 ~]$ srvctl config database -d maa -v
Database unique name: maa
Database name: maa
Oracle home: /u01/app/oracle/product/11.2.0.3/db_1
Oracle user: oracle
Spfile: +DATA/maa/spfilemaa.ora
Domain:
Start options: mount
Stop options: immediate
Database role: PHYSICAL_STANDBY
Management policy: AUTOMATIC
Server pools: maa
Database instances: maa1,maa2
Disk Groups: DATA,FRA
Mount point paths:
Services:
Type: RAC
Database is administrator managed
[oracle@rac11pri02 ~]$
No novo primary (antigo standby):
[oracle@rac11stb02 ~]$ srvctl config database -d maastb -v
Database unique name: maastb
Database name:
Oracle home: /u01/app/oracle/product/11.2.0.3/db_1
Oracle user: oracle
Spfile:
Domain:
Start options: mount
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: maastb
Database instances: maastb1,maastb2
Disk Groups: DG01,FRA
Mount point paths:
Services:
Type: RAC
Database is administrator managed
[oracle@rac11stb02 ~]$
[oracle@rac11stb02 ~]$
[oracle@rac11stb02 ~]$ srvctl modify database -d maastb -s OPEN
[oracle@rac11stb02 ~]$
[oracle@rac11stb02 ~]$
[oracle@rac11stb02 ~]$ srvctl config database -d maastb -v
Database unique name: maastb
Database name:
Oracle home: /u01/app/oracle/product/11.2.0.3/db_1
Oracle user: oracle
Spfile:
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: maastb
Database instances: maastb1,maastb2
Disk Groups: DG01,FRA
Mount point paths:
Services:
Type: RAC
Database is administrator managed
[oracle@rac11stb02 ~]$
Se você configurar o CRS antes de ligar as outras instâncias do standby, não sofrerá dos ORA-XXXX listados antes.
AMBIENTE FINAL
Chegamos a um ambiente em que temos um DG com um banco primary chamado de maastb que começou a atuar assim após failover do seu antigo primary. O antigo primary, banco maa, sofreu failover e um posterior reinstate para ser reintegrado ao ambiente como physical standby. Observe o ambiente final:
DGMGRL> SHOW CONFIGURATION;
Configuration - dgrac
Protection Mode: MaxAvailability
Databases:
maastb - Primary database
maa - Physical standby database
Fast-Start Failover: DISABLED
Configuration Status:
SUCCESS
DGMGRL>
Todos os passos que seguimos até agora permitiram fazer isso através do Broker. O uso deste, comparado com o modo “manual”, nos poupou e deu menos preocupações. Ainda não estamos em um ambiente ideal, precisamos configurar o fast-start failover para ter um ambiente confortável e auto gerenciável.
No fim recomendo ligar as outras instância do antigo primary e fazer um backup do ambiente.