
Automatic block media recovery to the rescue
As pointed out earlier, Oracle Active Data Guard (ADG) is a lot more than just the read-only standby database. One of the technologies no other product beats us on in regards to disaster recovery is the automatic block media recovery, which is part of the Active Data Guard license. This gets automatically enabled when you open the standby database read-only with redo apply enabled. You do not need to do something specific for it, it is just there.
In spite of how critical it is to prevent data corruption from the perspective of most DBAs, this is ironically a pretty underestimated feature. It is important to keep in mind that although it is rare, database block corruption still happens. No matter what anyone tells you, it still happens in today’s world. The good thing is that you can avoid painful downtime and manual labor with Active Data Guard. You obviously don’t want to try this in your 24/7 OLTP mission critical database. Therefore, I will ruin (and have it automatically repaired) my Sandbox test database.
We have a few little prerequisites:
1) Have ADG running in read only with apply
2) (for the test, not the feature) have standby_file_management on AUTO.
Intro – Prerequisites
We will use my vbox demo environment here, as that uses the file system and will simplify introducing a corruption (thanks Kamil for the help with the DD command). You can do it with ASM too. However, for demo purposes, this is just easier. So, there you see that ASM adds a layer of additional protection to your data files.
My primary database
1 2 3 4 5 6 |
SQL> show parameter standby_file_management NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ standby_file_management string AUTO SQL> |
my standby database
1 2 3 4 5 6 |
SQL> show parameter standby_file_management NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ standby_file_management string AUTO SQL> |
Then we verify the database roles and check if ADG is active.
My primary database:
1 2 3 4 5 6 7 8 9 |
SQL> select database_role,open_mode from v$database; DATABASE_ROLE OPEN_MODE ---------------- -------------------- PRIMARY READ WRITE 1 row selected. SQL> |
And of course my standby too:
1 2 3 4 5 6 7 |
SQL> select database_role,open_mode from v$database; DATABASE_ROLE OPEN_MODE ---------------- -------------------- PHYSICAL STANDBY READ ONLY WITH APPLY SQL> |
So, we are good to go and our database is fully protected against corrupt database blocks.
I have created a few little scripts to avoid the typing. The first one creates a new tablespace. If you use / copy / paste it, it creates it in /home/oracle with the size of 1M.
It then creates a table in the tablespace, adds a record and commits it. We make sure it gets into the data files and then we retrieve the block_id from the row. This we will need in the next step, corruption of this row.
The script 01_abmr.sql
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
set pages 999 lines 200; set echo on; set feed on; Col owner format a20; var rid varchar2(25); col segment_name format a20; drop tablespace corruptiontest including contents and datafiles; create tablespace corruptiontest datafile '/home/oracle/corruptiontest01.dbf' size 1m; create table will_be_corrupted(myfield varchar2(50)) tablespace corruptiontest; insert into will_be_corrupted(myfield) values ('This will have a problem') returning rowid into :rid; print Commit; Alter system checkpoint; select * from will_be_corrupted;select dbms_rowid.ROWID_BLOCK_NUMBER(ROWID, 'SMALLFILE') FROM will_be_corrupted where myfield='This will have a problem'; |
When I run it on my primary database inside the PDB it gives me following output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
SQL> @01_abmr.sql SQL> set pages 999 lines 200; SQL> set echo on; SQL> set feed on; SQL> Col owner format a20; SQL> var rid varchar2(25); SQL> col segment_name format a20; SQL> SQL> drop tablespace corruptiontest including contents and datafiles; Tablespace dropped. SQL> create tablespace corruptiontest datafile '/home/oracle/corruptiontest01.dbf' size 1m; Tablespace created. SQL> create table will_be_corrupted(myfield varchar2(50)) tablespace corruptiontest; Table created. SQL> insert into will_be_corrupted(myfield) values ('This will have a problem') returning rowid into :rid; 1 row created. SQL> print RID -------------------------------------------------------------------------------------------------------------------------------- AAAR1dAAfAAAAAPAAA SQL> Commit; Commit complete. SQL> Alter system checkpoint; System altered. SQL> select * from will_be_corrupted; MYFIELD -------------------------------------------------- This will have a problem 1 row selected. SQL> select dbms_rowid.ROWID_BLOCK_NUMBER(ROWID, 'SMALLFILE') FROM will_be_corrupted where myfield='This will have a problem'; DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID,'SMALLFILE') ------------------------------------------------ 15 1 row selected. SQL> |
Remember the 15 (in my case) as the block_id. This is where in the data file our record is stored.
In the meanwhile, you see the following in the alert log from primary:
1 2 3 4 5 6 7 |
2020-06-25T11:16:44.013994-04:00 MYPDB(3):drop tablespace corruptiontest including contents and datafiles 2020-06-25T11:16:45.830967-04:00 MYPDB(3):Deleted file /home/oracle/corruptiontest01.dbf MYPDB(3):Completed: drop tablespace corruptiontest including contents and datafiles MYPDB(3):create tablespace corruptiontest datafile '/home/oracle/corruptiontest01.dbf' size 1m MYPDB(3):Completed: create tablespace corruptiontest datafile '/home/oracle/corruptiontest01.dbf' size 1m |
That the new data file has been added to the PDB and in the standby we see:
1 2 3 4 5 6 7 8 9 10 |
2020-06-25T11:16:45.908009-04:00 (3):Recovery deleting file #30:'/home/oracle/corruptiontest01.dbf' from controlfile. (3):Deleted file /home/oracle/corruptiontest01.dbf (3):Recovery dropped tablespace 'CORRUPTIONTEST' (3):WARNING: File being created with same name as in Primary (3):Existing file may be overwritten (3):Recovery created file /home/oracle/corruptiontest01.dbf (3):Datafile 31 added to flashback set (3):Successfully added datafile 31 to media recovery (3):Datafile #31: '/home/oracle/corruptiontest01.dbf' |
The new data file has been added to the media recovery, so when we query the standby, we see that the inserted row is there as well:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SQL> select database_role,open_mode from v$database; DATABASE_ROLE OPEN_MODE ---------------- -------------------- PHYSICAL STANDBY READ ONLY WITH APPLY SQL> select * from will_be_corrupted; MYFIELD -------------------------------------------------- This will have a problem SQL> |
Time for corruption
We will use 02_abmr.sql for this.
1 |
host dd conv=notrunc bs=1 count=2 if=/dev/zero of=/home/oracle/corruptiontest01.dbf seek=$((&block_id*8192+16)) |
All it does, is just clean out our record in the datafile. So when we run it, it asks the block_id and that is the. number from the last script
1 2 3 4 5 6 7 8 |
SQL> @02_abmr.sql SQL> host dd conv=notrunc bs=1 count=2 if=/dev/zero of=/home/oracle/corruptiontest01.dbf seek=$((&block_id*8192+16)) Enter value for block_id: 15 2+0 records in 2+0 records out 2 bytes (2 B) copied, 0.000129355 s, 15.5 kB/s SQL> |
The recovery
Or should we say Active Data Guard to the rescue. As this is a test system, not a lot activity and so on, we need to flush the buffer cache and then we access the row. ADG will see that we have block corruption and we will fetch the correct block from the standby and repair the primary, not causing outages.
You could guess, this is 03_abmr.sql to show this
1 2 |
alter system flush buffer_cache; select * from will_be_corrupted; |
So when we run it
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
SQL> @03_abmr.sql SQL> alter system flush buffer_cache; System altered. SQL> select * from will_be_corrupted; MYFIELD -------------------------------------------------- This will have a problem 1 row selected. SQL> |
Apparently nothing happened. You think! However, when you look in alert log, you will see what happened behind the scenes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
2020-06-25T11:24:50.224758-04:00 MYPDB(3):ALTER SYSTEM: Flushing buffer cache inst=0 container=3 global MYPDB(3):Hex dump of (file 31, block 15) in trace file /u01/app/oracle/diag/rdbms/dgdemovm1/dgdemovm1/trace/dgdemovm1_ora_6947.trc MYPDB(3): MYPDB(3):Corrupt block relative dba: 0x07c0000f (file 31, block 15) MYPDB(3):Bad check value found during multiblock buffer read MYPDB(3):Data in bad block: MYPDB(3): type: 6 format: 2 rdba: 0x07c0000f MYPDB(3): last change scn: 0x0000.0000.0046be75 seq: 0x1 flg: 0x06 MYPDB(3): spare3: 0x0 MYPDB(3): consistency value in tail: 0xbe750601 MYPDB(3): check value in block header: 0x0 MYPDB(3): computed block checksum: 0x898b MYPDB(3): MYPDB(3):Reading datafile '/home/oracle/corruptiontest01.dbf' for corrupt data at rdba: 0x07c0000f (file 31, block 15) MYPDB(3):Reread (file 31, block 15) found same corrupt data (no logical check) MYPDB(3):Automatic block media recovery requested for (file# 31, block# 15) 2020-06-25T11:24:50.252028-04:00 Corrupt Block Found TIME STAMP (GMT) = 06/25/2020 11:24:49 CONT = 3, TSN = 6, TSNAME = CORRUPTIONTEST RFN = 31, BLK = 15, RDBA = 130023439 OBJN = 73053, OBJD = 73053, OBJECT = WILL_BE_CORRUPTED, SUBOBJECT = SEGMENT OWNER = SYS, SEGMENT TYPE = Table Segment 2020-06-25T11:24:50.582288-04:00 Automatic block media recovery successful for (file# 31, block# 15) 2020-06-25T11:24:50.583066-04:00 MYPDB(3):Automatic block media recovery successful for (file# 31, block# 15) |
So, you see that Oracle fixes the problem before the user even sees the problem. This works vice versa as well of course, when the standby faces a corrupt block, it will be fetched and repaired again from the primary.
Conclusion
This feature which is automatically enabled as part of Active Data Guard offers a critical extra layer of protection against disasters. The more layers you add in the storage system, the higher the chance that something can go wrong. You can protect yourself against this by having Active Data Guard enabled from the beginning for a full protection of your primary and standby system.