Thursday, April 21, 2011

How to measure the throughput of RMAN.

One day I had to (was granted the privilege to!) rehost a production database onto a new machine.

The Oracle 11g Backup and Recovery User’s Guide explains how to DUPLICATE a database.     The duplicate command is convenient because we can put all the instructions in a single command file.  That would be fine if wanted to copy a production database into development.
This database is over 850GB.   I’d like to rehost this database with less than 1 hour of down time.   If I do everything all at once in a single command file, it will take too long to restore the datafiles onto the new target, much less recover them.
Oracle 10g has a Backup and Recovery Advanced User’s Guide.  (If 11g has an advanced guide, I sure don’t see it.)   Chapter 7 has a section called “Restore and Recovery of the Database on New Host”, which explains a different procedure using the restore-database and recover-database command.   This should be the ideal procedure.
See, I can “restore” the database onto the new host any day of the week.   Then I can “recover” the latest transaction logs, in batches, until my maintenance window starts.
When the maintenance window starts, the following steps can be completed in less than an hour.
-          Stop the listener
-          Make sure the users are disconnected.
-          Alter system switch logfile N times, and let the archiver saves the online logs.
-          Shutdown the old instance
-          Apply the latest archived logs to the new instance
-          Alter system open resetlogs.
So.  I can test this right?  Fine.   So I setup the test, and start doing the “restore database” step.   I’m using 8 channels.  It seems to move along.   Every once in a while one of the channels finishes with one of the pieces.   Is there a way to measure how fast it is going?
The following query will serve this purpose:
set linesize 126
column Pct_Complete format 99.99
column client_info format a25
column sid format 999
column MB_PER_S format 999.99
select s.client_info,
l.sid,
l.serial#,
l.sofar,
l.totalwork,
round (l.sofar / l.totalwork*100,2) "Pct_Complete",
aio.MB_PER_S,
aio.LONG_WAIT_PCT
from v$session_longops l,
v$session s,
(select sid,
serial,
100* sum (long_waits) / sum (io_count) as "LONG_WAIT_PCT",
sum (effective_bytes_per_second)/1024/1024 as "MB_PER_S"
from v$backup_async_io
group by sid, serial) aio
where aio.sid = s.sid
and aio.serial = s.serial#
and l.opname like 'RMAN%'
and l.opname not like '%aggregate%'
and l.totalwork != 0
and l.sofar <> l.totalwork
and s.sid = l.sid
and s.serial# = l.serial#
order by 1;
This will produce a result-listing such as the following (spaces removed to fit).
CLIENT_INFO             SID SERIAL#   SOFAR TOTALWORK   Pct MB_PER_S LONG_WAIT
                                                                                        Complete                                     PCT
----------------------- --- ------- ------- --------- ----- --------     -----
rman channel=ORA_DISK_1 336       7 3080004   6875904 44.79    16.76        79
rman channel=ORA_DISK_2 335       5   12293  12533760   .10    37.22        82
rman channel=ORA_DISK_3 334       5 5672665   6666112 85.10    37.47        82
rman channel=ORA_DISK_4 333       5 4183281   8384894 49.89    35.32        81
rman channel=ORA_DISK_5 339      36 2383878  12521472 19.04    14.11        78
rman channel=ORA_DISK_6 343      22   44293  10289152   .43    10.88        78
rman channel=ORA_DISK_7 341      19 6881776   7932672 86.75    35.76        80
rman channel=ORA_DISK_8 340      40 1476057   8941568 16.51    26.97        86

8 rows selected. 

Well.  This is ghastly.
The fastest channel of any of them is pushing the blazing-fast speed of 37.47 MB per second.  
I had to talk with my storage adminstrators about that. 
This machine is supposed to have a 1GB Ethernet connection.   That ought to be able to push over 110MB per second, and I’m barely getting one-third of that.   They eventually found that one of their interfaces was generating a lot of errors.  No wonder each channel is spending 80% of the time in a long-wait state
I ran the query multiple times, and it seemed like the average was 35MB/s.   I had to move 850GB of data.
850GB * 1024MB/GB / 35MB/s  * 1hour/3600s = 7.1 hours.
The actual time turned out to be 6.9 hours – close enough.
That showed the throughput is not cumulative.   Just because 8 channels push 35MB/s each does not mean they push (8*35=) 280MB/s.     Then I could have finished in about 1 hour.   Trust me, I would have noticed the difference
Hopefully you can get better throughput than that.   If not, talk to your coworkers, and see if they can find any blockage in the plumbing.
Best wishes.
~the hard-working DBA.