Disaster Recovery with DB2 UDB for z/OS

(A new Redbook from IBM) 

By Ron Steele

L. B. Software Consultants, Inc.

Ron-steele@nc.rr.com

919-523-0305

December 2004

 

Here are some critical questions you should be able to answer regarding your company’s disaster recovery:

 

IBM has made available, a new Redbook Disaster Recovery with DB2 UDB for z/OS. This new manual provides an outstanding description of the latest information about DB2 V8 and Disaster Recovery on the zSeries mainframe. This edition applies to DB2 UDB for z/OS V8, program number 5625-DB2. 

 

Covered in the Redbook are:

 

The primary objective of the Redbook is to document realistic scenarios related to DB2 disaster recovery. The focus is on IBM solutions using mirroring and using the ESS (Enterprise Storage Server), and on System Point In Time (PIT) Recovery. The non-DB2 data such as VSAM files, logically or physically related to the DB2 applications, should be treated with equivalent and congruent solutions.

 

In this article, an overview of this new Redbook is provided. The redbook is 530 pages, so the purpose of this article is to provide the reader with a sort of “Book Report” and “Highlights” in fewer pages (approximately 20 pages), so that the points of interest can be determined for further reading and study in the redbook.

 

The Redbook tries not to repeat what is already in the standard DB2 manuals, but to provide just enough detail about the choices and functions without having to go too often to the manuals. It consolidates a lot of key information and choices into one document, with a focus on the “User” as opposed to being a “Reference” manual.

 

While doing some research on the web for additional information on DB2 and Disaster Recovery, some points, definitions, and useful websites were encountered, which will be covered under the heading Additional Definitions, Points, and Web Sites. Also covered under this topic is a brief description of IMS V8’s new IMS-DB2 Coordinated Disaster Recovery.

Click here to skip credits and go directly to the body of the article.
Credits

 

Due credit is given to each of the following establishments that information was extracted from for this article:

 


The Redbook is in 5 parts:

 

The following is a description of each of the 5 Parts of the redbook.

 

Part 1: The Whole Picture: Business Continuity and Disaster Recovery

 

Business Continuity (BC) and Disaster Recovery (DR) act at different levels in the organization.

 

BC is the strategy at the enterprise level, while DR is the solution at the IT level. The BS 7799 (British Standard -Information technology - Code of practice for information security management) has generically defined the objective of business continuity management as: To counteract interruptions of business activities and to protect critical business processes from the effects of major failures or disasters. This definition has been adopted by ISO as ISO/IEC 17799:2000. Here are some references for these topics:

 

DR is the process that must be activated following a disaster for the purpose of restoring information services in the anticipated manner and time. Such a process manages and resolves a contingent situation (a disaster recovery plan is a particular “contingency plan”). It includes the procedures necessary for the restoration of the data and the network, and has, as its ultimate purpose, the reactivation of the operability of the users of information services. It must describe how, where, and when the user will resume working activities.

 

The following chart positions Business Continuity and Disaster Recovery.

RPO, RTO, and NRO

 

Two very commonly used terms used to qualify the recovery solutions are now Recovery Point Objective (RPO) and Recovery Time Objective (RTO). These terms have replaced the former possibly more intuitive terms lost data and downtime. They are complementary and both must be considered when evaluating a disaster recovery solution. The enterprise's requirements for minimizing downtime and lost data vary by application and are determined by the cost to the organization of these two factors.

 

Another term encountered now is Network Recovery Objective (NRO). This can be studied in the SHARE presentation: “Cost Effective Disaster Recovery: How Much Can You Afford”: http://ew.share.org/proceedingmod/abstract.cfm?abstract_id=3011&conference_id=1

 

The Tiers of Disaster Recovery

 

SHARE has defined 7 tiers of Disaster Recovery to help companies decide where they stand, and where they want to be. The higher the tier, the higher the cost and investment, and the quicker the RTO and usually the later RPO.

 

 

Tiers can also be studied in the SHARE presentation: “Cost Effective Disaster Recovery: How Much Can You Afford”: http://ew.share.org/proceedingmod/abstract.cfm?abstract_id=3011&conference_id=1

 

Return On Investment

 

In order to assess the cost of your DR solution, there are questions that need to be answered.

·        What level of data currency is required?

·        Can the data at the disaster site be a few seconds, minutes, or hours old?

·        How consistent is the data expected to be?

·        Multiple table consistency?

·        Transaction consistency?

·        Subsystem-wide consistency? (A very high degree of data consistency will have a substantial cost in recovery time and resources.)

·        When would this solution be rolled out?

·        When would all of the hardware and software need to be available?

·        When can the needed hardware and software levels be installed?

 

Lessons Learned From 9/11

 

The events of September 11, 2001 in the United States of America have underlined how critical it is for businesses to be ready for disasters. The Federal Reserve, the Office of the Comptroller of the Currency, the Securities and Exchange Commission, and the New York State Banking Department (the agencies) have met with industry participants to analyze the lessons learned from the events of September 11. The agencies have released an interagency white paper on sound practices to strengthen the resilience of the US financial system. For more information on this, refer to: http://sec.gov/news/studies/34-47638.htm .

 

The following list is a summary of lessons learned about IT service continuity:

 


 

DB2 Disaster Recovery

 

This chapter introduces the options and features available for a DB2 subsystem DR solution. Topics covered are:

·        Introduction to DB2 disaster recovery solutions

·        DR solutions in terms of RTO and RPO

·        Data consistency

·        DB2’s disaster recovery functions

·        Determining the RBA for conditional restarts

·        Actions to take when there are active utilities

 

The following chart shows some of the solutions, and positions them in terms of data loss against recovery time:

These solutions are described in detail in Chapter 2 of the redbook.

 

Data consistency and the “Rolling Disaster”

 

ESS Copy Services provide data mirroring capability providing the automatic replication of current data from your primary site to a secondary site. The secondary site allows you to recover your data after a disaster without the need to restore DB2 image copies or apply DB2 logs to bring DB2 data to the current point-in-time.

 

Notice that the scenarios and procedures for data mirroring are intended for environments that mirror an entire DB2 subsystem or data sharing group, including DB2 catalog, directory, user data, BSDS, and active logs. You must mirror all volumes in such a way that they terminate at exactly the same point.

 

A rolling disaster is the typical real disaster, where your local site gradually and intermittently fails over a number of seconds. The various components fail in sequence. For example, if a data volume failed to update its secondary, yet the corresponding log update was copied to the secondary, this would eventually result in a secondary copy of the data that is inconsistent with the primary copy. The database would need to be recovered from image copies and log data. In all cases, notification of this miss must be known at secondary. When this happens for hundreds of volumes, without a clear notification of status of impacted secondary volumes, recovery can be extremely complex and long.

 

When using data mirroring for disaster recovery, you must mirror data from your local site with a method that does not reproduce a rolling disaster at your recovery site. To recover DB2 with data integrity, you must use volumes that end at a consistent point-in-time for each DB2 subsystem or data sharing group. Mirroring a rolling disaster causes volumes at your recovery site to end over a span of time rather than at one single point.

 

In a disaster (think of flood, fire, and explosion), it is very likely that different logical storage subsystems fail at different times. It is also true that for each SQL UPDATE, INSERT, and DELETE issued by an application, DB2 will issue, at different times, several dependent writes on log data sets, table spaces, and index spaces allocated to DASD volumes spread across several LSSs. While the write to the log is externalized at commit time, the write to table spaces and index spaces are externalized when the buffer pool thresholds are reached for each page set in each buffer pool.

 

The redbook explains in detail, with examples, how the rolling disaster occurs, and how to design for them using Consistency Groups. A consistency group is a collection of volumes that contain consistent, related data. This data can span logical storage subsystems and disk subsystems. For DB2 specifically, a consistency group contains an entire DB2 subsystem or a DB2 data-sharing group. The following DB2 elements comprise a consistency group:

·        BSDS

·        Active Logs

·        DB2 Catalog

·