文/Mark Kozak-Holland 譯/楊磊
回顧泰坦尼克號(hào)當(dāng)時(shí)的情形:當(dāng)船重新起航后(見第10部分),滲水演變成一場(chǎng)大災(zāi)難。當(dāng)晚12時(shí)45分左右,即在船體擱在冰架上65分鐘后,船長令指揮員們打開救生艇并把所有乘客和船員召集到甲板上。船員們因不清晰的溝通而處于困惑之中(見第11部分),行動(dòng)遲疑,不相信一切已經(jīng)不對(duì)頭了。畢竟,其時(shí)大災(zāi)難的跡象尚未顯見。
在今天,災(zāi)難恢復(fù)的概念是把在線運(yùn)行轉(zhuǎn)移到另一個(gè)替代性的服務(wù)環(huán)境。但是形式卻是多種多樣的,從數(shù)天內(nèi)完成單個(gè)應(yīng)用的數(shù)據(jù)/文件的簡單恢復(fù),到數(shù)分鐘小時(shí)內(nèi)就得完成整個(gè)業(yè)務(wù)運(yùn)行的相對(duì)復(fù)雜的恢復(fù)。災(zāi)難可能呈現(xiàn)三種態(tài)勢(shì),即:完全狀態(tài)(絕對(duì)而立即),急迫而逼近,緩慢而無毒害。當(dāng)災(zāi)難被確認(rèn)后,應(yīng)急計(jì)劃就啟動(dòng)了,災(zāi)難也將被公諸于眾。
在泰坦尼克號(hào)上,災(zāi)難屬于緩慢而無毒害型的。雖然全面的恢復(fù)計(jì)劃不再可行,船長與指揮官們?nèi)钥烧归_局部的恢復(fù)。而在缺乏正式的撤離或?yàn)?zāi)難恢復(fù)計(jì)劃的情況下,他們能做的也只能是在災(zāi)難跡象明顯之前,發(fā)令阻止恐慌和混亂的蔓延。在設(shè)計(jì)時(shí)(見第3部分)對(duì)災(zāi)難恢復(fù)的場(chǎng)景假想,是用救生艇把乘客們轉(zhuǎn)移到另一艘船上并帶回港岸,就是說,救生艇會(huì)往返運(yùn)載乘客,因此對(duì)其數(shù)量的要求就很小。但這一假想的前提是基于泰坦尼克號(hào)是不會(huì)沉沒的,至少能自己漂浮在海上待援。
而今我們開發(fā)一個(gè)災(zāi)難恢復(fù)計(jì)劃時(shí),必須考慮全I(xiàn)T方案中可能引發(fā)災(zāi)難的所有形式的故障。例如:
●技術(shù)上的物理故障或有形缺陷
●設(shè)計(jì)錯(cuò)誤,含系統(tǒng)/應(yīng)用程序軟件設(shè)計(jì)的失敗和代碼問題
●由運(yùn)行操作人員因事故,不熟練,培訓(xùn)不足,不按規(guī)程甚至蓄意惡意造成的運(yùn)行失敗
環(huán)境(如動(dòng)力系統(tǒng),冷卻系統(tǒng),連同網(wǎng)絡(luò))的故障,可以和自然災(zāi)害、恐怖行動(dòng)一樣,對(duì)運(yùn)行中心造成同等的破壞。
在過去400年中,絕大部分與橫渡大西洋有關(guān)的環(huán)境因素,都已經(jīng)被發(fā)現(xiàn),植入圖表和載入文檔了。內(nèi)容包羅萬象,從全年的自然情況(如海流的變化),天氣情形(如風(fēng)暴和颶風(fēng)),到自然危害(如海上濃霧,冰原,冰山帶和危險(xiǎn)的海岸線,礁石等等)。然而,在泰坦尼克號(hào)項(xiàng)目中彌漫的一種信念就是,這艘不會(huì)沉的巨大鐵船能應(yīng)對(duì)一切自然問題。
在設(shè)計(jì)一個(gè)災(zāi)難恢復(fù)計(jì)劃時(shí),還需考慮災(zāi)難的級(jí)別。比如,當(dāng)較小的風(fēng)暴,火災(zāi)或者水淹來襲時(shí),你的顧客希望得到某種相對(duì)迅速的應(yīng)急服務(wù)?,F(xiàn)在,你就需要對(duì)所有這些都準(zhǔn)備應(yīng)急措施,以至對(duì)更大的災(zāi)難也一樣。
災(zāi)難恢復(fù)的相關(guān)費(fèi)用,會(huì)因耗時(shí),引發(fā)原理,恢復(fù)程度的不同而相異。這些費(fèi)用,應(yīng)作為計(jì)劃的一部分,針對(duì)每個(gè)特定的IT方案對(duì)象,仔細(xì)確定。
對(duì)泰坦尼克號(hào)而言,按海運(yùn)慣例本應(yīng)有一個(gè)考慮到了上述一切情況的災(zāi)難恢復(fù)計(jì)劃,來將所有人帶到救生甲板,把他們轉(zhuǎn)移到座位寬綽有余的救生艇上,安全放下并讓訓(xùn)練有素的船員帶走他們。在金斯頓的救生艇訓(xùn)練中,應(yīng)該已經(jīng)測(cè)試過計(jì)劃中的這后一部份(見第5部分)。
在生產(chǎn)環(huán)境下大量的嚴(yán)重問題都開始于無毒無害的狀態(tài),即在問題剛開始時(shí),你的組織也許甚至都不會(huì)留意到它及其影響后果。如,IT方案中一個(gè)不緊要的部分停下來了,未被注意,但是因?yàn)楦鱾€(gè)部件和應(yīng)用之間的內(nèi)在關(guān)聯(lián),出現(xiàn)一種連鎖效應(yīng)并很快使得該方案的其他部分受到影響,這將在極短時(shí)間內(nèi)引發(fā)大的災(zāi)禍。
在泰坦尼克號(hào)上,救生艇的釋放明顯晚了,說明方式猶豫到最后才不得不發(fā)放的。指揮員的緩慢反應(yīng),可能因?yàn)榭傆X得該船不可能沉沒,事態(tài)也不明顯,當(dāng)時(shí)一切都尚顯正常。還有,900船員中,真正意義上的水手只有83個(gè)(見第5部分),只有這些人掌握了把30英尺長的救生艇(可乘65人)怎樣放到60英尺下海面上的復(fù)雜操作。這樣的救生艇一共16艘,此外另有4艘較小的可拆裝式的稱作Englehardts的救生艇(可乘45人)。
結(jié)論
如今,不少IT項(xiàng)目完全忽視災(zāi)難恢復(fù),其理由是不在項(xiàng)目范疇內(nèi),和另有年度計(jì)劃流程來覆蓋。IT項(xiàng)目本身除了確立商務(wù)理由,針對(duì)IT方案進(jìn)行設(shè)計(jì)外,其實(shí)也包括了對(duì)所需恢復(fù)展開深入的了解。對(duì)影響IT方案的災(zāi)難后果所作的嚴(yán)肅思考,需在項(xiàng)目早期盡早完成,以便對(duì)整體的災(zāi)難恢復(fù)計(jì)劃進(jìn)行調(diào)整。下一部分我們?nèi)詫⒅塾跒?zāi)難恢復(fù)。
原文:
In recapping Titanic’s situation, following the restart of the ship (Part 10) the flooding became catastrophic. Around 12:45 p.m. , 65 minutes after the initial grounding on the ice shelf, the captain gave orders to the officers to uncover the lifeboats and get the passengers and crew ready on deck. The crew, confused by unclear communication (Part 11), operated in a state of disbelief, refusing to believe that anything was wrong. After all, there were still few signs of the disaster.
In today’s world, disaster recovery is the concept of switching the online operation to an alternate service-delivery environment. However, it takes many shapes and forms, from the relatively simple recovery of data and files from a single application in a timeframe measured in days, to the relatively complex recovery of a complete business operation in a timeframe measured in minutes or hours. A disaster can take three forms, namely: total (absolute and immediate), rapid and imminent, slow and innocuous. When a disaster is recognized, contingency plans are invoked and a disaster is declared.
On board Titanic, the disaster was slow and innocuous. Although a full recovery was not feasible anymore, the captain and officers could enact a partial recovery. But without a formalized evacuation or disaster recovery plan, the best they could do was to bring some order to prevent widespread panic and chaos once the disaster signs became more obvious. The envisioned scenario for disaster recovery, at the time of the design (Part 3), was to transfer passengers through lifeboats to another ship and then deliver them to port. The lifeboats would ferry passengers back and forth to the rescue ship, requiring a much smaller total lifeboat capacity. This scenario was based on the perception that Titanic could not possibly sink, but would float in an incapacitated state waiting for help.
In today’s world in defining a disaster recovery plan, thought needs to be given to all the types of failures that could possibly happen to an IT solution and lead to a disaster. For example:
· Physical faults or failures in the technology
· Design errors which include system or application software design failures and bugs
· Operations errors caused by operations services staff because of accidents, inexperience, lack of due diligence or training, not following procedures or even malice
Environmental failures can be equally devastating, such as those in power supplies, cooling systems and network connections--as can natural disasters and terrorist activities against the operation center itself.
In the past 400 years, most environmental factors related to crossing the Atlantic had been observed, charted and documented. This included everything from year-round natural conditions like changing ocean currents and weather patterns like storms and hurricanes to natural hazards like fogbanks, ice fields and iceberg areas, and dangerous shorelines and rocky outcrops, etc. However, a belief had evolved during Titanic’s project (Part 4) that anything that nature could hand out could be handled by this enormous iron ship that was practically unsinkable.
In defining a disaster recovery plan, the scale of disaster is important to consider as well. For example, if a relatively minor storm, fire or flood knocks out your online operation, your customers are going to expect some contingency of service relatively quickly. In today’s world, you need contingency for all of these, even the most catastrophic disasters.
The associated costs of disaster recovery vary, based on the window of recovery (time), the elements of the disaster and the degree of recovery required. As part of a plan, these costs need to be carefully determined specifically for the IT solution created.
For Titanic, under maritime convention there should have been a disaster recovery plan defined for all the above situations that brought everyone onboard to the lifeboat deck, loaded them into the lifeboats with places to spare, lowered the lifeboats safely, and put them adrift with experienced crews to handle them. The life boat drill in Queenstown should have tested the latter part of the plan (Part 5).
Many serious problems with a production environment can start so innocuously that, in the first hour, your organization might not even be aware of it or its implications. For example, a less-critical part of the IT solution might be "down," so it goes unnoticed. However, because of interdependencies between components and applications, there tends to be a "knock on" effect and very quickly other parts of the IT solution can become affected. This leads to a catastrophic failure in a very short time.
On board Titanic there was a major delay in getting the lifeboats down, indicating a hesitation to launch the boats until as late as possible. It is likely the officers reacted slowly for several reasons: the ship was believed to be unsinkable, the gravity of the situation was not apparent and everything appeared so normal at the time. Also, only 83 of the crew of 900 were actual mariners (Part 5) and therefore familiar with the somewhat complex drill of lowering a 30 foot (65 person) lifeboat 60 feet to the water. There were 16 of these lifeboats in total, plus four smaller collapsible lifeboats (45 person) or "Englehardts."
Conclusions
Today, many IT projects completely ignore disaster recovery as something beyond their scope and covered off by a yearly IT planning process. Yet it is the IT project that determines the business justification and design around the IT solution, and develops an in-depth understanding of the kind of recovery that is required. Serious thought needs to be given to the consequences of a disaster impacting the IT solution, and this needs to be done early enough in the project so that adjustments to the overall disaster recovery plan can be made. The next installment will continue to look at disaster recovery.
【?發(fā)表評(píng)論?0條?】