回顧當(dāng)時(shí)泰坦尼克號(hào)的情形:與冰山相撞后(見第8部分)船體仍顯無恙,沒人受傷。在船橋指揮部看來,船的完整性保持如初。白星公司主管布魯斯-埃斯梅死守自己面子和公司的聲譽(yù)(見第9部分)。當(dāng)晚11點(diǎn)45分,在撞擊發(fā)生的10分鐘后,埃斯梅催促啟航,泰坦尼克號(hào)踉踉蹌蹌駛離冰架。對(duì)危險(xiǎn)一無所知的乘客們?cè)陂_船之中松了口氣,對(duì)撞擊及其潛在的損害、后果都少有擔(dān)憂。
如今的IT項(xiàng)目中至關(guān)重要的一點(diǎn)是,確保IT解決方案的平均故障恢復(fù)(MTTR)規(guī)程(見第9部分)已經(jīng)在項(xiàng)目本身(見第4部分)之中被建立,準(zhǔn)備,計(jì)劃和測(cè)試過了,并被配以專人(運(yùn)行團(tuán)隊(duì)/技術(shù)支持)“制度化”了。在故障的第2區(qū)間(4個(gè)區(qū)間分別為故障的探測(cè),確定,解決,和從中恢復(fù))內(nèi),數(shù)據(jù)的采集應(yīng)經(jīng)過嚴(yán)格的檢驗(yàn)。
在修復(fù)有問題的產(chǎn)品前,團(tuán)隊(duì)需要對(duì)修復(fù)本身的總體風(fēng)險(xiǎn)進(jìn)行評(píng)估。對(duì)待上級(jí)的干預(yù),應(yīng)同對(duì)待其他方面來的意見一樣,經(jīng)過仔細(xì)的檢驗(yàn),以免造成問題的惡化。重要的是,這些意見一旦可疑,就應(yīng)立即予以挑戰(zhàn)。
史密斯船長是否也是重起航程的決定者之一,已不重要了。因?yàn)榘K姑芬寻凑兆约旱囊庠缸笥伊舜缶?。史密斯到無線電報(bào)室向波士頓公司總部匯報(bào)情況時(shí)仍顯樂觀,畢竟這艘有73個(gè)水密艙的大船在設(shè)計(jì)上具有很大的信心。他發(fā)出的無線訊息中稱,船撞冰了但受損很小,大家都很安全,為預(yù)防起見正駛向加拿大海爾法克斯。這條訊息應(yīng)該給了白星公司足夠的時(shí)間去安排火車和馬車,把乘客們轉(zhuǎn)往紐約。該無線訊息沒有加密所以為各地媒體所悉。這也是歐洲新聞中對(duì)撞擊的早期報(bào)道都普遍樂觀的原因。
如今的IT項(xiàng)目中,平均故障恢復(fù)(MTTR)規(guī)程應(yīng)完全取決于對(duì)it方案服務(wù)負(fù)責(zé)的團(tuán)隊(duì)。與故障有關(guān)的溝通、消息發(fā)布都需先經(jīng)他們的密切配合,只有在與方案的服務(wù)對(duì)象做了外部溝通后,才能作后續(xù)支援的決定。不準(zhǔn)確的信息將迅速瓦解服務(wù)提供者的信譽(yù)。
第2組調(diào)查人員,包括結(jié)構(gòu)師托馬斯-安德魯斯和木匠約翰-哈金森,帶回了更準(zhǔn)確地事故評(píng)估和更好的數(shù)據(jù)。而第1組調(diào)查人員則尚未檢視完足夠的地方來獲悉更大范圍的損傷。實(shí)際上撞擊后數(shù)秒內(nèi),煤料燃燒房和第5鍋爐房已經(jīng)滲水。一名消防員事后證實(shí),在煤料燃燒房地板上見到2英尺深的裂口。抽水機(jī)立刻開始工作,似乎能應(yīng)付滲水、維持船體的上浮。托馬斯-安德魯斯深知一旦郵件室淹水,船也就完蛋了。
如今的IT項(xiàng)目可從中吸取的教訓(xùn)是,為了查明事故,支援團(tuán)隊(duì)必須對(duì)集成的可行方案知之甚祥,必須能將之邏輯分層,分解成一系列產(chǎn)品和部件。要訣在于,項(xiàng)目各個(gè)階段工作文檔化的重要性,和把文檔作為知識(shí)下傳后續(xù)運(yùn)行階段的支援團(tuán)隊(duì)。
重新起航后,第6鍋爐房也開始滲水。僅僅20分鐘后,當(dāng)初的決策有多不準(zhǔn)確就已經(jīng)很顯見了。補(bǔ)救措施已無濟(jì)于事,郵件室終被水淹。史密斯與托馬斯-安德魯斯及指揮員們開會(huì)決定讓8節(jié)航速的船慢慢停下來。續(xù)航的行動(dòng)終嘗惡果,災(zāi)難性上漲的海水讓船吃水更多,其他本未受撞擊影響的部分也在水壓下開始漏水了。
而今IT方案在不穩(wěn)定時(shí),在一個(gè)MTTR狀態(tài)下,重要的是不斷評(píng)估、再評(píng)估運(yùn)行環(huán)境的數(shù)據(jù)(證據(jù)),并監(jiān)視環(huán)境的變化。第1個(gè)修補(bǔ)通常是臨時(shí)性的(見第9部分)、只為讓方案重新開始服務(wù)。替代的永久性修補(bǔ),可能需要數(shù)小時(shí)、數(shù)天才能到位,方案本身可能需要在后臺(tái)打補(bǔ)丁。如,代碼可能需要重作,或者一個(gè)新的部件需要集成進(jìn)方案的整體中。這樣的話,在按照規(guī)程使之產(chǎn)品化之前,必須經(jīng)過一個(gè)嚴(yán)格的計(jì)劃、測(cè)試(見第4,5部分)。因此要求一個(gè)強(qiáng)有力的變更管理流程和測(cè)試/演示環(huán)境。
安德魯斯向史密斯準(zhǔn)確預(yù)測(cè)了船距離沉沒還有2小時(shí),這是死刑判決。而史密斯終于也認(rèn)識(shí)到情況已經(jīng)無可挽救,不像撞擊剛發(fā)生時(shí)那樣尚有所可為了。
如今的IT項(xiàng)目可從中吸取的教訓(xùn)在于,MTTR規(guī)程是可循環(huán)的,顧及了在有限時(shí)間內(nèi)的多次嘗試。但是,埃斯梅迫使情況發(fā)展到超出了MTTR規(guī)程或者說是可恢復(fù)的限度。
結(jié)論
如今許多IT項(xiàng)目在緊急情況下大打折扣,因?yàn)椴话凑疹A(yù)定的運(yùn)行和方案恢復(fù)規(guī)程行事。制度化的MTTR規(guī)程,本來應(yīng)有助于弱化如泰坦尼克號(hào)執(zhí)行的那種亡命決策,并防止緊急狀況惡化成大災(zāi)禍。因此,支援團(tuán)隊(duì)人員都應(yīng)對(duì)方案的細(xì)節(jié)知之甚祥。下一部分將著眼于IT項(xiàng)目的災(zāi)難性恢復(fù)階段。
原文:
In recapping Titanic’s situation, following the collision (Part 8) the ship appeared to be in remarkably good shape. No one had been injured and from the bridge the integrity of the ship appeared to be sound. White Star Director Bruce Ismay was hell bent on saving face--and his company’s reputation (Part 9). At 11:50 p.m., 10 minutes after the collision, Ismay pushed to restart the ship and limp Titanic off the ice shelf. Passengers, unaware of any dangers, later testified their initial relief that the ship was restarting the journey again, with little concern about the collision, the potential damage and consequences.
In today’s IT projects it is vital that Mean Time To Recovery (MTTR) procedures for the IT solution (see Part 9) are set up, prepared, planned and tested--in the project itself (Part 4) and "institutionalized" with the staff (operations groups/technical support). Data collected in the second "problem" quadrant (the four quadrants are: detection, determination, resolution and recovery) has to stand up to rigorous review.
Before a resolution or fix is applied into production, the team needs to assess the overall risk of proceeding with it. Executive intervention is handled like any other input and needs to stand up to careful examination so as not to further deteriorate the situation. Importantly, it needs to be challenged if it does not make sense, without any repercussions.
Whether Captain Smith was part of the decision to restart Titanic was not really relevant as Ismay was in control of the situation driving forward his own agenda. Smith proceeded to the wireless room to inform the White Star Line in Boston of the situation. Smith was still optimistic; after all, there was a great confidence in the design of the ship with the 73 water tight compartments. Smith sent a wireless message outlining that Titanic had struck ice but with little damage. Everyone was safe aboard, and as a precaution the ship was proceeding to Halifax. The message would give White Star time to organize trains and carriages to transport the passengers to New York. Wireless messages were not encrypted and this one was intercepted by the world media. It was the reason why early reports of the collision that appeared in the European press were overwhelmingly optimistic.
In today’s IT projects, MTTR procedures need to be completely controlled by the groups responsible for the IT solution and the services it provides. Communications or announcements related to an outage situation need to be made in close conjunction with these groups and support decisions made when communicating externally to the service recipients of a solution. Inaccurate information can quickly erode confidence in the service provider.
The second search party, with the architect Thomas Andrews and the carpenter John Hutchinson, returned with a more accurate assessment of the situation and better data. The first search party had not descended enough decks to see the full extent of the damage. Within seconds of the collision, flooding had occurred in the coal bunkers and Boiler Room 5. One of the firemen later testified seeing a gaping hole 2 feet into the floor of the coal bunker. Suction lines were set up right away and the pumping seemed to be coping with the rate of flooding to keep the ship afloat. Andrews knew that if the mail room was lost to flooding, the ship was doomed.
The lesson from this for IT projects today is in order to pinpoint faults the support team needs a detailed knowledge of the integrated working solution, and the ability to break it down into logical layers and decompose it into a sequence of products and components. The importance of creating documentation at each stage of the project, and then transferring it as knowledge to support staff for later use in the operation, is key.
After restarting the ship, Boiler Room 6 had started to flood. Around 20 minutes later it was apparent that the initial determination was grossly inaccurate, and the fix was not resolving the situation. The mail room was lost to flooding. Smith conferred with Andrews and the officers, determining that the ship--sailing now at 8 knots--should come to a gradual stop. The forward motion had taken its toll. The ship had taken on more water resulting in increased flooding that was becoming catastrophic. Other parts of the ship, which were initially unaffected, had started to spring leaks under the strain of the water.
In today’s world in a MTTR situation where an IT solution falters, it is important to keep assessing and reassessing the environmental data (evidence) and monitoring the environment for any changes. The first fix applied is usually temporary (Part 9) as so to get the solution online and back into service. It may take hours or days to get a permanent fix in place. The solution may have to be patched up in the background. For example code may have to be reworked or a new component integrated into the solution. This then needs to go through rigorous planning and testing (Part 4 and Part 5) before implementing into production using the procedures from the project, hence the requirement for a robust change management process and a test/staging environment.
Andrews rightly predicted to Smith that the ship had approximately two hours before foundering. This was a death sentence, and Smith finally recognized the situation was hopeless and not recoverable like it had been right after the collision.
The lesson from this for IT projects today is that MTTR procedures are cyclical and allow for several attempts at recovery, in a limited time frame. However, Ismay forced a situation where the ship went beyond MTTR or recovery.
Conclusions
Today, many IT projects severely compromise a critical situation by not following an established process in operation and recovery of a solution. Institutionalized MTTR procedures should help minimize disparate decision making as carried out on Titanic and prevent a critical situation from becoming catastrophic. So should the support staff’s detailed knowledge of the solution. The next installment will look at the disaster recovery stage of the IT project.
【?發(fā)表評(píng)論?0條?】