文/Mark Kozak-Holland 譯/楊磊
把當(dāng)時(shí)的情況再扼要地回顧一下:由于沒能從關(guān)鍵的反饋機(jī)制[見第7部分]發(fā)現(xiàn)各種問題,泰坦尼克號的提前警報(bào)系統(tǒng)實(shí)際上已經(jīng)失效,這可能是緣于害怕報(bào)復(fù);除此因素外,對該船的安全系數(shù)[見第4部分]存在普遍的過分相信,對法國船班尼亞加那號那樣的結(jié)局也冷漠不驚[見第6部分],對有關(guān)巨型冰原規(guī)模的相關(guān)信息又不準(zhǔn)確[見第7部分],所有這一切導(dǎo)致了總狀況上的截然不同。最終,Ismay的壓力和新的SLO(“服務(wù)水準(zhǔn)目標(biāo)”)[見第5部分]則把泰坦尼克推向其最高的航速,超出了其運(yùn)行極限。
泰坦尼克號就這樣駛向那場撞擊。這其實(shí)已幾乎無法避免。在漫布著小冰川和碎冰團(tuán)的靜固冰水中,船體仍以全速前行。瞭望監(jiān)視哨兵們在缺乏雙筒望遠(yuǎn)鏡,刺骨的寒風(fēng)不停擊打眼睛的情況下,還試圖從此時(shí)常會出現(xiàn)的霧層中分辨出地平線之所在。因此在他們費(fèi)力地想從蜃景般若隱若現(xiàn)的前方視野中辨認(rèn)出那團(tuán)巨型黑影的過程中,向艦橋指揮部報(bào)告的時(shí)間早已被耽誤了。
在此,而今的IT項(xiàng)目可吸取的教訓(xùn)是,對一個(gè)新的運(yùn)行方案,運(yùn)行操作人員只有在非常熟悉它以后才能掌控之。(對新的運(yùn)行方案),他們應(yīng)持的姿態(tài)是首先要防患于未然,并保證該方案符合服務(wù)的級別和水準(zhǔn)。同時(shí),對該運(yùn)行方案的內(nèi)部以及周遭相關(guān)的環(huán)境,他們也需好好加以洞察。面對從建立于項(xiàng)目計(jì)劃和測試階段的反饋機(jī)制中收集上來的數(shù)據(jù)[見第4部分],他們也應(yīng)能迅速地加以分析和評估。而當(dāng)反饋機(jī)制中信噪變得交雜不清的時(shí)候,他們須對情況進(jìn)行診斷,確定出與標(biāo)準(zhǔn)的偏差,確定出潛在的影響和影響的綜合程度。他們還需對問題是否該上報(bào)了、以及上報(bào)各種問題時(shí)的優(yōu)先級做出正確的決策。
由于當(dāng)時(shí)不僅海面平靜,而且也沒有浪花能涌現(xiàn)于“殘冰山”的基部,所以幾乎不可能從遠(yuǎn)處的霧層中及早發(fā)現(xiàn)這樣的“殘冰山”。泰坦尼克的瞭望哨兵認(rèn)定那一大團(tuán)黑影其實(shí)就是“殘冰山”,或所謂“黑冰山”----一種翻倒游動的黑色冰山的時(shí)候,情況已直轉(zhuǎn)危急了。哨兵們一旦確信自己的觀察后,就向艦橋指揮部發(fā)出了那句著名的報(bào)告“前面有冰山!”。而指揮官和值班大副默多克,鎮(zhèn)定地聽完報(bào)告并用雙筒望遠(yuǎn)鏡目測出了與冰山的距離為900碼。今天,從可獲得的所有證據(jù)看,當(dāng)時(shí)默多克大副采取了如下行動來應(yīng)對:
·首先,他關(guān)掉了引擎。這是合理的,因?yàn)槿绻藭r(shí)直接倒車,不僅只會攪拌船下的海水,還會抑制方向盤的轉(zhuǎn)動,使船難于控制。
·接下來,由于已不夠距離讓船停下來,又沒法繞過冰山,因此他試著轉(zhuǎn)左舵,或走一個(gè)s型---先急打左轉(zhuǎn)舵,緊接急打右轉(zhuǎn)舵---以設(shè)法使船能驟然減速。在僅有的短短40秒反應(yīng)時(shí)間里,這樣的動作可能讓他的船能與冰山平行起來,而不是迎頭撞上。
·第三,為防范計(jì),他把電控開關(guān)打到了關(guān)閉艙壁水密艙門的檔位。事后看來,這些可能都是當(dāng)時(shí)所能做的最好的應(yīng)急措施了.
現(xiàn)今的IT項(xiàng)目從這里可吸取的教訓(xùn)是,在緊急情況下發(fā)現(xiàn)的任何異常,都應(yīng)在運(yùn)行操作員(瞭望哨)和各級技術(shù)支持人員(艦橋指揮官員)之間平滑地逐級上報(bào)。這種為安全起見的逐級上報(bào)系統(tǒng),須在項(xiàng)目的測試階段,就通過對其可操作性的測試和實(shí)際運(yùn)行操作的測試來建立好。只有當(dāng)操作人員對解決方案和工作環(huán)境都熟悉了以后,才可建立更簡捷的上報(bào)程序。
在此節(jié)上,泰坦尼克號項(xiàng)目本身也明顯存在欠缺。比如,為測試所留出的時(shí)間太短,海上試驗(yàn)中指揮官員也根本沒有嘗試過操縱這艘船走“s型”;也未曾把在困難、可怕、或突發(fā)緊急狀況下模擬對船只的操縱,作為事故預(yù)防工作的一部分來完成。
現(xiàn)今的IT項(xiàng)目從這里可吸取的教訓(xùn)是,對與解決方案的可操作性有關(guān)的各種危急情形,運(yùn)行操作員和技術(shù)支持職員都需專門花時(shí)間來予以設(shè)想,為故障的預(yù)防制定出策略、定出設(shè)想中的和檢驗(yàn)過的行動步驟。所有這些工作,都需在項(xiàng)目執(zhí)行和實(shí)施之前就完成并通過驗(yàn)證。其間還要考慮對自動化操作員的屏蔽,否則在緊要情況下他們的操作可能使問題變得更大化??偠灾?,最終目標(biāo)就是首先要防止停運(yùn),或整個(gè)服務(wù)的終止。
當(dāng)泰坦尼克號搖轉(zhuǎn)回右舷時(shí),默多克大副已避不開冰山了,他和他的艦橋同事們只好打起精神來應(yīng)付一場撞擊了。
結(jié)論
今天許多IT項(xiàng)目,因沒有足夠重視其運(yùn)行操作期而大打折扣。對運(yùn)行操作平臺的設(shè)定,變成了事余的工作。而運(yùn)行操作平臺中的相關(guān)職員,晚到項(xiàng)目的具體實(shí)施才進(jìn)入項(xiàng)目組,而沒有在項(xiàng)目計(jì)劃和測試階段就加入并扮演重要的角色??墒窃谏虅?wù)上,運(yùn)行操作平臺畢竟對維持服務(wù)的水準(zhǔn)負(fù)有直接而根本的責(zé)任。對某個(gè)解決方案,如果沒能首先為其設(shè)立起足夠的運(yùn)行操作平臺(人,工作程序,工具),那末其結(jié)果不可避免將導(dǎo)向成日,成周,甚至成月不斷出現(xiàn)運(yùn)行問題和潛在的故障,甚至于整個(gè)服務(wù)的停運(yùn)。
泰坦尼克號的各支持階層沒時(shí)間來熟悉他們的這艘船。他們沒能弄清楚相關(guān)異常的范圍,沒能集思眾智。默多克的最后指示和嘗試雖被很好地執(zhí)行了,但如果他的這一嘗試經(jīng)過些事先的測試,也許能使船幸免遇難。在一線運(yùn)行操作員和技術(shù)支持階層之間關(guān)于失蹤雙筒望遠(yuǎn)鏡的摩擦,也于事無補(bǔ),瞭望哨位的猶豫則浪費(fèi)了最寶貴的最后數(shù)秒時(shí)間。
下一部分將著眼于一個(gè)可控的局面如何演變成了一場災(zāi)難。
原文:
In recapping the situation, Titanic’s early warning system had failed because of the failure to report problems with key feedback mechanisms (see Part 7), possibly because of the fear of reprisal. This, coupled with general over-confidence in the safety of the ship (see Part 4), apathy to the fate of the French Liner Niagara (see Part 6), and inaccurate information on the extent of the giant ice field (see Part 7) led to a state of gross indifference. Finally, Ismay’s pressure and new SLO (see Part 5) pushed Titanic to her highest speed and past her operational limits.
Titanic was heading for a collision. In fact, it was almost inevitable. The ship, at its maximum speed, raced through icy still waters littered with small bergs and pieces of ice. The lookouts, without binoculars and a freezing wind hitting their eyes, were trying to outline the horizon through the haze common in these conditions. As they struggled to make out the shape of a dark mass looming in front of them they delayed reporting this to the bridge.
The lesson for today’s IT projects is that in monitoring a newly operational solution, operations staff needs to be very familiar with it. They need to be in a position to proactively prevent failures from happening in the first place and ensure it meets its service levels. They need good visibility into the solution and surrounding environment around it. They need to be able to quickly assess and analyze data in front of them, collected from feedback mechanisms set up during the planned testing stage of the project (see Part 4). As the mechanisms become noisy they need to diagnose situations and determine deviations from set norms, any potential impacts and overall extent. They need to clarify whether there is something actually wrong or just problematic. They need to make the right decision as to whether to escalate, and at what priority.
Titanic’s lookouts determined the dark mass was in fact a "growler," or "black iceberg"--an iceberg that has flipped over and is dark in color. With a calm sea and no breakers against the base of the growler it was practically invisible in the haze. This had now turned into a critical situation. Once sure of their sighting they notified the bridge with the infamous "Iceberg dead ahead!" Officer Murdoch, chief duty officer, calmly took the call and with his binoculars confirmed the sighting about 900 yards ahead. From all the evidence available today, Murdoch took the following actions:
· First, he cut power to the engines. This made sense as putting the engines into reverse would just churn up the water and limit the steering and handling capability of the ship.
· Second, there was not enough distance to stop the ship and he could not get around the iceberg. So he attempted a port-around or an S-turn first steering hard a port, and then hard a starboard in an effort to sharply decelerate the ship. With only 40 seconds of reaction time this would bring him parallel to the iceberg rather than a head on collision.
· Third, he threw the electric switch to close bulkhead watertight doors as a precaution.
In hindsight these were probably the best possible course of actions.
The lesson for today’s IT projects is that in a critical situation, any anomalies spotted are enacted on with a smooth escalation between operations (lookouts) and the levels of technical support staff (bridge officers). This trouble-free escalation needs to be established in the project testing stage (see Part 4) attained through operability and operational testing. As operations become familiar with the solution and environment they set up more effective procedures.
At this point it is evident that there were serious deficiencies in Titanic’s project itself. For example, time set aside for testing was too short, the officers did not go through any s-turn maneuvers during sea trials, or simulate handling the ship under rough or dire conditions, or an emergency situation as part of accident prevention.
The lesson for today’s IT projects is that operation and technical support staff need time to map out critical scenarios for the operability of the solution, work out strategies for failure prevention and determine preset and proven courses of action. These need to be carefully carried out and tested prior to implementation. This includes considering automated operators which need to be overridden, otherwise they could cause more problems in a critical situation. After all, the ultimate goal is preventing an outage from occurring, or loss of service, in the first place.
As Titanic swung back to starboard, Murdoch just failed to clear the iceberg and he and the bridge staff braced themselves for a collision.
Conclusions
Today, many IT projects severely compromise the operations stage by not paying enough attention to it. Setting up operations is an afterthought and staff is not brought into the project until implementation rather than taking a prominent role in the planning and testing stages. After all, operations are ultimately responsible for upholding the service levels of the solution to the business. The inability to set up an adequate operation (people, processes, tools) around a solution in the first place will inevitably lead to operational problems that manifest themselves days, weeks or even months after going live and a potential failure or a worst case outage.
Titanic’s levels of support had little time to familiarize themselves with the ship. They had failed to clarify the scope of anomalies and put together the intelligence. Murdoch’s maneuver was well executed, but perhaps with some testing he could have pulled it off. The friction between operations and technical support over the missing binoculars did not help in the situation and the lookouts hesitation cost vital seconds.
The next installment will look at how a manageable situation was turned into a disastrous one.
【?發(fā)表評論?0條?】