RAS(一)介紹
寫在開篇之前
近期收到了公司大禮包,想著在找工作期間把Linux RAS整理一下,寫成系列文章。畢竟作為OS RAS負(fù)責(zé)人兼開發(fā),為阿里云X86和倚天710 RAS落地了很多RAS增強和解決方案,對阿里云服務(wù)器穩(wěn)定性做出些許貢獻。期間也有不少其他團隊過來請教過RAS事項,所以想著記錄下來,對以后計劃了解和學(xué)習(xí)RAS的Linux愛好者有所幫助。另外個人視角主要從Linux內(nèi)核出發(fā),梳理Linux RAS涉及的組件、功能、特性都有哪些,也會介紹內(nèi)核RAS涉及的硬件。
RAS背景
隨著云時代的到來,各個公司都在將產(chǎn)品、服務(wù)等遷移上云,云為數(shù)字化建設(shè)提供了極大的便利。國內(nèi)外云服務(wù)器廠商如雨后春筍般出現(xiàn),比如谷歌云、阿里云、騰訊云等等。近些年,全球范圍內(nèi)云服務(wù)出現(xiàn)多次宕機事件,云的穩(wěn)定性越來越受到大家關(guān)注。據(jù)《中國數(shù)據(jù)災(zāi)備產(chǎn)業(yè)白皮書暨數(shù)據(jù)災(zāi)備建設(shè)調(diào)研報告》中描述,業(yè)務(wù)宕機1分鐘,平均會使運輸業(yè)損失15萬美元,銀行業(yè)損失27萬美元,通信業(yè)損失35萬美元,制造業(yè)損失42萬美元,證券業(yè)損失45萬美元,同時公司聲譽等無形資產(chǎn)損失更是無法估量。
在嵌入式領(lǐng)域,越來越多的國產(chǎn)自研硬件商用發(fā)布,包括內(nèi)存、硬盤、GPU、CPU等,應(yīng)用于PC主機、汽車電子、工業(yè)控制等市場。隨著這些硬件使用數(shù)量的飛速上升,穩(wěn)定性問題也逐步開始暴露出來。比如筆者在華為內(nèi)核團隊負(fù)責(zé)對接某ARM嵌入式產(chǎn)品業(yè)務(wù),在使用國產(chǎn)中發(fā)現(xiàn)使用國產(chǎn)內(nèi)存條出現(xiàn)硬件問題的概率大大超過國外某大廠品牌。
總體來說,隨著軟件技術(shù)的成熟和完善,因為軟件導(dǎo)致的問題占比逐年減少。此消彼長下,硬件問題占比逐年突顯,比如硬件故障導(dǎo)致服務(wù)器異常或宕機問題已逐步成為云服務(wù)器Top1問題。
RAS定義
服務(wù)器硬件穩(wěn)定性,主要體現(xiàn)在RAS上。RAS指機器的可靠性(Reliability)、可用性(Availability)和可服務(wù)性(Serviceability)。Linux Kernel對Reliability,Availability,Serviceability定義如下
Reliability
is the probability that a system will produce correct outputs.
?Generally measured as Mean Time Between Failures (MTBF)
?Enhanced by features that help to avoid, detect and repair hardware faults
Availability
is the probability that a system is operational at a given time
?Generally measured as a percentage of downtime per a period of time
?Often uses mechanisms to detect and correct hardware faults in runtime;
Serviceability (or maintainability)
is the simplicity and speed with which a system can be repaired or maintained
?Generally measured on Mean Time Between Repair (MTBR)
RAS目標(biāo)是使系統(tǒng)盡可能長期可靠的運行而不停機,減少系統(tǒng)downtime;提供硬件檢測上報機制,以便在硬件錯誤引起數(shù)據(jù)丟失或宕機之前能夠通知管理員及時更換硬件;提供硬件錯誤恢復(fù)機制,并盡可能糾正錯誤,使系統(tǒng)可持續(xù)可靠的運行。
RAS涉及的硬件包括且不限于:CPU、Memory、IO、PCIe、硬盤和其他外設(shè)
?CPU – detect errors at instruction execution and at L1/L2/L3 caches;
?Memory – add error correction logic (ECC) to detect and correct errors;
?I/O – add CRC checksums for tranfered data;
?Storage – RAID, journal file systems, checksums, Self-Monitoring, Analysis and Reporting Technology (SMART).
通常來說,硬件錯誤分為CE、UE、Fatal Error、Non-fatal Error,定義如下
?Correctable Error (CE)?- the error detection mechanism detected and corrected the error. Such errors are usually not fatal, although some Kernel mechanisms allow the system administrator to consider them as fatal.
?Uncorrected Error (UE)?- the amount of errors happened above the error correction threshold, and the system was unable to auto-correct.
?Fatal Error?- when an UE error happens on a critical component of the system (for example, a piece of the Kernel got corrupted by an UE), the only reliable way to avoid data corruption is to hang or reboot the machine.
?Non-fatal Error?- when an UE error happens on an unused component, like a CPU in power down state or an unused memory bank, the system may still run, eventually replacing the affected hardware by a hot spare, if available.
但是實際這個定義比較寬泛且簡陋,比如還有Defferred Error(DE)
Deferred error
The error was detected, was not corrected, and was deferred. The error has not been silently propagated. The error might be latent in the system. It is IMPLEMENTATION DEFINED whether the error continues to infect the state of the node or whether it has been deferred to the consumer. The node continues to operate. If the error might have been silently propagated, it must be reported as an Uncorrected error.
又比如Intel將軟件可恢復(fù)的UC Error定義為UCR(Uncorrected Recoverable) Error,下面又分為SRAR、SRAO、UCNA等。
RAS基本框圖
RAS基本流程框圖如上,硬件發(fā)生故障后,通過硬件RAS能力觸發(fā)中斷或異常,通知到Firmware/OS,軟件收到通知后采取相應(yīng)的策略,比如Panic、執(zhí)行Recover actions或者通知到用戶。
隨著RAS功能不斷更新迭代以及架構(gòu)不同,RAS體系開始呈現(xiàn)多樣性,因不同使用場景所有不同,體現(xiàn)在:
1.通知方式多樣
通知方式細(xì)分下來包括IRQ、Exception、Poll、SEA、SDEI、GPIO等方式。
2.Mode多樣
硬件故障先通知到Firmware,然后Firmware帶外處理或再通知到OS的方式,稱為Firmware First Mode;
硬件故障通知到OS,OS處理硬件故障的方式,稱為Kernel First Mode;
這兩種方式還可以支持混合使用,各有優(yōu)劣,要學(xué)會因地制宜。比如對于CE來說,服務(wù)器經(jīng)常發(fā)生大量CE事件,就會產(chǎn)生CE Irq風(fēng)暴,CPU長時間在處理這些Irq,就會導(dǎo)致其他任務(wù)得不到調(diào)度,影響整體性能。
3.芯片架構(gòu)、硬件多樣性
隨著近些年芯片行業(yè)發(fā)展,芯片架構(gòu)越來越多樣性,包括Intel、AMD、ARM、RISC等,不同芯片架構(gòu)下硬件組成也有些許差異。
4.軟件多樣性
對于Linux驅(qū)動來說,包括mce驅(qū)動、apei驅(qū)動、edac驅(qū)動等;
對于用戶態(tài)RAS服務(wù)來說,包括mcelog、rasdaemon、perf event通知等;
總體來說,RAS是一個復(fù)雜的體系,不同芯片架構(gòu)、不同硬件RAS功能各不相同,作為RAS開發(fā)要根據(jù)不同業(yè)務(wù)場景采取對應(yīng)的RAS方案。
RAS故障處理流程
以Intel服務(wù)器為例,
1.Intel服務(wù)器內(nèi)存發(fā)生CE故障后,硬件觸發(fā)CMCI中斷,執(zhí)行OS注冊的中斷處理函數(shù);
2.該函數(shù)調(diào)用EDAC驅(qū)動代碼,讀取MCA狀態(tài)寄存器來獲取硬件故障信息,比如故障級別、故障硬件位置、故障地址等等。EDAC驅(qū)動會將信息保存在/dev/mcelog;
3.Mcelog是一個用戶態(tài)的服務(wù)程序,通過解析/dev/mcelog信息,將其保存在/var/log/mcelog。用戶可以通過查看該文件了解此服務(wù)器是否發(fā)生過硬件故障以及故障發(fā)生的時間、硬件信息、是否恢復(fù)等關(guān)鍵信息;
RAS硬件故障舉例
如下是x86服務(wù)器注入內(nèi)存CE故障的日志,EDAC驅(qū)動會打印故障發(fā)生所在的硬件(Memory)、Addr、Processor、類型(CE)、memory channel/dimm等信息。
?
C++ [22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR [22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 [22715.834759] EDAC sbridge MC3: TSC 0 [22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86 [22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0 [22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 - ?area:DRAM err_code0090 socket:0 channel_mask:1 rank:0) |
?
圖片來源:v2-8e4986144a6a70301ee1a30c60c5ffad_720w.webp (720×378) (zhimg.com)
編輯:黃飛
?
評論
查看更多