白皮書下載鏈接:Arm架構下的Synchronization概述和案例分析
https://developer.arm.com/documentation/107630/latest/
?
1. 簡介
?
隨著近年來Arm服務器的應用越來越廣泛,越來越多的云廠商開始提供基于Arm架構的云實例,越來越多的開發人員正在為Arm平臺編寫軟件。
?
Synchronization是軟件遷移和優化過程中的熱門話題。基于Arm架構的服務器通常具有比其他架構更多的CPU內核,對Synchronization的深入理解顯得更為重要。
?
Arm和X86 CPU之間最顯著的區別之一是它們的內存模型:Arm架構具有與x86架構的TSO(Total Store Order)模型不同的弱內存模型。不同的內存模型可能會導致程序在一種架構上運行良好,但在另一種架構上會遇到性能問題或錯誤。Arm服務器更寬松的內存模型允許更多的編譯器和硬件優化以提高系統性能,但代價是它更難理解并且可能更容易編寫錯誤代碼。
?
我們創作此文檔是為了分享有關Arm架構的Synchronization專業知識,可以幫助其他架構的開發人員在Arm系統上進行開發。
?
2. Armv8-A架構上的Synchronization方法
?
本文檔首先介紹了Armv8-A架構上的Synchronization相關知識,包括原子操作、Arm內存順序和數據訪問屏障指令。
?
2.1 原子操作
?
鎖的實現要求原子訪問,Arm架構定義了兩種類型的原子訪問:
?
-
Load exclusive and store exclusive
-
Atomic operation, which is introduced in armv8.1-a large system extension (LSE)
?
2.1.1 Exclusive load and store
?
LDREX/LDXR - The load exclusive instruction performs a load from an addressed memory location, the PE (e.g. the CPU) also marks the physical address being accessed as an exclusive access. The exclusive access mark is checked by store exclusive instructions.STREX/STXR - The store exclusive instruction tries to a value from a register to memory if the PE (e.g. the CPU) has exclusive access to the memory address, and returns a status value of 0 if the store was successful, or of 1 if no store was performed. ?2.1.2 LSE Atomic operation
?
LDXR/STXR使用了try and test機制,LSE不一樣,它直接強制原子訪問,主要有如下指令: ?-
Compare and Swap instructions, CAS, and CASP. These instructions perform a read from memory and compare it against the value held in the first register. If the comparison is equal, the value in the second register is written to memory. If the write is performed, the read and write occur atomically such that no other modification of the memory location can take place between the read and write.
-
Atomic memory operation instructions, LD
, and ST , where is one of ADD, CLR, EOR, SET, SMAX, SMIN, UMAX, and UMIN. Each instruction atomically loads a value from memory, performs an operation on the values, and stores the result back to memory. The LD instructions save the originally read value in the destination register of the instruction. -
Swap instruction, SWP. This instruction atomically reads a location from memory into a register and writes back a different supplied value back to the same memory location.
?
2.2 Arm內存順序
?
Arm架構定義了一種弱內存模型,內存訪問可能不會按照代碼順序:
?
2.3 Arm數據訪問屏障指令
?
Arm架構定義了屏障指令來保證內存訪問的順序。 ?DMB?– Data Memory BarrierExplicit memory accesses before the DMB are observed before any explicit access after the DMB ?
-
Does not guarantee when the operations happen, just guarantee the order
LDR X0, [X1] ;Must be seen by memory system before STR DMB SY ADD X2, #1 ; May be executed before or after memory system sees LDR STR X3, [X4] ;Must be seen by memory system after LDR
A DSB is more restrictive than a DMB ?
-
Use a DSB when necessary, but do not overuse them
-
All explicit memory accesses before the DSB in program order have completed
-
Any outstanding cache/TLB/branch predictor operations complete
?
DC ISW ; Operation must have completed before DSB can complete
STR X0, [X1] ; Access must have completed before DSB can complete
DSB SY
ADD X2, X2, #3 ;Cannot be executed until DSB completes
DMB和DSB是雙向柵欄,對兩個方向都限制,Armv8-a也設計了一種單向柵欄:load-acquire和store-release機制,只在一個方向上做限制。
?Load-Acquire (LDAR)
?-
All accesses after the LDAR are observed by memory system after the LDAR.
-
Accesses before the LDAR are not affected.
-
All accesses before the STLR are observed by memory system before the STLR
-
Accesses after the STLR are not affected
?
3. C++內存模型
?
有了語言層面的內存模型,對于大多數情況,開發者不需要去寫依賴于具體架構的匯編代碼,而只需要借助于良好設計的語言層面的內存模型來編寫高質量代碼,不必擔心架構差異。C++ memory model:
https://en.cppreference.com/w/cpp/header/atomic ? ?我們做了一個C++內存模型與Armv8-A實現之間的映射: ?
?
4. 總結
?
在白皮書中,為幫助讀者更好地理解,我們選取了三個典型案例進行深入分析。由于與Synchronization相關的編程非常復雜,因此我們必須仔細權衡其正確性和性能。我們建議首先使用較重的屏障指令保證邏輯的正確性,然后通過移除一些冗余屏障或在必要時切換到較輕的屏障來繼續提高性能。對Arm內存模型和相關指令的深入理解,是對實現準確和高性能的Synchronization編程非常有必要的。 ?在附錄部分,我們還介紹了內存模型工具(The litmus test suite),它可以幫助理解內存模型并在各種架構上驗證程序。 ?關于以上內容更完整的講解,請參考“Arm架構下的Synchronization概述和案例分析白皮書”。 ?參考文獻
?
-
Arm, “Arm Architecture Reference Manual Armv8, for Armv8-A architecture profile Documentation”?
https://developer.arm.com/docs/ddi0487/latest
-
“The software suite diy7”
http://diy.inria.fr/ - “A working example of how to use the herd7 Memory Model Tool”https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/how-to-use-the-memory-model-tool
- “How to generate litmus tests automatically with the diy7 tool”https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/generate-litmus-tests-automatically-diy7-tool
-
“Running litmus tests on hardware using litmus7”
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/running-litmus-tests-on-hardware-litmus7
?
評論
查看更多