C++ Boost

Boost Statechart 庫

Performance 性能


Speed versus scalability tradeoffs 速度與擴展性的折衷
Memory management customization 內存管理定制化
RTTI customization RTTI定制化
Resource usage 資源使用

Speed versus scalability tradeoffs 速度與擴展性的折衷

Quite a bit of effort has gone into making the library fast for small simple machines and scaleable at the same time (this applies only to state_machine<>, there still is some room for optimizing fifo_scheduler<>, especially for multi-threaded builds). While I believe it should perform reasonably in most applications, the scalability does not come for free. Small, carefully handcrafted state machines will thus easily outperform equivalent Boost.Statechart machines. To get a picture of how big the gap is, I implemented a simple benchmark in the BitMachine example. The Handcrafted example is a handcrafted variant of the 1-bit-BitMachine implementing the same benchmark.
我們已投入了相當多的努力,以使得本庫對於小的簡單狀態機更快速,同時也具有可擴展性(這些努力只用於 state_machine<>, 對於 fifo_scheduler<> 的優化還有很多空間,尤其是多線程的構建)。而我認為要在大多數應用中合理執行的話,可擴展性並不是免費的。因此,小的、精心地手工構造的狀態機很容易勝 過用 Boost.Statechart 構造的同樣的狀態機。為了全面瞭解它們之間到底有多大差距,我在 BitMachine 例子中實現了一個簡單的基準測試。而 Handcrafted 例子則是一個實現相同基準測試的 1-bit-BitMachine 的手工變體。

I tried to create a fair but somewhat unrealistic worst-case scenario:
我試圖創建一個公平的,但有點不切實際的最壞情 況:

The Benchmarks - running on a 3.2GHz Intel Pentium 4 - produced the following dispatch and transition times per event:
基準測試 - 運行在一個 3.2GHz Intel Pentium 4 平台上 - 得到的每個事件的分派和轉換時間如下:

Although this is a big difference I still think it will not be noticeable in most real-world applications. No matter whether an application uses handcrafted or Boost.Statechart machines it will...
雖然有很大差距,但是我仍然認為在真實的應用中是不會感覺到的。無論應用程序使用手工的或是 Boost.Statechart 的狀態機,它都會...

Therefore, in real-world applications event dispatch and transition not normally constitutes a bottleneck and the relative gap between handcrafted and Boost.Statechart machines also becomes much smaller than in the worst-case scenario.
因此,在真實應用中,事件調度和轉換通常不會構成瓶頸,手工構造物與 Boost.Statechart 之間的相對差距也會比最壞情形下的差距小得多

Detailed performance data 具體的性能數據

In an effort to identify the main performance bottleneck, the example program "Performance" has been written. It measures the time that is spent to process one event in different BitMachine variants. In contrast to the BitMachine example, which uses only transitions, Performance uses a varying number of in-state reactions together with transitions. The only difference between in-state-reactions and transitions is that the former neither enter nor exit any states. Apart from that, the same amount of code needs to be run to dispatch an event and execute the resulting action.
BitMachine為了找到主要的性能瓶頸,我們寫了一個程序"Performance"。它測量在不同的 BitMachine 中處理一個事件所花費的時間。和 例子中只使用了轉換相比,Performance 使用了轉換以及許多不同的狀態內反應。狀態內反應與轉換之間的唯一區別在於,前者無需進入和退出任何狀態。此外,分派一個事件和執行相應的動作所需的代碼 量是相同的。

The following diagrams show the average time the library spends to process one event, depending on the percentage of in-state reactions employed. 0% means that all triggered reactions are transitions. 100% means that all triggered reactions are in-state reactions. I draw the following conclusions from these measurements:
以下圖表顯示的是本庫在處理一個事件中花費的平均時間,其依賴於狀態機反應使用率。0%表示所有觸發的反應都是轉換。100%則表示所有觸發的反應都是狀 態機反應。我從這些測量結果中得到以下結論:

Out of the box 缺省配置

The library is used as is, without any optimizations/modifications.
使用缺省配置的庫,不帶任何優化/修改。

PerformanceNormal1PerformanceNormal2PerformanceNormal3PerformanceNormal4

Native RTTI 原生RTTI

The library is used with BOOST_STATECHART_USE_NATIVE_RTTI defined.
使用帶 BOOST_STATECHART_USE_NATIVE_RTTI 定義的庫。

PerformanceNative1PerformanceNative2PerformanceNative3PerformanceNative4

Customized memory-management 定制內存管理

The library is used with customized memory management (boost::fast_pool_allocator).
使用帶定制的內存管理(boost::fast_pool_allocator) 的庫。

PerformanceCustom1PerformanceCustom2PerformanceCustom3PerformanceCustom4

Double dispatch 雙重分派

At the heart of every state machine lies an implementation of double dispatch. This is due to the fact that the incoming event and the active state define exactly which reaction the state machine will produce. For each event dispatch, one virtual call is followed by a linear search for the appropriate reaction, using one RTTI comparison per reaction. The following alternatives were considered but rejected:
在每個狀態機的核心都有一個雙重分派的實現。這是因為以下事實,輸入的事件活 動狀態一起精確地定義了狀態機要產生的 反應。 對於每一個事件分派,將有一次虛擬函數調用加一個對合適反應的線性查找,每個反應使用一個RTTI比較。以下替代方法曾經被考慮但否決:

Memory management customization 內存管理定制化

Out of the box, everything (event objects, state objects, internal data, etc.) is allocated through std::allocator< void > (the default for the Allocator template parameter). This should be satisfactory for applications meeting the following prerequisites:
缺省情況下,所有東西(事件對像、狀態對像、內部數據等等)都是通過 std::allocator< void > (Allocator 模板參數的缺省值)來分配的。這對於符合以下條件的應用來說是可以滿足的:

Should an application not meet these prerequisites, Boost.Statechart's memory management can be customized as follows:
如果一個應用程序不符合以上條件,那麼 Boost.Statechart 的內存管理可以按以下方法進行定制:

RTTI customization RTTI定制化

RTTI is used for event dispatch and state_downcast<>(). Currently, there are exactly two options:
RTTI用於事件的分派和 state_downcast<>()。 目前有兩種選擇:

  1. By default, a speed-optimized internal implementation is employed缺省,使用一個速度優化的內部實現
  2. The library can be instructed to use native C++ RTTI instead by defining BOOST_STATECHART_USE_NATIVE_RTTI本 庫可以定義 BOOST_STATECHART_USE_NATIVE_RTTI 改為指定使用原生的 C++ RTTI

There are 2 reasons to favor 2:
對於選擇2有2個原因:

Resource usage 資源使用

Memory 內存

On a 32-bit box, one empty active state typically needs less than 50 bytes of memory. Even very complex machines will usually have less than 20 simultaneously active states so just about every machine should run with less than one kilobyte of memory (not counting event queues). Obviously, the per-machine memory footprint is offset by whatever state-local members the user adds.
在32位環境下,一個空的活動狀態典型需要小於50個字節的內存。即使是非 常複雜的狀態機,通常都低於20個同時活動的狀態,所以每個狀態機運行時只需要不到1K字節的內存(不算事件隊列)。顯然,每 個狀態機的內存佔用要加上用戶增加的狀態局部成員。

Processor cycles 處理器週期

The following ranking should give a rough picture of what feature will consume how many cycles:
以下排名可以粗略地瞭解到哪些特性要消耗多少週期:

  1. state_cast<>(): By far the most cycle-consuming feature. Searches linearly for a suitable state, using one dynamic_cast per visited state
    state_cast<>(): 目前消耗最多週期的特性。線性查找一個合適的狀態,對每個被訪問的狀態使用一次 dynamic_cast 
  2. State entry and exit: Profiling of the fully optimized 1-bit-BitMachine suggested that roughly 3 quarters of the total event processing time is spent destructing the exited state and constructing the entered state. Obviously, transitions where the innermost common context is "far" from the leaf states and/or with lots of orthogonal states can easily cause the destruction and construction of quite a few states leading to significant amounts of time spent for a transition
    狀態進入和退出:對全優化的 1-bit-BitMachine 的性能測試表明,總的事件處理時間中大約3/4花費在析構被退出的狀態和構造被進入的狀態。顯然,對於 最內層公共上下文 "遠"離葉子狀態的,和/或帶有大量正交狀態的轉換,很容易會引起相當多狀態的析構和構造,這將導致一次轉換花費大量的時間
  3. state_downcast<>(): Searches linearly for the requested state, using one virtual call and one RTTI comparison per visited state
    state_downcast<>(): 線性查找被請求的狀態,對每個被訪問的狀態使用一次虛擬調用和一次 RTTI 比較
  4. Deep history: For all innermost states inside a state passing either has_deep_history or has_full_history to its state base class, a binary search through the (usually small) history map must be performed on each exit. History slot allocation is performed exactly once, at first exit
    深歷史:對一個其狀態基類帶有 has_deep_historyhas_full_history 的狀態,在每次退出時,要對其內部的所有最內層狀態通過(通常是小的)歷史圖執行二分查找。在第一次退出時,執行一次歷史插槽分配
  5. Shallow history: For all direct inner states of a state passing either has_shallow_history or has_full_history to its state base class, a binary search through the (usually small) history map must be performed on each exit. History slot allocation is performed exactly once, at first exit
    淺歷史:對於一個其狀態基類帶有 has_shallow_historyhas_full_history 的狀態,在每次退出時,要對其所有直接內層狀態通過(通常 是小的)歷史圖執行二分查找。在第一次退出時,執行一次歷史插槽分配
  6. Event dispatch: One virtual call followed by a linear search for a suitable reaction, using one RTTI comparison per visited reaction
    事件分派:一次虛擬調用,加一次對適合的 反應 線性查找,對每個被訪問的反應使用一次 RTTI 比較
  7. Orthogonal states: One additional virtual call for each exited state if there is more than one active leaf state before a transition. It should also be noted that the worst-case event dispatch time is multiplied in the presence of orthogonal states. For example, if two orthogonal leaf states are added to a given state configuration, the worst-case time is tripled
    正交狀態:如果在一次轉換之 前有一個以上的活動葉子狀 態,則每個被退出狀態要額外增加一次虛擬調用。還要注意,最壞情況下事件分派的時間要倍乘以正交狀態的數量。例如,如果兩個正交葉子狀態被增加到一個給定 的狀態配置,最壞情況下,消耗時間為3倍

Valid HTML 4.01 Transitional

Revised 03 December, 2006

Copyright © 2003-2006 Andreas Huber Dönni

Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)