（1）（キャッシュ）Memory (1) (Cache)

メモリに関する話題（1） - Cache
Memory (1) - Cache
•
•
•
•
Memory-CPUボトルネック/Bottleneck
メモリ階層性/Hierarchy of Memory
メモリ構造/Memory Structure : SRAM/DRAM
Cache（キャッシュ）メモリ
•
•
•
•
Cacheの発明/Innovation of Cache
Cache構造/ Cache Structure
Mapping 方式/Methods
Cache組み込みのキーポイント/Optimization of Cache
• Mapping parametersによる効率（ミス率）
Efficiency (Miss rate) with Mapping methods/parameters
• 更新方式/書き込み制御 Replacement/Restoring
• cacheが有効でない場合（実行プログラム特性）
Cases that the cache works inefficiently (Characteristics of programs)
福永力; Chikara Fukunaga
1
フォンノイマンボトルネック（隘路）
Von Neumann Bottleneck
• 「プログラム内蔵方式のコンピュータ（フォンノイマン型）」はプロセッサ
のプログラム実行時のメモリ参照に弱点がある．
Computer architecture with the “stored program method” needs
frequent access of memory for program execution.
• 通常メモリのアクセススピードはプロセッサの計算スピードに較べて
遅い
Memory access rates are lower than ones of Arithmetic-Logical
processes
• メモリアクセススピードによってプログラムの実行性能が規定される
The memory access rate becomes one
of the important parameters for the
overall performance of the processor .
⇒フォンノイマン型（Von Neumann）
ボトルネック（Bottleneck）
福永力; Chikara Fukunaga
2
メモリの階層性とCache
hierarchy of the storage and Cache
•
•
•
•
Register : FF circuits
Cache : Bipolar, CMOS SRAM
Main Storage : SRAM, DRAM
Disk Cache : DRAM
• 早いメモリは消費電力
発熱量が高い．
Memories with high access rate
have high power dissipation and
consequently high thermal emission.
福永力; Chikara Fukunaga
3
SRAM構造 structure
Static Read Access Memory
Cell Structure (1 bit)
福永力; Chikara Fukunaga
4
DRAM構造
Dynamic RAM structure
• 構造が簡単、記憶原理が簡単
Simple structure and storage method
• コンデンサの充電（1）・放電（0）による記憶
Charge-up (1) and empty (0) of a capacitors
make the 1 bit information in the memory
• 自然放電に対してRefresh動作が必要
Refresh cycle is required to refrain from the
natural discharge of capacitors
• 読み出しは充電されていた電荷の放出→
破壊的読み出し→W線全体の再書き込み
Readout makes the discharge of capacitors,
which means a destructive readout, we
need re-reading for a W line to keep info.
福永力; Chikara Fukunaga
5
プログラムのもつ局所性
Locality of a program
• プログラムの特徴としての局所性（Locality）
• ある短い時間に限ってみるとメモリのある特定の領域の情報（命
令＋データ）のみで動作
A program is operated with the instructions and data stored in a
limited region of the MS in small time interval
• 情報の再利用あるいは近辺の情報の参照による情報更新
Information is updated with the re-use of it and the reference of
neighbourhood info.
• メモリ参照の局所性(locality of the memory reference)
• temporal locality（時間的）
• spatial locality（空間）
福永力; Chikara Fukunaga
6
Cacheの登場
Introduction of Cache
• よく参照されるデータの特殊領域への待避（ボトルネックの解消）
We needed a temporally small storage to keep a copy of frequently accessed
parts in the Main Storage (MS) in order to
eliminate the bottle-neck between CPU and MS.
• それは効率的に局所性を再現するもの
でなくてはならない
It must simulate the parts of MS efficiently.
• 時間にともなう内容変化に対応
The contents must be changed with time
⇒ ⇒
• Cacheの登場 / Development of Cache
• Cacheの組み込み
Installation of Cache between MS and
the main processor
福永力; Chikara Fukunaga
7
Cacheの内部構成
Cache - conceptual structure
• 抽象化されたCacheの構造
Abstracted structure of Cache
• コントロール部（アドレス
デコーダとtag，コンパ
レータ）とCacheメモリ部
Control block (the address decoder, tags,
the comparator) and the Cache memory
(SRAM)
• Main Storage (MS) のアドレス
とtagの素早い参照が重要（Mapping）
Efficient mapping between the actual
address in the Main storage (MS) with
tag is issue for cache design
• データの格納法にもさまざまな工夫
Pick up of hot blocks in MS and Storage
of them in cache are also very important
items for cache construction
福永力; Chikara Fukunaga
8
いくつかの基本データ
Several Basic Data for Cache
•
•
•
•
•
Block size : 1024~4096Byte
Hit cycle : 1 clock or more
Miss Penalty : 8~32 clock
Miss rate : 1%~20%
Cache size : 1k~1M Byte
• Memory (MS) accessing: handled as
2 dim. Array of row and column
for Physically Linear Array of
Memory
福永力; Chikara Fukunaga
9
ブロック（バースト）データ転送
Block (Burst) Data Transfer
• 一般のデータ読込みではアドレス
ラインをバスに出してしばらくして
データがバスにメモリからロードさ
れる．
In general data is loaded on bus a
little after the address in on the bus.
• ブロック転送では開始アドレスと転
送バイトサイズを指定すれば最初
のアドレスロード以降データは間髪
をおかずバスにロードされる．
If the start address and size of data
are specified in a instruction, data
is transferred to the target without
any intervention
福永力; Chikara Fukunaga
10
MSからCacheへのデータ選択法
Mapping method from MS to Cache
Mapping Method
セクター
Sector
ダイレクトマップ
Direct Mapping
フルアソシエーティブ
Full Associative
n-way セットアソシエーティブ
n-way set associative
Mechanism
Characteristics
きめ細やかのMSの領域（ブロッ
ク）の写像
The most sophisticated
algorithm for mapping
アドレス変換に1ステップ必要
Address conversion from
tagging needs an another
hardware step
比較的簡単なマッピング
Simple mapping
ブロックの構成によっては効率悪
い
Inefficiency will be observed
もっとも直感的なマッピング
Simplest idea for mapping
Tag→アドレス変換が複雑
The conversion between
address and tag is complicated
ダイレクトマップとフルアソシエー
ティブ方法の折衷
Mixed idea of Direct Mapping &
Full Associative
福永力; Chikara Fukunaga
今一般的に使われている
11
セクタ方式
Sector Mapping Method
• Memory Segment：例 Example
Row（10bit addr），Block（1024 Byte,16 Block/Row）
• Max. 16 blocks for 1 Row can be registered in cache at once
• Dataのvalidity bit (validity bit for data in a block)
• invalid bitであれば同じブロックのメインストレージからの再読み込み
If valid bit is off (invalid), reread the columns from the Main Storage
• Tag = 10bit address（Row addr.） & 16bit tagged column pattern/block
• Tag match & valid bit → any block in the Row can be used (cache hit)
• If Tag match & invalid bit → blocks must be retransferred
• If Tag mismatch, exchange an old Row with new (now required)
福永力; Chikara Fukunaga
12
ダイレクトマップ方式
Direct Map method
• 各Rowにつき、ただ1つのblockが選ばれcache登録されうる
Only a block in a Row can be registered in cache.
• Tagは単純にRowの順番．例えばBlockのcolumn番号をtagに記録する．
Tag number is simply the number of the row (Tag 0 for Row 0, Tag 1 for Row 1, etc.)
Tag contains the column number in the row in which the cached block is assigned.
• Tagとメインアドレスの比較は10bit以下のcolumn番号のみでよいので制御は容易
Matching algorithm of tag and the address in the MS is therefore simple
• 1 rowにつき1個のブロックしかcacheできない．特定のrowに所属するブロックへの
アクセスが集中すると効率は悪くなる．
Since only a block can be cached for one row, the efficiency will be worsen if
accesses are concentrated into several blocks in the same row
福永力; Chikara Fukunaga
13
フルアソシエーティブ方式
Full Associative mapping
• MSとcache間の対応はBlockで行うがその他の対応はなにもしない．
A block is used still as a unit for mapping between cache and MS, but nothing
others.
• 任意のブロックがcacheの任意の位置に記録される．
No sequential ordering. First block comes in the first free block in Cache
• メインストレージの実アドレスとtagはフルにブロック番号で比較させ，一致をと
るので制御機構が複雑
Matching of the tag and the address in MS is complicated since it needs full bit
comparison.
• 使用頻度の高い複数ブロックの無制限有効利用
Effective use of highly frequently referred blocks without restriction
福永力; Chikara Fukunaga
14
(n-way) セットアソシエーティブ方式
(n-way) Set Associative mapping
• ダイレクトマップとフーリーアソシエーティブ方式の折衷案
Combined method of the direct and fully associative mapping
• cache内で記録block数が1 rowで1個だったものを複数個に拡大．この複数
のblock数をwayと呼ぶ．
Extension of number of blocks in a row from one to n (unit is called way) of
the direct mapping method
• 以下の例は2-way,6 bit row addr.、 8 bit column addr.
The example here is 2 way 6 bit row address,8 bit column address
福永力; Chikara Fukunaga
15
16-way セットアソシエーティブ構成例
16-way Set Associative Mapping Cache-Example
福永力; Chikara Fukunaga
16
Cache構成のキーポイント（1）
Optimization of Cache (1)
• Cacheメモリの容量
Capacity of Cache Memory
• ブロックサイズ
Size of a block
• マッピング方式の選択
Selection of Mapping methods
• セットアソシエーティブwayの数とrow数（set数）
Number of ways and Rows for the Set Associative Mapping
福永力; Chikara Fukunaga
17
Cache構成のキーポイント（2）
Optimization of Cache (2)
• Replacement方式
Cache memory replacement method
• Restore動作- 更新アルゴリズム
Restoring of Cache data – rewrite algorithm
• Compulsory（初動遅延; Initial delay; cache initial state）
• Conflict：ブロック集中アクセスによるミス
（ダイレクトマップ，セットアソシエーティブ）
Conflict: Miss with concentrated access of particular blocks
(Direct, Set-associative Mapping)
• 実行プログラムの特性（データアクセスの効率）
Data access characteristics of programs run on the
machine
福永力; Chikara Fukunaga
18
Cacheパラメータ変更によるMiss rate実測値例
Measurement of Miss rate with Cache parameters
Cache size
From Hennessy & Patterson, CA A Quantum Approach 4th ed.
8,4,2, 1
•
•
•
•
•
•
Cache capacity → increase,
capacity miss，conflict miss → decrease (Fig. C9)
Number of ways → increase,
conflict miss rate → decrease (Fig. C9)
Block size → decrease, inefficient (Fig. C10)
Block size → increase, miss rate → increase
(Fig. C10)
1 ways miss rate ~ cache size 50% of 2 ways
(Fig. C9)
Cache size double → miss rate sqrt(2) (Fig. C9)
福永力; Chikara Fukunaga
19
リプレースメント（旧データの追出し）方式
Replacement (Kick out) of old unused contents
• ミスヒットによるcache更新時のデータ追出しアルゴリズム
Algorithm for the kick-out of old data with miss-hit
• Invalid bitタグのブロックの追放
Invalid tag bit to indicate the useless block
• LRU（Least Recently Used）ブロックの認定・追放
Recognition of Least Recently Used Block, and kick it out
• 履歴用（タイムスタンプ）のtag bitを必要とする
need a time stamp information for each block
• FIFO（First-In First-Out）
• 最古参ブロックから追い出し
The block entered first must be purged out first
• LRUより制御機構が簡単 (Simpler control logic than LRU, just sequence)
• ブロックのランダム抽出（Random Selection）
• 時間にかかわらずどのブロックのアクセス頻度は同じであろうという考え方
The frequency of access to any blocks will be more-or-less constant regardless of the time
• Cache Access Counter for each block
• cacheアクセスごと，あるいはクロックごとにカウンターを増やす．カウンター値と同じライ
ンのブロックを追放，そしてカウンターリセット
Count up with clock or access, and if it exceeds some limit, it will be candidate to be
purged
20
福永力; Chikara Fukunaga
データCache 書き込み制御
Restoring Control
• キャッシュ一貫性制御ともいう
It is also called “Cache Coherence Control”.
• Data cacheのみが対象
The operation is applied to the Data cache.
• データ更新にともなうメインストレージへの再書き
込み
• ストアスルーあるいはライトスルー
（Store-Through or Write-Through）
• ストアイン，ライトバックあるいはコピーバック
（Store-In,Write-Back or Copy-Back）
福永力; Chikara Fukunaga
21
ストア（ライト）スルー
Store（Write）-Through
• 書き込みデータのブロックがcacheでヒット（Write Cache hit）した場合，
cacheデータとMSともに更新（Write）
If the block to be modified by write is found in Cache (Write Cache hit),
Both contents in cache and MS are modified.
• Write Cache missした場合，MSの対応ブロックのデータ更新するとともに
If Write Cache miss, the corresponding block in the MS is updated but
also
• cacheに当該ブロックを読込みMSとcacheをともに更新（Write Allocate），ある
いは
The block is also refilled into the cache, and it is updated, or
• cacheは何もせずMSのみ変更
No touch with the cache (only the MS will be affected).
• ライトバッファ（Write Buffer）
• メインストレージへのアクセス中CPU処理がストールされてしまう．この期間
を最小限にするために中間バッファ（ライトバッファ）を持つ必要がある．
We need a special buffer called “Write Buffer” in order to keep minimize the
access time of the MS to avoid stalls in the CPU pipeline.
福永力; Chikara Fukunaga
22
ストアインあるいはライト（コピー）バック
（Store In or Write（Copy）-Back）
• Write Cache hitした場合，cacheのみ変更（Dirty）
If write Cache hit, only Cache is modified (Dirty)
• Write Cache missした場合，必ずMSからcacheへ当該ブロックを読込
み，cacheのみを変更しcacheとMSとで当該ブロックのある値が異なる
という旨をマークしておく．
（clean bitをOFF → dirty）
If Write Cache miss, always the corresponding block is load into the
Cache from the MS and is modified only in the cache. The block is
marked up with “dirty” from “clean”.
• このブロックがcacheから追出されるときにメインストレージを更新す
る（ライトバック）．
（clean bitをON → clean）
If this block is once kicked out, the MS is updated, and the block is
backed to “clean”.
• Readではcacheのhit/missにかかわらずclean
The block is kept “clean” for Read case regardless of the hit/miss.
福永力; Chikara Fukunaga
23
ストアイン（ライトバック）キャッシュの状態遷移
State Transition Diagram for Store In (Write back)
Cache mechanism
福永力; Chikara Fukunaga
24
cacheの多重（マルチ）レベル化
Multi-Level Cache
• L1，L2
• Victim（n-wayセットアソシエーティブ法
cacheの追い出されたデータ保持用）
Victim buffer for temporal storage buffer
for blocks kicked out in the n-way set
associative mapping.
福永力; Chikara Fukunaga
25
cacheが有効でない場合（1）
Some cases that the cache is not effective (1)
• 長い（要素の多い）ベクトル計算
Calculation of Long one dimensional
arrays
• データ量がcacheサイズを越えるとパ
フォーマンスの低下が見える
If the size exceeds the cache block size,
the degradation of the performance
• サイズのみでなくcacheの構成をも考
慮したプログラミングの配慮が必要
If the cache is structured in multi-level,
the programming must be optimized
with this structure.
Performance (rate) MFLOPS
C=C + A[i] * B[i]
福永力; Chikara Fukunaga
配列長 Array size
QCD (Physics)
配列長 Array size
26
cacheが有効でない場合（2）
Some cases that the cache is not effective (2)
• 行列計算の繰り返し順序
Loop order for Matrix
multiplication of Zij=ΣXik・Ykj
• CとFortranでは異なることも．
Optimization of C and Fortran is
different.
福永力; Chikara Fukunaga
27

Download Report