Meet the Walkers! Accelerating Index Traversals for In-Memory Databases" Onur Kocberber" Boris Grot, Javier Picorel, Babak Falsafi," Kevin Lim, Parthasarathy Ranganathan " Our World is Data-Driven!" Data resides in huge databases" Data! – Most frequent task: find data" " Indexes used for fast data lookup" " – Rely on pointer-intensive data structures" " Index! " Indexing efficiency is critical " – Many requests, abundant parallelism " – Power-limited hardware" " " " Need high-throughput and energy-efficient index lookups"2" Index Lookups on General-Purpose Cores " Index Lookups! • Data in memory" • Inherent parallelism" OoO Cores! • Pointer-chasing ! Low MLP" • Limited OoO inst. window" – One lookup at a time" Index Lookups" " OoO Core" Inst. Window" Memory" OoO cores are ill-matched to indexing" 3" Widx: an Indexing Widget" Specialized: Custom HW for index lookups" – Switch fewer transistors per lookup" " Parallel: Multiple lookups at once" – Extract parallelism beyond the OoO exec. window" " Programmable: Simple RISC cores" – Target a wide range of DBMSs" 3x higher throughput, 5.5x energy reduction vs. OoO" 4" Outline" " • • • • • • Introduction" Indexing in database systems" Indexing inefficiencies in modern processors" Widx" Evaluation highlights" Summary" " 5" Modern Databases & Indexing" Indexes are essential for all database operators" – Data structures for fast data lookup" Hash index: fundamental index structure" " " " " " " " Hash Index" K! 0" R" T" 1" C" Q" 2" E" K" Walk" Dominant operation: join via hash index" " " 6" Join via Hash Index" Join: findon theindex matching values in Ainand B" Lookup for every entry A "" Hash B" Index on B" A" Join" Walk" 7" How Much Time is Spent Indexing?! Measurement on Xeon 5670 CPU with 100GB Dataset" % of Execution Time" Index " Scan " Sort & Join" Other" 100" 75" 50" 25" 0" 2" 3" 5" 7" 8" 9" 11"13"14"15"17"18"19"20"21"22" 5" 37"40"43"46"52"64"81"82" TPC-H" TPC-DS" Indexing is the biggest contributor to execution time" 8" Dissecting Index Lookups" Hash: Avg. 30% time of each lookup " – Computationally intensive, high cache locality" Walk: Avg. 70% time of each lookup" – Trivial computation, low cache locality" Next lookup: Inherently parallel" – Beyond the inst. window capacity" OoO Core" Inst. Window" 9" Outline" • • • • • • " Introduction" Indexing in Database Systems" Indexing inefficiencies in modern processors" Widx" Evaluation highlights" Summary" 10" Roadmap for ! Efficient and High-Throughput Indexing" " 1. Specialize" – Customize hardware for hashing and walking" " 2. Parallelize" – Perform multiple index lookups at a time" 3. Generalize" – Use a programmable building block" " " 11" Step 1: Specialize" Design a dedicated unit for hash and walk" – Hash: compute hash values from a key list" – Walk: access the hash index and follow pointers" Hash" General-purpose" OoO" Walk" Specialized" hash and walk hardware" 12" Decoupled" Decoupled & Parallel" Time" Serial" Step 2: Parallelize" H" W H" W H"H" W W Step 3: Generalize" Widx unit" Widx unit: common building block for hash and walk" – Two-stage RISC core" – Custom ISA" " Widx units are programmable" – Execute functions written in Widx ISA" – Support limitless number of data structure layouts " hash( )" H" " " " walk( )" " " " " W 14" Putting it all together: Widx" When Widx runs, core goes idle" " " Widx" " Hash" Walkers" W W H" OoO" Core" Result " Producer" P" Widx" W W Simple, parallel hardware" MMU" L1" Programming Model" " Development" Write code for each unit " and compile for Widx ISA" Execution" Communicate " Load the code" query-specific inputs" " Hash" hash (arg1, arg2, …)" {……}" " DBMS" Index" Code" Walk" " H" W P" walk (arg1, arg2, …)" {……}" " Res. Produce" " emit (arg1, arg2, …)" {……}" " 16" Outline" • • • • • • " Introduction" Indexing in Database Systems" Indexing inefficiencies in modern processors" Widx" Evaluation highlights" Summary" 17" Methodology" Flexus simulation infrastructure [Wenisch '06]" Benchmarks" – TPC-H on MonetDB" – TPC-DS on MonetDB" – Dataset: 100GB" Area and Power" – – – – – " uArch Parameters" – Core Types" • OoO: 4-wide, 128-entry ROB" • In-order: 2-wide" – Frequency: 2GHz" – L1 (I & D): 32KB" – LLC: 4MB" Synopsys Design Compiler" Technology node: TSMC 40 nm, std. cell" Frequency: 2GHz" Widx Area: 0.24mm2" Widx Power: 0.3W " 18" 0" TPC-H" 3x higher indexing throughput" qry82" qry64" qry52" qry40" OoO" qry37" qry5" qry22" qry20" 6" qry19" qry17" qry11" qry2" Indexing Speedup" Widx Performance" Widx" 5" 4" 3" 2" 1" TPC-DS" 19" 2.5" 2.5" 2" 2" 2" 1.5" 1" 1.5" 1" 0.5" 0.5" 0" 0" Energy-Delay" 2.5" Energy" Runtime" Widx Efficiency" 1.5" 1" 0.5" 0" 5.5x reduction in indexing energy vs. OoO" 20" Conclusions" Indexing is essential in modern DBMSs" " Modern CPUs spend significant time in index lookups" – Not efficient & fall short of extracting parallelism" " Widx: Specialized widget for index lookups" – Efficient, parallel & programmable " 3x higher throughput, 5.5x energy reduction vs. OoO" 21" Thanks!" 22"
© Copyright 2024 ExpyDoc