DI-MMAP: Data-Intensive Memory-map Runtime

DI-MMAP: Data-Intensive
Memory-map Runtime
Motivation
Workload: e.g. simulation + analysis
Process
Threads
Process
Network
I/O
Network
I/O
CPU load/store
direct access
PerMA Sim. + DI-MMAP
DRAM
Page Cache
Hotpage FIFO
Our goal is to enable scalable out-of-core computations for dataintensive computing by effectively integrating non-volatile random
access memory into the HPC node’s memory architecture. We are
making NVRAM effective for supporting large persistent data
structures and for using DRAM-cached NVRAM as an extension to
main memory. Overall, we are enabling latency-tolerant applications
to smoothly transition a larger percentage of their working set out-ofcore and into persistent memory with minimal performance loss.
is a hot page
?
Primary FIFO
Eviction Queue
writeback
page
CPU read/write
direct access
load/store I/O
page
fault
RRAM
Read/Write I/O
NVRAM
Simulated
NVRAM
Next Gen.
PCIe flash
PCIe
flash
SSD flash
DI-MMAP runtime & PerMA simulator system
diagram
80000
K−mers per second
70000
4x faster
than mmap
60000
50000
40000
standard−fs−mmap
20000
10000
0
0
50
100
150
200
250
Data-intensive high-performance computing applications have large
data sets and large working sets. Furthermore, they tend to be
memory bound due to either irregular memory access patterns, poor
computation to communication ratio, or latency sensitive algorithms.
Examples of data-intensive applications are the processing of
massive real-world graphs, bioinformatics / computational biology,
and in-situ VDA algorithms such as streamline tracing. To make
applications work well with NVRAM it is important to tune them to
optimize their data structures for page-level locality and have
sufficient I/O concurrency. Additionally, adapting algorithms to be
more latency tolerant is important for allowing computation and
communication to overlap longer memory latencies.
DI-MMAP runtime
DI−MMAP
30000
Data-Intensive HPC Applications
300
350
Number of Threads
Linux RHEL6 2.6.32 mmap vs. DI-MMAP using
Livermore Metagenomic Analysis Toolkit (LMAT)
with a 16 GiB DRAM buffer and 375 GiB
persistent database. Linux mmap performance
peaks at 16 threads and DI-MMAP scales past
200 threads. At peak performance DI-MMAP is 4x
faster than Linux mmap.
Only 23% slower
with 50% less DRAM
7.44x faster
than mmap
Performance comparison of Linux RHEL6 2.6.32
mmap and DI-MMAP using HavoqGT library
large-scale Breadth-First Search (BFS) Graph
Analysis. DI-MMAP provides high performance
and scalable out-of-core performance. Graph was
R-MAT Scale 31 (146 GiB of vertex and edge
data) and system was provisioned with 16 GiB
DRAM for buffer cache + 24 GiB DRAM for
algorithmic data.
Our data-intensive memory-map runtime provides a high
performance alternative to the standard Linux memory-map system.
The performance of DI-MMAP scales up with increased application
I/O concurrency and does not degrade under memory pressure. It is
a loadable kernel module that provides a custom buffering scheme
for NVRAM data. Additionally, is open source and integrated into the
Simple Linux Utility for Resource Management (SLURM) and TriLabs Open Source Software (TOSS) environment.
PerMA simulator
Our persistent memory simulator is a companion project to DI-MMAP
that allows our runtime to simulate the performance of future
generations of NVRAM technologies. This allows us to execute
applications at scale and test how they will perform on future
platforms.
Runtime Analysis
DI-MMAP provides introspection interfaces that allow applications to
log the sequence of page faults and frequency of page (re-)faults for
both online and offline analysis. Additionally, applications can track
page-specific statistics such as residency, number of major and
minor faults, as well as dynamic fault rates.
# Monitoring di-mmap fault sequence window every 1 ms
#
Fault Statistics
# fault_seq V T D Q FC
DevID
pgOffset V
1 1 1 0 F
1 3047. 693438 0x000000000000 0
2 1 1 0 F
1 3047. 693438 0x00074aaaa000 0
3 1 1 0 F
1
178. 47476 0x000000000000 0
...
42938497 1 2 1 V
2 1765. 431957 0x000429cea000 0
42938498 1 1 0 F
1 2362. 514471 0x00045ca08000 0
42938499 1 1 0 F
1 1765. 431957 0x000429d15000 0
42938500 1 1 0 F
1 1324. 340240 0x000419910000 0
42938501 1 1 0 F
1 2014. 694450 0x00040381f000 0
42938502 1 1 0 F
1 1053. 984643 0x000419931000 0
42938503 1 2 1 V
2
548. 72548 0x00040a071000 0
FUNDING AGENCY: LLNL/LDRD, ASCR
FUNDING ACKNOWLEDGEMENT: This work was performed under the
auspices of the U.S. Department of Energy by Lawrence Livermore National
Laboratory under contract DE-C52-07NA27344.
AUTHORS: Brian Van Essen [email protected], Roger Pearce, Sasha Ames,
Maya Gokhale
RESOURCES: Center for Applied Scientific Computing (LLNL), Livermore
Computing (LLNL), https://computation-rnd.llnl.gov/perma/activities.php
LLNL-POST-662211
T
0
0
0
D
0
0
0
Q
F
F
F
FC
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
P
H
P
P
P
P
P
1
1
1
1
1
1
1
Evicted Page Statistics
DevID
pgOffset
0.
0 0x000000000000
0.
0 0x000000000000
0.
0 0x000000000000
3047.
989.
2076.
4006.
2416.
2660.
1636.
693438
458561
191920
935087
804018
608042
362025
0x00000054a5c5
0x00000000cd97
0x0000002b0dd8
0x0000003b01f2
0x0000004c2972
0x000000152914
0x00000027b308