1-PDF

Hadoop Beyond Batch: Real-time Workloads, SQL-onHadoop, and theVirtual EDW
Headline Goes Here
Marcel Kornacker | [email protected] Speaker
Name
or
Subhead
Goes
Here
February 2014
Copyright © 2013 Cloudera Inc. All rights reserved.
Analytic Workloads on Hadoop: Where Do
We Stand?
“DeWitt Clause” prohibits
using DBMS vendor name
!2
Copyright © 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
• Hadoop
has traditional been utilized for offline batch processing:
ETL and ELT
• Next step: Hadoop for traditional business intelligence (BI)/data
warehouse (EDW) workloads:
• interactive
• concurrent
users
• Topic
of this talk: a Hadoop-based open-source stack for EDW
workloads:
• HDFS:
a high-performance storage system
• Parquet: a state-of-the-art columnar storage format
• Impala: a modern, open-source SQL engine for Hadoop
!3
Copyright © 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
• Thesis
of this talk:
• techniques
and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
• Hadoop stack is effective solution for certain EDW workloads
• Hadoop-based EDW solution maintains Hadoop’s strengths:
flexibility, ease of scaling, cost effectiveness
!4
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: A Storage System for Analytic
Workloads
• Available
in Hdfs today:
• high-efficiency
data scans at or near hardware speed, both
from disk and memory
• On the immediate roadmap:
• co-partitioned tables for even faster distributed joins
• temp-FS: write temp table data straight to memory,
bypassing disk
!5
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
• High
efficiency data transfers
• short-circuit
reads: bypass DataNode protocol when reading
from local disk
-> read at 100+MB/s per disk
• HDFS caching: access explicitly cached data w/o copy or
checksumming
-> access memory-resident data at memory bus speed
!6
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
• Coming
attractions:
• affinity
groups: collocate blocks from different files
-> create co-partitioned tables for improved join
performance
• temp-fs: write temp table data straight to memory,
bypassing disk
-> ideal for iterative interactive data analysis
!7
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Columnar Storage for Hadoop
• What
it is:
• state-of-the-art,
open-source columnar file format that’s
available for (most) Hadoop processing frameworks:
Impala, Hive, Pig, MapReduce, Cascading, …
• offers both high compression and high scan efficiency
• co-developed by Twitter and Cloudera; hosted on github and
soon to be an Apache incubator project
• with contributors from Criteo, Stripe, Berkeley AMPlab,
LinkedIn
• used in production at Twitter and Criteo
!8
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: The Details
• columnar
storage: column-major instead of the traditional
row-major layout; used by all high-end analytic DBMSs
• optimized storage of nested data structures: patterned
after Dremel’s ColumnIO format
• extensible set of column encodings:
• run-length
and dictionary encodings in current version (1.2)
• delta and optimized string encodings in 2.0
• embedded
statistics: version 2.0 stores inlined column
statistics for further optimization of scan efficiency
!9
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Storage Efficiency
!10
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Scan Efficiency
!11
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
• implementation
of an MPP SQL query engine for the Hadoop
environment
• highest-performance SQL engine for the Hadoop ecosystem;
already outperforms some of its commercial competitors
• effective for EDW-style workloads
• maintains Hadoop flexibility by utilizing standard Hadoop
components (HDFS, Hbase, Metastore, Yarn)
• plays well with traditional BI tools:
exposes/interacts with industry-standard interfaces (odbc/
jdbc, Kerberos and LDAP, ANSI SQL)
!12
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
• history:
• developed
by Cloudera and fully open-source; hosted on
github
• released as beta in 10/2012
• 1.0 version available in 05/2013
• current version is 1.2.3, available for CDH4 and CDH5 beta
!13
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala from The User’s Perspective
• create
tables as virtual views over data stored in HDFS
or Hbase;
schema metadata is stored in Metastore (shared with
Hive, Pig, etc.; basis of HCatalog)
• connect via odbc/jdbc; authenticate via Kerberos or
LDAP
• run standard SQL:
• current
version: ANSI SQL-92 (limited to SELECT and bulk
insert) minus correlated subqueries, has UDFs and UDAs
!14
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala from The User’s Perspective
• 2014
SQL roadmap:
• 1.3:
analytic window functions, Order By without Limit,
Decimal(<precision>, <scale>)
• 2.0: support for nested types (structs, arrays, maps), UDTFs
• >2.0: disk-based joins and aggregation, subqueries and
Exists, set operators (Intersect, Minus)
!15
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture
• distributed
service:
• daemon
process (impalad) runs on every node with data
• easily deployed with Cloudera Manager
• each node can handle user requests; load balancer
configuration for multi-user environments recommended
• query
execution phases:
• client
request arrives via odbc/jdbc
• planner turns request into collection of plan fragments
• coordinator initiates execution on remote impala’s
!16
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Query Execution
• Request
!17
arrives via odbc/jdbc
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Query Execution
• Planner
turns request into collection of plan fragments
• Coordinator initiates execution on remote impalad nodes
!18
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Query Execution
• Intermediate
results are streamed between impala’s
• Query results are streamed back to client
!19
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Planning
• 2-phase
process:
• single-node plan: left-deep tree of query operators
• partitioning into plan fragments for distributed parallel
execution:
maximize scan locality/minimize data movement, parallelize
all query operators
• cost-based join order optimization
• cost-based join distribution optimization
!20
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
• execution
engine designed for efficiency, written from scratch
in C++; no reuse of decades-old open-source code
• circumvents MapReduce completely
• in-memory execution:
• aggregation results and right-hand side inputs of joins are
cached in memory
• example: join with 1TB table, reference 2 of 200 cols, 10% of
rows -> need to cache 1GB across all nodes in cluster
-> not a limitation for most workloads
!21
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
• runtime
code generation:
• uses llvm to jit-compile the runtime-intensive parts of a
query
• effect the same as custom-coding a query:
• remove branches
• propagate constants, offsets, pointers, etc.
• inline function calls
• optimized execution for modern CPUs (instruction pipelines)
!22
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
!23
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
• Impala
vs. SQL-on-MR
• Impala
1.1.1/Hive 0.12 (“Stinger Phases 1 and 2”)
• file formats: Parquet/ORCfile
• TPC-DS, 3TB data set running on 5-node cluster
!24
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
•
Impala speedup:
• interactive:
• report:
• deep
8-69x
6-68x
analytics:
10-58x
!25
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs Presto for Analytic Workloads
• Impala
1.2.3/Presto 0.54
• file formats: RCfile (+ Parquet)
• TPC-DS, 15TB data set running on 21-node cluster
!26
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs Presto for Analytic Workloads
•
Impala speedup:
• interactive:
• report:
2-52x
4-8x
• deep
analytics:
4-59x
• total:
!27
Copyright © 2013 Cloudera Inc. All rights reserved.
12x
Impala vs Presto for Analytic Workloads
•
!28
Copyright © 2013 Cloudera Inc. All rights reserved.
Multi-user benchmark:
• 10 users concurrently
• same dataset, same
hardware
• workload: queries from
“interactive” group
Impala vs Presto for Analytic Workloads
!29
Copyright © 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
• Hadoop’s
promise of linear scalability: add more
nodes to cluster, gain a proportional increase in
capabilities
-> adapt to any kind of workload changes simply by
adding more nodes to cluster
• scaling dimensions for EDW workloads:
• response
time
• concurrency/query throughput
• data size
!30
Copyright © 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
• Scalability
results for Impala:
• tests
show linear scaling along all 3 dimensions
• setup:
•2
clusters: 18 and 36 nodes
• 15TB TPC-DS data set
• 6 “interactive” TPC-DS queries
!31
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Latency
!32
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Concurrency
• Comparison:
!33
10 vs 20 concurrent users
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Data Size
• Comparison:
!34
15TB vs. 30TB data set
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
• Thesis
of this talk:
• techniques
and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
• Impala/Parquet/Hdfs is effective solution for certain EDW
workloads
• Hadoop-based EDW solution maintains Hadoop’s strengths:
flexibility, ease of scaling, cost effectiveness
!35
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
• latest
technological innovations add capabilities that
originated in high-end proprietary systems:
• high-performance
disk scans and memory caching in HDFS
• Parquet: columnar storage for analytic workloads
• Impala: high-performance parallel SQL execution
!36
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
• Impala/Parquet/Hdfs
• integrates
for EDW workloads:
into BI environment via standard connectivity and
security
• comparable or better performance than commercial
competitors
• currently still SQL limitations
• but those are rapidly diminishing
!37
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
• Impala/Parquet/Hdfs
maintains traditional Hadoop
strengths:
• flexibility:
Parquet is understood across the platform, natively
processed by most popular frameworks
• demonstrated scalability and cost effectiveness
!38
Copyright © 2013 Cloudera Inc. All rights reserved.
The End
!39
Summary: Hadoop for Analytic Workloads
• what
the future holds:
• further
performance gains
• more complete SQL capabilities
• improved resource mgmt and ability to handle multiple
concurrent workloads in a single cluster
!40
Copyright © 2013 Cloudera Inc. All rights reserved.