Hadoop Beyond Batch: Real-time Workloads, SQL-onHadoop, and theVirtual EDW Headline Goes Here Marcel Kornacker | [email protected] Speaker Name or Subhead Goes Here February 2014 Copyright © 2013 Cloudera Inc. All rights reserved. Analytic Workloads on Hadoop: Where Do We Stand? “DeWitt Clause” prohibits using DBMS vendor name !2 Copyright © 2013 Cloudera Inc. All rights reserved. Hadoop for Analytic Workloads • Hadoop has traditional been utilized for offline batch processing: ETL and ELT • Next step: Hadoop for traditional business intelligence (BI)/data warehouse (EDW) workloads: • interactive • concurrent users • Topic of this talk: a Hadoop-based open-source stack for EDW workloads: • HDFS: a high-performance storage system • Parquet: a state-of-the-art columnar storage format • Impala: a modern, open-source SQL engine for Hadoop !3 Copyright © 2013 Cloudera Inc. All rights reserved. Hadoop for Analytic Workloads • Thesis of this talk: • techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack • Hadoop stack is effective solution for certain EDW workloads • Hadoop-based EDW solution maintains Hadoop’s strengths: flexibility, ease of scaling, cost effectiveness !4 Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: A Storage System for Analytic Workloads • Available in Hdfs today: • high-efficiency data scans at or near hardware speed, both from disk and memory • On the immediate roadmap: • co-partitioned tables for even faster distributed joins • temp-FS: write temp table data straight to memory, bypassing disk !5 Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: The Details • High efficiency data transfers • short-circuit reads: bypass DataNode protocol when reading from local disk -> read at 100+MB/s per disk • HDFS caching: access explicitly cached data w/o copy or checksumming -> access memory-resident data at memory bus speed !6 Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: The Details • Coming attractions: • affinity groups: collocate blocks from different files -> create co-partitioned tables for improved join performance • temp-fs: write temp table data straight to memory, bypassing disk -> ideal for iterative interactive data analysis !7 Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Columnar Storage for Hadoop • What it is: • state-of-the-art, open-source columnar file format that’s available for (most) Hadoop processing frameworks: Impala, Hive, Pig, MapReduce, Cascading, … • offers both high compression and high scan efficiency • co-developed by Twitter and Cloudera; hosted on github and soon to be an Apache incubator project • with contributors from Criteo, Stripe, Berkeley AMPlab, LinkedIn • used in production at Twitter and Criteo !8 Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: The Details • columnar storage: column-major instead of the traditional row-major layout; used by all high-end analytic DBMSs • optimized storage of nested data structures: patterned after Dremel’s ColumnIO format • extensible set of column encodings: • run-length and dictionary encodings in current version (1.2) • delta and optimized string encodings in 2.0 • embedded statistics: version 2.0 stores inlined column statistics for further optimization of scan efficiency !9 Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Storage Efficiency !10 Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Scan Efficiency !11 Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine • implementation of an MPP SQL query engine for the Hadoop environment • highest-performance SQL engine for the Hadoop ecosystem; already outperforms some of its commercial competitors • effective for EDW-style workloads • maintains Hadoop flexibility by utilizing standard Hadoop components (HDFS, Hbase, Metastore, Yarn) • plays well with traditional BI tools: exposes/interacts with industry-standard interfaces (odbc/ jdbc, Kerberos and LDAP, ANSI SQL) !12 Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine • history: • developed by Cloudera and fully open-source; hosted on github • released as beta in 10/2012 • 1.0 version available in 05/2013 • current version is 1.2.3, available for CDH4 and CDH5 beta !13 Copyright © 2013 Cloudera Inc. All rights reserved. Impala from The User’s Perspective • create tables as virtual views over data stored in HDFS or Hbase; schema metadata is stored in Metastore (shared with Hive, Pig, etc.; basis of HCatalog) • connect via odbc/jdbc; authenticate via Kerberos or LDAP • run standard SQL: • current version: ANSI SQL-92 (limited to SELECT and bulk insert) minus correlated subqueries, has UDFs and UDAs !14 Copyright © 2013 Cloudera Inc. All rights reserved. Impala from The User’s Perspective • 2014 SQL roadmap: • 1.3: analytic window functions, Order By without Limit, Decimal(<precision>, <scale>) • 2.0: support for nested types (structs, arrays, maps), UDTFs • >2.0: disk-based joins and aggregation, subqueries and Exists, set operators (Intersect, Minus) !15 Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture • distributed service: • daemon process (impalad) runs on every node with data • easily deployed with Cloudera Manager • each node can handle user requests; load balancer configuration for multi-user environments recommended • query execution phases: • client request arrives via odbc/jdbc • planner turns request into collection of plan fragments • coordinator initiates execution on remote impala’s !16 Copyright © 2013 Cloudera Inc. All rights reserved. Impala Query Execution • Request !17 arrives via odbc/jdbc Copyright © 2013 Cloudera Inc. All rights reserved. Impala Query Execution • Planner turns request into collection of plan fragments • Coordinator initiates execution on remote impalad nodes !18 Copyright © 2013 Cloudera Inc. All rights reserved. Impala Query Execution • Intermediate results are streamed between impala’s • Query results are streamed back to client !19 Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Planning • 2-phase process: • single-node plan: left-deep tree of query operators • partitioning into plan fragments for distributed parallel execution: maximize scan locality/minimize data movement, parallelize all query operators • cost-based join order optimization • cost-based join distribution optimization !20 Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution • execution engine designed for efficiency, written from scratch in C++; no reuse of decades-old open-source code • circumvents MapReduce completely • in-memory execution: • aggregation results and right-hand side inputs of joins are cached in memory • example: join with 1TB table, reference 2 of 200 cols, 10% of rows -> need to cache 1GB across all nodes in cluster -> not a limitation for most workloads !21 Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution • runtime code generation: • uses llvm to jit-compile the runtime-intensive parts of a query • effect the same as custom-coding a query: • remove branches • propagate constants, offsets, pointers, etc. • inline function calls • optimized execution for modern CPUs (instruction pipelines) !22 Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution !23 Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs MR for Analytic Workloads • Impala vs. SQL-on-MR • Impala 1.1.1/Hive 0.12 (“Stinger Phases 1 and 2”) • file formats: Parquet/ORCfile • TPC-DS, 3TB data set running on 5-node cluster !24 Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs MR for Analytic Workloads • Impala speedup: • interactive: • report: • deep 8-69x 6-68x analytics: 10-58x !25 Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs Presto for Analytic Workloads • Impala 1.2.3/Presto 0.54 • file formats: RCfile (+ Parquet) • TPC-DS, 15TB data set running on 21-node cluster !26 Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs Presto for Analytic Workloads • Impala speedup: • interactive: • report: 2-52x 4-8x • deep analytics: 4-59x • total: !27 Copyright © 2013 Cloudera Inc. All rights reserved. 12x Impala vs Presto for Analytic Workloads • !28 Copyright © 2013 Cloudera Inc. All rights reserved. Multi-user benchmark: • 10 users concurrently • same dataset, same hardware • workload: queries from “interactive” group Impala vs Presto for Analytic Workloads !29 Copyright © 2013 Cloudera Inc. All rights reserved. Scalability in Hadoop • Hadoop’s promise of linear scalability: add more nodes to cluster, gain a proportional increase in capabilities -> adapt to any kind of workload changes simply by adding more nodes to cluster • scaling dimensions for EDW workloads: • response time • concurrency/query throughput • data size !30 Copyright © 2013 Cloudera Inc. All rights reserved. Scalability in Hadoop • Scalability results for Impala: • tests show linear scaling along all 3 dimensions • setup: •2 clusters: 18 and 36 nodes • 15TB TPC-DS data set • 6 “interactive” TPC-DS queries !31 Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Latency !32 Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Concurrency • Comparison: !33 10 vs 20 concurrent users Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Data Size • Comparison: !34 15TB vs. 30TB data set Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads • Thesis of this talk: • techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack • Impala/Parquet/Hdfs is effective solution for certain EDW workloads • Hadoop-based EDW solution maintains Hadoop’s strengths: flexibility, ease of scaling, cost effectiveness !35 Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads • latest technological innovations add capabilities that originated in high-end proprietary systems: • high-performance disk scans and memory caching in HDFS • Parquet: columnar storage for analytic workloads • Impala: high-performance parallel SQL execution !36 Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads • Impala/Parquet/Hdfs • integrates for EDW workloads: into BI environment via standard connectivity and security • comparable or better performance than commercial competitors • currently still SQL limitations • but those are rapidly diminishing !37 Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads • Impala/Parquet/Hdfs maintains traditional Hadoop strengths: • flexibility: Parquet is understood across the platform, natively processed by most popular frameworks • demonstrated scalability and cost effectiveness !38 Copyright © 2013 Cloudera Inc. All rights reserved. The End !39 Summary: Hadoop for Analytic Workloads • what the future holds: • further performance gains • more complete SQL capabilities • improved resource mgmt and ability to handle multiple concurrent workloads in a single cluster !40 Copyright © 2013 Cloudera Inc. All rights reserved.
© Copyright 2024 ExpyDoc