DDF - Spark Summit

DATA INTELLIGENCE FOR ALL
Distributed DataFrame on Spark:
Simplifying Big Data
For The Rest Of Us
Christopher Nguyen, PhD
Co-Founder & CEO
Agenda
1. Challenges & Motivation
2. DDF Overview
3. DDF Design & Architecture
4. Demo
Christopher Nguyen, PhD
Adatao Inc.
Co-Founder & CEO
•
Former Engineering Director of Google Apps (Google Founders’ Award)
•
Former Professor and Co-Founder of the Computer
Engineering program at HKUST
•
PhD Stanford, BS U.C. Berkeley Summa cum Laude
•
Extensive experience building technology companies
that solve enterprise challenges
How Have We Defined
“Big Data”?
Old Definition
New Definition
Big Data has
BIG DATA + BIG COMPUTE
Huge Volume
High Velocity
Great Variety
(Machine) Learn from Data
Problems
=Opportunitie$
Current
Big Data World
Web-Based iPython /
R-Studio
Java
API
Cascading
MapReduce
Data
Scalding
Trident
Shark SQL
Pig
SummingBird
SparkR
Query &
Compute
Giraph
Hive
SQL
Cascading
Presentation
R
Python
Scala
+Java
DataFrame
Java
API
RDD
Storm
Spark
HDFS / YARN
Scala + Java
API
DStream
GraphX
Low-Latency + Real-Time
PySpark
Batch
CREATE EXTERNAL TABLE page_view(viewTime INT,
userid BIGINT,
page_url STRING, referrer_url STRING,
Configuration conf = new Configuration();
ip STRING COMMENT 'IP Address of the User',
FileSystem fs = FileSystem.get(markovPath.toUri(),
country STRING COMMENT 'country of
conf);
origination')
markovPath = fs.makeQualified(markovPath);
COMMENT 'This is the staging page view table'
outputPath = fs.makeQualified(outputPath);
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
Path vectorOutputPath = new
STORED AS TEXTFILE
Path(outputPath.getParent(), "vector");
LOCATION '<hdfs_location>';
VectorCache.save(new
IntWritable(Keys.DIAGONAL_CACHE_INDEX), diag,
vectorOutputPath, conf);
INSERT OVERWRITE TABLE page_view
!
PARTITION(dt='2008-06-08', country='US')
// set up the job itself
SELECT pvs.viewTime, pvs.userid,
Job job = new Job(conf,
pvs.page_url, pvs.referrer_url, null, null,
"VectorMatrixMultiplication");
pvs.ip
WHERE pvs.country = ‘US’;
job.setInputFormatClass(SequenceFileInputFormat.class);
!
job.setOutputKeyClass(IntWritable.class);
public static DistributedRowMatrix
runJob(Path markovPath, Vector diag, Path
outputPath, Path tmpPath)
- sample and copy data to local
throws IOException,
- lrfit <- glm( using ~ + age + education)
ClassNotFoundException, InterruptedException
- Export model
{
- export model to XML
- Deploy model using java
Maintenance is
too painful!
I have a dream…
In the Ideal World…
data = load housePrice
data dropNA
data transform (duration = now - begin)
data train glm(price, bedrooms)
It just works!!!
Make Big-Data API simple & accessible
Java
Java
PySpark MAPREDUCE
HBASE
Python
PySpark MAPREDUCE
SparkR HBASE
MONADS Python
RHADOOP SparkRMONOIDS
MONADS
MONOIDS
RHADOOP
HIVE
HIVE
OR
Design Principles
DDF is Simple
DDFManager manager = DDFManager.get(“spark”);
DDF ddf = manager.sql2ddf(“select * from airline”);
It’s like table
ddf.Views.project(“origin, arrdelay”);
ddf.groupBy(“dayofweek”, “avg(arrdelay)”);
ddf.join(otherddf)
It’s just nice like this
ddf.dropNA()
!
ddf.getNumRows()
!
ddf.getSummary()
Focus on analytics, not MR
Simple, high-level, data-science oriented
APIs, powered by Spark
RDD
val fileRDD = spark.textFile("hdfs://...")
val counts = fileRDD.flatMap(line =>
line.split(" "))
.map(arrdelay_byday =>
(arrdelay_byday.split(“,”)(0),
arrdelay_byday.split(“,”)(1)))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
ddf = manager.load(“spark://ddf-runtime/ddf-db/airline”)
ddf . aggregate ("dayofweek, sum(arrdelay)")
Quickly Access to
a Rich set of familiar ML idioms
Data
Wrangling
ddf.setMutable(true)
ddf.dropNA()
ddf.transform(“speed
=distance/duration”)
Model
Building & Validation
ddf.lm(0.1, 10)
ddf.roc(testddf)
Model
Deployment
manager.loadModel(lm)
lm.predict(point)
Seamless
Integration with MLLib
//plug-in algorithm
linearRegressionWithSGD =
org.apache.spark.mllib.regression.LinearRegressionWithSGD
!
//run algorithm
ddf.ML.train("linearRegressionWithSGD", 10, 0.1, 0.1)
Easily Collaborate with Others
Can I see
your Data?
DDF://com.adatao/airline
DDF on Multiple Languages
DDF
Architecture
Data Scientist/Engineer
Owns
Have access to
DDF Manager
Config
Handler
DDF
ETL
Handlers
Statistics
Handlers
Representation
Handlers
ML
Handlers
DATA INTELLIGENCE FOR ALL
Demo
Cluster Configuration
Data size
8 nodes x 8 cores x 30G RAM
12GB/120 millions of rows
DDF offers
Native R Data.Frame Experience
Table-like Abstraction on Top of Big Data
Focus on Analytics, not MapReduce
Easily Test & Deploy New Components
Simple, Data-Science Oriented APIs,
Powered by Spark
Pluggable Components by Design
Collaborate Seamlessly & Efficiently
Work with APIs Using Preferred Languages
Mutable & Sharable
Multi Language Support (Java, Scala, R, Python)
R-Studio
Data Engineer
Python
API
API
API
HDFS
DDF Client
Web Browser
Data Scientist
PA Client
Business Analyst
PI Client
Example:
DATA INTELLIGENCE FOR ALL
To learn more about
Adatao & DDF
contact us
www.adatao.com