DATA INTELLIGENCE FOR ALL Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO Agenda 1. Challenges & Motivation 2. DDF Overview 3. DDF Design & Architecture 4. Demo Christopher Nguyen, PhD Adatao Inc. Co-Founder & CEO • Former Engineering Director of Google Apps (Google Founders’ Award) • Former Professor and Co-Founder of the Computer Engineering program at HKUST • PhD Stanford, BS U.C. Berkeley Summa cum Laude • Extensive experience building technology companies that solve enterprise challenges How Have We Defined “Big Data”? Old Definition New Definition Big Data has BIG DATA + BIG COMPUTE Huge Volume High Velocity Great Variety (Machine) Learn from Data Problems =Opportunitie$ Current Big Data World Web-Based iPython / R-Studio Java API Cascading MapReduce Data Scalding Trident Shark SQL Pig SummingBird SparkR Query & Compute Giraph Hive SQL Cascading Presentation R Python Scala +Java DataFrame Java API RDD Storm Spark HDFS / YARN Scala + Java API DStream GraphX Low-Latency + Real-Time PySpark Batch CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, Configuration conf = new Configuration(); ip STRING COMMENT 'IP Address of the User', FileSystem fs = FileSystem.get(markovPath.toUri(), country STRING COMMENT 'country of conf); origination') markovPath = fs.makeQualified(markovPath); COMMENT 'This is the staging page view table' outputPath = fs.makeQualified(outputPath); ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' Path vectorOutputPath = new STORED AS TEXTFILE Path(outputPath.getParent(), "vector"); LOCATION '<hdfs_location>'; VectorCache.save(new IntWritable(Keys.DIAGONAL_CACHE_INDEX), diag, vectorOutputPath, conf); INSERT OVERWRITE TABLE page_view ! PARTITION(dt='2008-06-08', country='US') // set up the job itself SELECT pvs.viewTime, pvs.userid, Job job = new Job(conf, pvs.page_url, pvs.referrer_url, null, null, "VectorMatrixMultiplication"); pvs.ip WHERE pvs.country = ‘US’; job.setInputFormatClass(SequenceFileInputFormat.class); ! job.setOutputKeyClass(IntWritable.class); public static DistributedRowMatrix runJob(Path markovPath, Vector diag, Path outputPath, Path tmpPath) - sample and copy data to local throws IOException, - lrfit <- glm( using ~ + age + education) ClassNotFoundException, InterruptedException - Export model { - export model to XML - Deploy model using java Maintenance is too painful! I have a dream… In the Ideal World… data = load housePrice data dropNA data transform (duration = now - begin) data train glm(price, bedrooms) It just works!!! Make Big-Data API simple & accessible Java Java PySpark MAPREDUCE HBASE Python PySpark MAPREDUCE SparkR HBASE MONADS Python RHADOOP SparkRMONOIDS MONADS MONOIDS RHADOOP HIVE HIVE OR Design Principles DDF is Simple DDFManager manager = DDFManager.get(“spark”); DDF ddf = manager.sql2ddf(“select * from airline”); It’s like table ddf.Views.project(“origin, arrdelay”); ddf.groupBy(“dayofweek”, “avg(arrdelay)”); ddf.join(otherddf) It’s just nice like this ddf.dropNA() ! ddf.getNumRows() ! ddf.getSummary() Focus on analytics, not MR Simple, high-level, data-science oriented APIs, powered by Spark RDD val fileRDD = spark.textFile("hdfs://...") val counts = fileRDD.flatMap(line => line.split(" ")) .map(arrdelay_byday => (arrdelay_byday.split(“,”)(0), arrdelay_byday.split(“,”)(1))) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") ddf = manager.load(“spark://ddf-runtime/ddf-db/airline”) ddf . aggregate ("dayofweek, sum(arrdelay)") Quickly Access to a Rich set of familiar ML idioms Data Wrangling ddf.setMutable(true) ddf.dropNA() ddf.transform(“speed =distance/duration”) Model Building & Validation ddf.lm(0.1, 10) ddf.roc(testddf) Model Deployment manager.loadModel(lm) lm.predict(point) Seamless Integration with MLLib //plug-in algorithm linearRegressionWithSGD = org.apache.spark.mllib.regression.LinearRegressionWithSGD ! //run algorithm ddf.ML.train("linearRegressionWithSGD", 10, 0.1, 0.1) Easily Collaborate with Others Can I see your Data? DDF://com.adatao/airline DDF on Multiple Languages DDF Architecture Data Scientist/Engineer Owns Have access to DDF Manager Config Handler DDF ETL Handlers Statistics Handlers Representation Handlers ML Handlers DATA INTELLIGENCE FOR ALL Demo Cluster Configuration Data size 8 nodes x 8 cores x 30G RAM 12GB/120 millions of rows DDF offers Native R Data.Frame Experience Table-like Abstraction on Top of Big Data Focus on Analytics, not MapReduce Easily Test & Deploy New Components Simple, Data-Science Oriented APIs, Powered by Spark Pluggable Components by Design Collaborate Seamlessly & Efficiently Work with APIs Using Preferred Languages Mutable & Sharable Multi Language Support (Java, Scala, R, Python) R-Studio Data Engineer Python API API API HDFS DDF Client Web Browser Data Scientist PA Client Business Analyst PI Client Example: DATA INTELLIGENCE FOR ALL To learn more about Adatao & DDF contact us www.adatao.com
© Copyright 2024 ExpyDoc