TC-03 テクノロジートラック AWSビッグデータソリューション Amazon Redshift, Amazon EMR, Amazon DynamoDBご紹介 Yuta Imai Solutions Architect Amazon Data Services Japan ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank You! ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. フィードバックをお寄せ下さい本イベントについてツイートされる際は、ハッシュタグをご利利⽤用ください。 #AWSRoadshow お帰りになる前には、アンケートへのご協⼒力力をお願いします。引換⽤用の記念念品をご⽤用意しています。 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. ⾃自⼰己紹介 •  名前今井雄太 (いまいゆうた) •  所属アマゾンデータサービスジャパンソリューションアーキテクト •  仕事広告業界、ゲーム業界のお客様を担当 50ms or dieなWebサービスの技術全般 4 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. アジェンダ 1.  AWSのサービス全体像とビッグデータ関連サービス 2.  解剖ビッグデータ 3.  Getting Started with Big Data Services –  –  –  Amazon Redshift Amazon Elastic MapReduce Amazon DynamoDB 4.  Practical Deep Dive ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. アジェンダ 1.  AWSのサービス全体像とビッグデータ関連サービス 2.  解剖ビッグデータ 3.  Getting Started with Big Data Services –  –  –  Amazon Redshift Amazon Elastic MapReduce Amazon DynamoDB 4.  Practical Deep Dive ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWSサービスの全体像 Ecosystem Management & Administration Technology Partner / Consulting Partner CloudWatch CloudTrail IAM データ分析 Kinesis Management Console コンテンツ配信 EMR Data Pipeline Region 7 AZ CLI S3 Virtual Private Cloud Glacier BeanStalk CloudFormation OpsWorks アプリケーションサービス SNS SWF SES ストレージ Elastic Auto Scaling WorkSpaces Load Balancing ネットワーク SQS CloudFront コンピュート処理理 EC2 SDK ⾃自動化とデプロイメント EBS Storage Gateway Elastic Transcoder CloudSearch データベース RDS Direct Connect AWSグローバルインフラ Regions / Availability Zones / Contents Delivery POPS ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamoDB ElastiCache Redshift Rout53 Big Data services on AWS DynamoDB NoSQL Hadoop Elastic Redshift DWH MapReduce Interface Workflow Management Kinesis S3 Storage 8 Data Pipeline Glacier RDS ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. RDB AWSビッグデータサービス群 Amazon Simple Storage Service(S3) •  容量量制限がなく、利利⽤用分だけの⽀支払いで利利⽤用できるストレージ •  データの耐久性は99.999999999% •  静的HTTPサーバーとしても利利⽤用可 Amazon Glacier(Glacier) •  S3と同等のデータ耐久性 •  S3の1/3の価格で利利⽤用可能なストレージ •  データの取り出しリクエストからアクセス可能になるまで約4時間かかる ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWSビッグデータサービス群 Amazon DynamoDB(DynamoDB) •  NoSQL as a Service •  ストレージ容量量の制限なし •  必要なスループットをプロビジョンして利利⽤用 Amazon Redshift(Redshift) •  マネージドData Ware House •  EC2やRDSと同じように使った分だけの課⾦金金 •  スケールアウト/インも容易易 Amazon RDS(RDS) •  マネージドRelational Database •  PostgreSQL,MySQL,Oracle,SQL Serverか ©2014, Amazonらエンジンを選択可能 Web Services, Inc. or its affiliates. All rights reserved. AWSビッグデータサービス群 Amazon Elastic MapReduce(EMR) •  マネージドHadoop •  HDFSとシームレスにS3を扱える Amazon Kinesis(Kinesis) •  Stream Computingのためのサービス •  ストリーミングMapReduceのようなことを容易易に可能にしてくれる ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWSビッグデータサービス群 AWS Data Pipeline(Data Pipeline) •  データの移動やETLのバッチ処理理/ パイプライン処理理のためのオーケストレーションサービス ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big Data services on AWS DynamoDB NoSQL Hadoop Elastic Redshift DWH MapReduce Interface Workflow Management Kinesis S3 Storage 13 Data Pipeline Glacier RDS ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. RDB Big Data services on AWS DynamoDB NoSQL Hadoop Elastic Redshift DWH MapReduce Interface Workflow Management Kinesis S3 Storage 14 Data Pipeline Glacier RDS ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. RDB Big Data services on AWS DynamoDB NoSQL Hadoop Elastic Redshift MapReduce DWH Interface 本⽇日はこれらのサービスの役割や使い分け、組み合わせ⽅方をメインに取り上げていきます。 Workflow Kinesis Management S3 Data Pipeline Storage RDS RDB Glacier 15 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. アジェンダ 1.  AWSのサービス全体像とビッグデータ関連サービス 2.  解剖ビッグデータ 3.  Getting Started with Big Data Services –  –  –  Amazon Redshift Amazon Elastic MapReduce Amazon DynamoDB 4.  Practical Deep Dive ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. データ活⽤用の4つのステップ 1.  あつめる –  多数のアプリケーションサーバーやクライアント、デバイスからのデータ収集 2.  ためる –  安全でコスト効率率率よく、かつ利利⽤用しやすい形でデータを保存 3.  処理理する –  抽出、除外、整形、いわゆる前処理理 –  ⼀一次集計もここに含まれる 4.  つかう –  BIツールで利利⽤用 –  データをAPIで提供 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. データ活⽤用のステップで⾒見見る EMR ETL Kinesis Web app DynamoDB S3 Analytics Data Redshift Data Pipeline Sum EMR 18 あつめる©2014, 処理するためる Glacier RDS Amazon Web Services, Inc. or its affiliates. All rights reserved. Dashboard つかう例例：バッチ処理理によるアクセスログ集計 USER PATH TIMESTAMP -‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒ USER1 /login 2014-‐‑‒02-‐‑‒26 00:00:01 USER2 /home 2014-‐‑‒02-‐‑‒26 01:13:31 1.1.1.1, /login, 20140226000101, … 192.168…, /home, 20140226011226, … 1.1.1.2, /home, 20140226011331, … Webサーバー S3 ログ（オリジナル） EMR S3 処理理済みデータ Redshift ETL済みデータ BIツールなどログ集約サーバー ETL 1.1.1.1, /login, 20140226000101, … 192.168…, /home, 20140226011226, … 1.1.1.2, /home, 20140226011331, … 19 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. 例例：収集したデータの活⽤用 –DMP-‐‑‒ USER1, 20140226000101, … USER2, 20140226011226, … USER1, 20140226011331, … EMR Webサーバー S3 処理理済みデータ Redshift ETL済みデータ S3 レポートツールログ（オリジナル）ログ集約サーバーデータ提供API EMR ユーザーごとにログを集計して興味分野を分析する 20 S3 処理理済みデータ DynamoDB ETL済みデータ USER1: { Interest: [ ʻ‘Carʼ’, ʻ‘Homeʼ’ ], ... } USER2: { Interest: [ ʻ‘Dogʼ’, ʻ‘Catʼ’ ], … } ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. データ活⽤用の4つのステップ 1.  あつめる –  多数のアプリケーションサーバーやクライアント、デバイスからのデータ収集 2.  ためる –  安全でコスト効率率率よく、かつ利利⽤用しやすい形でデータを保存 3.  処理理する –  抽出、除外、整形、いわゆる前処理理 –  ⼀一次集計もここに含まれる 4.  つかう –  BIツールで利利⽤用 –  データをAPIで提供 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. アジェンダ 1.  AWSのサービス全体像とビッグデータ関連サービス 2.  解剖ビッグデータ: あつめる、ためる、つかう 3.  Getting Started with Big Data Services –  Amazon DynamoDB、Amazon Elastic MapReduce、 Amazon Redshift 4.  Practical Deep Dive –  現場で⾒見見かけるアーキテクチャ ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshiftのアーキテクチャ •  MPP（超並列列演算） –  CPU、Disk・Network I/Oの並列列化 –  論論理理的なリソースの括り「ノードスライス」 •  データの格納 –  列列指向（カラムナ） –  圧縮 •  データの通信 –  コンピュート・ノード間の通信 –  各コンピュート・ノードからリーダー・ノードへの通信 –  他のAWSサービスとの通信 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift概要 BIツール •  リーダーノードを経由してクエリーを実⾏行行 •  インターフェイスはPostgreSQL互換(psqlで使えます) •  各コンピュートノードで演算が並列列実⾏行行 •  各コンピュートノードにローカルストレージを保持 JDBC/ODBC リーダーノード SQL エンドポイント: •  クエリーの並列列化 •  結果を⽣生成 10GigE Mesh コンピュートノードコンピュートノードコンピュートノードクエリー実⾏行行ノード •  “N” スケールアウトを実現 •  ローカルディスク S3, DynamoDB, EMRとの統合 25 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. アーキテクチャ：列列指向 •  ⾏行行指向（RDBMS） •  列列指向（Redshift） orderid name price orderid name price 1 Book 100 1 Book 100 2 Pen 50 2 Pen 50 … n Eraser … 70 n Eraser ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. 70 Amazon Redshiftのノードタイプ "   dw1.xlarge: •  •  •  •  •  CPU: 2 virtual cores ECU: 4.4 Memory: 15 GiB Storage: 2TB(HDD) Network: 0.3GB/s "   dw1.8xlarge •  •  •  •  •  27 •  CPU: 16 virtual cores ECU: 35 Memory: 120 GiB Storage: 16TB(SSD) Network: 2.4GB/s dw2.large: –  –  –  –  –  •  CPU: 2 virtual cores ECU: 7 Memory: 15 GiB Storage: 160GB(SSD) Network: 0.2GB/s dw2.8xlarge –  –  –  –  –  CPU: 32 virtual cores ECU: 104 Memory: 244 GiB Storage: 2.56TB(SSD) Network: 3.7GB/s ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshiftの拡張性 dw1.8xarge & dw2.8xlarge dw1.xlarge & dw2.large クラスター 2 – 100ノードシングルノード 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L 8X L XL クラスター 2 – 32ノード XL 28 XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL XL ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3からCSVファイルをロードしてみる •  まずはRedshiftにログイン psql -d mydb –h YOUR_REDSHIFT_ENDPOINT -p 5439 -U awsuser -W! ! •  Redshiftのシェルでcopyコマンド COPY customer FROM 's3://data/customer.tbl.’ CREDENTIALS ’aws_access_key_id=KEY;aws_secret_access_key= SEC’ DELIMITER ‘,’ GZIP TIME_FORMAT ‘auto’;! 29 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshiftにログインしてデータロードしてクエリを掛けてみる •  Redshiftのシェルでテーブルを定義 CREATE TABLE nginx (! remote_addr char(15),! time timestamp,! request varchar(255),! status integer,! bytes bigint,! ua varchar! );! 30 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. あとはいつものSQL SELECT ua, request, COUNT(*) ! FROM nginx! GROUP BY ua, request;! 31 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3へデータを書き出すのも簡単 UNLOAD TO ‘s3://YOUR_BUCKET/PATH/! SELECT * FROM nginx;! 32 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshiftとは・・・ •  •  •  •  •  ⼤大量量のデータを⾼高速にSQLで処理理してくれる RDBと違いデータ量量が増えても性能が劣劣化しにくいただし、あくまでOLAP⽤用データベースであるデータはETLや正規化されている必要がある集計や統計など、数値の可視化に有効レポートツール等,データ可視化に最適 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshiftについてより深く知りたい⽅方は・・ •  Amazon Redshiftパフォーマンス・チューニング –  資料料： http://media.amazonwebservices.com/jp/ summit2014/TA-‐‑‒08.pdf –  動画： https://www.youtube.com/watch? v=_̲x4o1vNWbAA ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Elastic MapReduce ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Elastic MapReduce ユーザーは普通のHadoopとして利用できる •  AWSが提供するマネージドHadoop •  Hadoop1系、2系、MapRが利利⽤用可能 •  マネージド？ •  クラスタの構築・監視・復復旧 •  CloudWatchによるモニタリング •  S3のデータを扱える •  ２つの⼤大きな特徴 •  ワークフローマネジメント •  S3、DynamoDBのデータを扱える 36 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hadoop Hadoop ワークフローマネージメント：スクリプトでHadoopを起動 aws emr create-cluster \! --name bigdata-handson \! --ami-version 3.2.1 \! --applications Name=Hive \! --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large \! --log-uri s3:/PATH/TO/LOG/ \! --ec2-attributes SubnetId=subnet-a06474e6,KeyName=YOUR_KEY! 37 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. ワークフローマネージメント：スクリプトでHadoopを起動 aws emr create-cluster \! --name bigdata-handson \! --ami-version 3.2.1 \! --applications Name=Hive \! --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large \! --log-uri s3:/PATH/TO/LOG/ \! --ec2-attributes SubnetId=subnet-a06474e6,KeyName=YOUR_KEY! 名前をつける 38 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. ワークフローマネージメント：スクリプトでHadoopを起動 aws emr create-cluster \! --name bigdata-handson \! --ami-version 3.2.1 \! --applications Name=Hive \! --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large \! --log-uri s3:/PATH/TO/LOG/ \! --ec2-attributes SubnetId=subnet-a06474e6,KeyName=YOUR_KEY! AMI（Hadoop）のバージョンを指定する 39 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. ワークフローマネージメント：スクリプトでHadoopを起動 aws emr create-cluster \! --name bigdata-handson \! --ami-version 3.2.1 \! --applications Name=Hive \! --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large \! --log-uri s3:/PATH/TO/LOG/ \! --ec2-attributes SubnetId=subnet-a06474e6,KeyName=YOUR_KEY! インストールするアプリケーションを指定 40 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. ワークフローマネージメント：スクリプトでHadoopを起動 aws emr create-cluster \! --name bigdata-handson \! --ami-version 3.2.1 \! --applications Name=Hive \! --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large \! --log-uri s3:/PATH/TO/LOG/ \! --ec2-attributes SubnetId=subnet-a06474e6,KeyName=YOUR_KEY! インスタンスタイプを指定 41 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. ワークフローマネージメント：スクリプトでHadoopを起動 aws emr create-cluster \! --name bigdata-handson \! --ami-version 3.2.1 \! --applications Name=Hive \! --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large \! --log-uri s3:/PATH/TO/LOG/ \! --ec2-attributes SubnetId=subnet-a06474e6,KeyName=YOUR_KEY! ログの吐き出し先を指定 42 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. ワークフローマネージメント：スクリプトでHadoopを起動 aws emr create-cluster \! --name bigdata-handson \! --ami-version 3.2.1 \! --applications Name=Hive \! --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large \! --log-uri s3:/PATH/TO/LOG/ \! --ec2-attributes SubnetId=subnet-a06474e6,KeyName=YOUR_KEY! VPCなどの情報を指定 43 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. さらに ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. 仕事を予め定義して起動し、終わったら⾃自動削除 aws emr create-cluster \! --name bigdata-handson \! --ami-version 3.2.1 \! --applications Name=Hive \! --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large \! --log-uri s3:/PATH/TO/LOG/ \! --ec2-attributes SubnetId=subnet-a06474e6,KeyName=YOUR_KEY! --steps Type=HIVE,Name='Hive program’, Args=[-f,s3://PATH/TO/ QUERY.q] \! --auto-terminate! 45 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. cronやData Pipelineでワークフローを制御すればHadoopのジョブを⾃自動化できる！ ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3、DynamoDBとの連携 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3のデータを扱える •  HDFSとシームレスにS3上のデータを扱える •  INPUTやOUTPUTにs3://〜～を指定する •  S3からデータを取り出して結果を更更にS3に吐き出す hadoop jar YOUR_JAR.jar \! --src s3://YOUR_BUCKET/logs/ \! --dest s3://YOUR_BUCKET/output/! •  S3からデータを取り出して結果はローカルのHDFSに吐き出す hadoop jar YOUR_JAR.jar \! --src s3://YOUR_BUCKET/logs/ \! --desct hdfs:///output/ ! 48 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. もちろんHiveでも CREATE EXTERNAL TABLE s3_as_external_table(! "user_id INT,! "movie_id INT,! "rating INT,! "unixtime STRING )! ROW FORMAT DELIMITED FIELDS ! TERMINATED BY '\t'! STORED AS TEXTFILE! LOCATION 's3://mybucket/tables/';! 49 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hiveを使ったETL INSERT INTO TABLE table2! SELECT! column1,! ←例例えばtable1から column2,! column3,4を除外したい column5,! FROM table1;! ! 50 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. 同じようにDynamoDBのテーブルもマウントできるので・・・ CREATE EXTERNAL TABLE dynamodb_as_external_table(! "user_id INT,! "movie_id INT,! "rating INT,! "unixtime STRING )! STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' ! TBLPROPERTIES (! "dynamodb.table.name" = ”your_table",! "dynamodb.column.mapping" = ! ”user_id:UserID,movie_id:MovieId,rating:Rating,unixtime:UnixTime”! );! 51 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. 同じようにDynamoDBのテーブルもマウントできるので・・・ •  DynamoDBのデータをS3にバックアップしたり INSERT OVERWRITE TABLE! "s3_as_external_table! SELECT * ! FROM dynamodb_as_external_table;! •  DynamoDBのデータをHiveでMapReduceしたりできる SELECT COUNT(*) ! FROM dynamodb_as_external_table;! 52 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. EMR ちょっとDeep Dive 〜～RedshiftとEMRどちらを使う？〜～ ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. EMRのジョブ分布 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift •  •  •  •  •  •  基本的な使い勝⼿手はRDB SQLを使って解析 BIツールのバックエンドとしてある程度度正規化されたデータが前提条件複雑なジョインも得意クラスタは基本的には起動しっぱなし ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Elastic MapReduce •  Hadoop •  map reduce, hive, pig, streamingなどのHadoopのエコシステムが利利⽤用できる •  hiveでSQLっぽく使うこともできるがRedshiftのほうが速いし簡単（ただし、TRANSFORMやUDF/UDAFなどの独⾃自のメリットはある） •  正規化しづらいデータを扱うのが得意 •  ⽴立立ちあげっぱなしではなく、ジョブごとにクラスタを起動して終了了した破棄する使い⽅方もできる ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. EMRかRedshiftか •  SQLを使った分析/解析ならRedshiftのほうが圧倒的に速い •  それ以外ならEMR とう⾔言っくりざ ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Elastic MapReduceとは・・・ •  Hadoopを便便利利に利利⽤用できるようにしたサービスで⼤大量量のテキストデータを整形するのが⾮非常に得意 •  ワークフローをうまく活⽤用すれば処理理の⾃自動が⽤用意 •  S3のデータを⾃自在に扱える⽣生ログのETLや⼀一次集計に強みを持つ ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon DynamoDB ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamoDBとは •  NoSQL as a Service •  データ量量が増えても性能が劣劣化しない •  ⼤大規模なデータを⾼高速に扱いたいときに真の価値を発揮 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamoDBの特⻑⾧長 •  管理理不不要で信頼性が⾼高い •  プロビジョンドスループット •  ストレージの容量量制限がない ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. 特⻑⾧長１：管理理不不要で信頼性が⾼高い •  SPOFの存在しない構成 •  データは3箇所のAZに保存されるので信頼性が⾼高い •  ストレージは必要に応じて⾃自動的にパーティショニングされるクライアント ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. 特⻑⾧長２：プロビジョンドスループット •  テーブルごとにReadとWriteそれぞれに対し、必要な分だけのスループットキャパシティを割り当てる（＝プロビジョンする）ことができる •  例例えば下記のようにプロビジョンする –  Read : 1,000 –  Write : 100 •  書き込みワークロードが上がってきたら –  Read : 500 –  Write : 1,000 •  この値はDB運⽤用中にオンラインで変更更可能 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. 特⻑⾧長３：ストレージの容量量制限がない •  使った分だけの従量量課⾦金金制のストレージ •  データ容量量の増加に応じたディスクやノードの増設作業は⼀一切切不不要 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamoDBの構成要素 Client application SDK Database ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service Side API Client Side •  オペレーションはHTTPベースのAPIで提供されている •  ユーザーはコードを書くだけで利利⽤用できる DynamoDBのテーブルのプライマリキーの持ち⽅方は２種類 •  Hash key •  Hash key & Range key ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. プライマリキーがハッシュキーのサンプル１：ユーザー情報データベース ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. ユーザー情報データベースユーザーIDをプライマリキーとしたKVS的なテーブル –  UserIdで⼀一意のItemを特定し情報の参照や更更新、削除を⾏行行う Users Table UserId (Hash) Name Nicknames Mail Address Interests aed9d Bob [ Rob, Bobby ] [email protected] some address [ Car, Motor Cycle] edfg12 Alice [ Allie ] a8eesd Carol [ Caroline ] f42aed Dan [ Daniel, Danny ] ※DynamoDBにはauto_increment等でユニークIDを払い出す機能はないので注意。 UUIDなどを使ってください。 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. プライマリキーがハッシュ&レンジのテーブルサンプル：ゲームの⾏行行動履履歴管理理データベース ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. ゲームの⾏行行動履履歴管理理データベース⾃自分のバトル履履歴を確認するケースを想定 –  Userに⾃自分(Alice)を指定し、更更にTimestampが7⽇日以内のデータをクエリしたりできる Your Battle History Charlie 02-25 16:21 Won! Dan 02-24 09:48 Won! Alice 02-21 12:42 Won! Battle History User (Hash) Timestamp (Range) Opponent Result Alice 2014-02-21 12:21:20 Bob Lost Alice 2014-02-21 12:42:01 Bob Won Alice 2014-02-24 09:48:00 Dan Won Alice 2014-02-25 16:21:11 Charlie Won ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. テーブル設計のための基礎知識識+1 •  Local Secondary Index –  Range key以外に絞り込み検索索のためのキーを持つことができる –  Hash keyが同⼀一のアイテム群の中からの検索索のために利利⽤用 –  インデックスもテーブルにプロビジョンしたスループットを利利⽤用する •  Global Secondary Index –  Hash Keyをまたいで検索索を⾏行行うためのインデックス –  インデックスにテーブルとは独⽴立立したスループットをプロビジョンして利利⽤用する ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon DynamoDBとは・・・ •  •  •  •  分散型のNoSQL ⼤大量量のデータを投⼊入しても性能が劣劣化しない Redshiftとは違い、こちらはOLTP⽤用データベース SQLのようにJOINができるわけではない •  ⼤大量量のデータを格納しておいて、必要な少数のデータを⾼高速にやりとりするのに強みを持つ ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. データ活⽤用のステップで⾒見見る EMR ETL Kinesis Web app DynamoDB S3 Analytics Data Redshift Data Pipeline Sum EMR 73 あつめる©2014, 処理するためる Glacier RDS Amazon Web Services, Inc. or its affiliates. All rights reserved. Dashboard つかうアジェンダ 1.  AWSのサービス全体像とビッグデータ関連サービス 2.  解剖ビッグデータ: あつめる、ためる、つかう 3.  Getting Started with Big Data Services –  –  –  Amazon Redshift Amazon Elastic MapReduce Amazon DynamoDB 4.  Practical Deep Dive ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS上でデータ処理理を⾏行行う際には S3の利利⽤用がキー 75 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. データがS3にあればあとは必要に応じて処理理クラスタを起動して利利⽤用できる S3 Elastic MapReduce 76 DynamoDB ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift S3とEMR 負荷時間 Hadoop単体ではデータ共有はできない ↓ ひとつのデータに対する処理理はひとつのクラスタに強く結合してしまう ↓ キャパシティプランニングが難しい 77 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3とEMR 負荷キャパシティ時間データをHDFSではなく S3に格納しておけば・・ S3 78 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3とEMR 負荷キャパシティ時間 S3 79 S3上のデータを複数のクラスタから共⽤用できる ↓ 仕事量量に合わせて都度度クラスタを⽴立立ち上げて使える！ ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3とRedshift COPY table_name FROM ‘s3://hoge’ CREDENTIALS ‘access_key_id:hoge…’ DELIMITER ‘,’ S3 Redshift データ Redshiftへのデータロードは S3経由が効率率率的 80 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3とRedshift UNLOAD (‘SELECT * FROM…’) TO ‘s3://fuga/….’ CREDENTIALS ‘access_key_id:hoge…’; S3 RedshiftからS3へのエクスポートも容易易 81 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift S3とRedshift S3 古いデータはより安価なストレージのGlacierへ 82 Glacier ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift S3とRedshift S3 Glacier 83 Redshift クラスタに問題が起こったら、 S3からクラスタを再構築 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3とDynamoDB DynamoDB 84 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3とDynamoDB 古いデータはS3へオフロードもしくはバックアップ S3 85 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamoDB S3とDynamoDB 更更に古いデータはGlacierへ Glacier 86 S3 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. DynamoDB S3を使ったパイプライン処理理 S3 87 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3を使ったパイプライン処理理 EMR S3 S3 処理理１ 88 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3を使ったパイプライン処理理 EMR S3 S3 S3 処理理１ 89 EMR 処理理２ ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3を使ったパイプライン処理理 EMR EMR S3 S3 S3 処理理１処理理２ S3をチェックポイントとして利利⽤用することによって処理理を疎結合にできる 90 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3を使ったパイプライン処理理 EMR EMR S3 S3 S3 処理理１処理理２処理理が途中で失敗したら・・・ 91 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3を使ったパイプライン処理理 EMR EMR S3 S3 S3 処理理１処理理２チェックポイントからやりなおせる 92 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. ⼤大事なことなのでもう⼀一度度。中⼼心にあるのはいつもS3データコンテンツ配信 CloudFront 分析コンテンツトランスコード Elastic Redshift MapReduce Elastic Transcode r データ交換データバックアップ Amazon S3 93 EC 2 RDS EBS Storage Gatewa y Redshift Data Pipeline データアクセスGW コンテンツアーカイブ ©2014, Amazon Web Services, Inc. or Glacie its affiliates. All rights reserved. Storage Gateway One more thing.. ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. リアルタイム ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. リアルタイム性の導⼊入データは逐次的に流流れる 15分や1時間に1回データを更更新する EMR Webサーバー S3 処理理済みデータ Redshift ETL済みデータ S3 BIツールなどログ（オリジナル）ログ集約サーバーデータ提供API EMR S3 処理理済みデータ DynamoDB ETL済みデータ 15分や1時間に1回データを更更新する 96 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. リアルタイム性の導⼊入データは逐次的に流流れる Webサーバー 15分や1時間に1回データを更更新する EMR S3 処理理済みデータ Redshift ETL済みデータ S3 BIツールなどログ（オリジナル）データ提供API EMR S3 処理理済みデータ DynamoDB ETL済みデータ 15分や1時間に1回データを更更新する 97 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. リアルタイム性の導⼊入：ラムダアーキテクチャリアルタイムに速報値を流流し込むデータは逐次的に流流れる Webサーバー 15分や1時間に1回データを更更新する EMR S3 処理理済みデータ Redshift ETL済みデータ S3 BIツールなどログ（オリジナル）データ提供API EMR S3 処理理済みデータ DynamoDB ETL済みデータ 15分や1時間に1回データを更更新するリアルタイムに速報値を流流し込む 98 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. まとめ ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. 各サービスの役割を理理解し、うまく組み合わせる •  あつめる、ためる、処理理する、つかう、それぞれのフェーズに有効なサービスは異異なる。 •  ⾃自分がどのフェーズのソリューションを必要としているのかを⾒見見極めてサービスを選ぶ。 ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. 各サービスの役割を理理解し、うまく組み合わせるどこから？あつめるサーバーどのくらい？どういう？ためる処理する S3 EMR どうやって？つかう Redshift DynamoDB DynamoDB EC2 モバイル RDS ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved. RDS