Replicating from RDBMS into Hadoop

From Dolphins to Elephants:
Real-Time MySQL to Hadoop
Replication with Tungsten
MC Brown, Director of Documentation
Linas Virbalas, Senior Software Engineer
©Continuent 2014.
About Tungsten Replicator
•
Open source drop-in replacement for MySQL
replication, providing:
•
•
•
•
•
•
Global transaction ID
©Continuent 2014
Multiple masters
Multiple sources
Flexible topologies
Heterogeneous replication
Parallel replication
2
Tungsten Replicator
Master
Download
transactions via network
DBMS
Logs
Slave
Replicator
(Transactions + Metadata)
Replicator
Apply using JDBC
©Continuent 2014
THL
THL
(Transactions + Metadata)
3
How Tungsten Replicator Works
Pipeline
Stage
Extract Filter
Master
DBMS
©Continuent 2014
Stage
Apply
Extract Filter
Transaction
History Log
Stage
Apply
Extract Filter
In-Memory
Queue
4
Apply
Slave
DBMS
Where we replicate
master-slave
fan-in slave
©Continuent 2014
Heterogene
MySQL
Oracle
Oracle
MySQL
all-masters
5
Direct slave
Regular
MySQL
star-schema
Why Hadoop
•
•
•
•
•
•
©Continuent 2014
Customer driven
Change in the air
Environments moving to heterogenous
•
•
NoSQL was the first
We already support MongoDB
Hadoop used for big analytics
More frequently a live resource
Big datasets require Map/Reduce
6
Tungsten Replicator and Hadoop
•
•
Extract from MySQL or Oracle
•
Automatic replication of incremental changes
Customizable formatting
•
•
•
Hive Schema generation
©Continuent 2014
Base Hadoop and Commercial distributions; Cloudera,
HortonWorks, Amazon Elastic MapReduce and IBM
InfoSphere BigInsights compatible
Materialized views in Hive for carbon-copy tables
Sqoop and parallel extractor compatibility for
provisioning
7
Applying Data into Hadoop
Replicator
Replicator
Extract
transactions
from log
THL
CSV
Hadoop
DBMS
Logs
©Continuent 2014
8
Applying Data into Hadoop
Replicator
Replicator
Extract
transactions
from log
THL
CSV
Hadoop
DBMS
Logs
©Continuent 2014
9
Applying Data into Hadoop
Replicator
Replicator
Extract
transactions
from log
THL
CSV
Hadoop
DBMS
Logs
©Continuent 2014
10
CSV (Staging)
Materialized Views
Hadoop
ID
Message
Hive Table
©Continuent 2014
11
CSV (Staging)
Materialised Views
Hadoop
ID
Message
Hive Table
©Continuent 2014
12
CSV (Staging)
Materialised Views
Hadoop
ID
Message
Hive Table
©Continuent 2014
13
CSV (Staging)
Materialized Views
Hadoop
ID
Message
Hive Table
©Continuent 2014
14
MySQL Configuration
•
•
•
©Continuent 2014
Use Row-based replication
Every table must have primary keys
Replicator configured with:
•
•
Filters for metadata and primary key optimisation
Extracts to standard THL
15
Configure Hadoop
•
•
Data is stored in CSV format on HDFS
•
•
•
Compatible with Hive, HBase, and others
©Continuent 2014
Cloudera, HortonWorks, Amazon Elastic Map
Reduce (EMR) and IBM Infosphere BigInsights
compatible
Staging DDL can be automatically generated
Live Table DDL can be automatically
generated
16
DDL Generation
•
•
Built-in Tool, part of Tungsten Replicator
•
Default mode is for default migrations to
Hive types
•
Customizable for your needs
•
©Continuent 2014
Handles staging and live table DDL
generation
•
BigInts as Strings
Data transformations possible through filters
17
Replicator Hadoop Configuration
•
•
•
©Continuent 2014
Batch Commit interval
•
•
By rows count
By time interval
CSV Format
•
•
Predefined formats
Customizable by field and row characters
Parallelization Supported
18
Materialized Views
•
Merges Data from Staging CSV into Hive
Tables
•
Processing separate from Replicator
•
©Continuent 2014
•
Allows individual table views to be generated
independently
•
Allows for custom materialization intervals
Views based on 'live' data, or by point-in-time
from CSV staging
19
Demo
©Continuent 2014
20
Provisioning Data
•
•
©Continuent 2014
Sqoop
•
•
•
•
Start the replicator
Sqoop the data
Materialized views are idempotent
DDL generation is Hive compatible
Parallel Extractor
•
•
Currently Oracle only
Will extract data in parallel and insert into THL
21
Replication Management
•
Replication can be stopped, started,
restarted at any time
•
Enables MySQL or Hadoop maintenance
windows
•
•
•
DDL customizable
©Continuent 2014
Views regenerated at any time
Schema changes can be handled by reSqooping and dematerialising views
22
560 S. Winchester Blvd., Suite 500
San Jose, CA 95128
Tel +1 (866) 998-3642
Fax +1 (408) 668-1009
e-mail: [email protected]
Our Blogs:
http://scale-out-blog.blogspot.com
http://mcslp.wordpress.com
http://flyingclusters.blogspot.com
http://www.continuent.com/news/blogs
Continuent Web Page:
http://www.continuent.com
!
Tungsten
2.2Standby
and 3.0 Failed
Preview:
Master Replicator
Slave Hot
http://code.google.com/p/tungsten-replicator
©Continuent 2014