HBase2

Seven Databases in Seven Weeks
HBase
Hbaseとは何か
Google の内部システム
（発表した論文より）
バッチ処理
MapReduce
Hadoop プロジェクト
（Googleクローン）
リアルタイム応答
BigTable
Google File Sytem (GFS)
MapReduce
HBase
Hadoop Distributed
File Sytem (HDFS)
BigTable
(ソート済列志向データベース)
スキーマで定義する
RowKey ColumnFamily1
ColumnFamily2
1
Column1 Column2 Column1
Column1 Column2
2
Column2 Column3
ColumnFamily3
Column2 Column3
スキーマレス（自由に追加できる）
必須
ソート済
あるColumn
timestamp 5
timestamp 4
#F00
timestamp 3
timestamp 2
#0F0
#000
timestamp 1
#FFF
#00F
タイムスタンプでバージョニングされる
リージョン分割・自動シャーディン
RowKey ColumnFamily1
ColumnFamily2
ColumnFamily3
1
2
3
4
リージョン
リージョン
リージョン
リージョン
リージョン
リージョン
5
6
7
8
9
•
•
•
•
テーブルはリージョンで物理的に分割（シャーディング）される
リージョンはクラスタ中のリージョンサーバが担当する
リージョンは ColumnFamily 毎に作られる
リージョンはソート済のRowKey を適当なサイズで分割する
HBase の特徴
自動シャーディング・自動フェールオーバー
テーブルサイズが大きくなった時、自動的に分割する
分割されたシャードは、ノード障害時に自動的にフェールオーバーする
データの一貫性 (CAP:Consistency)
データの更新は反映された瞬間から読出可能
結果的に同じ値が読めるようになる（結果整合性）条件緩和を取らない
Hadoop/HDFS 統合
Hadoop の HDFS 上に展開できる
Hadoop/MapReduce でAPIを挟まず HBase を入出力の対象にできる
7つのデータベース７つの世界
での構成
１日目：CRUDとテーブル管理
スタンドアロンでHbaseを動かす
テーブルを作る
データの出し入れをする
２日目：ビッグデータを扱う
Wikipedia ダンプを投入する
スクリプト (Not Shell) での操作に慣れる
３日目：クラウドに持っていく
Thrift を使って操作する
今回は扱いません
Whirr を使って EC2 にデプロイする
今回は扱いません
Wikipedia のダンプファイルを HBase で扱う
サンプルデータ
<mediawiki xmlns=http://www.mediawiki.org/xml/export-0.8/
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
xsi:schemaLocation=http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd
version="0.8" xml:lang="ja">
<siteinfo>
<sitename>Wikipedia</sitename>
<base>http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8</base>
<generator>MediaWiki 1.22wmf2</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">メディア</namespace>
<namespace key="-1" case="first-letter">特別</namespace>
<namespace key="0" case="first-letter" />
</namespaces>
</siteinfo>
<page>
<title>アンパサンド</title>
<ns>0</ns>
<id>5</id>
<revision>
<id>46524710</id>
<parentid>44911376</parentid>
<timestamp>2013-03-06T22:31:33Z</timestamp>
<contributor>
<username>Addbot</username>
<id>712937</id>
</contributor>
<minor />
<comment>ボット: 言語間リンク 31 件を[[d:|ウィキデータ]]上の [[d:q11213]] に転記</comment>
<text xml:space="preserve">{{記号文字|&amp;}} [[Image:Trebuchet MS ampersand.svg|right|thumb|100px|[[Trebuchet
<sha1>4duebxtzaadjddpy3036cey6451d992</sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>
<page>
…
サンプルデータ
Rowkey:title
id
アンパサンド 5
text
{{記号文字…
revision
<mediawiki xmlns=http://www.mediawiki.org/xml/export-0.8/
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
xsi:schemaLocation=http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd
version="0.8" xml:lang="ja">
<siteinfo>
<sitename>Wikipedia</sitename>
<base>http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8</base>
<generator>MediaWiki 1.22wmf2</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">メディア</namespace>
<namespace key="-1" case="first-letter">特別</namespace>
<namespace key="0" case="first-letter" />
</namespaces>
</siteinfo>
<page>
<title>アンパサンド</title>
<ns>0</ns>
<id>5</id>
<revision>
<id>46524710</id>
<parentid>44911376</parentid>
<timestamp>2013-03-06T22:31:33Z</timestamp>
<contributor>
<username>Addbot</username>
<id>712937</id>
</contributor>
<minor />
<comment>ボット: 言語間リンク 31 件を[[d:|ウィキデータ]]上の [[d:q11213]] に転記</comment>
<text xml:space="preserve">{{記号文字|&amp;}} [[Image:Trebuchet MS ampersand.svg|right|thumb|100px|[[Trebuchet
<sha1>4duebxtzaadjddpy3036cey6451d992</sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>
<page>
…
…
parentid:44911376
スキーマ作成
Rowkey:title
id
アンパサンド 5
text
{{記号文字…
revision
parentid:44911376
…
スキーマ定義
hbase(main):004:0> create 'wiki', 'id', 'text', 'revision'
0 row(s) in 2.4180 seconds
hbase(main):005:0> disable 'wiki'
0 row(s) in 2.3650 seconds
hbase(main):006:0> alter 'wiki', {NAME=>'text', COMPRESSION=>'GZ', BLOOMFILTER=>'ROW'}
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 1.2860 seconds
ColumnFamily:Text の圧縮・BloomFilter有効
hbase(main):007:0> enable 'wiki'
0 row(s) in 2.7430 seconds
BloomFilter
リージョン中の RowKey/ColmnFamily
クエリ
リージョン中に指定した
RowKey/ColumnFamily
がないことを高速に検知する
•‘ROW’ : RowKey のみ
•'ROWCOL' : RowKey/ColumnFamily
データ投入用コード
include Java
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.HBaseConfiguration
import javax.xml.stream.XMLStreamConstants
require "time"
def jbytes(*args)
args.map { |arg| arg.to_s.to_java_bytes }
end
factory = javax.xml.stream.XMLInputFactory.newInstance
reader = factory.createXMLStreamReader(java.lang.System.in)
table = HTable.new( HBaseConfiguration.new, "wiki" )
document = nil
buffer = nil
count = 0
while reader.has_next
type = reader.next
if type == XMLStreamConstants::START_ELEMENT
tag = reader.local_name
case tag
when 'page' then document = {}
when /title|id|parentid|timestamp|text/ then buffer = []
end
elsif type == XMLStreamConstants::CHARACTERS
text = reader.text buffer << text unless
buffer.nil?
elsif type == XMLStreamConstants::END_ELEMENT
tag = reader.local_name
case tag
when /title|id|parentid|timestamp|text/ then document[tag] = buffer.join
when 'revision'
key = document['title'].to_java_bytes
ts = (Time.parse document['timestamp']).to_i
p = Put.new(key , ts)
p.add( *jbytes("text", "", document['text']) )
p.add( *jbytes("id", "", document['id']) )
p.add( *jbytes("revision", "parendid", document['parentid']) )
table.put(p)
count += 1
table.flushCommits() if count % 50 == 0
puts "#{count} records inserted (#{document['title']})" if count % 1000 == 0
end
end
end
table.flushCommits()
puts "#{count}"
exit
実行例
[root@HBase01 opt]# cat jawiki-latest-pages-meta-current.xml | time hbase-0.94.7/bin/hbase org.jruby.Main hoge.rb
2362613
860.30user 172.80system 28:29.00elapsed 60%CPU (0avgtext+0avgdata 696352maxresident)k 304inputs+4072outputs
(0major+91296minor)pagefaults 0swaps
データ投入用コード
document = nil
buffer = nil
count = 0
while reader.has_next
type = reader.next
if type == XMLStreamConstants::START_ELEMENT
tag = reader.local_name
case tag
when 'page' then document = {}
when /title|id|parentid|timestamp|text/ then buffer = []
end
elsif type == XMLStreamConstants::CHARACTERS
text = reader.text
buffer << text unless buffer.nil?
elsif type == XMLStreamConstants::END_ELEMENT
tag = reader.local_name
case tag
when /title|id|parentid|timestamp|text/
document[tag] = buffer.join
when 'revision'
key = document['title'].to_java_bytes
ts = (Time.parse document['timestamp']).to_i
p = Put.new(key , ts)
投入 p.add( *jbytes("text", "", document['text']) )
p.add( *jbytes("id", "", document['id']) )
p.add( *jbytes("revision", "parendid", document['parentid']) )
table.put(p)
count += 1
table.flushCommits() if count % 50 == 0
if count % 1000 == 0
puts "#{count} records inserted (#{document['title']})“
end
end
end
end
開始タグ
タグ内要素
終了タグ
<page>
<title>アンパサンド</title>
<ns>0</ns>
<id>5</id>
<revision>
<id>46524710</id>
<parentid>44911376</parentid>
<timestamp>2013-03-06T22:31:33Z</timestamp>
<contributor>
<username>Addbot</username>
<id>712937</id>
</contributor>
<minor />
<comment>…</comment>
<text xml:space="preserve">…</text>
<sha1>4duebxtzaadjddpy3036cey6451d992</sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>
<page>
…
table.flushCommits()
puts "#{count}"
exit
実験
投入時
Q: text領域のGZ圧縮により高速化するのか？
A: 46min -> 26min (70%高速化)
Q: text領域のGZ圧縮によりデータ領域は節約されるのか？
A: 8.1GB -> 2.6GB (32%のサイズに圧縮)
取出し時
Q: text領域のGZ圧縮によりgetは高速化するのか？
A: 有意な差がない
Q: 全体 scan と部分 scan での速度差は？
A: 0.136[s] (開始RowKey指定,10件) vs 119.738[s](全体,Column値条件検索)
Q: text領域のGZ圧縮によりscanは高速化するのか？
A: 条件による（次ページ）
GZ圧縮とText（大きなColumnFamily）と scan
title
id
text
revision
リージョン
リージョン
リージョン
リージョン
リージョン
リージョン
hbase(main):001:0> scan 'wiki' , {COLUMN=>['id','revision'], FILTER => "SingleColumnValueFilter('revision','parendid',=,'binary:46628036')"}
hbase(main):001:0> scan 'wiki' , {FILTER => "SingleColumnValueFilter('revision','parendid',=,'binary:46628036')"}
hbase(main):001:0> scan 'wiki' , {FILTER => "SingleColumnValueFilter('text','',=,'substring:ぱんだねこ')"}
300
250
200
150
100
50
0
Text取得しない
Text取得する
Text領域外条件scan
Text取得する
Text条件scan
圧縮あり
10.424
117.751
203.014
圧縮なし
11.961
193.735
262.239

Download Report