python hadoop pig

python hadoop pig
December 30, 2014
1
Python Hadoop Pig
Contents
This notebook aims at showing how to submit a PIG job to remote hadoop cluster (tested with Cloudera).
It works better if you know Hadoop otherwise I recommend reading Map/Reduce avec PIG (French). First,
we download data. We are going to upload that data to the remote cluster. The Hadoop distribution tested
here is Cloudera.
In [1]: import pyensae
pyensae.download_data("ConfLongDemo_JSI.txt", website="https://archive.ics.uci.edu/ml/machine-le
downloading of
https://archive.ics.uci.edu/ml/machine-learning-databases/00196/ConfLongDemo JSI.txt
Out[1]: ’ConfLongDemo JSI.txt’
We open a SSH connection to the bridge which can communicate to the cluster.
In [1]: import pyquickhelper
params={"server":"", "username":"", "password":""}
pyquickhelper.open_html_form(params=params,title="credentials",key_save="ssh_remote_hadoop")
Out[1]: <IPython.core.display.HTML at 0x742c9f0>
In [2]: password = ssh_remote_hadoop["password"]
server = ssh_remote_hadoop["server"]
username = ssh_remote_hadoop["username"]
We open the SSH connection:
In [3]: %remote_open
Out[3]: <pyensae.remote.remote connection.ASSHClient at 0xa9b16f0>
We check the content of the remote machine:
In [5]: %remote_cmd ls -l
Out[5]: <IPython.core.display.HTML at 0xb68a8b0>
We check the content on the cluster:
In [6]: %remote_cmd hdfs dfs -ls
Out[6]: <IPython.core.display.HTML at 0x7bb37d0>
1
to
We upload the file on the bridge (we should zip it first, it would reduce the uploading time).
In [8]: %remote_up ConfLongDemo_JSI.txt ConfLongDemo_JSI.txt
We check it got there:
In [12]: %remote_cmd ls Conf*JSI.txt
Out[12]: <IPython.core.display.HTML at 0xb693750>
We put it on the cluster:
In [23]: %remote_cmd hdfs dfs -put ConfLongDemo_JSI.txt ConfLongDemo_JSI.txt
Out[23]: <IPython.core.display.HTML at 0xbb7a110>
We check it was put on the cluster:
In [24]: %remote_cmd hdfs dfs -ls Conf*JSI.txt
Out[24]: <IPython.core.display.HTML at 0xbb7a890>
We create a simple PIG program:
In [25]: %%PIG filter_example.pig
myinput = LOAD ’ConfLongDemo_JSI.txt’ USING PigStorage(’,’) AS
(index:long, sequence, tag, timestamp:long, dateformat, x:double,y:double, z:double, activi
filt = FILTER myinput BY activity == ’walking’ ;
STORE filt INTO ’ConfLongDemo_JSI.walking.txt’ USING PigStorage() ;
In [26]: %jobsubmit filter_example.pig filter_example.redirect
Out[26]: <IPython.core.display.HTML at 0xbb7a930>
We check the redirected files were created:
In [27]: %remote_cmd ls f*redirect*
Out[27]: <IPython.core.display.HTML at 0xbb7a790>
We check the tail on a regular basis to see the job running (some other commands can be used to monitor
jobs, %remote cmd mapred --help).
In [32]: %remote_cmd tail filter_example.redirect.err
Out[32]: <IPython.core.display.HTML at 0xbb7ac50>
In [34]: %remote_cmd hdfs dfs -ls Conf*JSI.walking.txt
Out[34]: <IPython.core.display.HTML at 0xbb7afd0>
After that, the stream has to downloaded to the bridge and then to the local machine with %remote down.
We finally close the connection.
In [35]: %remote_close
END
In []:
2