python hadoop pig December 30, 2014 1 Python Hadoop Pig Contents This notebook aims at showing how to submit a PIG job to remote hadoop cluster (tested with Cloudera). It works better if you know Hadoop otherwise I recommend reading Map/Reduce avec PIG (French). First, we download data. We are going to upload that data to the remote cluster. The Hadoop distribution tested here is Cloudera. In [1]: import pyensae pyensae.download_data("ConfLongDemo_JSI.txt", website="https://archive.ics.uci.edu/ml/machine-le downloading of https://archive.ics.uci.edu/ml/machine-learning-databases/00196/ConfLongDemo JSI.txt Out[1]: ’ConfLongDemo JSI.txt’ We open a SSH connection to the bridge which can communicate to the cluster. In [1]: import pyquickhelper params={"server":"", "username":"", "password":""} pyquickhelper.open_html_form(params=params,title="credentials",key_save="ssh_remote_hadoop") Out[1]: <IPython.core.display.HTML at 0x742c9f0> In [2]: password = ssh_remote_hadoop["password"] server = ssh_remote_hadoop["server"] username = ssh_remote_hadoop["username"] We open the SSH connection: In [3]: %remote_open Out[3]: <pyensae.remote.remote connection.ASSHClient at 0xa9b16f0> We check the content of the remote machine: In [5]: %remote_cmd ls -l Out[5]: <IPython.core.display.HTML at 0xb68a8b0> We check the content on the cluster: In [6]: %remote_cmd hdfs dfs -ls Out[6]: <IPython.core.display.HTML at 0x7bb37d0> 1 to We upload the file on the bridge (we should zip it first, it would reduce the uploading time). In [8]: %remote_up ConfLongDemo_JSI.txt ConfLongDemo_JSI.txt We check it got there: In [12]: %remote_cmd ls Conf*JSI.txt Out[12]: <IPython.core.display.HTML at 0xb693750> We put it on the cluster: In [23]: %remote_cmd hdfs dfs -put ConfLongDemo_JSI.txt ConfLongDemo_JSI.txt Out[23]: <IPython.core.display.HTML at 0xbb7a110> We check it was put on the cluster: In [24]: %remote_cmd hdfs dfs -ls Conf*JSI.txt Out[24]: <IPython.core.display.HTML at 0xbb7a890> We create a simple PIG program: In [25]: %%PIG filter_example.pig myinput = LOAD ’ConfLongDemo_JSI.txt’ USING PigStorage(’,’) AS (index:long, sequence, tag, timestamp:long, dateformat, x:double,y:double, z:double, activi filt = FILTER myinput BY activity == ’walking’ ; STORE filt INTO ’ConfLongDemo_JSI.walking.txt’ USING PigStorage() ; In [26]: %jobsubmit filter_example.pig filter_example.redirect Out[26]: <IPython.core.display.HTML at 0xbb7a930> We check the redirected files were created: In [27]: %remote_cmd ls f*redirect* Out[27]: <IPython.core.display.HTML at 0xbb7a790> We check the tail on a regular basis to see the job running (some other commands can be used to monitor jobs, %remote cmd mapred --help). In [32]: %remote_cmd tail filter_example.redirect.err Out[32]: <IPython.core.display.HTML at 0xbb7ac50> In [34]: %remote_cmd hdfs dfs -ls Conf*JSI.walking.txt Out[34]: <IPython.core.display.HTML at 0xbb7afd0> After that, the stream has to downloaded to the bridge and then to the local machine with %remote down. We finally close the connection. In [35]: %remote_close END In []: 2
© Copyright 2025 ExpyDoc