Apache Pig for batch data analysis over Hadoop

In these days I’m playing with Apache Pig for running data analysis over Apache Hadoop. Below a sample wordcloud generated from the top word count of nouns of the Italian translation of the Bible

la-sacra-bibbia-frequenza-paroleCopy the file book.txt to hadoop distribuited file system (HDFS) with

hadoop-2.4.0/bin/hdfs dfs -copyFromLocal -f book.txt

Test the pig job locally with

pig-0.13.0/bin/pig -x local wordcount.pig

Run the pig job in hadoop with

pig-0.13.0/bin/pig -x mapreduce wordcount.pig

Look at results with

hadoop-2.4.0/bin/hdfs dfs -cat book-wordcount/part*|more

Copy the results to a local file with

hadoop-2.4.0/bin/hdfs dfs -cat book-wordcount/part* > frequenza-parole-bibbia.txt


Below the two scripts I used for this short tutorial:

Wordcount (pig script):

a = load '/user/matteo/book.txt';
b = foreach a {
        line = LOWER(REPLACE((chararray)$0, '[!?\\.»«:;,\']', ' '));
    generate flatten(TOKENIZE(line)) as word;
c = group b by word;
d = foreach c generate group, COUNT(b) as cnt;
d_ordered = ORDER d BY cnt DESC;
store d_ordered into '/user/matteo/book-wordcount';


Wordcloud (R script)

p = read.table(file="frequenza-parole-bibbia.txt")
png("/home/matteo/la-sacra-bibbia-frequenza-parole.png", width=900, height=900)
wordcloud(p$V1, p$V2, scale=c(8,.3),min.freq=2,max.words=200, random.order=T, rot.per=.15)