Monthly Archives: August 2014

Apache Pig for batch data analysis over Hadoop

In these days I’m playing with Apache Pig for running data analysis over Apache Hadoop. Below a sample wordcloud generated from the top word count of nouns of the Italian translation of the Bible

la-sacra-bibbia-frequenza-paroleCopy the file book.txt to hadoop distribuited file system (HDFS) with

hadoop-2.4.0/bin/hdfs dfs -copyFromLocal -f book.txt

Test the pig job locally with

pig-0.13.0/bin/pig -x local wordcount.pig

Run the pig job in hadoop with

pig-0.13.0/bin/pig -x mapreduce wordcount.pig

Look at results with

hadoop-2.4.0/bin/hdfs dfs -cat book-wordcount/part*|more

Copy the results to a local file with

hadoop-2.4.0/bin/hdfs dfs -cat book-wordcount/part* > frequenza-parole-bibbia.txt

 

Below the two scripts I used for this short tutorial:

Wordcount (pig script):

a = load '/user/matteo/book.txt';
b = foreach a {
        line = LOWER(REPLACE((chararray)$0, '[!?\\.»«:;,\']', ' '));
    generate flatten(TOKENIZE(line)) as word;
}
c = group b by word;
d = foreach c generate group, COUNT(b) as cnt;
d_ordered = ORDER d BY cnt DESC;
store d_ordered into '/user/matteo/book-wordcount';

 

Wordcloud (R script)

library(wordcloud)
p = read.table(file="frequenza-parole-bibbia.txt")
png("/home/matteo/la-sacra-bibbia-frequenza-parole.png", width=900, height=900)
wordcloud(p$V1, p$V2, scale=c(8,.3),min.freq=2,max.words=200, random.order=T, rot.per=.15)
dev.off()

Developing applications with a microservice architecture

“The microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery.” Read the full article

 

This is my first sample microservices written in clojure