Bigdata projects can be very expensive and can easily fail: I suggest to start with a small, useful but not critical project. Better if it is about unstructured data collection and batch processing. In this case you have time to get practise with the new technologies and the Apache Hadoop system can have not critical downtimes.

At home I have the following system running on a small Raspberry PI: for sure it is not fast ;-)

At work I introduced Hadoop just few months ago for collecting web data and generating daily reports.


Posted in Me.


The trend of these years has been switching from SQL (RDBMS) databases to NoSQL databases like Hadoop, MongoDB, Cassandra, Riak, …

SQL is a old but easy and fast way to query data. And people STILL look at it for quering Hadoop and bigdata:

Read details from 10 ways to query hadoop with sql ..

Posted in Me.

Instead of using the old Hadoop way (map/reduce), I suggest using the newer and faster way (Apache Spark on top of Hadoop Yarn): in few lines you can open all tweets (zipped json files saved in several subdirectories hdfs://path/to/YEAR/MONTH/DAY/*gz) and query them in a SQL like language

sc = SparkContext(appName="")
sqlContext = SQLContext(sc)
tweets = sqlContext.jsonFile("/tmp/twitter/opensource/2014/*/*.gz")
t = sqlContext.sql("SELECT distinct createdAt,user.screenName,hashtagEntities FROM tweets")
tweets_by_days = count_items( t: javaTimestampToString(t[0])))
stats_hashtags = count_items(t.flatMap(lambda t: t[2])\ .map(lambda t: t[2].lower()))

My source codes are free and available (Python and Scala repositories).


Posted in Me.

Apache Spark has just passed Hadoop in popolarity on the web (google trends)

My first Apache Spark usage was extracting texts from tweets I’ve been collecting in Hadoop HDFS. My python script was

import json

from pyspark import SparkContext

def valid(tweet):
  return 'text' in tweet

def gettext(line):
  tweet = json.loads(line)
  return tweet['text']

sc = SparkContext(appName="Tweets")
data = sc.textFile("hdfs://*/*/*.gz")

result = data.filter(lambda line: valid(line))\
    .map(lambda tweet: gettext(tweet))

output = result.collect()
for text in output:
    print text.encode('utf-8')

And lunched with

spark-1.1.0> bin/spark-submit --master local[4]

Continue reading

Posted in Me.