Instead of using the old Hadoop way (map/reduce), I suggest using the newer and faster way (Apache Spark on top of Hadoop Yarn): in few lines you can open all tweets (zipped json files saved in several subdirectories hdfs://path/to/YEAR/MONTH/DAY/*gz) and query them in a SQL like language

sc = SparkContext(appName="")
sqlContext = SQLContext(sc)
tweets = sqlContext.jsonFile("/tmp/twitter/opensource/2014/*/*.gz")
t = sqlContext.sql("SELECT distinct createdAt,user.screenName,hashtagEntities FROM tweets")
tweets_by_days = count_items( t: javaTimestampToString(t[0])))
stats_hashtags = count_items(t.flatMap(lambda t: t[2])\ .map(lambda t: t[2].lower()))

My source codes are free and available (Python and Scala repositories).


Posted in Me.

Apache Spark has just passed Hadoop in popolarity on the web (google trends)

My first Apache Spark usage was extracting texts from tweets I’ve been collecting in Hadoop HDFS. My python script was

import json

from pyspark import SparkContext

def valid(tweet):
  return 'text' in tweet

def gettext(line):
  tweet = json.loads(line)
  return tweet['text']

sc = SparkContext(appName="Tweets")
data = sc.textFile("hdfs://*/*/*.gz")

result = data.filter(lambda line: valid(line))\
    .map(lambda tweet: gettext(tweet))

output = result.collect()
for text in output:
    print text.encode('utf-8')

And lunched with

spark-1.1.0> bin/spark-submit --master local[4]

Continue reading

Posted in Me.

The two top Hadoop distributions (Cloudera and Hortonworks but remember that Hadoop is a Free Software and many companies do not pay anything for using it!) include Apache Solr as Hadoop search tool

See apache-solr-hadoop-search article and the following two presentations from the two vendors



See also the Natural Language Processing and Sentiment Analysis for Retailers using HDP and ITC Infotech Radar article






Posted in Me.

I opened a service request to Oracle and they did not provide me an official way to add the Google Analytics javascript code to Oracle OBIEE (release I wanted to add it in only one place and see it in all pages of Oracle Obiee.

The solution I found and tested is to add the javascript code (without <scripts> and </scripts>) in the file


Pay attention that the file could be overwritten after any software upgrades

Posted in Me.

In this blog post Google confirms its adoption of the opensource statistical environment R (see my R introduction) releasing a new R package..

“How can we measure the number of additional clicks or sales that an AdWords campaign generated? How can we estimate the impact of a new feature on app downloads? How do we compare the effectiveness of publicity across countries? In principle, all of these questions can be answered through causal inference […]

How the package works
The CausalImpact R package implements a Bayesian approach to estimating the causal effect of a designed intervention on a time series. Given a response time series (e.g., clicks) and a set of control time series (e.g., clicks in non-affected markets, clicks on other sites, or Google Trends data), the package constructs a Bayesian structural time-series model with a built-in spike-and-slab prior for automatic variable selection. This model is then used to predict the counterfactual, i.e., how the response metric would have evolved after the intervention if the intervention had not occurred.” Read the full Google blog post

Posted in Me.

Sometimes ago I read that some components of IBM Watson were implemented in prolog . So I decided to look at it again after many years… I like Prolog, I studied prolog at Computer Science University of Milan and for my thesis I wrote code in Prolog (and Lisp).

proloGraph is a simple example of howto exposing a prolog graph database to other applications,  building a REST web service. I used swi-prolog and its http library

Install the prolog language (I used the fantastic Linux Debian distribution) with

apt-get install swi-prolog

clone my git repository

git clone
cd proloGraph

Run it with

swipl -s -g 'server(8765).'

Open the following url with your browser


and you will get:

  "prev": [ {"from":"user(gabriele)", "to":"user(matteo)", "rel":"follow"} ],
  "next": [
    {"from":"user(matteo)", "to":"user(ele)", "rel":"follow"},
    {"from":"user(matteo)", "to":"user(gabriele)", "rel":"follow"},
    {"from":"user(matteo)", "to":"user(4)", "rel":"follow"},
    {"from":"user(matteo)", "to":"country(italy)", "rel":"lives"},
    {"from":"user(matteo)", "to":"hobby(running)", "rel":"likes"}
Posted in Me.