drill

The trend of these years has been switching from SQL (RDBMS) databases to NoSQL databases like Hadoop, MongoDB, Cassandra, Riak, …

SQL is a old but easy and fast way to query data. And people STILL look at it for quering Hadoop and bigdata:

Read details from 10 ways to query hadoop with sql ..

Posted in Me.

Instead of using the old Hadoop way (map/reduce), I suggest using the newer and faster way (Apache Spark on top of Hadoop Yarn): in few lines you can open all tweets (zipped json files saved in several subdirectories hdfs://path/to/YEAR/MONTH/DAY/*gz) and query them in a SQL like language

sc = SparkContext(appName="extraxtStatsFromTweets.py")
sqlContext = SQLContext(sc)
tweets = sqlContext.jsonFile("/tmp/twitter/opensource/2014/*/*.gz")
tweets.registerTempTable("tweets") 
t = sqlContext.sql("SELECT distinct createdAt,user.screenName,hashtagEntities FROM tweets")
tweets_by_days = count_items(t.map(lambda t: javaTimestampToString(t[0])))
stats_hashtags = count_items(t.flatMap(lambda t: t[2])\ .map(lambda t: t[2].lower()))

My source codes are free and available (Python and Scala repositories).

 

Posted in Me.

Apache Spark has just passed Hadoop in popolarity on the web (google trends)

My first Apache Spark usage was extracting texts from tweets I’ve been collecting in Hadoop HDFS. My python script tweet-texts.py was

import json

from pyspark import SparkContext

def valid(tweet):
  return 'text' in tweet

def gettext(line):
  tweet = json.loads(line)
  return tweet['text']

sc = SparkContext(appName="Tweets")
data = sc.textFile("hdfs://hadoop.redaelli.org:9000/user/matteo/staging/twitter/searches/TheCalExperience.json/*/*/*.gz")

result = data.filter(lambda line: valid(line))\
    .map(lambda tweet: gettext(tweet))

output = result.collect()
for text in output:
    print text.encode('utf-8')

And lunched with

spark-1.1.0> bin/spark-submit --master local[4] tweet-texts.py

Continue reading

Posted in Me.

The two top Hadoop distributions (Cloudera and Hortonworks but remember that Hadoop is a Free Software and many companies do not pay anything for using it!) include Apache Solr as Hadoop search tool

See apache-solr-hadoop-search article and the following two presentations from the two vendors

 

 

See also the Natural Language Processing and Sentiment Analysis for Retailers using HDP and ITC Infotech Radar article

 

 

 

 

 

Posted in Me.

I opened a service request to Oracle and they did not provide me an official way to add the Google Analytics javascript code to Oracle OBIEE (release 11.1.1.7): I wanted to add it in only one place and see it in all pages of Oracle Obiee.

The solution I found and tested is to add the javascript code (without <scripts> and </scripts>) in the file

bi_server1/tmp/_WL_user/analytics_11.1.1/7dezjl/war/res/b_mozilla/common.js

Pay attention that the file could be overwritten after any software upgrades

Posted in Me.