Monthly Archives: January 2015

Howto managing tweets saved in #Hadoop using #Apache #Spark SQL

Instead of using the old Hadoop way (map/reduce), I suggest using the newer and faster way (Apache Spark on top of Hadoop Yarn): in few lines you can open all tweets (zipped json files saved in several subdirectories hdfs://path/to/YEAR/MONTH/DAY/*gz) and query them in a SQL like language

sc = SparkContext(appName="extraxtStatsFromTweets.py")
sqlContext = SQLContext(sc)
tweets = sqlContext.jsonFile("/tmp/twitter/opensource/2014/*/*.gz")
tweets.registerTempTable("tweets") 
t = sqlContext.sql("SELECT distinct createdAt,user.screenName,hashtagEntities FROM tweets")
tweets_by_days = count_items(t.map(lambda t: javaTimestampToString(t[0])))
stats_hashtags = count_items(t.flatMap(lambda t: t[2])\ .map(lambda t: t[2].lower()))

My source codes are free and available (Python and Scala repositories).