Monthly Archives: January 2015

Howto managing tweets saved in #Hadoop using #Apache #Spark SQL

Instead of using the old Hadoop way (map/reduce), I suggest using the newer and faster way (Apache Spark on top of Hadoop Yarn): in few lines you can open all tweets (zipped json files saved in several subdirectories hdfs://path/to/YEAR/MONTH/DAY/*gz) and query them in a SQL like language

sc = SparkContext(appName="")
sqlContext = SQLContext(sc)
tweets = sqlContext.jsonFile("/tmp/twitter/opensource/2014/*/*.gz")
t = sqlContext.sql("SELECT distinct createdAt,user.screenName,hashtagEntities FROM tweets")
tweets_by_days = count_items( t: javaTimestampToString(t[0])))
stats_hashtags = count_items(t.flatMap(lambda t: t[2])\ .map(lambda t: t[2].lower()))

My source codes are free and available (Python and Scala repositories).