Monthly Archives: November 2014

Howto managing tweets saved in #Hadoop using #Apache #Spark

Apache Spark has just passed Hadoop in popolarity on the web (google trends)

My first Apache Spark usage was extracting texts from tweets I’ve been collecting in Hadoop HDFS. My python script tweet-texts.py was

import json

from pyspark import SparkContext

def valid(tweet):
  return 'text' in tweet

def gettext(line):
  tweet = json.loads(line)
  return tweet['text']

sc = SparkContext(appName="Tweets")
data = sc.textFile("hdfs://hadoop.redaelli.org:9000/user/matteo/staging/twitter/searches/TheCalExperience.json/*/*/*.gz")

result = data.filter(lambda line: valid(line))\
    .map(lambda tweet: gettext(tweet))

output = result.collect()
for text in output:
    print text.encode('utf-8')

And lunched with

spark-1.1.0> bin/spark-submit --master local[4] tweet-texts.py

Continue reading