Monthly Archives: October 2015

TwitterPopularTags.scala example of Apache Spark Streaming in a standalone project

This is an easy tutorial of using Apache Spark Streaming with Scala language using the official  TwitterPopularTags.scala example and putting it in a standalone sbt project.

 

In few minutes you will be able to receive streams of tweets and manipulating then in realtime with  Apache Spark Streaming

  • Install Apache Spark (I used 1.5.1)
  • Install sbt
  • git clone https://github.com/matteoredaelli/TwitterPopularTags
  • cd TwitterPopularTags
  • cp twitter4j.properties.sample twitter4j.properties
  • edit twitter4j.properties
  • sbt package
  • spark-submit –master local –packages “org.apache.spark:spark-streaming-twitter_2.10:1.5.1” ./target/scala-2.10/twitterpopulartags_2.10-1.0.jar italy

Howto collecting twitter data in 15 minutes

For this tutorial I assume you are using a  Debian/Ubuntu Linux system but it could be easily adapted for other Openrating Systems

Install the software

apt-get install openjdk-7-jdk  
wget http://apache.panu.it/karaf/4.0.2/apache-karaf-4.0.2.tar.gz
tar xvfz apache-karaf-4.0.2.tar.gz

Start the server

cd apache-karaf-4.0.2/
./bin/start

Install additional connectors

ssh -p 8101 karaf@localhost
feature:repo-add camel 2.16.0
feature:install camel camel-blueprint camel-twitter camel-jackson camel-dropbox
exit

Configure our routes

Create two new files:

twitter-to-file.xml

<?xml version="1.0" encoding="UTF-8"?>
<blueprint xmlns="http://www.osgi.org/xmlns/blueprint/v1.0.0"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:camel="http://camel.apache.org/schema/blueprint"
       xsi:schemaLocation="
       http://www.osgi.org/xmlns/blueprint/v1.0.0 http://www.osgi.org/xmlns/blueprint/v1.0.0/blueprint.xsd
       http://camel.apache.org/schema/blueprint http://camel.apache.org/schema/blueprint/camel-blueprint.xsd">

  <camelContext id="twitter-to-file" streamCache="true" xmlns="http://camel.apache.org/schema/blueprint">

    <dataFormats>
      <json id="jack" library="Jackson" />
      <jaxb id="myJaxb" prettyPrint="true" contextPath="org.apache.camel.example"/>
    </dataFormats>

    <route id="twitter-tweets-to-file">
      <from uri="vm:twitter-tweets-to-file" />
      <setHeader headerName="CamelFileName">
         <simple>${in.header.twitter-id}</simple>
      </setHeader>
      <split>
        <simple>${body}</simple>
        <to uri="vm:twitter-tweet-to-file" />
      </split>
    </route>

    <route id="twitter-tweet-to-file">
      <from uri="vm:twitter-tweet-to-file" />
      <log message="Saving tweet id= ${body.id}" />
      <!-- transforming the body (a single tweet) to a json doc -->
      <marshal ref="jack" />
      <convertBodyTo type="java.lang.String" charset="UTF8" />
      <transform>
        <simple>${body}\n</simple>
      </transform>
      <setHeader headerName="CamelFileName">
        <simple>${in.header.CamelFileName}/${date:now:yyyy}/${date:now:MM}/${date:now:dd}</simple>
      </setHeader>
      <to uri="file:twitter-data?autoCreate=true&amp;fileExist=Append" />
    </route>
  </camelContext>
</blueprint>

twitter-streaming-sample.xml

<blueprint xmlns="http://www.osgi.org/xmlns/blueprint/v1.0.0">
  <camelContext id="twitter-search-sample" xmlns="http://camel.apache.org/schema/blueprint">
    <route id="twitter-search-sample">
      <from uri="twitter://streaming/sample?count=100&amp;type=polling&amp;consumerKey=XXX&amp;consumerSecret=XXX&amp;accessToken=XXX&amp;accessTokenSecret=XXX" />
      <setHeader headerName="twitter-id">
        <simple>sample</simple>
      </setHeader>
      <to uri="vm:twitter-tweets-to-file" />
    </route>

  </camelContext>
</blueprint>

and copy then in the “deploy” directory. Check logs in data/log/karaf.log and see results in the folder twitter-data/sample/yyyy/mm/dd

 

Good lucks

Matteo