Managing Spark dataframes in Python

Below a quick sample of using Apache Spark (2.0) dataframes for manipulating data. Sample data is a file of jsonlines like

{"description": "255/40 ZR17 94W", "ean": "EAN: 4981910401193", "season": "tires_season summer", "price": "203,98", "model": "Michelin Pilot Sport PS2 255/40 R17", "id": "MPN: 2351610"}
{"description": "225/55 R17 101V XL", "ean": "EAN: 5452000438744", "season": "tires_season summer", "price": "120,98", "model": "Pirelli P Zero 205/45 R17", "id": "MPN: 530155"}
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import col
from pyspark.sql.functions import lit
from pyspark.sql.functions import *
import re, sys

# warehouse_location points to the default location for managed databases and tables
warehouse_location = 'spark-warehouse'

spark = SparkSession \
    .builder \
    .appName("Python Spark  ") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \

records_orig ="scraped_tyres_data.json")

## removing bad records 
records = records_orig \
  .filter( != '') \
  .filter(regexp_extract('description', '(rinnovati)', 1) == '')

## saving bad records  
records_orig.subtract(records).coalesce(1).write.csv("bad-records.csv", sep=";")

# extract new features
regexp_size = "(\d+)/(\d+) R(\d+) (\d+)(\w+)\s*"

records = records \
  .withColumn("width",       regexp_extract("description", regexp_size, 1)) \
  .withColumn("ratio",       regexp_extract("description", regexp_size, 2)) \
  .withColumn("diameter",    regexp_extract("description", regexp_size, 3)) \
  .withColumn("load_index",  regexp_extract("description", regexp_size, 4)) \
  .withColumn("speed_index", regexp_extract("description", regexp_size, 5)) \
  .withColumn("brand",       regexp_extract("model", "^(\w+) ", 1)) \
  .withColumn("season",      trim(regexp_replace("season", "tires_season",""))) \
  .withColumn("id",          trim(regexp_replace("id", "MPN: ",""))) \
  .withColumn("ean",         trim(regexp_replace("ean", "EAN: ",""))) \
  .withColumn("runflat",     regexp_extract("description", "(runflat)", 1)) \
  .withColumn("mfs",         regexp_extract("description", "(MFS|FSL|bordo di protezione|bordino di protezione)", 1)) \
  .withColumn("xl",          regexp_extract("description", " (XL|RF)\s*", 1)) \
  .withColumn("chiodabile",  regexp_extract("description", "(chiodabile)\s*", 1))

## extracting and saving all season values"season").distinct().coalesce(1).write.csv("season_values", sep=";")

# misc
# records.columns   # show columns
# records.groupBy("brand").count().show()
# records.groupBy("brand").count().filter("count > 100").show(20,False)
# renaming all columns before joining dataframes with same column names
# records_renamed =*(col(x).alias(x + '_renamed') for x in records.columns))
# join two dataframe
# records.join(record_renamed, records.ean == records_renamed.ean_renamed)
# saving data to several formats
records.coalesce(1).write.csv("result.csv", sep=";")


Howto batch Install OBIEE 12c (silent mode)

If you wanted to install / deploy automatically obiee systems in a datacenter/cloud you could simply run few simple commands like:

export TEMP=/home/oracle/tmp
export TEMPDIR=/home/oracle/tmp
export JAVA_HOME=/home/oracle/apps/jdk1.8.0

java -jar fmw_12. \
     -silent -responseFile /home/oracle/KIT/response_01_fmw_infrastructure.rsp \
     -invPtrLoc /home/oracle/oraInst.loc

./bi_platform- -silent \
      -responseFile /home/oracle/KIT/response_02_bi_platform.rsp \
      -invPtrLoc /home/oracle/oraInst.loc \

export ORACLE_HOME=/home/oracle/Oracle/Middleware/Oracle_Home
$BI_PRODUCT_HOME/bin/ -silent \
    -responseFile /home/oracle/KIT/response_03_bi_platform_config.rsp \
    -invPtrLoc /home/oracle/oraInst.loc \

Any faster alternative to #Hadoop HDFS?

I’d like to have an alternative to Hadoop HDFS, a faster and not java filesystem:

Which is better? Any suggestions?


  • [1]

Apache Spark howto import data from a jdbc database using python

Using Apache spark 2.0 and python I’ll show how to import a table from a relational database (using its jdbc driver) into a python dataframe and save it in a parquet file. In this demo the database is an oracle 12.x


from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \

df ="jdbc").options(url="jdbc:oracle:thin:ro/", 


And the run it with

spark-2.0.1-bin-hadoop2.7/bin/spark-submit –jars instantclient_12_1/ojdbc7.jar

Building websites for high traffic with REST APIs, AngularJs and Jekyll

If you have few hw resources and/or you expect high traffic on your website, here are some quick suggestions (also taken from the article Meet the Obama campaign’s $250 million fundraising platform):

  • Expose your business logic with REST services
  • Use a javascript framework like angularJS for calling your rest APIs and building a dynamic site
  • Build a (not so) static website using Jekyll or similars and put your static files on S3 (if you are using Amazon AWS)
  • use a CDN

A sample website (not hosted at AWS but home based using a poor raspberryPI2) is

Adding an application (angularjs+rest api) inside a WordPress site

If you need to integrate an application written with AngularJS and Rest API services in your wordpress website, just create an empy page and edit it in “text” mode with something like

<!-- the following two lines can be put in the header template --> 
<script src=""></script>
<script src=""></script>

<div ng-app="myApp" ng-controller="planetController">
       <div >
           <input ng-model="query" placeholder="inserisci una parola" type="text">
            <p><button ng-click="searchV(query)" >Dividi in sillabe</button></p>

A running example is (now, but in the near future I’ll switch to a generated static web site) at