Managing Spark dataframes in Python

Below a quick sample of using Apache Spark (2.0) dataframes for manipulating data. Sample data is a file of jsonlines like

{"description": "255/40 ZR17 94W", "ean": "EAN: 4981910401193", "season": "tires_season summer", "price": "203,98", "model": "Michelin Pilot Sport PS2 255/40 R17", "id": "MPN: 2351610"}
{"description": "225/55 R17 101V XL", "ean": "EAN: 5452000438744", "season": "tires_season summer", "price": "120,98", "model": "Pirelli P Zero 205/45 R17", "id": "MPN: 530155"}
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import col
from pyspark.sql.functions import lit
from pyspark.sql.functions import *
import re, sys


# warehouse_location points to the default location for managed databases and tables
warehouse_location = 'spark-warehouse'

spark = SparkSession \
    .builder \
    .appName("Python Spark  ") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

records_orig = spark.read.json("scraped_tyres_data.json")

## removing bad records 
records = records_orig \
  .filter(records.id != '') \
  .filter(regexp_extract('description', '(rinnovati)', 1) == '')

## saving bad records  
records_orig.subtract(records).coalesce(1).write.csv("bad-records.csv", sep=";")

# extract new features
regexp_size = "(\d+)/(\d+) R(\d+) (\d+)(\w+)\s*"

records = records \
  .withColumn("width",       regexp_extract("description", regexp_size, 1)) \
  .withColumn("ratio",       regexp_extract("description", regexp_size, 2)) \
  .withColumn("diameter",    regexp_extract("description", regexp_size, 3)) \
  .withColumn("load_index",  regexp_extract("description", regexp_size, 4)) \
  .withColumn("speed_index", regexp_extract("description", regexp_size, 5)) \
  .withColumn("brand",       regexp_extract("model", "^(\w+) ", 1)) \
  .withColumn("season",      trim(regexp_replace("season", "tires_season",""))) \
  .withColumn("id",          trim(regexp_replace("id", "MPN: ",""))) \
  .withColumn("ean",         trim(regexp_replace("ean", "EAN: ",""))) \
  .withColumn("runflat",     regexp_extract("description", "(runflat)", 1)) \
  .withColumn("mfs",         regexp_extract("description", "(MFS|FSL|bordo di protezione|bordino di protezione)", 1)) \
  .withColumn("xl",          regexp_extract("description", " (XL|RF)\s*", 1)) \
  .withColumn("chiodabile",  regexp_extract("description", "(chiodabile)\s*", 1))

## extracting and saving all season values
records.select("season").distinct().coalesce(1).write.csv("season_values", sep=";")

# misc
# records.columns   # show columns
# records.groupBy("brand").count().show()
# records.groupBy("brand").count().filter("count > 100").show(20,False)
#
# renaming all columns before joining dataframes with same column names
# records_renamed = records.select(*(col(x).alias(x + '_renamed') for x in records.columns))
# join two dataframe
# records.join(record_renamed, records.ean == records_renamed.ean_renamed)
#
#
# saving data to several formats
records.coalesce(1).write.csv("result.csv", sep=";")
records.write.json("result.json")
records.write.parquet("result.parquet")
records.write.format("com.databricks.spark.avro").save("result.avro")

 

Howto batch Install OBIEE 12c (silent mode)

If you wanted to install / deploy automatically obiee systems in a datacenter/cloud you could simply run few simple commands like:

export TEMP=/home/oracle/tmp
export TEMPDIR=/home/oracle/tmp
export JAVA_HOME=/home/oracle/apps/jdk1.8.0

java -Djava.io.tmpdir=/home/oracle/tmp -jar fmw_12.2.1.2.0_infrastructure.jar \
     -silent -responseFile /home/oracle/KIT/response_01_fmw_infrastructure.rsp \
     -invPtrLoc /home/oracle/oraInst.loc

./bi_platform-12.2.1.2.0_linux64.bin -silent \
      -responseFile /home/oracle/KIT/response_02_bi_platform.rsp \
      -invPtrLoc /home/oracle/oraInst.loc \
      -ignoreSysPrereqs

export ORACLE_HOME=/home/oracle/Oracle/Middleware/Oracle_Home
export BI_PRODUCT_HOME=$ORACLE_HOME/bi
$BI_PRODUCT_HOME/bin/config.sh -silent \
    -responseFile /home/oracle/KIT/response_03_bi_platform_config.rsp \
    -invPtrLoc /home/oracle/oraInst.loc \
    -ignoreSysPrereqs

Any faster alternative to #Hadoop HDFS?

I’d like to have an alternative to Hadoop HDFS, a faster and not java filesystem:

Which is better? Any suggestions?

References:

  • [1] https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems

Apache Spark howto import data from a jdbc database using python

Using Apache spark 2.0 and python I’ll show how to import a table from a relational database (using its jdbc driver) into a python dataframe and save it in a parquet file. In this demo the database is an oracle 12.x

file jdbc-to-parquet.py:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()


df = spark.read.format("jdbc").options(url="jdbc:oracle:thin:ro/ro@mydboracle.redaelli.org:1521:MYSID", 
      dbtable="myuser.dim_country", 
      driver="oracle.jdbc.OracleDriver").load()

df.write.parquet("country.parquet")

And the run it with

spark-2.0.1-bin-hadoop2.7/bin/spark-submit –jars instantclient_12_1/ojdbc7.jar jdbc-to-parquet.py

Building websites for high traffic with REST APIs, AngularJs and Jekyll

If you have few hw resources and/or you expect high traffic on your website, here are some quick suggestions (also taken from the article Meet the Obama campaign’s $250 million fundraising platform):

  • Expose your business logic with REST services
  • Use a javascript framework like angularJS for calling your rest APIs and building a dynamic site
  • Build a (not so) static website using Jekyll or similars and put your static files on S3 (if you are using Amazon AWS)
  • use a CDN

A sample website (not hosted at AWS but home based using a poor raspberryPI2) is http://paroleonline.it

Adding an application (angularjs+rest api) inside a WordPress site

If you need to integrate an application written with AngularJS and Rest API services in your wordpress website, just create an empy page and edit it in “text” mode with something like

<!-- the following two lines can be put in the header template --> 
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.3/angular.min.js"></script>
<script src="http://your.site.com/app.js"></script>

<div ng-app="myApp" ng-controller="planetController">
       <div >
           <input ng-model="query" placeholder="inserisci una parola" type="text">
            <p><button ng-click="searchV(query)" >Dividi in sillabe</button></p>
       </div>
</div

A running example is (now, but in the near future I’ll switch to a generated static web site) at http://rapid.tips/site/sillabazione-parole-italiane/

Installing Nodejs oracledb module on Suse SLES 11

For a quick tutorial about installing Oracle module for Nodejs (oracledb) on Suse SLES, follow the info at

Node-OracleDB Installation

but remember to use the gcc compiler release 5.0

export ORACLE_HOME=/home/oracle/instantclient_12_1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME
export TNS_ADMIN=$ORACLE_HOME
export OCI_LIBRARY_PATH=$ORACLE_HOME
export OCI_LIB_DIR=$ORACLE_HOME
export OCI_INC_DIR=$ORACLE_HOME/sdk/include
CC=gcc-5 CXX=g++-5 npm install oracledb

ldiff2sql: How to import ldap data to a database

Export your data

ldapsearch -o ldif-wrap=no -E pr=1000/noprompt -x -h ldapserver.redaelli.org -D "CN=admin,OU=users,DC=redaelli,DC=org" -w mypwd -s sub -b "DC=redaelli,DC=org" "(objectclass=computer)" dNSHostName description operatingSystem operatingSystemVersion  -LLL > ad-computer-sa.ldiff
rm hosts.csv

Converto to sql