Monthly Archives: October 2016

Apache Spark howto import data from a jdbc database using python

Using Apache spark 2.0 and python I’ll show how to import a table from a relational database (using its jdbc driver) into a python dataframe and save it in a parquet file. In this demo the database is an oracle 12.x

file jdbc-to-parquet.py:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()


df = spark.read.format("jdbc").options(url="jdbc:oracle:thin:ro/ro@mydboracle.redaelli.org:1521:MYSID", 
      dbtable="myuser.dim_country", 
      driver="oracle.jdbc.OracleDriver").load()

df.write.parquet("country.parquet")

And the run it with

spark-2.0.1-bin-hadoop2.7/bin/spark-submit –jars instantclient_12_1/ojdbc7.jar jdbc-to-parquet.py