PostgreSQL

Create an PostgreSQL Datasource

  1. Navigate to: Resources > Data Sources

  2. Select the PostgreSQL Icon:

Data Source Inputs

  • Name (required)

  • Host (required):

    • Enter the hostname of your PostgreSQL database.

    • Example (using PostgreSQL in AWS RDS): postgresql-zepl-docs.si89zol2ksop.us-west-1.rds.amazonaws.com

  • Port (required): 5432 is the default port for PostgreSQL

  • Database (optional):

    • If this is filled in, Zepl will append the database name to your connection string.

    • If this is blank, the connection may use the default database set by your administrator

    • Example:postgresql-zepl-docs.si89zol2ksop.us-west-1.rds.amazonaws.com

      /<database name>

  • Credential Type: See Data Source Security​

  • Username (required - can be entered later): Username used for connecting to PostgreSQL

  • Password (required - can be entered later): Password used for connecting to PostgreSQL

Use PostgreSQL

Read

Python
Scala
PySpark
R
SQL
Python
%python
import pandas as pd
​
#Open a connection and create a cursor object to perform database operations
conn = z.getDatasource("zepl_docs_postgresql")
cur = conn.cursor()
​
### Option 1: Read into Python Pandas DataFrame object ###
df = pd.read_sql_query("SELECT * FROM zepl_doc;", conn)
​
### Option 2: Read into Python list object ###
cur.execute("SELECT * FROM zepl_doc;")
res = cur.fetchall()
​
# Display Results from dataframe (df) or list object (res)
z.show(df)
​
# Make the changes to the database persistent
conn.commit()
​
# Close communication with the database
cur.close()
conn.close()

Use z.getDatasource() to return PostgreSQL connection object: psycopg2.extensions.connection​

Scala
%spark
// Fill in your variables, replace <variablename>
var url = "jdbc:postgresql://<host>:<port>/<database>"
var table_name = "<table name>"
var driver = "org.postgresql.Driver" // Do not change
var user = "<username>"
var pass = "<password>"
​
// Create JDBC connection to PostgreSQL using spark.read operations
val df = spark.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("dbtable", table_name)
.option("user", user)
.option("password", pass)
.load()
​
// Display DataFrame
z.show(df)

This method does not use the Zepl data source. Please remove all PostgreSQL data sources attached to the notebook

PySpark

Option 1: Use Zepl's Data Source

Please follow guidance from the Python tab

Option 2: Using SparkSQL

%pyspark
url = "jdbc:mysql://<hostname>:<port>/<database_name>"
table_name = "<table_name>"
driver = "com.mysql.jdbc.Driver" # Do not change
user = "<username>"
password = "<password>"
​
df.write.format('jdbc').options(
url=url,
driver=driver,
dbtable=table_name,
user=user,
password=password)\
.mode('append')\
.save()
R
TODO:
SQL
%datasource.<data source name here>
​
# Query database table zepl_doc
SELECT * FROM zepl_doc

Write

Python
Scala
PySpark
R
Python
%python
​
#Open a connection and create a cursor object to perform database operations
conn = z.getDatasource("zepl_docs_postgresql")
cur = conn.cursor()
​
# Create table called zepl_doc if not exists
cur.execute("CREATE TABLE if not exists zepl_doc (id serial PRIMARY KEY, num integer, data varchar);")
​
# Insert data (100, "xyz") into the table zepl_doc
cur.execute("INSERT INTO zepl_doc (num, data) VALUES (%s, %s)", (100, "xyz"))
​
# Persist changes to the database
conn.commit()
​
# Close communication with the database
cur.close()
conn.close()
Scala
%spark
import java.util.Properties
​
// Create Properties object for write function
val properties = new Properties();
​
// Create inputs
var url = "jdbc:postgresql://<host>:<port>/<database>"
var table_name = "<table name>"
var driver = "org.postgresql.Driver" // Do not change
var user = "<username>"
var pass = "<password>"
​
// Set Properties
properties.setProperty("driver", driver);
properties.setProperty("user", user);
properties.setProperty("password", pass);
​
// If the DataFrame contains the Primary Key field (in this case 'id'), must drop this before calling 'df.write' to avoid primary key collision
var data = df.drop("id")
​
// Write SparkSQL DataFrame to PostgreSQL table
data.write.mode("append").jdbc(url, table_name, properties)

This method does not use the Zepl data source our data source. Please remove all PostgreSQL data sources from the notebook.

PySpark

Option 1: Use Zepl's Data Source

Please follow guidance from the Python tab

Option 2: Using SparkSQL

%pyspark
# Fill in your variables, replace <variablename>
url = "jdbc:postgresql://<host>:<port>/<database>"
table_name = "<table name>"
driver = "org.postgresql.Driver" # Do not change
user = "<username>"
password = "<password>"
​
# If the DataFrame contains the Primary Key field (in this case 'id'), must drop this before calling 'df.write' to avoid primary key collision
data = df.drop("id")
​
data.write.format('jdbc').options(
url=url,
driver=driver,
dbtable=table_name,
user=user,
password=password)\
.mode('append')\
.save()
R
TODO:

Configure Authentication in PostgreSQL

In your PostgreSQL database be sure that these items are configured properly:

  • PostgreSQL should be accessible to Zepl's whitelisted IP addresses

  • The user associated with your user name and password have access to the desired databases and tables

  • User should have read and write permissions on the desired databases and tables