Zepl
Search…
Amazon Redshift

Create an Amazon S3 Datasource

  1. 1.
    Navigate to: Resources > Data Sources
  2. 2.
    Select the Amazon Redshift Icon:

Data Source Inputs

  • Name (required)
  • Description (optional)
  • Host (required)
    • Redshift Cluster Endpoint
    • Do not add Port or Database to the end of the host url
    • Example: redshift-cluster.sda23451.us-west-1.redshift.amazonaws.com
  • Port (required)
  • Database (optional)
  • Credential Type: See Data Source Security​
  • Username (required - can be entered later)
    • A master username and password will be set upon creating the Redshift database
  • Password (required - can be entered later)

Use Amazon Redshift

Read

Python
Scala
PySpark
R
SQL
1
%python
2
import pandas as pd
3
​
4
# Open a connection and create a cursor object to perform database operations
5
conn = z.getDatasource("zepl_docs_redshift")
6
cur = conn.cursor()
7
​
8
### Option 1: Read into Python Pandas DataFrame object ###
9
df = pd.read_sql_query("SELECT * FROM zepl_doc;", conn)
10
​
11
### Option 2: Read into Python list object ###
12
cur.execute("SELECT * FROM zepl_doc;")
13
res = cur.fetchall()
14
​
15
# Display Results from dataframe (df) or list object (res)
16
z.show(df)
17
​
18
# Make the changes to the database persistent
19
conn.commit()
20
​
21
# Close communication with the database
22
cur.close()
23
conn.close()
Copied!
Use z.getDatasource() to return Redshift connection object: psycopg2.extensions.connection​
1
%spark
2
val df = z.getDatasource("zepl_docs_redshift")
3
.asInstanceOf[org.apache.spark.sql.DataFrameReader]
4
.option("dbtable", "<table name>")
5
.load()
6
​
7
z.show(df)
Copied!
Option 1: Using Python to connect to Redshift:
1
%pyspark
2
import pandas as pd
3
​
4
#Open a connection and create a cursor object to perform database operations
5
conn = z.getDatasource("zepl_docs_redshift")
6
cur = conn.cursor()
7
​
8
### Option 1: Read into Python Pandas DataFrame object ###
9
df = pd.read_sql_query("SELECT * FROM zepl_doc;", conn)
10
​
11
### Option 2: Read into Python list object ###
12
cur.execute("SELECT * FROM zepl_doc;")
13
res = cur.fetchall()
14
​
15
# Display Results from dataframe (df) or list object (res)
16
z.show(df)
17
​
18
# Make the changes to the database persistent
19
conn.commit()
20
​
21
# Close communication with the database
22
cur.close()
23
conn.close()
Copied!
Option 2: Using Scala to connect to Redshift and Reading temp table in Pyspark
1
%spark
2
val df = z.getDatasource("zepl_docs_redshift")
3
.asInstanceOf[org.apache.spark.sql.DataFrameReader]
4
.option("dbtable", "zepl_doc")
5
.load()
6
7
// Un comment if need Pyspark to use SparkSQL DataFrame
8
df.registerTempTable("pass_to_pyspark")
Copied!
1
%pyspark
2
# Read dataframe from Scala
3
df = sqlContext.sql("select * from pass_to_pyspark")
4
​
5
z.show(df)
Copied!
First, Install required packages and make connection to Redshift
1
%r
2
# Reference Doc: https://aws.amazon.com/blogs/big-data/connecting-r-with-amazon-redshift/
3
install.packages("RJDBC")
4
library(RJDBC)
5
​
6
# download Amazon Redshift JDBC driver
7
download.file('http://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC41-1.1.9.1009.jar','RedshiftJDBC41-1.1.9.1009.jar')
8
​
9
# connect to Amazon Redshift
10
driver <- JDBC("com.amazon.redshift.jdbc41.Driver", "RedshiftJDBC41-1.1.9.1009.jar", identifier.quote="`")
11
# Fill in the values below: "jdbc:redshift://<JDBCURL>:<PORT>/<DBNAME>?user=<USER>&password=<PW>"
12
url <- "jdbc:redshift://redshift-cluster.sdcedksis89.us-west-1.redshift.amazonaws.com:5439/dev?user=<username>&password=<password>"
13
conn <- dbConnect(driver, url)
Copied!
Then, Execute queries
1
%r
2
res = dbGetQuery(conn, "select * from zepl_doc")
3
z.show(res)
Copied!
For more information, see AWS Reference Documentation​
1
%datasource.zepl_docs_redshift
2
​
3
SELECT * FROM zepl_doc
Copied!

Write

Python
R
1
%python
2
#Open a connection and create a cursor object to perform database operations
3
conn = z.getDatasource("zepl_docs_redshift")
4
cur = conn.cursor()
5
​
6
# Create table called zepl_doc if not exists
7
cur.execute(""" create table IF NOT exists zepl_doc(
8
salesid integer not null,
9
listid varchar not null distkey) """)
10
​
11
# Insert data (100, "xyz") into the table zepl_doc
12
cur.execute("INSERT INTO zepl_doc (salesid, listid) VALUES (%s, %s)", (100, "xyz"))
13
​
14
# Persist changes to the database
15
conn.commit()
16
​
17
# Close communication with the database
18
cur.close()
19
conn.close()
Copied!
Returned connection object from z.getDatasource(string): psycopg2.extensions.connection​

First, Install required packages and make connection to Redshift

1
%r
2
# Reference Doc: https://aws.amazon.com/blogs/big-data/connecting-r-with-amazon-redshift/
3
install.packages("RJDBC")
4
library(RJDBC)
5
​
6
# download Amazon Redshift JDBC driver
7
download.file('http://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC41-1.1.9.1009.jar','RedshiftJDBC41-1.1.9.1009.jar')
8
​
9
# connect to Amazon Redshift
10
driver <- JDBC("com.amazon.redshift.jdbc41.Driver", "RedshiftJDBC41-1.1.9.1009.jar", identifier.quote="`")
11
# Fill in the values below: "jdbc:redshift://<JDBCURL>:<PORT>/<DBNAME>?user=<USER>&password=<PW>"
12
url <- "jdbc:redshift://redshift-cluster.sdcedksis89.us-west-1.redshift.amazonaws.com:5439/dev?user=<username>&password=<password>"
13
conn <- dbConnect(driver, url)
Copied!

Then, Write to Redshift

1
%r
2
# Append values from the res object into zepl_doc table in Redshift
3
dbWriteTable(conn, "zepl_doc", res, append='True')
Copied!

Configure Authentication

  • Redshift must be accessible to Zepl's whitelisted IP addresses. This should be done as a Security Group in the AWS Console.
  • The user associated with your user name and password must have access to the desired databases and tables
  • User should have read and write permissions on the desired databases and tables
AWS is constantly evolving - please review their documentation to ensure best practice
Last modified 1yr ago