Amazon S3

Create an Amazon S3 Datasource

  1. Navigate to: Resources > Data Sources

  2. Select the Amazon S3 Icon:

Data Source Inputs

  • Name (required)

  • Bucket name (required): Name of the S3 bucket to connect to

  • Credential Type: See Data Source Security​

  • Access Key ID (required - can be entered later):

  • Secret access key (required - can be entered later):

  • Session token (optional)

Use Amazon S3

Read

Python
Scala
PySpark
R
Python
%python
# Connect to S3 Bucket and return Boto3 Bucket object
bucket = z.getDatasource("zepl_documentation")
​
# Display files and folders in the zepl_documentation bucket
for s3_file in bucket.objects.all():
print(s3_file.key)
​
# Download a .csv or .pkl file
bucket.download_file("finalized_model.pkl", "./finalized_model.pkl")
bucket.download_file("titanic3.csv", "./titanic3.csv")
​
# Validate the csv file is loaded on the container file system
print("List local container:")
!ls -al

Use z.getDatasource() to return boto3 Bucket object: boto3.resources.factory.s3.Bucket​

Scala
%spark
​
// User must set AWS Access Key and Secret here
val myAccessKey = "<Your AWS Access Key>"
val mySecretKey = "<Your AWS Secret Key>"
​
// Set Spark Context Hadoop Configurations
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
​
// Read CSV file to a DataFrame
val df = spark.read.option("header",true).csv("s3a://zepl-documentation/titanic3.csv")
​
// Display Results
z.show(df)
PySpark
%pyspark
# Connect to S3 Bucket and return Boto3 Bucket object
bucket = z.getDatasource("zepl_documentation")
​
# Display files and folders in the zepl_documentation bucket
for s3_file in bucket.objects.all():
print(s3_file.key)
​
# Download a .csv or .pkl file
bucket.download_file("finalized_model.pkl", "./finalized_model.pkl")
bucket.download_file("titanic3.csv", "./titanic3.csv")
​
# Validate the csv file is loaded on the container file system
print("List local container:")
!ls -al
R

First, set spark context variables before connecting to S3 buckets using R

%spark
// User must set AWS Access Key and Secret here
val myAccessKey = "<Your AWS Access Key>"
val mySecretKey = "<Your AWS Secret Key>"
​
// Set Spark Context Hadoop Configurations
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

Read from S3 using R

%r
# ATTENTION: Must set hadoopConfigurations using Scala before connecting to your S3 bucket
# Read DataFrame from S3
df1 <- SparkR::read.df("s3a://zepl-documentation/titanic3.csv", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "", header="true")

The S3 data source is not supported for R. This method uses the Spark API to read and write

Write

Python
Scala
PySpark
R
Python
%python
import pandas as pd
import boto3
​
# Connect to S3 Bucket and return Boto3 Bucket object
bucket = z.getDatasource("zepl_documentation")
​
# Create data frame
data = {0: [1, 2, 3, 4],
1: [5, 6, 7, 8],
2: [9, 10, 11, 12],
3: [13, 14, 15, 16],
4: [17, 18, 19, 20],
5: [21, 22, 23, 24]}
​
# Create Pandas DataFrame
df = pd.DataFrame.from_dict(data, orient='index')
​
# Create a csv file on the container file system (Write DataFrame to a CSV)
df.to_csv("local_data.csv", sep='\t', encoding='utf-8')
​
# Upload fie to S3
bucket.upload_file("local_data.csv", "s3_data.csv")
Scala
%spark
​
// User must set AWS Access Key and Secret here
val myAccessKey = "<Your AWS Access Key>"
val mySecretKey = "<Your AWS Secret Key>"
​
// Set Spark Context Hadoop Configurations
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
​
// Write DataFrame to the S3 bucket called 'zepl-documentation'. df must be instantiated as a org.apache.spark.sql.DataFrame Object
df.write.option("header",true).mode("overwrite").csv("s3a://zepl-documentation/write_from_scala")
PySpark
%pyspark
import pandas as pd
import boto3
​
# Connect to S3 Bucket and return Boto3 Bucket object
bucket = z.getDatasource("zepl_documentation")
​
# Create data frame
data = {0: [1, 2, 3, 4],
1: [5, 6, 7, 8],
2: [9, 10, 11, 12],
3: [13, 14, 15, 16],
4: [17, 18, 19, 20],
5: [21, 22, 23, 24]}
​
# Create Pandas DataFrame
df = pd.DataFrame.from_dict(data, orient='index')
​
# Write DataFrame to a CSV
df.to_csv("local_data.csv", sep='\t', encoding='utf-8')
​
# Upload fie to S3
bucket.upload_file("local_data.csv", "s3_data.csv")
R

First, Set Spark context variables in Scala

%spark
// User must set AWS Access Key and Secret here
val myAccessKey = "<Your AWS Access Key>"
val mySecretKey = "<Your AWS Secret Key>"
​
// Set Spark Context Hadoop Configurations
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", myAccessKey)
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", mySecretKey)
sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

Then, Write to S3 using R

%r
# ATTENTION: Must set hadoopConfigurations using Scala before connecting to your S3 bucket
# Write DataFrame to the S3 bucket called 'zepl-documentation'
SparkR::write.df(df1, "s3a://zepl-documentation/write_from_r", "csv", "overwrite")

The S3 data source is not supported for Scala. This method uses the Spark API to read and write

Configure Authentication

Generate User Access Key and Secret

  1. Login to AWS

  2. Navigate to IAM > Users

  3. Select your user name

  4. Select Add permissions, to make sure your user has the required permissions to support the level of access required. AmazonS3FullAccess - Grants access to all actions and all resources AmazonS3ReadOnlyAccess - Grants read Get and List actions to all resources. This won't be able to write files from Zepl to S3.

  5. Select Security credentials > Create access key

  6. Download .csv file - This file contains the key that will be entered into Zepl's S3 Data Source. DO NOT LOSE THIS file.

AWS is constantly evolving - please review their documentation at (some link) to ensure best practice