Zepl
Search…
Amazon S3

Create an Amazon S3 Datasource

  1. 1.
    Navigate to: Resources > Data Sources
  2. 2.
    Select the Amazon S3 Icon:

Data Source Inputs

  • Name (required)
  • Bucket name (required): Name of the S3 bucket to connect to
  • Credential Type: See Data Source Security​
  • Access Key ID (required - can be entered later):
  • Secret access key (required - can be entered later):
  • Session token (optional)

Use Amazon S3

Read

Python
Scala
PySpark
R
1
%python
2
# Connect to S3 Bucket and return Boto3 Bucket object
3
bucket = z.getDatasource("zepl_documentation")
4
​
5
# Display files and folders in the zepl_documentation bucket
6
for s3_file in bucket.objects.all():
7
print(s3_file.key)
8
​
9
# Download a .csv or .pkl file
10
bucket.download_file("finalized_model.pkl", "./finalized_model.pkl")
11
bucket.download_file("titanic3.csv", "./titanic3.csv")
12
​
13
# Validate the csv file is loaded on the container file system
14
print("List local container:")
15
!ls -al
Copied!
Use z.getDatasource() to return boto3 Bucket object: boto3.resources.factory.s3.Bucket​
1
%spark
2
​
3
// User must set AWS Access Key and Secret here
4
val myAccessKey = "<Your AWS Access Key>"
5
val mySecretKey = "<Your AWS Secret Key>"
6
​
7
// Set Spark Context Hadoop Configurations
8
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)
9
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)
10
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
11
​
12
// Read CSV file to a DataFrame
13
val df = spark.read.option("header",true).csv("s3a://zepl-documentation/titanic3.csv")
14
​
15
// Display Results
16
z.show(df)
Copied!
1
%pyspark
2
# Connect to S3 Bucket and return Boto3 Bucket object
3
bucket = z.getDatasource("zepl_documentation")
4
​
5
# Display files and folders in the zepl_documentation bucket
6
for s3_file in bucket.objects.all():
7
print(s3_file.key)
8
​
9
# Download a .csv or .pkl file
10
bucket.download_file("finalized_model.pkl", "./finalized_model.pkl")
11
bucket.download_file("titanic3.csv", "./titanic3.csv")
12
​
13
# Validate the csv file is loaded on the container file system
14
print("List local container:")
15
!ls -al
Copied!

First, set spark context variables before connecting to S3 buckets using R

1
%spark
2
// User must set AWS Access Key and Secret here
3
val myAccessKey = "<Your AWS Access Key>"
4
val mySecretKey = "<Your AWS Secret Key>"
5
​
6
// Set Spark Context Hadoop Configurations
7
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)
8
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)
9
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
Copied!

Read from S3 using R

1
%r
2
# ATTENTION: Must set hadoopConfigurations using Scala before connecting to your S3 bucket
3
# Read DataFrame from S3
4
df1 <- SparkR::read.df("s3a://zepl-documentation/titanic3.csv", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "", header="true")
Copied!
The S3 data source is not supported for R. This method uses the Spark API to read and write

Write

Python
Scala
PySpark
R
1
%python
2
import pandas as pd
3
import boto3
4
​
5
# Connect to S3 Bucket and return Boto3 Bucket object
6
bucket = z.getDatasource("zepl_documentation")
7
​
8
# Create data frame
9
data = {0: [1, 2, 3, 4],
10
1: [5, 6, 7, 8],
11
2: [9, 10, 11, 12],
12
3: [13, 14, 15, 16],
13
4: [17, 18, 19, 20],
14
5: [21, 22, 23, 24]}
15
​
16
# Create Pandas DataFrame
17
df = pd.DataFrame.from_dict(data, orient='index')
18
​
19
# Create a csv file on the container file system (Write DataFrame to a CSV)
20
df.to_csv("local_data.csv", sep='\t', encoding='utf-8')
21
​
22
# Upload fie to S3
23
bucket.upload_file("local_data.csv", "s3_data.csv")
Copied!
1
%spark
2
​
3
// User must set AWS Access Key and Secret here
4
val myAccessKey = "<Your AWS Access Key>"
5
val mySecretKey = "<Your AWS Secret Key>"
6
​
7
// Set Spark Context Hadoop Configurations
8
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)
9
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)
10
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
11
​
12
// Write DataFrame to the S3 bucket called 'zepl-documentation'. df must be instantiated as a org.apache.spark.sql.DataFrame Object
13
df.write.option("header",true).mode("overwrite").csv("s3a://zepl-documentation/write_from_scala")
Copied!
1
%pyspark
2
import pandas as pd
3
import boto3
4
​
5
# Connect to S3 Bucket and return Boto3 Bucket object
6
bucket = z.getDatasource("zepl_documentation")
7
​
8
# Create data frame
9
data = {0: [1, 2, 3, 4],
10
1: [5, 6, 7, 8],
11
2: [9, 10, 11, 12],
12
3: [13, 14, 15, 16],
13
4: [17, 18, 19, 20],
14
5: [21, 22, 23, 24]}
15
​
16
# Create Pandas DataFrame
17
df = pd.DataFrame.from_dict(data, orient='index')
18
​
19
# Write DataFrame to a CSV
20
df.to_csv("local_data.csv", sep='\t', encoding='utf-8')
21
​
22
# Upload fie to S3
23
bucket.upload_file("local_data.csv", "s3_data.csv")
Copied!

First, Set Spark context variables in Scala

1
%spark
2
// User must set AWS Access Key and Secret here
3
val myAccessKey = "<Your AWS Access Key>"
4
val mySecretKey = "<Your AWS Secret Key>"
5
​
6
// Set Spark Context Hadoop Configurations
7
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", myAccessKey)
8
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", mySecretKey)
9
sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
Copied!

Then, Write to S3 using R

1
%r
2
# ATTENTION: Must set hadoopConfigurations using Scala before connecting to your S3 bucket
3
# Write DataFrame to the S3 bucket called 'zepl-documentation'
4
SparkR::write.df(df1, "s3a://zepl-documentation/write_from_r", "csv", "overwrite")
Copied!
The S3 data source is not supported for Scala. This method uses the Spark API to read and write

Configure Authentication

Generate User Access Key and Secret

  1. 1.
    Login to AWS
  2. 2.
    Navigate to IAM > Users
  3. 3.
    Select your user name
  4. 4.
    Select Add permissions, to make sure your user has the required permissions to support the level of access required. AmazonS3FullAccess - Grants access to all actions and all resources AmazonS3ReadOnlyAccess - Grants read Get and List actions to all resources. This won't be able to write files from Zepl to S3.
  5. 5.
    Select Security credentials > Create access key
  6. 6.
    Download .csv file - This file contains the key that will be entered into Zepl's S3 Data Source. DO NOT LOSE THIS file.
AWS is constantly evolving - please review their documentation at (some link) to ensure best practice
Last modified 5mo ago