Amazon S3
- 1.Navigate to: Resources > Data Sources
- 2.Select the Amazon S3 Icon:

- Name (required)
- Bucket name (required): Name of the S3 bucket to connect to
- Access Key ID (required - can be entered later):
- User Access Key ID can be found in the IAM section for your specific user.
- Secret access key (required - can be entered later):
- User Secret access key can be found in the IAM section for your specific user.
- Session token (optional)
Python
Scala
PySpark
R
%python
# Connect to S3 Bucket and return Boto3 Bucket object
bucket = z.getDatasource("zepl_documentation")
# Display files and folders in the zepl_documentation bucket
for s3_file in bucket.objects.all():
print(s3_file.key)
# Download a .csv or .pkl file
bucket.download_file("finalized_model.pkl", "./finalized_model.pkl")
bucket.download_file("titanic3.csv", "./titanic3.csv")
# Validate the csv file is loaded on the container file system
print("List local container:")
!ls -al
%spark
// User must set AWS Access Key and Secret here
val myAccessKey = "<Your AWS Access Key>"
val mySecretKey = "<Your AWS Secret Key>"
// Set Spark Context Hadoop Configurations
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
// Read CSV file to a DataFrame
val df = spark.read.option("header",true).csv("s3a://zepl-documentation/titanic3.csv")
// Display Results
z.show(df)
%pyspark
# Connect to S3 Bucket and return Boto3 Bucket object
bucket = z.getDatasource("zepl_documentation")
# Display files and folders in the zepl_documentation bucket
for s3_file in bucket.objects.all():
print(s3_file.key)
# Download a .csv or .pkl file
bucket.download_file("finalized_model.pkl", "./finalized_model.pkl")
bucket.download_file("titanic3.csv", "./titanic3.csv")
# Validate the csv file is loaded on the container file system
print("List local container:")
!ls -al
%spark
// User must set AWS Access Key and Secret here
val myAccessKey = "<Your AWS Access Key>"
val mySecretKey = "<Your AWS Secret Key>"
// Set Spark Context Hadoop Configurations
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
%r
# ATTENTION: Must set hadoopConfigurations using Scala before connecting to your S3 bucket
# Read DataFrame from S3
df1 <- SparkR::read.df("s3a://zepl-documentation/titanic3.csv", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "", header="true")
The S3 data source is not supported for R. This method uses the Spark API to read and write
Python
Scala
PySpark
R
%python
import pandas as pd
import boto3
# Connect to S3 Bucket and return Boto3 Bucket object
bucket = z.getDatasource("zepl_documentation")
# Create data frame
data = {0: [1, 2, 3, 4],
1: [5, 6, 7, 8],
2: [9, 10, 11, 12],
3: [13, 14, 15, 16],
4: [17, 18, 19, 20],
5: [21, 22, 23, 24]}
# Create Pandas DataFrame
df = pd.DataFrame.from_dict(data, orient='index')
# Create a csv file on the container file system (Write DataFrame to a CSV)
df.to_csv("local_data.csv", sep='\t', encoding='utf-8')
# Upload fie to S3
bucket.upload_file("local_data.csv", "s3_data.csv")
%spark
// User must set AWS Access Key and Secret here
val myAccessKey = "<Your AWS Access Key>"
val mySecretKey = "<Your AWS Secret Key>"
// Set Spark Context Hadoop Configurations
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
// Write DataFrame to the S3 bucket called 'zepl-documentation'. df must be instantiated as a org.apache.spark.sql.DataFrame Object
df.write.option("header",true).mode("overwrite").csv("s3a://zepl-documentation/write_from_scala")
%pyspark
import pandas as pd
import boto3
# Connect to S3 Bucket and return Boto3 Bucket object
bucket = z.getDatasource("zepl_documentation")
# Create data frame
data = {0: [1, 2, 3, 4],
1: [5, 6, 7, 8],
2: [9, 10, 11, 12],
3: [13, 14, 15, 16],
4: [17, 18, 19, 20],
5: [21, 22, 23, 24]}
# Create Pandas DataFrame
df = pd.DataFrame.from_dict(data, orient='index')
# Write DataFrame to a CSV
df.to_csv("local_data.csv", sep='\t', encoding='utf-8')
# Upload fie to S3
bucket.upload_file("local_data.csv", "s3_data.csv")
%spark
// User must set AWS Access Key and Secret here
val myAccessKey = "<Your AWS Access Key>"
val mySecretKey = "<Your AWS Secret Key>"
// Set Spark Context Hadoop Configurations
sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", myAccessKey)
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", mySecretKey)
sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
%r
# ATTENTION: Must set hadoopConfigurations using Scala before connecting to your S3 bucket
# Write DataFrame to the S3 bucket called 'zepl-documentation'
SparkR::write.df(df1, "s3a://zepl-documentation/write_from_r", "csv", "overwrite")
The S3 data source is not supported for Scala. This method uses the Spark API to read and write
- 1.Login to AWS
- 2.Navigate to IAM > Users
- 3.Select your user name
- 4.Select Add permissions, to make sure your user has the required permissions to support the level of access required. AmazonS3FullAccess - Grants access to all actions and all resources AmazonS3ReadOnlyAccess - Grants read Get and List actions to all resources. This won't be able to write files from Zepl to S3.
- 5.Select Security credentials > Create access key
- 6.Download .csv file - This file contains the key that will be entered into Zepl's S3 Data Source. DO NOT LOSE THIS file.
AWS is constantly evolving - please review their documentation at (some link) to ensure best practice
Last modified 1yr ago