Navigate to: Resources > Data Sources
Select the Amazon S3 Icon:
Name (required)
Bucket name (required): Name of the S3 bucket to connect to
Credential Type: See Data Source Security
Access Key ID (required - can be entered later):
User Access Key ID can be found in the IAM section for your specific user.
See Configure Authentication below.
Secret access key (required - can be entered later):
User Secret access key can be found in the IAM section for your specific user.
See Configure Authentication below.
Session token (optional)
%python# Connect to S3 Bucket and return Boto3 Bucket objectbucket = z.getDatasource("zepl_documentation")# Display files and folders in the zepl_documentation bucketfor s3_file in bucket.objects.all():print(s3_file.key)# Download a .csv or .pkl filebucket.download_file("finalized_model.pkl", "./finalized_model.pkl")bucket.download_file("titanic3.csv", "./titanic3.csv")# Validate the csv file is loaded on the container file systemprint("List local container:")!ls -al
Use z.getDatasource()
to return boto3 Bucket object: boto3.resources.factory.s3.Bucket
%spark// User must set AWS Access Key and Secret hereval myAccessKey = "<Your AWS Access Key>"val mySecretKey = "<Your AWS Secret Key>"// Set Spark Context Hadoop Configurationssc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")// Read CSV file to a DataFrameval df = spark.read.option("header",true).csv("s3a://zepl-documentation/titanic3.csv")// Display Resultsz.show(df)
%pyspark# Connect to S3 Bucket and return Boto3 Bucket objectbucket = z.getDatasource("zepl_documentation")# Display files and folders in the zepl_documentation bucketfor s3_file in bucket.objects.all():print(s3_file.key)# Download a .csv or .pkl filebucket.download_file("finalized_model.pkl", "./finalized_model.pkl")bucket.download_file("titanic3.csv", "./titanic3.csv")# Validate the csv file is loaded on the container file systemprint("List local container:")!ls -al
%scala// User must set AWS Access Key and Secret hereval myAccessKey = "<Your AWS Access Key>"val mySecretKey = "<Your AWS Secret Key>"// Set Spark Context Hadoop Configurationssc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
%r# ATTENTION: Must set hadoopConfigurations using Scala before connecting to your S3 bucket# Read DataFrame from S3df1 <- SparkR::read.df("s3a://zepl-documentation/titanic3.csv", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "", header="true")
The S3 data source is not supported for R. This method uses the Spark API to read and write
%pythonimport pandas as pdimport boto3# Connect to S3 Bucket and return Boto3 Bucket objectbucket = z.getDatasource("zepl_documentation")# Create data framedata = {0: [1, 2, 3, 4],1: [5, 6, 7, 8],2: [9, 10, 11, 12],3: [13, 14, 15, 16],4: [17, 18, 19, 20],5: [21, 22, 23, 24]}# Create Pandas DataFramedf = pd.DataFrame.from_dict(data, orient='index')# Create a csv file on the container file system (Write DataFrame to a CSV)df.to_csv("local_data.csv", sep='\t', encoding='utf-8')# Upload fie to S3bucket.upload_file("local_data.csv", "s3_data.csv")
%spark// User must set AWS Access Key and Secret hereval myAccessKey = "<Your AWS Access Key>"val mySecretKey = "<Your AWS Secret Key>"// Set Spark Context Hadoop Configurationssc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", myAccessKey)sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", mySecretKey)sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")// Write DataFrame to the S3 bucket called 'zepl-documentation'. df must be instantiated as a org.apache.spark.sql.DataFrame Objectdf.write.option("header",true).mode("overwrite").csv("s3a://zepl-documentation/write_from_scala")
%pysparkimport pandas as pdimport boto3# Connect to S3 Bucket and return Boto3 Bucket objectbucket = z.getDatasource("zepl_documentation")# Create data framedata = {0: [1, 2, 3, 4],1: [5, 6, 7, 8],2: [9, 10, 11, 12],3: [13, 14, 15, 16],4: [17, 18, 19, 20],5: [21, 22, 23, 24]}# Create Pandas DataFramedf = pd.DataFrame.from_dict(data, orient='index')# Write DataFrame to a CSVdf.to_csv("local_data.csv", sep='\t', encoding='utf-8')# Upload fie to S3bucket.upload_file("local_data.csv", "s3_data.csv")
%scala// User must set AWS Access Key and Secret hereval myAccessKey = "<Your AWS Access Key>"val mySecretKey = "<Your AWS Secret Key>"// Set Spark Context Hadoop Configurationssc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", myAccessKey)sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", mySecretKey)sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
%r# ATTENTION: Must set hadoopConfigurations using Scala before connecting to your S3 bucket# Write DataFrame to the S3 bucket called 'zepl-documentation'SparkR::write.df(df1, "s3a://zepl-documentation/write_from_r", "csv", "overwrite")
The S3 data source is not supported for Scala. This method uses the Spark API to read and write
Login to AWS
Navigate to IAM > Users
Select your user name
Select Add permissions, to make sure your user has the required permissions to support the level of access required. AmazonS3FullAccess - Grants access to all actions and all resources AmazonS3ReadOnlyAccess - Grants read Get and List actions to all resources. This won't be able to write files from Zepl to S3.
Select Security credentials > Create access key
Download .csv file - This file contains the key that will be entered into Zepl's S3 Data Source. DO NOT LOSE THIS file.
AWS is constantly evolving - please review their documentation at (some link) to ensure best practice