Zepl
Search…
⌃K

Google BigQuery

Create an BigQuery Datasource

  1. 1.
    Navigate to: Resources > Data Sources
  2. 2.
    Select the BigQuery Icon:

Data Source Inputs

  • Name (required)
  • Google Cloud Project ID (required): Retrieve this from your GCP console
  • Credential Type: Select Shared This is the recommended type by Zepl, to reduce the number of users that require access to the service account key See Data Source Security
  • Query Priority: For more information, see Google's Documentation
    • Interactive: This is the default value and will count towards your concurrent rate limit and your daily limit
    • Batch: This method queues each batch query on your behalf, and starts the query as soon as idle resources are available in the BigQuery shared resource pool.
  • Service account key (JSON):
    • Copy the entire JSON key located in the service account credential file downloaded from BigQuery.
    • See Configure Authentication for more details

Use BigQuery

Read

Python
SQL
%python
from google.cloud import bigquery
import pandas as pd
# Construct a BigQuery client object.
client = z.getDatasource("BigQuery_Sample")
# Create Query
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = 'TX'
GROUP BY name, state
ORDER BY total_people DESC
LIMIT 20
"""
# Execute Query and return Results
query_job = client.query(query).result()
# Convert query result object into a pandas dataframe
df = query_job.to_dataframe()
z.show(df)
Use z.getDatasource() to return BigQuery client object: google.cloud.bigquery.client.Client
// TODO: After Robert completes this

Write

Python
SQL

Create Sample Pandas DataFrame for Writing

%python
import pandas as pd
# Create data frame
data = {"names": ["Edward Norton", "Nicholas Cage", "Sean Connery", "Ed Harris"],
"Row1": [5, 6, 7, 8],
"Row2": [9, 10, 11, 12],
"Row3": [13, 14, 15, 16],
"Row4": [17, 18, 19, 20],
"Row5": [21, 22, 23, 24]}
# Create Pandas DataFrame
df = pd.DataFrame.from_dict(data)

Create Client, Dataset, Table, and LoadJobConfig

%python
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
# Create BigQuery Client Object
client = z.getDatasource("BigQuery_Sample")
# Get Project ID
project_id = client.project
# Create BigQuery Dataset Reference Name (ID)
dataset_id = "write_dataset4"
dataset_id_full = "{}.{}".format(project_id, dataset_id)
# Create BigQuery Dataset Object - Check if it already exists
dataset = bigquery.Dataset(dataset_id_full)
try:
client.get_dataset(dataset_id) # Make an API request.
print("Dataset {} already exists".format(dataset_id))
except NotFound:
# Create the new BigQuery dataset.
dataset = client.create_dataset(dataset)
# Create BigQuery Table Reference Name (ID)
table_id = "write_df_to_table"
table_id_full = "{}.{}.{}".format(project_id, dataset_id, table_id)
# Create BigQuery Table Object
table_ref = dataset.table(table_id)
# Configure the query job.
job_config = bigquery.QueryJobConfig()
# Set the destination to write to the table object created above
job_config.destination = table_ref
# Create LoadJobConfig to specify the table schema and write characteristics
job_config = bigquery.LoadJobConfig(
# Specify a (partial) schema. All columns are always written to the table. The schema is used to assist in data type definitions
schema=[
# Specify the type of columns whose type cannot be auto-detected. For example the "title" column uses pandas dtype "object", so its data type is ambiguous
bigquery.SchemaField("names", bigquery.enums.SqlTypeNames.STRING)
],
# Optionally, set the write disposition. BigQuery appends loaded rows to an existing table by default, but with WRITE_TRUNCATE write disposition it replaces the table with the loaded data
write_disposition="WRITE_TRUNCATE",
)

Write DataFrame to Table

%python
# Execute write
job = client.load_table_from_dataframe(
df, table_id_full, job_config=job_config
)
job.result()
# Display results jf loaded succesfully
table = client.get_table(table_id_full)
print( "Loaded {} rows and {} columns to {}".format(table.num_rows, len(table.schema), table_id_full))
//TODO: After Robert completes
Documentation:
If support is needed for Scala, PySpark, and R, please contact [email protected]

Configure Authentication

Enable BigQuery API

  1. 1.
    Login to Google Cloud Platform
  2. 2.
    Select or Create a Project
  3. 3.
    If not already enabled, select "Enable APIs and Services". Enable the BigQuery APIs for this project

Create a Service Account Credential for Authentication from Zepl

  1. 1.
    Navigate to "APIs & Services" > Credentials
  2. 2.
    Create Credentials > Service Account
  3. 3.
    Service account details > Name (can be changed later): "zepl-bigquery-service-account"
  4. 4.
    Grant this service account access to project (optional) > Role: Scroll down to BigQuery and select the desired Roles to grant access to
  5. 5.
    Grant User (optional) ...
  6. 6.
    Edit the service account just created > Select Add Key > Create new Key > JSON > Create This is the key that will be entered into Zepl's BigQuery Data Source. DO NOT LOSE THIS KEY.
  7. 7.
    Save