Google BigQuery

Create an BigQuery Datasource

  1. Navigate to: Resources > Data Sources

  2. Select the BigQuery Icon:

Data Source Inputs

  • Name (required)

  • Google Cloud Project ID (required): Retrieve this from your GCP console

  • Credential Type: Select Shared This is the recommended type by Zepl, to reduce the number of users that require access to the service account key See Data Source Security

  • Query Priority: For more information, see Google's Documentation​

    • Interactive: This is the default value and will count towards your concurrent rate limit and your daily limit​

    • Batch: This method queues each batch query on your behalf, and starts the query as soon as idle resources are available in the BigQuery shared resource pool. This

    ​

  • Service account key (JSON):

    • Copy the entire JSON key located in the service account credential file downloaded from BigQuery.

    • See Configure Authentication for more details

Use BigQuery

Read

Python
SQL
Python
%python
from google.cloud import bigquery
import pandas as pd
​
# Construct a BigQuery client object.
client = z.getDatasource("BigQuery_Sample")
​
# Create Query
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = 'TX'
GROUP BY name, state
ORDER BY total_people DESC
LIMIT 20
"""
​
# Execute Query and return Results
query_job = client.query(query).result()
​
# Convert query result object into a pandas dataframe
df = query_job.to_dataframe()
​
z.show(df)

Use z.getDatasource() to return BigQuery client object: google.cloud.bigquery.client.Client​

SQL
// TODO: After Robert completes this

Write

Python
SQL
Python

Create Sample Pandas DataFrame for Writing

%python
import pandas as pd
​
# Create data frame
data = {"names": ["Edward Norton", "Nicholas Cage", "Sean Connery", "Ed Harris"],
"Row1": [5, 6, 7, 8],
"Row2": [9, 10, 11, 12],
"Row3": [13, 14, 15, 16],
"Row4": [17, 18, 19, 20],
"Row5": [21, 22, 23, 24]}
​
# Create Pandas DataFrame
df = pd.DataFrame.from_dict(data)

Create Client, Dataset, Table, and LoadJobConfig

%python
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
​
# Create BigQuery Client Object
client = z.getDatasource("BigQuery_Sample")
​
# Get Project ID
project_id = client.project
​
# Create BigQuery Dataset Reference Name (ID)
dataset_id = "write_dataset4"
dataset_id_full = "{}.{}".format(project_id, dataset_id)
​
# Create BigQuery Dataset Object - Check if it already exists
dataset = bigquery.Dataset(dataset_id_full)
try:
client.get_dataset(dataset_id) # Make an API request.
print("Dataset {} already exists".format(dataset_id))
except NotFound:
# Create the new BigQuery dataset.
dataset = client.create_dataset(dataset)
​
​
# Create BigQuery Table Reference Name (ID)
table_id = "write_df_to_table"
table_id_full = "{}.{}.{}".format(project_id, dataset_id, table_id)
​
# Create BigQuery Table Object
table_ref = dataset.table(table_id)
​
# Configure the query job.
job_config = bigquery.QueryJobConfig()
​
# Set the destination to write to the table object created above
job_config.destination = table_ref
​
# Create LoadJobConfig to specify the table schema and write characteristics
job_config = bigquery.LoadJobConfig(
# Specify a (partial) schema. All columns are always written to the table. The schema is used to assist in data type definitions
schema=[
# Specify the type of columns whose type cannot be auto-detected. For example the "title" column uses pandas dtype "object", so its data type is ambiguous
bigquery.SchemaField("names", bigquery.enums.SqlTypeNames.STRING)
],
# Optionally, set the write disposition. BigQuery appends loaded rows to an existing table by default, but with WRITE_TRUNCATE write disposition it replaces the table with the loaded data
write_disposition="WRITE_TRUNCATE",
)

Write DataFrame to Table

%python
# Execute write
job = client.load_table_from_dataframe(
df, table_id_full, job_config=job_config
)
job.result()
​
# Display results jf loaded succesfully
table = client.get_table(table_id_full)
print( "Loaded {} rows and {} columns to {}".format(table.num_rows, len(table.schema), table_id_full))
SQL
//TODO: After Robert completes

Documentation:

If support is needed for Scala, PySpark, and R, please contact [email protected]​

Configure Authentication

Enable BigQuery API

  1. Login to Google Cloud Platform

  2. Select or Create a Project

  3. If not already enabled, select "Enable APIs and Services". Enable the BigQuery APIs for this project

Create a Service Account Credential for Authentication from Zepl

  1. Navigate to "APIs & Services" > Credentials

  2. Create Credentials > Service Account

  3. Service account details > Name (can be changed later): "zepl-bigquery-service-account"

  4. Grant this service account access to project (optional) > Role: Scroll down to BigQuery and select the desired Roles to grant access to

  5. Grant User (optional) ...

  6. Edit the service account just created > Select Add Key > Create new Key > JSON > Create This is the key that will be entered into Zepl's BigQuery Data Source. DO NOT LOSE THIS KEY.

  7. Save