Thursday 13 February 2020

Manage BigQuery data by Using Python


Hello Everyone! Today we are going to discuss about “how we can manage the datasets and tables in BigQuery by using Python”. Before starting the discussion, you might have question like Google could platform provided an awesome console for the BigQuery service what is needed to do same things by using python. Exactly I was also thinking like when I was dealing with less volume of data, but when I got large amount of data then I felt difficulty.
I will try to cover all operations and compare the schema and data types of BigQuery and other databases.
So let’s start…
1)      Convert Relational database schema (DDL) into BigQuery Schema:
Let’s understand thing with the help of example. In this illustration I am converting Oracle DDL into BigQuery schema. If you are familiar with BigQuery then you must know what kind of data types are supported by BigQuery, I have written few data types below and you can get more details about it on official gcp websites.

·         Numeric Types:
(a)    integer
(b)    Numeric
(c)     floating
·         Boolean Types
·         String
·         Bytes Types
·         Date Types

Here  below is the schema which in supported in Oracle database and you can see that there are some data types(CLOB, VARCHAR etc.) which are not supported in BigQuery that’s why we need convert to schema after that we can easily create table into datasets.

CREATE TABLE Test.OracleDB
(
RECORDID NUMBER(15) NOT NULL,
ACTLINKID NUMBER(15) NOT NULL,
ACTIONINDEX NUMBER(15) NOT NULL,
SERVERNAME VARCHAR2(64) NOT NULL,
GUIDENAME VARCHAR2(254) NOT NULL,
GUIDEMODE NUMBER(15) NOT NULL,
GUIDETABLEID NUMBER(15) NOT NULL,
ASSIGNSHORT VARCHAR2(255),
ASSIGNLONG CLOB,
SAMPLESERVER VARCHAR2(64),
SAMPLEGUIDE VARCHAR2(254),
OVERLAYGROUP VARCHAR2(254) NOT NULL 
); 


We can convert Oracle Schema into BigQuery supported schema by using below python Script. First you need to install ddlparse python module then use the below code and replace the sample_ddl value with your desire Oracle schema.

from ddlparse.ddlparse import DdlParse

# Change the schema According your requirement :
sample_ddl = """
CREATE TABLE ARADMIN.Actlink_call
(
RECORDID NUMBER(15) NOT NULL,
ACTLINKID NUMBER(15) NOT NULL,
ACTIONINDEX NUMBER(15) NOT NULL,
SERVERNAME VARCHAR2(64) NOT NULL,
GUIDENAME VARCHAR2(254) NOT NULL,
GUIDEMODE NUMBER(15) NOT NULL,
GUIDETABLEID NUMBER(15) NOT NULL,
ASSIGNSHORT VARCHAR2(255),
ASSIGNLONG CLOB,
SAMPLESERVER VARCHAR2(64),
SAMPLEGUIDE VARCHAR2(254),
OVERLAYGROUP VARCHAR2(254) NOT NULL 
); 
"""

table = DdlParse().parse(sample_ddl)
print("* BigQuery Schema  *")
print(table)

1)      Create dataset using Python:
First need to install “google-cloud-bigquery” library then we use the below python code.
from google.cloud import bigquery

client = bigquery.Client()

#Set dataset_id to the ID of the dataset to create, here i have named "demodata" as dataset_id.
dataset_id = "{}.demodata".format(client.project)  

dataset = bigquery.Dataset(dataset_id)

dataset.location = "US"

dataset = client.create_dataset(dataset)  # Make an API request.
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))

2)      Create the Tables in dataset:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

schema = [
    bigquery.SchemaField("full_name""STRING", mode="REQUIRED"),
    bigquery.SchemaField("age""INTEGER", mode="REQUIRED"),
]

table = bigquery.Table(table_id, schema=schema)
table = client.create_table(table)  # Make an API request.
print(
    "Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)

3)      List the tables belonging to a dataset with the list_tables() method:

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# Set dataset_id to the ID of the dataset that contains
#the tables you are listing.
# dataset_id = 'your-project.your_dataset'

tables = client.list_tables(dataset_id)  # Make an API request.

print("Tables contained in '{}':".format(dataset_id))
for table in tables:

    print("{}.{}.{}".format(table.project, table.dataset_id, table.table_id))

0 comments:

Post a Comment