Hello
data lover! today’s data-driven world, organizations generate vast
amounts of data from various sources—websites, IoT devices, customer
interactions, and more. To make sense of this data and use it for better
decisions, businesses rely on data warehousing. In this article, we will
discuss one of the data warehousing services provided by Google Cloud Platform:
BigQuery.
BigQuery is a fully managed, serverless, cloud-based data warehouse provided by Google Cloud. It is designed for fast SQL-based analytics on massive datasets. BigQuery is known for its scalability, integration with Google Cloud services, and ability to handle large-scale analytics efficiently without requiring significant infrastructure management.
Use BigQuery when:
- You
need to run fast SQL queries on large datasets (petabytes).
- It's
for analytics, reporting, or business intelligence.
- You
prefer a serverless, fully managed service with automatic scaling.
- You
need real-time analytics with streaming data.
Avoid BigQuery when:
- You
need frequent data updates or inserts (use Cloud SQL or Firestore).
- For small-scale, transactional, or unstructured data (use Cloud Storage or Cloud SQL).
Let’s explore the internal architecture of BigQuery and
understand how it works. Refer to the diagram below as we discuss each
component step by step.
Brog:
Google's Borg system is a cluster manager that runs hundreds
of thousands of jobs, from many thousands of different applications, across a
number of clusters each with up to tens of thousands of machines.
Dremel:
BigQuery takes advantage of Borg for data processing. Borg
simultaneously runs thousands of Dremel jobs across one or more clusters made
up of tens of thousands of machines. In addition to assigning compute capacity
for Dremel jobs, Borg handles fault-tolerance.
Jupiter Network:
Big data workloads often face bottlenecks due to network
speed, not just disk I/O. Since BigQuery separates its compute and storage
layers, it depends on a lightning-fast network to move massive amounts of
data—terabytes within seconds—from storage to compute for executing Dremel
queries. This is made possible by Google’s Jupiter network, which provides
BigQuery with an impressive 1 Petabit per second of total bisection bandwidth,
ensuring seamless data transfer and high performance.
Colossus File-System:
BigQuery leverages Capacitor to store data in Colossus.
Colossus is Google’s latest generation distributed file system and successor to
GFS (Google File Systems). Colossus handles cluster-wide replication, recovery
and distributed management. It provides client-driven replication and encoding.
When writing data to Colossus, BigQuery makes some decision about initial
sharding strategy which evolves based on the query and access patterns. Once
data is written, to enable the highest availability BigQuery initiates
geo-replication of data across different data centers.
Client for BQ:
There are multiple ways to retrieve/anaysis which are stored in BQ.
BQ Console/ Web UI:
From BQ console, we can directly execute the query
and we can see the result based on query statement. BigQuery supports almost
all SQL syntax, but some data types and functions differ from other relational
query languages. Please refer to the UI console snapshot and the table creation
example below.
Note:
Web UI has many features like data preview, schema preview, table details, data Lineage etc.
Example:
Executing from Command-line:
Below I have described all options in detail:
·
bq mk --table: Command to create a table.
·
description: Optional. Adds a description
to the table.
·
label: Optional. Adds labels (key-value
pairs) for easier resource management.
·
project_id:dataset_id.table_name:
Specifies the project ID, dataset ID, and table name. Replace these with your
actual values.
· schema.json: Points to a JSON file defining the table schema.
Interact with BQ using Python:
Note:
BigQuery can Integration with GCP services:
- Google
Cloud Storage (GCS): Load large datasets from GCS into BigQuery.
- Google
Cloud Pub/Sub: Stream real-time data into BigQuery.
- Google
Cloud Dataflow: Process data before loading it into BigQuery.
- Google
Cloud Dataproc: Use Spark/Hadoop for large-scale processing and load
results into BigQuery.
- Google
Cloud AI/ML: Integrate machine learning models with BigQuery for
predictions and analysis.
- Google
Cloud Functions: Trigger actions in BigQuery based on events.
- Google
Cloud Spanner: Analyze transactional data from Spanner in BigQuery.
- Google
Cloud Bigtable: Query time-series data from Bigtable in BigQuery.
- Google
Cloud Firestore: Export and analyze Firestore data in BigQuery.
- Google
Cloud Logging: Export logs to BigQuery for analysis.
- Google
Cloud Data Catalog: Organize and discover BigQuery datasets.
- Google
Cloud Composer: Automate ETL workflows involving BigQuery.
- Google Cloud Pub/Sub (Streaming): Stream IoT data into BigQuery for real-time analytics.
Q. How many
types of tables you can create in BigQuery?
1.
Native Tables: Standard tables for
storing structured data directly in BigQuery.
2.
External Tables: Reference data stored
outside of BigQuery, like in Google Cloud Storage.
3.
Partitioned Tables: Split by a field
(e.g., date) for efficient querying of large datasets.
Notes:
·
Only DATE, TIMESTAMP, or INTEGER columns can be
used for partitioning.
· Partition
Size: Each partition in a partitioned table can have up to 1 TB of data.
· Partitioned
Tables by Date/Integer: The maximum number of partitions allowed per table
is 4,000.
· Null
Values: Null values are not allowed in partitioning columns.
4.
Clustered Tables: Sorted by one or more
columns to optimize query performance.
5. Temporary Tables: Short-lived tables used for intermediate query results.
Q.
How many types of views you can create in BigQuery?
1.
Standard View: A virtual table based on a
SQL query, with no data stored.
Example: CREATE VIEW
project_id.dataset_id.view_name AS
SELECT column1, column2 FROM dataset.table
WHERE conditions;
2.
Materialized View: A stored version of a
query result, automatically updated for faster performance.
Example: CREATE MATERIALIZED VIEW
project_id.dataset_id.materialized_view_name AS
SELECT column1, column2 FROM dataset.table
WHERE conditions;
3.
Authorized Views:
Authorized Views in BigQuery
allow you to provide users access to a view that references sensitive or
restricted data, without directly granting access to the underlying tables or
datasets.
Example:
# Granting the authorized view
access to the underlying dataset (dataset1)
bq add-iam-policy-binding
project1:dataset1 \
--member="user:example@domain.com"
--role="roles/bigquery.dataViewer"
-- Querying the authorized view
in dataset2
SELECT * FROM
`project1.dataset2.view_name`;