Monday, 18 November 2024

All About BigQuery

 

Hello data lover!  today’s data-driven world, organizations generate vast amounts of data from various sources—websites, IoT devices, customer interactions, and more. To make sense of this data and use it for better decisions, businesses rely on data warehousing. In this article, we will discuss one of the data warehousing services provided by Google Cloud Platform: BigQuery.

 

BigQuery is a fully managed, serverless, cloud-based data warehouse provided by Google Cloud. It is designed for fast SQL-based analytics on massive datasets. BigQuery is known for its scalability, integration with Google Cloud services, and ability to handle large-scale analytics efficiently without requiring significant infrastructure management.


Use BigQuery when:

  • You need to run fast SQL queries on large datasets (petabytes).
  • It's for analytics, reporting, or business intelligence.
  • You prefer a serverless, fully managed service with automatic scaling.
  • You need real-time analytics with streaming data.

Avoid BigQuery when:

  • You need frequent data updates or inserts (use Cloud SQL or Firestore).
  • For small-scale, transactional, or unstructured data (use Cloud Storage or Cloud SQL).




Let’s explore the internal architecture of BigQuery and understand how it works. Refer to the diagram below as we discuss each component step by step.



Brog:

Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.

Dremel:

BigQuery takes advantage of Borg for data processing. Borg simultaneously runs thousands of Dremel jobs across one or more clusters made up of tens of thousands of machines. In addition to assigning compute capacity for Dremel jobs, Borg handles fault-tolerance.

Jupiter Network:

Big data workloads often face bottlenecks due to network speed, not just disk I/O. Since BigQuery separates its compute and storage layers, it depends on a lightning-fast network to move massive amounts of data—terabytes within seconds—from storage to compute for executing Dremel queries. This is made possible by Google’s Jupiter network, which provides BigQuery with an impressive 1 Petabit per second of total bisection bandwidth, ensuring seamless data transfer and high performance.

Colossus File-System:

BigQuery leverages Capacitor to store data in Colossus. Colossus is Google’s latest generation distributed file system and successor to GFS (Google File Systems). Colossus handles cluster-wide replication, recovery and distributed management. It provides client-driven replication and encoding. When writing data to Colossus, BigQuery makes some decision about initial sharding strategy which evolves based on the query and access patterns. Once data is written, to enable the highest availability BigQuery initiates geo-replication of data across different data centers.

Client for BQ:

There are multiple ways to retrieve/anaysis which are stored in BQ.

BQ Console/ Web UI:

From BQ console, we can directly execute the query and we can see the result based on query statement. BigQuery supports almost all SQL syntax, but some data types and functions differ from other relational query languages. Please refer to the UI console snapshot and the table creation example below.


Note: 

Web UI has many features like data preview, schema preview, table details, data Lineage etc.

Example:


Executing from Command-line:


Below I have described all options in detail:

·       bq mk --table: Command to create a table.

·       description: Optional. Adds a description to the table.

·       label: Optional. Adds labels (key-value pairs) for easier resource management.

·       project_id:dataset_id.table_name: Specifies the project ID, dataset ID, and table name. Replace these with your actual values.

·       schema.json: Points to a JSON file defining the table schema.


Interact with BQ using Python:



Note:

BigQuery can Integration with GCP services:

  1. Google Cloud Storage (GCS): Load large datasets from GCS into BigQuery.
  2. Google Cloud Pub/Sub: Stream real-time data into BigQuery.
  3. Google Cloud Dataflow: Process data before loading it into BigQuery.
  4. Google Cloud Dataproc: Use Spark/Hadoop for large-scale processing and load results into BigQuery.
  5. Google Cloud AI/ML: Integrate machine learning models with BigQuery for predictions and analysis.
  6. Google Cloud Functions: Trigger actions in BigQuery based on events.
  7. Google Cloud Spanner: Analyze transactional data from Spanner in BigQuery.
  8. Google Cloud Bigtable: Query time-series data from Bigtable in BigQuery.
  9. Google Cloud Firestore: Export and analyze Firestore data in BigQuery.
  10. Google Cloud Logging: Export logs to BigQuery for analysis.
  11. Google Cloud Data Catalog: Organize and discover BigQuery datasets.
  12. Google Cloud Composer: Automate ETL workflows involving BigQuery.
  13. Google Cloud Pub/Sub (Streaming): Stream IoT data into BigQuery for real-time analytics.

Q. How many types of tables you can create in BigQuery?

1.      Native Tables: Standard tables for storing structured data directly in BigQuery.

2.      External Tables: Reference data stored outside of BigQuery, like in Google Cloud Storage.

3.      Partitioned Tables: Split by a field (e.g., date) for efficient querying of large datasets.

 

Notes:

·       Only DATE, TIMESTAMP, or INTEGER columns can be used for partitioning.

·       Partition Size: Each partition in a partitioned table can have up to 1 TB of data.

·       Partitioned Tables by Date/Integer: The maximum number of partitions allowed per table is 4,000.

·       Null Values: Null values are not allowed in partitioning columns.

 

 

4.      Clustered Tables: Sorted by one or more columns to optimize query performance.

5.      Temporary Tables: Short-lived tables used for intermediate query results.

Q. How many types of views you can create in BigQuery?

1.      Standard View: A virtual table based on a SQL query, with no data stored.

 

Example: CREATE VIEW project_id.dataset_id.view_name AS

SELECT column1, column2 FROM dataset.table WHERE conditions;

2.      Materialized View: A stored version of a query result, automatically updated for faster performance.

Example: CREATE MATERIALIZED VIEW project_id.dataset_id.materialized_view_name AS

SELECT column1, column2 FROM dataset.table WHERE conditions;

3.      Authorized Views:

Authorized Views in BigQuery allow you to provide users access to a view that references sensitive or restricted data, without directly granting access to the underlying tables or datasets.

Example:

# Granting the authorized view access to the underlying dataset (dataset1)

bq add-iam-policy-binding project1:dataset1 \

  --member="user:example@domain.com" --role="roles/bigquery.dataViewer"

-- Querying the authorized view in dataset2

SELECT * FROM `project1.dataset2.view_name`;


0 comments:

Post a Comment