In this scenario, we are going to transfer the on-premise
MySQL database to BigQuery. As I am gcp certified and Data Engineer cum python
developer so I will will use both gcp and python concept for this scenario. I
do not prefer to use any third-party tools. I believe in KISS(Keep it Simple
and Stupid )technique to solve the issues. We will discuss each step one by
one.
steps:
1. We have created a database on on-premise
MySQL, for the reference see the below screenshot.
2. Take the dump of desire database by
using below command. Here I have flipkart database.
mysqldump
--databases flipkart --quick --lock-tables=false > full-backup-$(date
+%F).sql -u root -pIndia@123
Note:
Don’t give the
space between password parameter which is -p and your root account password.
If you want to
take backup of table into json format then run the below command
mysql-to-json -d mysql -u root -p -e 'SELECT * FROM
information_schema.tables' > tables.json
3. Now we need to push this file into
Cloud Storage. But here is a problem BigQuery only supports :
I.
Avro
II.
CSV
III.
JSON (newline
delimited only)
IV.
ORC
V. Parquet
V. Parquet
So we need
to convert the mysql dump file into .sql to .json or .csv.
4. For the the transform the .sql file into .csv or .json we have a python scripts which is given below
4. For the the transform the .sql file into .csv or .json we have a python scripts which is given below
Note: Here
we are converting into csv.
command : python mysql2csv.py /path/mysql_dump.sql
5.
Now we need to
connect push the .csv file into GCP bucket. For that we need to install GCP CLI
on on-premise instance.
A). Update YUM with Cloud SDK repo information: just copy below command and Run it.
A). Update YUM with Cloud SDK repo information: just copy below command and Run it.
sudo tee -a /etc/yum.repos.d/google-cloud-sdk.repo <<
EOM
[google-cloud-sdk]
name=Google Cloud SDK
baseurl=https://packages.cloud.google.com/yum/repos/cloud-sdk-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpghttps://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOM
B). Install the Cloud SDK: Run the below command, it will ask to pass some parameters so put it accordingly.
B). Install the Cloud SDK: Run the below command, it will ask to pass some parameters so put it accordingly.
yum
install google-cloud-sdk
screenshots:
C). Run gcloud init to get started:
Command: gcloud init
It will ask for verification, copy the link from current
terminal and paste it into browser and authenticate with you gcp account
credentials. it looks like below.
D).
Put the gcp credential and you will get verification
link, for reference find below the screenshot.
Now
Our on-premise environment successfully connected with gcp environment.
E). Now push the data into Cloud Storage by using
below command:
Commant : gsutil cp dump.csv gs://sales-007
So we have successfully imported the data into GCP Storage and we easily import the data into BigQuery by Selecting the import from Cloud Storage option.
great work !
ReplyDeleteThanks for sharing this Informative content. Well explained. Got to learn new things from your article
ReplyDeleteGCP Training Online
Online GCP Training