Library of Congress Prod-ifier
Create a modular job to run in parallel to prod-ify a db
2 Minutes, 8 Seconds
2024-06-21 03:36 +0000
Loc Prodifier
Overview
Loc Prodifier is a Python script designed to merge data from staging tables into production tables in Google BigQuery without inserting duplicate records. It uses the Google Cloud BigQuery Python client and can be run both locally and in Google Cloud Run. The script is designed to be flexible and scalable, allowing for parallel execution across multiple tables using Google Cloud Workflows.
Features
- Merges data from staging tables into production tables without duplicates.
- Supports parallel execution for multiple tables.
- Can be run locally with custom credentials or deployed in Google Cloud Run.
- Easily configurable through command-line arguments.
Prerequisites
- Python 3.7 or higher
- Google Cloud SDK
- Docker
- A Google Cloud project with BigQuery and Cloud Run enabled
- Google Artifact Registry enabled in your Google Cloud project
Installation
- Clone the repository:
git clone https://github.com/justin-napolitano/loc_prodifier.git
cd loc_prodifier
- Install the required Python packages:
pip install -r requirements.txt
Usage
Running Locally
To run the script locally, ensure you have your Google Cloud credentials JSON file and pass the required arguments:
python your_script.py --dataset_id your_dataset_id --staging_table_id your_staging_table_id --prod_table_id your_prod_table_id --local
Running in Docker
- Build the Docker image:
docker build -t my-bigquery-script .
- Run the Docker container:
docker run --rm my-bigquery-script --dataset_id your_dataset_id --staging_table_id your_staging_table_id --prod_table_id your_prod_table_id --local
Deploying to Google Cloud Run
- Create the `cloudrun.yaml` file:
steps:
# Step 1: Create the Artifact Registry repository if it doesn't exist
- name: 'gcr.io/cloud-builders/gcloud'
entrypoint: 'bash'
args:
- '-c'
- |
if ! gcloud artifacts repositories describe python-loc-prodifier --location=us-west2 > /dev/null 2>&1; then
gcloud artifacts repositories create python-loc-prodifier --repository-format=docker --location=us-west2
else
echo "Repository python-loc-prodifier already exists"
fi
# Step 2: Build the Docker image
- name: 'gcr.io/cloud-builders/docker'
args: ['build', '-t', 'us-west2-docker.pkg.dev/smart-axis-421517/python-loc-prodifier/python-loc-prodifier:dev', '.']
# Step 3: Push the Docker image to Artifact Registry
- name: 'gcr.io/cloud-builders/docker'
args: ['push', 'us-west2-docker.pkg.dev/smart-axis-421517/python-loc-prodifier/python-loc-prodifier:dev']
# Step 4: Create the Cloud Run job
- name: 'gcr.io/cloud-builders/gcloud'
args: [
'run', 'jobs', 'create', 'python-loc-prodifier-job',
'--image', 'us-west2-docker.pkg.dev/smart-axis-421517/python-loc-prodifier/python-loc-prodifier:dev',
'--region', 'us-west2'
]
images:
- 'us-west2-docker.pkg.dev/smart-axis-421517/python-loc-prodifier/python-loc-prodifier:dev'
- Submit the build:
gcloud builds submit --config cloudrun.yaml .
- Execute the Cloud Run job:
gcloud run jobs execute python-loc-prodifier-job --region us-west2
- Check logs for job execution:
gcloud logging read "resource.type=cloud_run_job AND resource.labels.job_name=python-loc-prodifier-job" --limit 50
Running with Google Cloud Workflows
To execute the script for multiple tables in parallel using Google Cloud Workflows:
- Create the `workflow.yaml` file:
main:
params: [dataset_id, tables, staging_table_suffix, prod_table_suffix]
steps:
- parallel:
steps:
- parallel_task:
call: http.post
args:
url: "https://your-region-run.googleapis.com/v1/projects/your_project_id/locations/your-region/services/my-bigquery-script:run"
headers:
Authorization: "Bearer $(ref(auth.access_token))"
body:
dataset_id: ${dataset_id}
staging_table_id: ${table}.staging
prod_table_id: ${table}.prod
each:
- table: ${tables}
- return: "Workflow executed"
auth:
steps:
- get_token:
call: google.auth.access_token
- Deploy the workflow:
gcloud workflows deploy my-bigquery-workflow --source=workflow.yaml --location=your-region
- Execute the workflow:
gcloud workflows execute my-bigquery-workflow --location=your-region --data='{
"dataset_id": "your_dataset_id",
"tables": ["table1", "table2", "table3"],
"staging_table_suffix": "staging",
"prod_table_suffix": "prod"
}'
License
This project is licensed under the MIT License. See the LICENSE file for details.