COVID-Data-Pipeline

COVID-Data-Pipeline

End-To-End ETL Solution with AWS Glue, Athena and Redshift

·

2 min read

Project Overview

This project is an end-to-end ETL (Extract, Transform, Load) pipeline designed to process COVID-19 data from multiple sources, transform it into structured formats, and load it into a cloud data warehouse for analysis. The pipeline leverages AWS Glue, Amazon Athena, and Amazon Redshift for efficient data processing, querying, and storage.

Architecture

  1. Data Extraction:

    • 10 raw datasets stored in Amazon S3.

    • AWS Glue crawlers extract metadata and create tables.

  2. Data Transformation:

    • Data queried in Amazon Athena to create:

      • 3 dimension tables (dimRegion, dimHospital, dimDate)

      • 1 fact table (factCovid)

    • DataFrames created using Python and Pandas.

  3. Data Loading:

    • Transformed data is loaded back to S3 and subsequently into Amazon Redshift using the COPY command.
  4. Data Analysis:

    • Redshift query editor is used to perform further analysis and generate insights.

Tech Stack

  • AWS Glue: Data extraction and transformation.

  • Amazon S3: Data storage.

  • Amazon Athena: Querying and creating tables.

  • Amazon Redshift: Cloud data warehouse.

  • Python: Data transformation and DataFrame operations.

  • VS Code: Development environment.

Fact and Dimension Tables

Fact Table: factCovid

ColumnTypeDescription
fipsbigintFIPS code
province_statestringState or province
country_regionstringCountry
confirmedbigintConfirmed cases
deathsbigintDeaths
recoveredbigintRecovered cases

Dimension Table: dimRegion

ColumnTypeDescription
fipsbigintFIPS code
province_statestringState or province
country_regionstringCountry

Dimension Table: dimHospital

ColumnTypeDescription
fipsbigintFIPS code
hospital_namestringName of the hospital

Dimension Table: dimDate

ColumnTypeDescription
datestringDate of record