Motorsport Data Pipeline (ETL + Dashboard)

An end-to-end data engineering project collecting structured data on motorsport championships, teams, circuits, and events — from raw HTML scraping to a normalized SQLite database and an interactive Streamlit dashboard.

The project is structured around real data engineering principles: modular scrapers, a cleaning pipeline, schema-driven storage, and a visualization layer on top.

Goals

Automate the collection of motorsport data from public web sources (primarily Wikipedia).
Build a clean, normalized relational database from heterogeneous HTML tables.
Deliver an interactive dashboard for exploring championships, circuits, and teams.
Practice good scraping hygiene: respecting robots.txt, adding request delays, using identifiable User-Agent headers.

Architecture

scrapers/     → HTTP extraction (Wikipedia wikitables)
pipelines/    → Cleaning, normalization, orchestration
database/     → SQL schema + SQLite file (motorsport.db)
data/raw/     → Unversioned raw CSVs from scrapers
data/processed/ → Normalized CSVs ready for DB load
notebooks/    → Exploratory analysis
dashboard/    → Streamlit app

ETL Pipeline

The pipeline is orchestrated by a single entry point (build_dataset.py) that accepts CLI flags for partial runs:

# Full pipeline
python pipelines/build_dataset.py

# Only Wikipedia championship data
python pipelines/build_dataset.py --only wiki

# Skip teams (faster iteration)
python pipelines/build_dataset.py --skip-teams

Each scraper targets a specific Wikipedia article structure (e.g., "List of … series" pages exposing wikitable HTML). The cleaning layer normalizes column names, deduplicates rows, fixes encoding issues, and outputs consistent CSVs.