An end-to-end data engineering project collecting structured data on motorsport championships, teams, circuits, and events — from raw HTML scraping to a normalized SQLite database and an interactive Streamlit dashboard.
The project is structured around real data engineering principles: modular scrapers, a cleaning pipeline, schema-driven storage, and a visualization layer on top.
robots.txt, adding request delays, using identifiable User-Agent headers.scrapers/ → HTTP extraction (Wikipedia wikitables)
pipelines/ → Cleaning, normalization, orchestration
database/ → SQL schema + SQLite file (motorsport.db)
data/raw/ → Unversioned raw CSVs from scrapers
data/processed/ → Normalized CSVs ready for DB load
notebooks/ → Exploratory analysis
dashboard/ → Streamlit app
The pipeline is orchestrated by a single entry point (build_dataset.py) that accepts CLI flags for partial runs:
# Full pipeline
python pipelines/build_dataset.py
# Only Wikipedia championship data
python pipelines/build_dataset.py --only wiki
# Skip teams (faster iteration)
python pipelines/build_dataset.py --skip-teams
Each scraper targets a specific Wikipedia article structure (e.g., "List of … series" pages exposing wikitable HTML). The cleaning layer normalizes column names, deduplicates rows, fixes encoding issues, and outputs consistent CSVs.
The SQLite database covers four main entities:
championships — name, type, founding year, governing bodycircuits — name, location, country, lengthteams — constructor, nationality, active seasonsevents — calendar placeholder (extended in future versions)All tables are replaced on each pipeline run to ensure reproducibility.
The dashboard provides an interactive interface to explore the loaded data:
requests, BeautifulSoup)pandassqlite3)Copyright © 2026