This project demonstrates a lightweight, scalable framework for evaluating and monitoring third-party data vendors using standardized data quality metrics, SLA monitoring, and side-by-side vendor bake-offs.
The goal is to enable data-driven vendor decisions, improve downstream reliability, and maintain a clear audit trail for compliance and internal transparency.
- Define a clear, measurable definition of Data Quality
- Compare multiple vendors using quantitative bake-off metrics
- Monitor SLA performance and identify breaches early
- Translate business requirements into technical evaluation logic
- Provide transparent documentation for audits and stakeholder alignment
To run the SQL notebook, please visit folder >models>Exploratory_data.ipynb
To view management report, please visit folder >output>management_report.md
To view output files from data analysis, please visit folder >output
vendor-data-quality/
├── README.md
├── data/
│ ├── vendor_a.csv
│ ├── vendor_b.csv
│ ├── united_states.csv
├── models/
│ ├── Exploratory_data.ipynb
├── docs/
│ ├── data_quality_definition.md
│ ├── vendor_bakeoff_methodology.md
│ └── data_lineage.md
├── output/
│ ├── SLA_breach_report.csv
│ ├── vendor_missing_value_flag.csv
│ ├── management_report.md
Each vendor provides criminal record-like datasets with the following schema:
| Column | Description |
|---|---|
| record_id | Unique record identifier |
| vendor | Data provider name |
| county | Jurisdiction |
| dob | Date of birth (PII) |
| ssn | Social Security Number (PII) |
| disposition | Case outcome |
| record_date | Source record timestamp |
| ingest_time | Time data was ingested |
Mock data intentionally includes:
- Missing PII
- Delayed ingestion
- Inconsistent dispositions between vendors
- This simulates real-world vendor data variability.
- Data Quality Dimensions
Data quality is defined across four weighted dimensions:
- PII Completeness
- Presence of DOB and SSN
- Disposition Accuracy
- Valid and interpretable case outcomes
-
Freshness (Latency)
Time difference between record date and ingestion -
Coverage
Jurisdictional availability
Data Quality Score =
35% PII Completeness + 30% Disposition Accuracy + 20% Freshness + 15% Coverage
Weights can be adjusted based on jurisdiction risk, compliance requirements, or downstream product sensitivity.