Skip to content

Emumba-Abdullah/app-devhub-data-validation-script

Repository files navigation

Excel to JSON Converter (Dev Hub)

This repository contains a converter that reads a structured Excel workbook and produces a merged JSON file of resource cards used by the Dev Hub.

This README explains how to set up, run, and customize the converter step-by-step for a new contributor.

Requirements

  • Python 3.8+ installed
  • Create a virtual environment (recommended)
  • Install dependencies:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Files and Structure

  • excel_to_json_converter.py: Main converter script.
  • config.py: Configuration for sheet mappings, validation, and tag inference.
  • app-dev-hub-v5.xlsx: Input Excel workbook (default path in config.py).
  • output/: Directory where generated files are written:
    • templates.json - generated JSON output
    • validation_report.txt - validation summary
    • statistics.json - conversion stats
    • converter.log - detailed run log
    • keep the source excel file in root

Quick Start

  1. Activate virtual environment (see above).
  2. Place your Excel workbook at the path specified by config.INPUT_FILE or pass it with --input.
  3. Run the converter:
python excel_to_json_converter.py
  1. Optional: disable tag inference for a single run:
python excel_to_json_converter.py --no-inference

CLI Options

  • --input: Path to the input Excel file (default from config.py).
  • --output: Path to output JSON file.
  • --validate-only: Validate an existing JSON output instead of converting.
  • --verbose: Enable debug logging.
  • --no-inference: Disable automatic tag inference for this run.

How the Converter Works (High Level)

  • First pass: Learning Path sheets are processed to collect learning-path metadata keyed by Link.
  • Second pass: Content sheets (App Dev, AI, Analytics, Blogs, Samples) are processed. Each row is keyed by Link and merged into a single resource card.
  • Tags are built from explicit tag columns, featured flag, and learning-path title (normalized). Tag inference can be enabled/disabled via config.py or --no-inference.
  • When a Link appears in both LP and a content sheet, the content sheet's Title and Description take precedence for the main title / description fields; LP metadata is attached as learningPathTitle, learningPathDescription, and optional tileNumber.

Configuration Notes (config.py)

  • SHEETS_CONFIG: Maps Excel sheet names to behavior. Edit this if your sheet names differ.
  • TAG_INFERENCE['enabled']: Turn on/off inference. The project default can be changed here.
  • VALIDATION: Adjust required fields, URL schemes, and minimum tags.

Expected Excel Columns

  • Learning Path subsheet:

    • Learning Path Title: section title used for LP tagging
    • Card title: becomes learningPathTitle
    • Description: becomes learningPathDescription
    • Link: canonical URL (used as unique key)
    • No.: optional tile number for LP entries
    • Featured: optional Yes to mark as featured
  • Content subsheets (App Dev, AI, Blogs, ...):

    • Title, Link, Description, optional Tags, Featured, Priority/Importance, Last Updated Date

Tag Rules

  • Explicit tags: from Tags column and any Unnamed columns — multiple tags can be separated by , or ;.
  • Featured: any row with Featured value yes (case-insensitive) will include the featured tag.
  • Learning Path tag: normalized value of Learning Path Title is added for LP entries.
  • Inference: when enabled, tags may also be inferred from URL/title/description patterns configured in config.TAG_INFERENCE.

Troubleshooting

  • If the converter complains about missing required fields, open config.py and inspect VALIDATION['required_fields'].
  • If templates.json looks wrong, check output/converter.log for warnings and output/validation_report.txt for errors.

Example Runs

Generate JSON and save outputs:

python excel_to_json_converter.py --input app-dev-hub-v5.xlsx --output output/templates.json

Validate an existing output:

python excel_to_json_converter.py --validate-only --output output/templates.json

Contributing

  • Make small, focused changes and run the converter locally to verify output.
  • Update README.md when you add new sheets or change expected columns.

Next steps I can help with

  • Add unit tests for tag extraction and LP merging.
  • Add an example Excel file with sample rows.
  • Automate a pre-commit check that runs the converter in validation-only mode.

Created/updated by automation. Ask me to adjust wording or add examples you want included.

Excel to JSON Converter for Dev Hub Content

A robust Python application that converts Excel files containing developer hub content into structured JSON format with comprehensive tag processing, validation, and error handling.

📋 Table of Contents

🎯 Overview

This converter transforms Excel content (learning paths, blogs, app development resources, etc.) into a unified JSON array with enriched metadata and intelligent tag processing. It's designed for developer hub websites where content is managed in Excel for easy editing by non-technical users, and JSON is consumed by frontend applications.

✨ Features

  • Multi-sheet Processing: Handles different content types (Learning Paths, Blogs, App Dev, AI, Analytics, Samples)
  • Intelligent Tag Processing:
    • Automatic tag normalization
    • Tag inference from URLs and content
    • Multi-source tag merging
  • Comprehensive Validation: Entry-level and output-level validation
  • Detailed Reporting: Statistics, validation reports, and conversion logs
  • Error Recovery: Continues processing even if some rows fail
  • Production-Ready: Robust error handling and logging

🚀 Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package installer)

Install Dependencies

pip install -r requirements.txt

This will install:

  • pandas - Excel file reading and data manipulation
  • openpyxl - Excel file support (.xlsx)
  • jsonschema - JSON validation

🏃 Quick Start

  1. Place your Excel file in the project directory (or specify path with --input)

  2. Run the converter:

    python excel_to_json_converter.py
  3. Check the output in the output/ directory:

    • templates.json - Converted JSON data
    • statistics.json - Conversion statistics
    • validation_report.txt - Validation results
    • converter.log - Detailed logs

📊 Excel Structure

Supported Sheets

1. Learning Path

  • Purpose: Sequential learning content with tile numbers
  • Columns:
    • Learning Path Title - Groups cards
    • No. - Sequential number (not used)
    • Card title - Entry title (REQUIRED)
    • Link - URL (REQUIRED)
    • Description - Detailed description
    • Tags + Unnamed: 9-19 - Tag columns

Special Behavior:

  • Auto-adds learning-path tag
  • Auto-adds normalized learning path title as tag
  • Auto-adds flexibleserver tag
  • Assigns sequential tileNumber within each group

2-4. App Dev, AI, Analytics

  • Purpose: Content resources
  • Columns:
    • Topic Area - Category (for context)
    • Title - Entry title (REQUIRED)
    • Link - URL (REQUIRED)
    • Description - Detailed description
    • Priority/Importance - P0, P1, P2, etc.
    • Last Updated Date - Date field
    • Tags + Unnamed: 9-19 - Tag columns

Auto-added Tags:

  • App Dev → app-dev
  • AI → genai
  • Analytics → analytics

5. Blogs

  • Purpose: Blog content
  • Columns: Same as content sheets, plus:
    • Featured - If "Yes", adds "featured" tag
  • Auto-added Tags: blog

6. Samples

  • Purpose: Code samples
  • Columns: Same as content sheets
  • Auto-added Tags: sample

⚙️ Configuration

Edit config.py to customize:

File Paths

INPUT_FILE = "app-dev-hub-v5.xlsx"
OUTPUT_DIR = "output"
OUTPUT_FILE = f"{OUTPUT_DIR}/templates.json"

Sheet Configuration

SHEETS_CONFIG = {
    'Learning Path': {
        'type': 'learning_path',
        'base_tags': ['learning-path'],
        'default_tags': ['flexibleserver'],
        'has_tile_numbers': True,
    },
    # ... more sheets
}

Tag Inference

TAG_INFERENCE = {
    'enabled': True,
    'url_patterns': {
        'postgresql/flexible-server': 'flexibleserver',
        '/python': 'python',
        # ... more patterns
    },
    'title_patterns': {
        'python': 'python',
        'tutorial': 'tutorial',
        # ... more patterns
    }
}

Validation Rules

VALIDATION = {
    'required_fields': ['title', 'website', 'source', 'tags', 'image', 'description'],
    'url_schemes': ['http://', 'https://'],
    'min_tags': 1,
    'max_title_length': 200,
    'max_description_length': 1000,
}

🏷️ Tag Processing

Tag Sources (Merged in Order)

  1. Base Tags: From sheet type (e.g., app-dev, genai, blog)
  2. Explicit Tags: From Excel tag columns
  3. Inferred Tags: From URL and content analysis

Tag Normalization

Tags are automatically normalized:

  • Convert to lowercase
  • Replace spaces with hyphens
  • Remove special characters
  • Trim leading/trailing hyphens

Examples:

  • "App Dev" → "app-dev"
  • "How To" → "how-to"
  • "Building GenAI Apps" → "building-genai-apps"

Tag Inference

The converter automatically infers tags from:

URL Patterns:

  • postgresql/flexible-serverflexibleserver
  • github.comgithub
  • /pythonpython

Content Keywords:

  • "tutorial" → tutorial
  • "how to" → how-to
  • "quickstart" → quickstart
  • "python", "java", etc. → language tags

💻 Usage

Basic Usage

python excel_to_json_converter.py

Custom Input/Output

python excel_to_json_converter.py --input myfile.xlsx --output result.json

Verbose Mode (for Debugging)

python excel_to_json_converter.py --verbose

Disable Tag Inference

python excel_to_json_converter.py --no-inference

Validate Existing Output

python excel_to_json_converter.py --validate-only --output templates.json

Command Line Arguments

Argument Description Default
--input Input Excel file path app-dev-hub-v5.xlsx
--output Output JSON file path output/templates.json
--validate-only Only validate without converting False
--verbose Enable verbose logging False
--no-inference Disable tag inference False

📤 Output Structure

JSON Format

Each entry in the output JSON array has this structure:

{
  "title": "Build a CRUD app with Azure Database for PostgreSQL",
  "website": "https://learn.microsoft.com/azure/postgresql/flexible-server/connect-python",
  "source": "https://learn.microsoft.com/azure/postgresql/flexible-server/connect-python",
  "tags": [
    "learning-path",
    "developing-core-applications",
    "flexibleserver",
    "python",
    "how-to"
  ],
  "image": "./img/build-a-crud-app-with-azure-database-for-postgresq.png",
  "description": "Learn how to build a simple Python app...",
  "tileNumber": 2,
  "priority": "P0",
  "date": "2025-11-25"
}

Required Fields

  • title - Entry title
  • website - URL to resource
  • source - Same as website
  • tags - Array of tags (minimum 1)
  • image - Generated image path
  • description - Description (can be empty string)

Optional Fields

  • tileNumber - Sequential number (Learning Paths only)
  • priority - P0, P1, P2, etc.
  • date - Last updated date (YYYY-MM-DD format)

Image Path Generation

Image paths are auto-generated from titles:

"Build a CRUD App" → "./img/build-a-crud-app.png"
"AI & ML Guide!" → "./img/ai-ml-guide.png"

✅ Validation

Entry-Level Validation

Each entry is validated for:

  • All required fields present
  • Valid URL format (http:// or https://)
  • At least 1 tag
  • Title ≤ 200 characters
  • Description ≤ 1000 characters

Output-Level Validation

The entire output is validated for:

  • Duplicate titles (logged as warnings)
  • All entries valid
  • Proper JSON structure

Validation Report

The validation_report.txt file contains:

  • Total entries count
  • Valid vs. invalid entries
  • List of errors
  • List of warnings

🧪 Testing

Run All Tests

python tests/test_converter.py

Test Coverage

The test suite includes:

  • Tag normalization tests
  • Tag merging tests
  • Image path generation tests
  • Date parsing tests
  • Text cleaning tests
  • Entry validation tests
  • Tag inference tests
  • Tag extraction tests

🔧 Troubleshooting

Common Issues

"Input file not found"

  • Ensure the Excel file exists in the specified location
  • Check the filename in config.py matches your file

"Sheet not found"

  • Verify your Excel file has the expected sheet names
  • Update SHEETS_CONFIG in config.py to match your sheets

"Missing required field"

  • Check that Title and Link columns have values
  • Empty rows are automatically skipped

"Invalid URL scheme"

  • URLs must start with http:// or https://
  • Check for typos in the Link column

"No tags found"

  • Ensure tag columns contain data
  • Check if tag inference is enabled
  • Verify base tags are configured for the sheet

Debug Mode

Enable verbose logging to see detailed processing information:

python excel_to_json_converter.py --verbose

Check the log file for details:

output/converter.log

📈 Statistics

The converter generates comprehensive statistics including:

  • Total entries processed and output
  • Entries by sheet
  • Skip reasons breakdown
  • Tag statistics (total, unique, top tags)
  • Field coverage (priority, date, tileNumber)
  • Validation results

Example:

{
  "conversion_date": "2025-11-25T10:30:00",
  "total_entries_output": 106,
  "entries_by_sheet": {
    "Learning Path": 23,
    "App Dev": 34,
    "AI": 41
  },
  "tag_statistics": {
    "unique_tags": 48,
    "top_tags": [
      { "tag": "flexibleserver", "count": 104 },
      { "tag": "how-to", "count": 60 }
    ]
  }
}

🤝 Contributing

Code Style

  • Follow PEP 8 style guide
  • Add docstrings to all functions
  • Write unit tests for new features

Adding New Features

  1. Update config.py with new settings
  2. Implement feature in excel_to_json_converter.py
  3. Add tests in tests/test_converter.py
  4. Update this README

Reporting Issues

When reporting issues, include:

  • Python version
  • Error messages
  • Sample Excel data (if possible)
  • Steps to reproduce

📝 License

This project is designed for internal use in developer hub content management.

🙏 Acknowledgments

Built for converting Excel-based content management to JSON format for modern web applications.


Need Help? Check the troubleshooting section or review the conversion logs in output/converter.log.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages