Croissant Dataset Crawler Implementation Instructions

📋 Overview

This package implements a Croissant dataset crawler that discovers datasets from AI Institute portals and integrates them with your existing MCP server and web interface.

📁 Files Included

croissant_crawler.py - Main crawler module
mcp_server_updates.py - Updates needed for mcp_core_server.py
web_interface_updates.py - Updates needed for web_interface.py
croissant_datasets.html - HTML template for dataset display
IMPLEMENTATION_INSTRUCTIONS.md - This file

🚀 Implementation Steps

Step 1: Add the Crawler Module

Copy croissant_crawler.py to your server: /opt/mcp-data-server/
Install required dependencies:
```
pip install requests aiohttp
```

Step 2: Update MCP Server

Open mcp_core_server.py
Add the import: from croissant_crawler import CroissantCrawler
Add the _crawl_croissant_datasets_handler method to the MCPServer class
Add the tool registration in the __init__ method

Step 3: Update Web Interface

Open web_interface.py
Add the /croissant_datasets endpoint
Add the _get_croissant_datasets_template method
Copy croissant_datasets.html to templates/ directory

Step 4: Test the Implementation

Restart your MCP server: python3 mcp_core_server.py &
Restart your web interface: python3 web_interface.py &
Visit http://localhost:8187/croissant_datasets

🔧 Configuration

Target Portals

The crawler is configured to search:

AIFARMS Data Portal: https://data.aifarms.org
CyVerse Sierra: https://sierra.cyverse.org/datasets
AgAID GitHub: https://github.com/TrevorBuchanan/AgAIDResearch

Adding New Portals

To add new portals, update the portals dictionary in CroissantCrawler.__init__():

self.portals = {
    'aifarms': 'https://data.aifarms.org',
    'cyverse': 'https://sierra.cyverse.org/datasets',
    'agaid_github': 'https://github.com/TrevorBuchanan/AgAIDResearch',
    'new_portal': 'https://new-portal-url.com'
}

🎯 Features

Dataset Discovery

Automatically crawls AI Institute portals
Parses Croissant metadata files
Extracts dataset information, fields, and keywords

Web Interface

Beautiful dataset browsing interface
Rich metadata display
Source portal identification
Direct links to original datasets

MCP Integration

New crawl_croissant_datasets tool
Integrates with existing confidence-scoring search
Asynchronous crawling for performance

🔍 Usage

Via MCP Server

curl -X POST http://localhost:8188/mcp/tools/crawl_croissant_datasets \
  -H "Content-Type: application/json" \
  -d '{}'

Via Web Interface

Visit http://localhost:8187/croissant_datasets to browse discovered datasets.

🚨 Troubleshooting

Common Issues

Import errors: Ensure croissant_crawler.py is in the same directory as mcp_core_server.py
Template not found: Ensure croissant_datasets.html is in the templates/ directory
Crawling errors: Check network connectivity and portal availability

Debug Mode

Enable debug logging by adding:

import logging
logging.basicConfig(level=logging.DEBUG)

🎉 Expected Results

After implementation, you should have:

Automatic dataset discovery from AI Institute portals
Rich metadata display showing fields, keywords, and licensing
Beautiful web interface for browsing datasets
Integration with your existing confidence-scoring search system
Extensible architecture for adding new portals

📞 Support

If you encounter issues:

Check the server logs for error messages
Verify all files are in the correct locations
Ensure all dependencies are installed
Test the crawler independently before integration

🚀 Future Enhancements

Scheduled crawling for automatic updates
Dataset search integration with confidence scoring
Metadata filtering by source, license, or keywords
Download management for discovered datasets
API endpoints for programmatic access

Happy crawling! 🚀🌾✨

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Croissant Dataset Crawler Implementation Instructions

📋 Overview

📁 Files Included

🚀 Implementation Steps

Step 1: Add the Crawler Module

Step 2: Update MCP Server

Step 3: Update Web Interface

Step 4: Test the Implementation

🔧 Configuration

Target Portals

Adding New Portals

🎯 Features

Dataset Discovery

Web Interface

MCP Integration

🔍 Usage

Via MCP Server

Via Web Interface

🚨 Troubleshooting

Common Issues

Debug Mode

🎉 Expected Results

📞 Support

🚀 Future Enhancements

FilesExpand file tree

IMPLEMENTATION_INSTRUCTIONS.md

Latest commit

History

IMPLEMENTATION_INSTRUCTIONS.md

File metadata and controls

Croissant Dataset Crawler Implementation Instructions

📋 Overview

📁 Files Included

🚀 Implementation Steps

Step 1: Add the Crawler Module

Step 2: Update MCP Server

Step 3: Update Web Interface

Step 4: Test the Implementation

🔧 Configuration

Target Portals

Adding New Portals

🎯 Features

Dataset Discovery

Web Interface

MCP Integration

🔍 Usage

Via MCP Server

Via Web Interface

🚨 Troubleshooting

Common Issues

Debug Mode

🎉 Expected Results

📞 Support

🚀 Future Enhancements