This package implements a Croissant dataset crawler that discovers datasets from AI Institute portals and integrates them with your existing MCP server and web interface.
croissant_crawler.py- Main crawler modulemcp_server_updates.py- Updates needed for mcp_core_server.pyweb_interface_updates.py- Updates needed for web_interface.pycroissant_datasets.html- HTML template for dataset displayIMPLEMENTATION_INSTRUCTIONS.md- This file
- Copy
croissant_crawler.pyto your server:/opt/mcp-data-server/ - Install required dependencies:
pip install requests aiohttp
- Open
mcp_core_server.py - Add the import:
from croissant_crawler import CroissantCrawler - Add the
_crawl_croissant_datasets_handlermethod to the MCPServer class - Add the tool registration in the
__init__method
- Open
web_interface.py - Add the
/croissant_datasetsendpoint - Add the
_get_croissant_datasets_templatemethod - Copy
croissant_datasets.htmltotemplates/directory
- Restart your MCP server:
python3 mcp_core_server.py & - Restart your web interface:
python3 web_interface.py & - Visit
http://localhost:8187/croissant_datasets
The crawler is configured to search:
- AIFARMS Data Portal:
https://data.aifarms.org - CyVerse Sierra:
https://sierra.cyverse.org/datasets - AgAID GitHub:
https://github.com/TrevorBuchanan/AgAIDResearch
To add new portals, update the portals dictionary in CroissantCrawler.__init__():
self.portals = {
'aifarms': 'https://data.aifarms.org',
'cyverse': 'https://sierra.cyverse.org/datasets',
'agaid_github': 'https://github.com/TrevorBuchanan/AgAIDResearch',
'new_portal': 'https://new-portal-url.com'
}- Automatically crawls AI Institute portals
- Parses Croissant metadata files
- Extracts dataset information, fields, and keywords
- Beautiful dataset browsing interface
- Rich metadata display
- Source portal identification
- Direct links to original datasets
- New
crawl_croissant_datasetstool - Integrates with existing confidence-scoring search
- Asynchronous crawling for performance
curl -X POST http://localhost:8188/mcp/tools/crawl_croissant_datasets \
-H "Content-Type: application/json" \
-d '{}'Visit http://localhost:8187/croissant_datasets to browse discovered datasets.
- Import errors: Ensure
croissant_crawler.pyis in the same directory asmcp_core_server.py - Template not found: Ensure
croissant_datasets.htmlis in thetemplates/directory - Crawling errors: Check network connectivity and portal availability
Enable debug logging by adding:
import logging
logging.basicConfig(level=logging.DEBUG)After implementation, you should have:
- Automatic dataset discovery from AI Institute portals
- Rich metadata display showing fields, keywords, and licensing
- Beautiful web interface for browsing datasets
- Integration with your existing confidence-scoring search system
- Extensible architecture for adding new portals
If you encounter issues:
- Check the server logs for error messages
- Verify all files are in the correct locations
- Ensure all dependencies are installed
- Test the crawler independently before integration
- Scheduled crawling for automatic updates
- Dataset search integration with confidence scoring
- Metadata filtering by source, license, or keywords
- Download management for discovered datasets
- API endpoints for programmatic access
Happy crawling! 🚀🌾✨