Skip to content

LLKruczek/3DGameChatbot

Repository files navigation

Django Crawler - Data Ingestion & Monitoring System

A Django web application with Docker support for web crawling and data ingestion. This application integrates the IUSB crawler functionality into a full-featured Django admin interface.

Features

  • Web Crawling: Integrated web crawler based on the original iusb_crawler.py
  • Admin Interface: Django admin with custom interface for data ingestion
  • Database Storage: PostgreSQL database for storing crawled data
  • Real-time Monitoring: Dashboard to monitor crawl sessions and progress
  • Docker Support: Full Docker and Docker Compose setup
  • Background Processing: Celery for asynchronous crawl tasks
  • Responsive UI: Bootstrap-based responsive interface

Quick Start

Prerequisites

  • Docker and Docker Compose
  • Git

Installation

  1. Navigate to the project:

    cd /Users/lkruczek/Documents/3DGameDjango
  2. Start the application:

    docker-compose up --build
  3. Create a superuser (in a new terminal):

    docker-compose exec web python manage.py createsuperuser
  4. Access the application:

Usage

Starting a Crawl

  1. Via Web Interface:

  2. Via Admin Interface:

  3. Via Management Command:

    docker-compose exec web python manage.py start_crawl --name "My Crawl" --url "https://example.com"

Monitoring Crawls

  • Dashboard: View recent sessions and statistics
  • Session Detail: Monitor individual crawl progress
  • Admin Interface: Full database management and monitoring

Project Structure

3DGameDjango/
├── django_project/          # Django project settings
├── crawler_app/            # Main crawler application
│   ├── models.py          # Database models
│   ├── admin.py           # Admin interface
│   ├── views.py            # Web views
│   ├── crawler.py          # Django-integrated crawler
│   ├── tasks.py            # Celery tasks
│   └── management/         # Management commands
├── templates/              # HTML templates
├── data_ingestion/         # Original crawler files
├── docker-compose.yml      # Docker Compose configuration
├── Dockerfile             # Docker configuration
└── requirements.txt       # Python dependencies

Configuration

Environment Variables

Copy env.example to .env and modify as needed:

cp env.example .env

Key settings:

  • DEBUG: Set to 0 for production
  • SECRET_KEY: Change for production
  • Database and Redis settings

Crawler Settings

The crawler respects the following settings:

  • Max Pages: Maximum number of pages to crawl
  • Delay: Delay between requests (seconds)
  • Robots.txt: Automatically respects robots.txt
  • Rate Limiting: Built-in delays to be respectful

Database Models

  • CrawlSession: Represents a crawling session
  • CrawledPage: Individual crawled pages with HTML content and metadata
  • CrawlLog: Detailed logging for each session

API Endpoints

  • GET /: Dashboard
  • GET /start-crawl/: Start crawl form
  • POST /start-crawl/: Submit crawl form
  • GET /session/<id>/: Session detail view
  • GET /admin/: Django admin interface

Development

Running Locally (without Docker)

  1. Install dependencies:

    pip install -r requirements.txt
  2. Setup database:

    python manage.py migrate
    python manage.py createsuperuser
  3. Run the application:

    python manage.py runserver

Adding New Features

  1. Models: Add to crawler_app/models.py
  2. Admin: Register in crawler_app/admin.py
  3. Views: Add to crawler_app/views.py
  4. Templates: Add to templates/crawler_app/

Production Deployment

  1. Set environment variables:

    DEBUG=0
    SECRET_KEY=your-production-secret-key
  2. Use production database:

    • Update docker-compose.yml with production database settings
    • Use external PostgreSQL instance
  3. Static files:

    docker-compose exec web python manage.py collectstatic

Troubleshooting

Common Issues

  1. Database connection errors:

    • Ensure PostgreSQL container is running
    • Check database credentials
  2. Celery not processing tasks:

    • Ensure Redis container is running
    • Check Celery worker logs
  3. Permission errors:

    • Check file permissions in media directory
    • Ensure Docker volumes are properly mounted

Logs

View application logs:

docker-compose logs web
docker-compose logs celery

License

This project is for educational purposes. Please ensure you have permission to crawl any website and respect the website's terms of service.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors