Django Crawler - Data Ingestion & Monitoring System

A Django web application with Docker support for web crawling and data ingestion. This application integrates the IUSB crawler functionality into a full-featured Django admin interface.

Features

Web Crawling: Integrated web crawler based on the original iusb_crawler.py
Admin Interface: Django admin with custom interface for data ingestion
Database Storage: PostgreSQL database for storing crawled data
Real-time Monitoring: Dashboard to monitor crawl sessions and progress
Docker Support: Full Docker and Docker Compose setup
Background Processing: Celery for asynchronous crawl tasks
Responsive UI: Bootstrap-based responsive interface

Quick Start

Prerequisites

Docker and Docker Compose
Git

Installation

Navigate to the project:

cd /Users/lkruczek/Documents/3DGameDjango

Start the application:
```
docker-compose up --build
```

Create a superuser (in a new terminal):

docker-compose exec web python manage.py createsuperuser

Access the application:
- Main Dashboard: http://localhost:8000
- Admin Interface: http://localhost:8000/admin

Usage

Starting a Crawl

Via Web Interface:
- Go to http://localhost:8000
- Click "Start New Crawl"
- Fill in the form and click "Start Crawling"
Via Admin Interface:
- Go to http://localhost:8000/admin
- Navigate to "Crawl Sessions"
- Create a new session and use the "Start selected crawls" action

Via Management Command:

docker-compose exec web python manage.py start_crawl --name "My Crawl" --url "https://example.com"

Monitoring Crawls

Dashboard: View recent sessions and statistics
Session Detail: Monitor individual crawl progress
Admin Interface: Full database management and monitoring

Project Structure

3DGameDjango/
├── django_project/          # Django project settings
├── crawler_app/            # Main crawler application
│   ├── models.py          # Database models
│   ├── admin.py           # Admin interface
│   ├── views.py            # Web views
│   ├── crawler.py          # Django-integrated crawler
│   ├── tasks.py            # Celery tasks
│   └── management/         # Management commands
├── templates/              # HTML templates
├── data_ingestion/         # Original crawler files
├── docker-compose.yml      # Docker Compose configuration
├── Dockerfile             # Docker configuration
└── requirements.txt       # Python dependencies

Configuration

Environment Variables

Copy env.example to .env and modify as needed:

cp env.example .env

Key settings:

DEBUG: Set to 0 for production
SECRET_KEY: Change for production
Database and Redis settings

Crawler Settings

The crawler respects the following settings:

Max Pages: Maximum number of pages to crawl
Delay: Delay between requests (seconds)
Robots.txt: Automatically respects robots.txt
Rate Limiting: Built-in delays to be respectful

Database Models

CrawlSession: Represents a crawling session
CrawledPage: Individual crawled pages with HTML content and metadata
CrawlLog: Detailed logging for each session

API Endpoints

GET /: Dashboard
GET /start-crawl/: Start crawl form
POST /start-crawl/: Submit crawl form
GET /session/<id>/: Session detail view
GET /admin/: Django admin interface

Development

Running Locally (without Docker)

Install dependencies:
```
pip install -r requirements.txt
```

Setup database:

python manage.py migrate
python manage.py createsuperuser

Run the application:
```
python manage.py runserver
```

Adding New Features

Models: Add to crawler_app/models.py
Admin: Register in crawler_app/admin.py
Views: Add to crawler_app/views.py
Templates: Add to templates/crawler_app/

Production Deployment

Set environment variables:

DEBUG=0
SECRET_KEY=your-production-secret-key

Use production database:
- Update docker-compose.yml with production database settings
- Use external PostgreSQL instance

Static files:

docker-compose exec web python manage.py collectstatic

Troubleshooting

Common Issues

Database connection errors:
- Ensure PostgreSQL container is running
- Check database credentials
Celery not processing tasks:
- Ensure Redis container is running
- Check Celery worker logs
Permission errors:
- Check file permissions in media directory
- Ensure Docker volumes are properly mounted

Logs

View application logs:

docker-compose logs web
docker-compose logs celery

License

This project is for educational purposes. Please ensure you have permission to crawl any website and respect the website's terms of service.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Django Crawler - Data Ingestion & Monitoring System

Features

Quick Start

Prerequisites

Installation

Usage

Starting a Crawl

Monitoring Crawls

Project Structure

Configuration

Environment Variables

Crawler Settings

Database Models

API Endpoints

Development

Running Locally (without Docker)

Adding New Features

Production Deployment

Troubleshooting

Common Issues

Logs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
crawler_app		crawler_app
data_ingestion		data_ingestion
django_project		django_project
templates		templates
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
env.example		env.example
manage.py		manage.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Django Crawler - Data Ingestion & Monitoring System

Features

Quick Start

Prerequisites

Installation

Usage

Starting a Crawl

Monitoring Crawls

Project Structure

Configuration

Environment Variables

Crawler Settings

Database Models

API Endpoints

Development

Running Locally (without Docker)

Adding New Features

Production Deployment

Troubleshooting

Common Issues

Logs

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages