A Django web application with Docker support for web crawling and data ingestion. This application integrates the IUSB crawler functionality into a full-featured Django admin interface.
- Web Crawling: Integrated web crawler based on the original
iusb_crawler.py - Admin Interface: Django admin with custom interface for data ingestion
- Database Storage: PostgreSQL database for storing crawled data
- Real-time Monitoring: Dashboard to monitor crawl sessions and progress
- Docker Support: Full Docker and Docker Compose setup
- Background Processing: Celery for asynchronous crawl tasks
- Responsive UI: Bootstrap-based responsive interface
- Docker and Docker Compose
- Git
-
Navigate to the project:
cd /Users/lkruczek/Documents/3DGameDjango -
Start the application:
docker-compose up --build
-
Create a superuser (in a new terminal):
docker-compose exec web python manage.py createsuperuser -
Access the application:
- Main Dashboard: http://localhost:8000
- Admin Interface: http://localhost:8000/admin
-
Via Web Interface:
- Go to http://localhost:8000
- Click "Start New Crawl"
- Fill in the form and click "Start Crawling"
-
Via Admin Interface:
- Go to http://localhost:8000/admin
- Navigate to "Crawl Sessions"
- Create a new session and use the "Start selected crawls" action
-
Via Management Command:
docker-compose exec web python manage.py start_crawl --name "My Crawl" --url "https://example.com"
- Dashboard: View recent sessions and statistics
- Session Detail: Monitor individual crawl progress
- Admin Interface: Full database management and monitoring
3DGameDjango/
├── django_project/ # Django project settings
├── crawler_app/ # Main crawler application
│ ├── models.py # Database models
│ ├── admin.py # Admin interface
│ ├── views.py # Web views
│ ├── crawler.py # Django-integrated crawler
│ ├── tasks.py # Celery tasks
│ └── management/ # Management commands
├── templates/ # HTML templates
├── data_ingestion/ # Original crawler files
├── docker-compose.yml # Docker Compose configuration
├── Dockerfile # Docker configuration
└── requirements.txt # Python dependencies
Copy env.example to .env and modify as needed:
cp env.example .envKey settings:
DEBUG: Set to 0 for productionSECRET_KEY: Change for production- Database and Redis settings
The crawler respects the following settings:
- Max Pages: Maximum number of pages to crawl
- Delay: Delay between requests (seconds)
- Robots.txt: Automatically respects robots.txt
- Rate Limiting: Built-in delays to be respectful
- CrawlSession: Represents a crawling session
- CrawledPage: Individual crawled pages with HTML content and metadata
- CrawlLog: Detailed logging for each session
GET /: DashboardGET /start-crawl/: Start crawl formPOST /start-crawl/: Submit crawl formGET /session/<id>/: Session detail viewGET /admin/: Django admin interface
-
Install dependencies:
pip install -r requirements.txt
-
Setup database:
python manage.py migrate python manage.py createsuperuser
-
Run the application:
python manage.py runserver
- Models: Add to
crawler_app/models.py - Admin: Register in
crawler_app/admin.py - Views: Add to
crawler_app/views.py - Templates: Add to
templates/crawler_app/
-
Set environment variables:
DEBUG=0 SECRET_KEY=your-production-secret-key
-
Use production database:
- Update
docker-compose.ymlwith production database settings - Use external PostgreSQL instance
- Update
-
Static files:
docker-compose exec web python manage.py collectstatic
-
Database connection errors:
- Ensure PostgreSQL container is running
- Check database credentials
-
Celery not processing tasks:
- Ensure Redis container is running
- Check Celery worker logs
-
Permission errors:
- Check file permissions in media directory
- Ensure Docker volumes are properly mounted
View application logs:
docker-compose logs web
docker-compose logs celeryThis project is for educational purposes. Please ensure you have permission to crawl any website and respect the website's terms of service.