A Flask-based web application that provides dual computer vision capabilities through two state-of-the-art models: Vision Transformer (ViT) for image classification and YOLO for object detection. Users can upload images and choose between text-based predictions or visual object detection with bounding boxes.
- Dual Model Support: Choose between Vision Transformer and YOLO models
- Image Upload Interface: Secure file upload with filename sanitization
- Vision Transformer (ViT): Returns text-based object predictions
- YOLO Object Detection: Returns images with bounding boxes drawn around detected objects
- Responsive Web Interface: Clean, user-friendly interface for model selection and results
- Static File Serving: Integrated handling of uploaded images and prediction outputs
- Backend: Flask (Python web framework)
- Computer Vision Models:
- Vision Transformer (ViT) for image classification
- YOLO (You Only Look Once) for object detection
- File Handling: Werkzeug secure filename utilities
- Frontend: HTML templates with static file serving
flask-app/
├── app.py # Main Flask application
├── vit_model.py # Vision Transformer model implementation
├── yolo_model.py # YOLO model implementation
├── static/
│ └── uploads/ # Directory for uploaded images
├── templates/
│ ├── index.html # Home page with model selection
│ ├── vit_prediction.html # ViT prediction results
│ └── yolo_prediction.html # YOLO prediction results
└── README.md
- Clone the repository:
git clone <repository-url>
cd flask-computer-vision-app- Install required dependencies:
pip install flask werkzeug torch torchvision transformers opencv-python ultralytics- Create the uploads directory:
mkdir -p static/uploads- File:
vit_model.py - Function:
predict_vit(file_path) - Output: Text-based object classification
- Use Case: Image classification with descriptive labels
- File:
yolo_model.py - Function:
predict_yolo(file_path) - Output: Image file path with bounding boxes
- Use Case: Object detection with visual localization
python app.pyThe application will run on http://localhost:5000 in debug mode.
-
Navigate to Home Page (
/)- View descriptions of both available models
- Choose between ViT and YOLO models
-
Upload and Predict
- Select your preferred model (ViT or YOLO)
- Upload an image file
- Click predict to process
-
View Results
- ViT Results: Text-based classification displayed on results page
- YOLO Results: Original image with bounding boxes overlaid
- Description: Home page with model selection interface
- Returns: HTML page with model descriptions and upload form
- Description: Process uploaded image with selected model
- Parameters:
model: Selected model type ('vit' or 'yolo')file: Uploaded image file
- Returns:
- ViT: Prediction results page with text classification
- YOLO: Prediction results page with annotated image
"Vision Transformer Model: Predicts objects in images and returns the predicted object as text."
- Provides semantic understanding of image content
- Returns descriptive text labels
- Ideal for general image classification tasks
- Based on transformer architecture adapted for computer vision
"YOLO Model: Predicts objects in images and returns an image with bounding boxes drawn."
- Real-time object detection and localization
- Returns visual results with bounding boxes
- Identifies multiple objects in a single image
- Provides spatial information about detected objects
- Upload Security: Uses
secure_filename()to sanitize uploaded filenames - File Storage: Images saved to
static/uploads/directory - Static Serving: Flask configured to serve uploaded files and prediction results
- Supported Formats: Common image formats (JPG, PNG, etc.)
app.config['UPLOAD_FOLDER'] = 'static/uploads'
app = Flask(__name__, static_folder='static/uploads')- Returns "Prediction failed." message for unsuccessful predictions
- Implements secure filename handling to prevent directory traversal
- Debug mode enabled for development
- Modular design with separate files for each model
- Consistent function signatures for easy model swapping
- Flexible return types (text for ViT, image path for YOLO)
- User visits home page and sees model options
- User selects Vision Transformer model
- User uploads a photo of a cat
- System processes image through ViT model
- Results page displays: "Predicted object: Cat"
Alternatively:
- User selects YOLO model
- User uploads a street scene photo
- System processes image through YOLO model
- Results page displays original image with bounding boxes around cars, pedestrians, etc.
- Python 3.7+
- Flask
- Computer vision libraries (specific to your model implementations)
- Sufficient storage space for uploaded images
- GPU recommended for faster model inference (optional)
- Add support for batch processing multiple images
- Implement confidence scores for predictions
- Add model performance benchmarking
- Include data augmentation options
- Add user session management
- Implement prediction history
- Add support for video input
- Include model comparison features
- Add API endpoints for programmatic access
- Model Loading Issues: Ensure model files are properly downloaded and accessible
- File Upload Problems: Check upload directory permissions and disk space
- Memory Issues: Consider using GPU acceleration or reducing image sizes
- Template Errors: Verify all HTML templates are in the
templates/directory