Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions image_processing/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM anyscale/ray:2.51.1-slim-py312-cu128

# C compiler for Triton’s runtime build step (vLLM V1 engine)
# https://github.com/vllm-project/vllm/issues/2997
RUN sudo apt-get update && \
sudo apt-get install -y --no-install-recommends build-essential

RUN curl -LsSf https://astral.sh/uv/install.sh | sh

RUN uv pip install --system huggingface_hub boto3

RUN uv pip install --system vllm==0.11.0

RUN uv pip install --system transformers==4.57.1
55 changes: 55 additions & 0 deletions image_processing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Large-Scale Image Processing with Vision Language Models

This example demonstrates how to build a production-ready image processing pipeline that scales to billions of images using Ray Data and vLLM on Anyscale. We process the [ReLAION-2B dataset](https://huggingface.co/datasets/laion/relaion2B-en-research-safe), which contains over 2 billion image URLs with associated metadata.

## What This Pipeline Does

The pipeline performs three main stages on each image:

1. **Parallel Image Download**: Asynchronously downloads images from URLs using aiohttp with 1,000 concurrent connections, handling timeouts and validation gracefully.

2. **Image Preprocessing**: Validates, resizes, and standardizes images to 128×128 JPEG format in RGB color space using PIL, filtering out corrupted or invalid images.

3. **Vision Model Inference**: Runs the Qwen2.5-VL-3B-Instruct vision-language model using vLLM to generate captions or analyze image content, scaling across up to 64 L4 GPU replicas based on workload.

The entire pipeline is orchestrated by Ray Data, which handles distributed execution, fault tolerance, and resource management across your cluster.

## Key Features

- **Massive Scale**: Processes 2B+ images efficiently with automatic resource scaling
- **High Throughput**: Concurrent downloads (1,000 connections) and batched inference (8 images per batch, 16 concurrent batches per GPU)
- **Fault Tolerant**: Gracefully handles network failures, invalid images, and transient errors
- **Cost Optimized**: Automatic GPU autoscaling (up to 64 L4 replicas) based on workload demand
- **Production Ready**: Timestamped outputs, configurable memory limits, and structured error handling

## How to Run

First, make sure you have the [Anyscale CLI](https://docs.anyscale.com/get-started/install-anyscale-cli) installed.

You'll need a HuggingFace token to access the ReLAION-2B dataset. Get one at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).

Submit the job:

```bash
anyscale job submit -f job.yaml --env HF_TOKEN=$HF_TOKEN
```

Or use the convenience script:

```bash
./run.sh
```

Results will be written to `/mnt/shared_storage/process_images_output/{timestamp}/` in Parquet format.

## Configuration

The pipeline is configured for high-throughput processing:

- **Compute**: Up to 530 CPUs and 64 L4 GPUs (g6.xlarge workers) with auto-scaling
- **Vision Model**: Qwen2.5-VL-3B-Instruct on NVIDIA L4 GPUs with vLLM
- **Download**: 1,000 concurrent connections, 5-second timeout per image
- **Batch Processing**: 50 images per download batch, 8 images per inference batch
- **Output**: 100,000 rows per Parquet file for efficient storage

You can adjust these settings in `process_images.py` and `job.yaml` to match your requirements.
42 changes: 42 additions & 0 deletions image_processing/job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# View the docs https://docs.anyscale.com/reference/job-api#jobconfig.
name: process-images

# When empty, use the default image. This can be an Anyscale-provided base image
# like anyscale/ray:2.43.0-slim-py312-cu125, a user-provided base image (provided
# that it meets certain specs), or you can build new images using the Anyscale
# image builder at https://console.anyscale-staging.com/v2/container-images.
# image_uri: # anyscale/ray:2.43.0-slim-py312-cu125
containerfile: ./Dockerfile

# When empty, Anyscale will auto-select the instance types. You can also specify
# minimum and maximum resources.
compute_config:
# Pin worker nodes to g6e.12xlarge so the vision workload lands on L40S GPUs.
worker_nodes:
- instance_type: g5.12xlarge
min_nodes: 0
max_nodes: 16
max_resources:
CPU: 768
GPU: 64

# Path to a local directory or a remote URI to a .zip file (S3, GS, HTTP) that
# will be the working directory for the job. The files in the directory will be
# automatically uploaded to the job environment in Anyscale.
working_dir: .

# When empty, this uses the default Anyscale Cloud in your organization.
cloud:

env_vars:
RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION: "0.5"

# The script to run in your job. You can also do "uv run main.py" if you have a
# pyproject.toml file in your working_dir.
entrypoint: python process_images.py

# If there is an error, do not retry.
max_retries: 0

# Kill the job after 2 hours to control costs.
timeout_s: 7200
Loading