openml · lucifer4330k · Nov 17, 2025 · Nov 17, 2025
diff --git a/LARGE_DATASET_UPLOAD_FIX.md b/LARGE_DATASET_UPLOAD_FIX.md
@@ -0,0 +1,265 @@
+# Large Dataset Upload Fix
+
+## Problem Summary
+Users attempting to upload large datasets (2.7GB+) were encountering:
+1. **Server-side rejection**: PHP upload limits too restrictive (2MB max)
+2. **Client-side OverflowError**: Python SSL limitation when sending >2GB as single buffer
+
+## Server-Side Fix (COMPLETED ✓)
+
+### Changes Made to `/docker/config/php.ini`:
+
+| Setting | Old Value | New Value | Purpose |
+|---------|-----------|-----------|---------|
+| `upload_max_filesize` | 2M | **5G** | Maximum size per uploaded file |
+| `post_max_size` | 8M | **5G** | Maximum total POST request size |
+| `max_execution_time` | 30 | **3600** | Maximum script runtime (1 hour) |
+| `memory_limit` | 16G | 16G | Already sufficient ✓ |
+
+### Deployment Required
+After making these changes, you **must restart** the OpenML Docker container:
+```bash
+docker-compose down
+docker-compose up -d --build
+```
+
+Or if using plain Docker:
+```bash
+docker stop <container_name>
+docker start <container_name>
+```
+
+---
+
+## Client-Side Issue (Still Needs Addressing)
+
+### The OverflowError Explained
+```
+OverflowError: string longer than 2147483647 bytes
+```
+
+**Root cause**: Python's SSL layer uses a signed 32-bit integer for write buffer length. This limits a single `send()` call to 2,147,483,647 bytes (2^31-1 ≈ 2GB).
+
+**Why it happens**: The `openml-python` client or `requests` library may be:
+1. Reading the entire 2.7GB file into memory as one bytes object
+2. Building the entire multipart POST body in memory
+3. Attempting to send it in one SSL write operation
+
+### Solutions for Client-Side
+
+#### Option 1: Stream the Upload (RECOMMENDED)
+Modify how the file is passed to the OpenML client. Instead of:
+```python
+# BAD - loads entire file into memory
+with open('dataset.arff', 'rb') as f:
+    data = f.read()  # 2.7GB in RAM!
+    openml_dataset.publish()  # triggers OverflowError
+```
+
+Use streaming (requires patching openml-python or using direct requests):
+```python
+# GOOD - streams in chunks
+import requests
+
+with open('dataset.arff', 'rb') as f:
+    files = {'dataset': ('dataset.arff', f)}  # Pass file handle, not bytes
+    response = requests.post(
+        'https://openml.org/api/v1/data',
+        files=files,
+        data={'api_key': 'YOUR_KEY', 'description': xml_description}
+    )
+```
+
+**Note**: If `openml-python` internally calls `f.read()`, you'll need to patch it or use Option 2/3.
+
+#### Option 2: Compress Before Upload
+Reduce file size below 2GB:
+```bash
+# ARFF supports gzip compression
+gzip dataset.arff
+# Result: dataset.arff.gz (often 10-50x smaller for sparse data)
+```
+
+Then upload the `.arff.gz` file. OpenML should accept compressed ARFF.
+
+#### Option 3: Host Externally and Register by URL
+Upload to a service that handles large files:
+- **Zenodo**: Free, DOI-based, handles 50GB+
+- **AWS S3**: Pay-per-use, unlimited size
+- **Institutional repository**: Check your university
+
+Then register the dataset in OpenML by URL:
+```python
+import openml
+
+dataset = openml.datasets.OpenMLDataset(
+    name="My Large Dataset",
+    description="...",
+    url="https://zenodo.org/record/12345/files/dataset.arff.gz",
+    format="arff",
+    version_label="1.0"
+)
+dataset.publish()
+```
+
+#### Option 4: Patch openml-python
+If you control the client environment, patch the library to use streaming:
+
+**File to patch**: `<python_site_packages>/openml/_api_calls.py`
+
+Find the section that builds `file_elements` and ensure it passes file handles, not bytes:
+```python
+# In _perform_api_call or _read_url_files
+# BEFORE (bad):
+file_data = open(filepath, 'rb').read()  # Loads all into memory
+file_elements = {'dataset': (filename, file_data)}
+
+# AFTER (good):
+file_handle = open(filepath, 'rb')  # Keep handle open
+file_elements = {'dataset': (filename, file_handle)}
+```
+
+---
+
+## Testing Your Fix
+
+### Server-Side Test
+1. Check PHP configuration is loaded:
+   ```bash
+   docker exec <container_name> php -i | grep -E 'upload_max_filesize|post_max_size|max_execution_time'
+   ```
+   Should show: `upload_max_filesize => 5G`, `post_max_size => 5G`, `max_execution_time => 3600`
+
+2. Try a test upload via curl:
+   ```bash
+   curl -X POST https://your-openml-server.org/api/v1/data \
+     -F "api_key=YOUR_KEY" \
+     -F "[email protected]" \
+     -F "dataset=@test_large_file.arff"
+   ```
+
+### Client-Side Test
+1. Try uploading a 1GB file first (below the 2GB SSL limit)
+2. Monitor memory usage: `htop` or Task Manager
+3. If successful, the client is streaming properly
+4. For 2.7GB files, use compression or external hosting
+
+---
+
+## Recommended Workflow for 2.7GB Dataset
+
+**Best approach combining all solutions:**
+
+1. **Compress the dataset** (reduces transfer time and bypasses SSL limit):
+   ```bash
+   gzip -9 dataset.arff  # Maximum compression
+   ```
+
+2. **Verify server config** (already fixed in this repo):
+   - Restart Docker container to load new php.ini
+
+3. **Upload via direct HTTP streaming** (bypass openml-python client):
+   ```python
+   import requests
+
+   api_key = "YOUR_API_KEY"
+   url = "https://openml.org/api/v1/data"
+
+   # Prepare XML description
+   xml_desc = """<?xml version="1.0" encoding="UTF-8"?>
+   <oml:data_set_description xmlns:oml="http://openml.org/openml">
+     <oml:name>Dataset Name</oml:name>
+     <oml:description>Description here</oml:description>
+     <oml:format>arff</oml:format>
+   </oml:data_set_description>"""
+
+   # Stream upload
+   with open('dataset.arff.gz', 'rb') as f:
+       response = requests.post(
+           url,
+           data={'api_key': api_key, 'description': xml_desc},
+           files={'dataset': ('dataset.arff.gz', f)},
+           timeout=3600  # 1 hour timeout for large uploads
+       )
+
+   print(response.text)
+   ```
+
+4. **Monitor upload progress** (optional):
+   ```python
+   from tqdm import tqdm
+   import requests
+
+   # Wrapper for progress bar
+   class TqdmUploader:
+       def __init__(self, filename):
+           self.filename = filename
+           self.size = os.path.getsize(filename)
+           self.progress = tqdm(total=self.size, unit='B', unit_scale=True)
+
+       def __enter__(self):
+           self.f = open(self.filename, 'rb')
+           return self
+
+       def __exit__(self, *args):
+           self.f.close()
+           self.progress.close()
+
+       def read(self, size=-1):
+           chunk = self.f.read(size)
+           self.progress.update(len(chunk))
+           return chunk
+
+   with TqdmUploader('dataset.arff.gz') as uploader:
+       response = requests.post(url, files={'dataset': uploader}, ...)
+   ```
+
+---
+
+## Additional Considerations
+
+### Web Server Configuration
+If you're using **nginx** as a reverse proxy (not present in current setup), also add:
+```nginx
+client_max_body_size 5G;
+proxy_read_timeout 3600s;
+```
+
+### Network Timeouts
+For very large uploads over slow connections:
+- **Client timeout**: Set `timeout=7200` in requests (2 hours)
+- **Server timeout**: Already set via `max_execution_time = 3600`
+- **Load balancer timeout**: Check cloud provider settings (AWS ALB, GCP LB, etc.)
+
+### Storage Space
+Uploading 2.7GB datasets requires adequate disk space:
+- **Temporary space**: `/tmp` needs ~2.7GB during upload
+- **Final storage**: `DATA_PATH` needs ~2.7GB per dataset
+- **Recommend**: 50GB+ free space on server
+
+### Alternative: Split Dataset
+If all else fails, consider splitting into multiple smaller datasets:
+```python
+# Split dataset into chunks
+import pandas as pd
+
+df = pd.read_csv('dataset.csv')
+chunk_size = 1_000_000  # 1M rows per chunk
+
+for i, start in enumerate(range(0, len(df), chunk_size)):
+    chunk = df[start:start + chunk_size]
+    chunk.to_csv(f'dataset_part{i}.arff', index=False, header=True)
+    # Upload each part separately
+```
+
+---
+
+## Summary
+
+✅ **Server-side limits fixed** (this repo)  
+⚠️ **Client-side requires**:
+- File compression (easiest)
+- Streaming upload (most robust)
+- External hosting (most flexible)
+
+**For your 2.7GB file**: Compress with gzip first, should reduce to <500MB for typical datasets.
diff --git a/QUICK_FIX_OVERFLOW_ERROR.md b/QUICK_FIX_OVERFLOW_ERROR.md
@@ -0,0 +1,79 @@
+# Quick Fix: OverflowError on Large Dataset Upload
+
+## Error You're Seeing
+```
+OverflowError: string longer than 2147483647 bytes
+```
+
+## Immediate Solutions (Pick One)
+
+### Solution 1: Compress Your Dataset (EASIEST) ⭐
+```bash
+gzip -9 your_dataset.arff
+```
+This typically reduces file size by 80-95% for sparse datasets. Upload the `.arff.gz` file instead.
+
+### Solution 2: Use Direct HTTP Upload (MOST RELIABLE)
+Replace your `publish_dataset.py` with this:
+
+```python
+import requests
+import os
+
+# Configuration
+API_KEY = "your_api_key_here"
+DATASET_FILE = "your_dataset.arff"  # or .arff.gz
+DATASET_NAME = "Your Dataset Name"
+DATASET_DESCRIPTION = "Description of your dataset"
+
+# Create XML description
+xml_description = f"""<?xml version="1.0" encoding="UTF-8"?>
+<oml:data_set_description xmlns:oml="http://openml.org/openml">
+    <oml:name>{DATASET_NAME}</oml:name>
+    <oml:description>{DATASET_DESCRIPTION}</oml:description>
+    <oml:format>arff</oml:format>
+</oml:data_set_description>"""
+
+# Upload with streaming (no memory overflow)
+print(f"Uploading {DATASET_FILE} ({os.path.getsize(DATASET_FILE) / 1e9:.2f} GB)...")
+with open(DATASET_FILE, 'rb') as f:
+    response = requests.post(
+        'https://www.openml.org/api/v1/data',
+        data={
+            'api_key': API_KEY,
+            'description': xml_description
+        },
+        files={'dataset': (os.path.basename(DATASET_FILE), f)},
+        timeout=7200  # 2 hour timeout
+    )
+
+print(response.status_code)
+print(response.text)
+```
+
+### Solution 3: Host Externally (BEST FOR VERY LARGE FILES)
+1. Upload to Zenodo, Figshare, or S3
+2. Get the permanent URL
+3. Register in OpenML:
+
+```python
+import openml
+
+dataset = openml.datasets.OpenMLDataset(
+    name="Your Dataset Name",
+    description="Your description",
+    url="https://zenodo.org/record/XXXXX/files/dataset.arff.gz",
+    format="arff"
+)
+dataset.publish()
+```
+
+## Why This Happens
+
+1. **Python limitation**: SSL write buffer cannot exceed 2GB (signed 32-bit int max)
+2. **Client bug**: openml-python loads entire file into memory instead of streaming
+3. **Server limits**: Default OpenML server limits were 2MB (now fixed to 5GB)
+
+## Need More Help?
+
+See [LARGE_DATASET_UPLOAD_FIX.md](./LARGE_DATASET_UPLOAD_FIX.md) for complete details.
diff --git a/docker/README.md b/docker/README.md
@@ -26,3 +26,14 @@ Note that the protocol is `http` not `https`.
 ```bash
 docker build --tag openml/php-rest-api -f docker/Dockerfile .
 ```
+
+## Upload Limits
+
+The server is configured to support large dataset uploads:
+- **Maximum upload size**: 5GB per file
+- **Maximum POST size**: 5GB
+- **Execution timeout**: 3600 seconds (1 hour)
+
+These limits are set in `docker/config/php.ini`. If you need to change them, modify the file and rebuild the container.
+
+For uploading very large datasets (>2GB), see [LARGE_DATASET_UPLOAD_FIX.md](../LARGE_DATASET_UPLOAD_FIX.md) for client-side considerations.
diff --git a/docker/config/api.conf b/docker/config/api.conf
@@ -15,7 +15,7 @@ HostnameLookups Off
 </Directory>
 
 <Directory /var/www/openml>
-	Options Indexes FollowSymLinks MultiViews
+	Options FollowSymLinks MultiViews
 	AllowOverride All
 	Require all granted
 </Directory>