Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
29e86a5
adding arc values
xe-nvdk Oct 6, 2025
a433b23
we missed one query, now is complete
xe-nvdk Oct 6, 2025
01b692e
fixing run.sh and re run, just in case both benchmark in pro m3 max a…
xe-nvdk Oct 6, 2025
b663892
disabling query caching and re ran the benchmarks
xe-nvdk Oct 6, 2025
7a40588
updating repo to match the current for arc
xe-nvdk Oct 7, 2025
b934055
Merge branch 'main' into main
xe-nvdk Oct 7, 2025
fa56ed3
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 9, 2025
7c0ccab
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 9, 2025
1db5924
adding updated values for m3 max
xe-nvdk Oct 11, 2025
08fe758
Merge branch 'main' of github.com:Basekick-Labs/ClickBench
xe-nvdk Oct 11, 2025
bde45ce
updating results and scripts for arc
xe-nvdk Oct 12, 2025
7135fff
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 12, 2025
3a00ca3
fixing benchmark to load the data
xe-nvdk Oct 12, 2025
757d7fa
Merge branch 'main' of github.com:Basekick-Labs/ClickBench
xe-nvdk Oct 12, 2025
6e70633
fixing token creation
xe-nvdk Oct 12, 2025
32c62ba
fixing api env passing
xe-nvdk Oct 12, 2025
56702bc
fixing db specification for api creation
xe-nvdk Oct 12, 2025
82abc81
making sure that we don't have enabled query cache
xe-nvdk Oct 12, 2025
d6904f8
adding results for arc in clickbench
xe-nvdk Oct 12, 2025
48a8fc9
Merge branch 'main' into main
xe-nvdk Oct 13, 2025
8333f83
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 13, 2025
799b4a7
refining format of the results
xe-nvdk Oct 13, 2025
ecd0414
refining format of the results
xe-nvdk Oct 13, 2025
b905b50
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 13, 2025
ad86bf5
Merge branch 'main' of github.com:Basekick-Labs/ClickBench
xe-nvdk Oct 13, 2025
97da2bd
deleting comments in the results
xe-nvdk Oct 13, 2025
716b715
adding time-series tag
xe-nvdk Oct 13, 2025
705c8bf
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 14, 2025
4fef7fc
fix: improve benchmark output clarity and cache status reporting
xe-nvdk Oct 15, 2025
a49a8ef
Some fixes for results display, and print of caching status
xe-nvdk Oct 15, 2025
9a0b9b1
fixing and modifying things based on clickbench team
xe-nvdk Oct 16, 2025
229e53f
Merge branch 'main' of github.com:Basekick-Labs/ClickBench
xe-nvdk Oct 16, 2025
9f3b46d
Merge branch 'ClickHouse:main' into main
xe-nvdk Oct 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions arc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Arc - ClickBench Benchmark

Arc is a high-performance time-series data warehouse built on DuckDB, Parquet, and object storage.

## System Information

- **System:** Arc
- **Date:** 2025-10-15
- **Machine:** m3_max (14 cores, 36GB RAM)
- **Tags:** Python, time-series, DuckDB, Parquet, columnar, HTTP API
- **License:** AGPL-3.0
- **Repository:** https://github.com/Basekick-Labs/arc

## Performance

Arc achieves:
- **Write throughput:** 1.89M records/sec (MessagePack binary protocol)
- **ClickBench:** ~22 seconds total (43 analytical queries)
- **Storage:** DuckDB + Parquet with MinIO/S3/GCS backends

## Prerequisites

- Ubuntu/Debian Linux (or compatible)
- Python 3.11+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no prerequisites - the benchmark runs automatically on an empty AWS machine with Ubuntu AMI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. We’ll revisit the submission later this year. For now, we’re happy to have the benchmark numbers internally and will use them for our own reference. Once we release official binaries, we’ll try again to get included in ClickBench.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a problem, let's push this PR to ClickBench. The more systems included, the better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alexey-milovidov we just updated, we were able to run the benchmark.sh according to clickbench guidelines. Let me know if you have issues running, but shouldn't have any. Thank you.

- 8GB+ RAM recommended
- Internet connection for dataset download
- Sudo access (only if system dependencies are missing)

## Quick Start

The benchmark script handles everything automatically:

```bash
./benchmark.sh
```

This will:
1. Create Python virtual environment (no system packages modified)
2. Clone Arc repository
3. Install dependencies in venv
4. Start Arc server with optimal worker count (2x CPU cores)
5. Download ClickBench dataset (14GB parquet file)
6. Run 43 queries × 3 iterations
7. Output results in ClickBench JSON format

## Manual Steps

### 1. Install Dependencies

```bash
sudo apt-get update -y
sudo apt-get install -y python3-pip python3-venv wget curl
```

### 2. Create Virtual Environment

```bash
python3 -m venv arc-venv
source arc-venv/bin/activate
```

### 3. Clone and Setup Arc

```bash
git clone https://github.com/Basekick-Labs/arc.git
cd arc
pip install -r requirements.txt
mkdir -p data logs
```

### 4. Create API Token

```bash
python3 << 'EOF'
from api.auth import AuthManager

auth = AuthManager(db_path='./data/arc.db')
token = auth.create_token(name='clickbench', description='ClickBench benchmark')
print(f"Token: {token}")
EOF
```

### 5. Start Arc Server

```bash
# Auto-detect cores
CORES=$(nproc)
WORKERS=$((CORES * 2))

# Start server
gunicorn -w $WORKERS -b 0.0.0.0:8000 \
-k uvicorn.workers.UvicornWorker \
--timeout 300 \
api.main:app
```

### 6. Download Dataset

```bash
wget https://datasets.clickhouse.com/hits_compatible/hits.parquet
```

### 7. Run Benchmark

```bash
export ARC_URL="http://localhost:8000"
export ARC_API_KEY="your-token-from-step-4"
export DATABASE="clickbench"
export TABLE="hits"

./run.sh
```

**Note:** The benchmark uses Apache Arrow columnar format for optimal performance. Requires `pyarrow` to be installed.

## Configuration

Arc uses optimal settings for ClickBench (all automatic, no configuration needed):

- **Workers:** Auto-detected cores × 2 (optimal for analytical workloads)
- **Query cache:** Disabled (per ClickBench rules)
- **Storage:** Local filesystem (fastest for single-node)
- **Timeout:** 300 seconds per query
- **Format:** Apache Arrow (columnar, high-performance)

## Results Format

Results are output in official ClickBench format:

```
Load time: 0
Data size: 14779976446
[0.0226, 0.0233, 0.0284]
[0.0324, 0.0334, 0.0392]
...
```

- **Load time:** Arc queries Parquet files directly without a data loading phase (load time = 0)
- **Data size:** Size of the dataset in bytes (14GB)
- **Query results:** 43 lines, each containing 3 execution times (in seconds) for the same query

## Notes

- **Virtual Environment:** All dependencies installed in isolated venv (no `--break-system-packages` needed)
- **Authentication:** Uses Arc's built-in token auth (simpler than Permission-based auth)
- **Query Cache:** Disabled to ensure fair benchmark (no cache hits)
- **Worker Count:** Auto-detected based on CPU cores, optimized for analytical workloads
- **Timeout:** Generous 300s timeout for complex queries

## Architecture

```
ClickBench Query → Arc Arrow API → DuckDB → Parquet File → Arrow Results
```

Arc queries the Parquet file directly via DuckDB's `read_parquet()` function and returns results in Apache Arrow columnar format for maximum efficiency.

## Performance Characteristics

Arc is optimized for:
- **High-throughput writes** (1.89M RPS with MessagePack)
- **Analytical queries** (DuckDB's columnar engine)
- **Columnar data transfer** (Apache Arrow IPC for efficient results)
- **Object storage** (S3, GCS, MinIO compatibility)
- **Time-series workloads** (built-in time-based indexing)

## Support

- GitHub: https://github.com/Basekick-Labs/arc
- Issues: https://github.com/Basekick-Labs/arc/issues
- Docs: https://docs.arc.basekick.com (coming soon)

## License

Arc Core is licensed under AGPL-3.0.
Loading