Skip to content

Hottier fix#1645

Open
parmesant wants to merge 13 commits intoparseablehq:mainfrom
parmesant:hottier-fix
Open

Hottier fix#1645
parmesant wants to merge 13 commits intoparseablehq:mainfrom
parmesant:hottier-fix

Conversation

@parmesant
Copy link
Copy Markdown
Contributor

@parmesant parmesant commented May 8, 2026

Introduces multiple changes to the hottier flow

  • Split historic data sync and latest data sync into two tasks
  • Each stream gets its own hot-tier tasks
  • File downloads are parallel and chunked

New env vars-

  • P_HOT_TIER_DOWNLOAD_CHUNK_SIZE
  • P_HOT_TIER_DOWNLOAD_CONCURRENCY
  • P_HOT_TIER_FILES_PER_STREAM_CONCURRENCY
  • P_HOT_TIER_LATEST_MINUTES
  • P_HOT_TIER_HISTORIC_SYNC_MINUTES

Description


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • New Features

    • New CLI options to tune hot-tier downloads (chunk size, concurrency, per-stream file concurrency, latest vs historic time windows).
    • Buffered-write support added across storage backends.
    • New billing metric for partial-file scans in object-store operations.
    • Hot-tier updates now spawn background sync tasks automatically.
  • Improvements

    • Hot-tier sync redesigned to per-stream, phase-aware background loops with safer eviction and state reconciliation.
    • Faster parallel ranged downloads with partial-file staging and bounded concurrency.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack

Walkthrough

Per-stream hot-tier sync now runs Latest/Historic background loops; ObjectStorage adds async buffered_write and partial-path helper. S3/GCS/Azure implement staged parallel ranged downloads; LocalFS supports atomic partial writes. New partial-file scan metrics and CLI tuning options were added; HTTP handler spawns stream tasks after config changes.

Changes

Hot-Tier Concurrent Download Architecture

Layer / File(s) Summary
CLI Configuration
src/cli.rs
Five new hot-tier tuning parameters: chunk size, download concurrency, per-stream file concurrency, Latest time window (minutes), and historic sync interval (minutes).
Partial Scan Metrics
src/metrics/mod.rs
New IntCounterVec PARTIAL_FILE_SCANS_IN_OBJECT_STORE_CALLS_BY_DATE; registered in custom_metrics and an increment helper added.
Storage Trait & Utilities
src/storage/object_storage.rs, src/storage/mod.rs
ObjectStorage gains buffered_write; partial_path helper computes .partial sibling filenames for atomic staged writes.
LocalFS Buffered Write
src/storage/localfs.rs
LocalFS::buffered_write copies to .partial then renames to final path; cleans up partial on error.
S3 Parallel Download
src/storage/s3.rs
Adds _parallel_download/_parallel_download_inner: HEAD sizing, pre-allocated partial, semaphore-limited ranged GETs, writes at offsets, fsync, rename; buffered_write delegates to it.
GCS Parallel Download
src/storage/gcs.rs
Implements parallel ranged download helpers and buffered_write with HEAD sizing, range GET concurrency, offset writes, sync, and rename.
Azure Blob Parallel Download
src/storage/azure_blob.rs
Wraps client in Arc, adds _parallel_download/inner for ranged concurrent downloads, writes at offsets, updates metrics, and implements buffered_write.
Hot-Tier Manager Types & Caching
src/hottier.rs
Adds per-(tenant,stream) StreamSyncState with async mutex, SyncPhase enum, state_cache and get_or_load_state/invalidate_state for cached reconciliation.
Reconcile, Boot & Phase Processing
src/hottier.rs
reconcile_stream removes .partial orphans and reconciles manifests; download_from_s3 discovers streams and spawns Latest/Historic loops; process_stream/process_manifest/process_parquet_file_concurrent implement phase-aware work lists and reservation/commit flow; eviction tracks freed bytes and skips newer candidates; delete_hot_tier aborts running stream tasks before cleanup.
HTTP Handler Integration
src/handlers/http/logstream.rs
put_stream_hot_tier calls spawn_stream_tasks after persisting hot-tier config to start background loops.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

for next release

Suggested reviewers

  • nitisht
  • nikhilsinhaparseable

Poem

🐰 I tunneled through manifests and streams,
chunked the bytes into tidy beams.
Reserve the space, fetch every part,
sync and rename — a tiny art.
Background loops hum, hot-tier dreams.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Hottier fix' is vague and generic, using a non-descriptive term that does not convey meaningful information about the specific changes made in this substantial changeset. Replace with a more specific title that describes the main change, such as 'Split hot-tier sync into per-stream tasks with parallel chunked downloads' or 'Refactor hot-tier manager for per-stream sync phases and concurrent file downloads'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description adequately lists key changes and new environment variables introduced, though the checklist items remain unchecked and no explicit mention of testing or documentation is provided.
Docstring Coverage ✅ Passed Docstring coverage is 94.87% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/handlers/http/logstream.rs (1)

455-474: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Start the background sync only after the hot-tier config is durably persisted.

spawn_stream_tasks() runs before put_stream_json(). If the metastore write fails, this request returns an error even though the in-memory hot-tier state and background tasks are already live, so the feature is partially enabled.

Suggested fix
     hot_tier_manager
         .put_hot_tier(&stream_name, &mut hottier, &tenant_id)
         .await?;
-    hot_tier_manager
-        .spawn_stream_tasks(stream_name.clone(), tenant_id.clone())
-        .await;

     let mut stream_metadata: ObjectStoreFormat = serde_json::from_slice(
         &PARSEABLE
             .metastore
             .get_stream_json(&stream_name, false, &tenant_id)
@@
     PARSEABLE
         .metastore
         .put_stream_json(&stream_metadata, &stream_name, &tenant_id)
         .await?;
+
+    hot_tier_manager
+        .spawn_stream_tasks(stream_name.clone(), tenant_id.clone())
+        .await;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/handlers/http/logstream.rs` around lines 455 - 474, The code starts
background tasks via hot_tier_manager.spawn_stream_tasks(...) before persisting
the updated stream metadata; reorder operations so you first update
stream_metadata.hot_tier_enabled and hot_tier, call
PARSEABLE.metastore.put_stream_json(&stream_metadata, &stream_name,
&tenant_id).await? to durably persist the change, and only after that call
hot_tier_manager.spawn_stream_tasks(stream_name.clone(),
tenant_id.clone()).await; keep the existing hot_tier_manager.put_hot_tier(...)
call but ensure the metastore write succeeds before spawning tasks to avoid
partially enabled state.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cli.rs`:
- Around line 341-355: Make these clap args enforce valid ranges at parse time:
for the field hot_tier_latest_minutes (i64) add a value_parser that requires >=
1 (e.g. value_parser = clap::value_parser!(i64).range(1..)) so negatives/zero
are rejected, and for hot_tier_historic_sync_minutes (u32) add a value_parser
that requires >= 1 (e.g. value_parser = clap::value_parser!(u32).range(1..)) so
zero is rejected; update the #[arg(...)] attributes for those two fields in the
CLI struct accordingly.

In `@src/hottier.rs`:
- Around line 620-621: The cutoff boundary is recomputed separately causing race
classification and duplicate commits; instead compute a single stable cutoff
once per tick and use that same value in both phases (replace each recomputation
of latest_minutes/cutoff such as where PARSEABLE.options.hot_tier_latest_minutes
and chrono::Utc::now() are used at lines around the listed ranges) or implement
an in-flight claim keyed by file_path that both phases consult before
reserving/committing a parquet (i.e., add a shared claim
set/claim_file(file_path) check before parquet_path.exists() reservation and
before commit to prevent double-reserve/commit and duplicate entries in
hottier.manifest.json).
- Around line 109-127: The cached StreamSyncState returned by get_or_load_state
can become stale; change get_or_load_state so that when a state exists in
state_cache you still reconcile and refresh it before returning: call
reconcile_stream(stream, tenant_id).await to obtain the latest sht, create a new
Arc<StreamSyncState> (with AsyncMutex::new(sht)) and replace the existing entry
in state_cache (cache.insert(key, state.clone())) so the returned state reflects
updated .hot_tier.json limits; keep the existing fast-path only if you add an
explicit no-refresh flag, otherwise always refresh the cached entry before
returning from get_or_load_state (references: get_or_load_state, state_cache,
StreamSyncState, reconcile_stream, spawn_stream_tasks).
- Around line 762-820: The current logic treats an exact-size file as "no space"
because both comparisons use <=; change the two comparisons that check space to
strict < so exact-fit reservations are allowed: modify the condition using
self.is_disk_available(parquet_file.file_size).await? || sht.available_size <=
parquet_file.file_size to use sht.available_size < parquet_file.file_size (and
ensure the later if sht.available_size <= parquet_file.file_size becomes if
sht.available_size < parquet_file.file_size); keep the eviction flow
(cleanup_hot_tier_old_data, SyncPhase::Latest/ Historic) and the subsequent
sht.available_size -= parquet_file.file_size and self.put_hot_tier(...) behavior
unchanged.

In `@src/storage/azure_blob.rs`:
- Around line 271-296: The current loop spawns one Tokio task per range
(creating many idle tasks) even though concurrency is limited by semaphore;
instead, drive ranges through futures::stream::iter(ranges).map(|r| { let client
= client.clone(); let src = src.clone(); let std_file = std_file.clone(); let
semaphore = semaphore.clone(); async move { /* acquire permit, call
client.get_range(&src, r.clone()).await, write with spawn_blocking +
std_file.write_all_at(...), map errors to ObjectStorageError as before */ }
}).buffer_unordered(concurrency).collect::<Result<Vec<_>, _>>().await to bound
both task creation and execution; preserve existing error mapping
(ObjectStorageError::Custom with messages like "semaphore closed", "join", "join
error") and keep using r.start for the file offset and client.get_range(&src,
r.clone()) for reads.

In `@src/storage/s3.rs`:
- Around line 342-359: In _parallel_download (and similarly in other
buffered-write backends) ensure the temporary partial file produced by
partial_path(write_path) is removed if the final tokio::fs::rename(&partial,
&write_path).await fails: after calling _parallel_download_inner and before
returning Ok(()) call rename, and if rename returns Err attempt
tokio::fs::remove_file(&partial).await (ignoring its error) and then return the
rename error; also apply the same cleanup logic where write/rename is performed
so orphaned .partial files are always removed on failure.
- Around line 385-389: The code is clamping parsed values with .max(...) which
forces minimums and prevents lowering P_HOT_TIER_DOWNLOAD_CHUNK_SIZE and
P_HOT_TIER_DOWNLOAD_CONCURRENCY; remove the .max(...) calls on
PARSEABLE.options.hot_tier_download_chunk_size and
PARSEABLE.options.hot_tier_download_concurrency so the parsed values are
honored, and if you require hard minimums enforce/validate them at parse time in
the CLI parsing/validation logic that handles the P_HOT_TIER_DOWNLOAD_* options
(rather than silently overriding here).
- Around line 404-416: The parallel download path in _parallel_download_inner()
uses std::os::unix::fs::FileExt and std_file.write_all_at unconditionally (see
the tokio::spawn block that calls client.get_range, then spawn_blocking using
FileExt), which breaks Windows builds; either gate that entire offset-write
branch with #[cfg(unix)] or replace the offset-write logic with a
platform-independent approach (seek to offset on a duplicated/mutex-protected
std::fs::File or use a cross-platform write_at helper) in the S3, GCS, and Azure
Blob implementations (s3.rs, gcs.rs, azure_blob.rs) so the semaphore +
client.get_range + spawn_blocking flow compiles on non-Unix platforms.

---

Outside diff comments:
In `@src/handlers/http/logstream.rs`:
- Around line 455-474: The code starts background tasks via
hot_tier_manager.spawn_stream_tasks(...) before persisting the updated stream
metadata; reorder operations so you first update
stream_metadata.hot_tier_enabled and hot_tier, call
PARSEABLE.metastore.put_stream_json(&stream_metadata, &stream_name,
&tenant_id).await? to durably persist the change, and only after that call
hot_tier_manager.spawn_stream_tasks(stream_name.clone(),
tenant_id.clone()).await; keep the existing hot_tier_manager.put_hot_tier(...)
call but ensure the metastore write succeeds before spawning tasks to avoid
partially enabled state.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c525c75e-27b3-4de4-a45f-270c9e53c4c5

📥 Commits

Reviewing files that changed from the base of the PR and between 090feb9 and 2b2fe61.

📒 Files selected for processing (10)
  • src/cli.rs
  • src/handlers/http/logstream.rs
  • src/hottier.rs
  • src/metrics/mod.rs
  • src/storage/azure_blob.rs
  • src/storage/gcs.rs
  • src/storage/localfs.rs
  • src/storage/mod.rs
  • src/storage/object_storage.rs
  • src/storage/s3.rs

Comment thread src/cli.rs
Comment on lines +341 to +355
#[arg(
long = "hot-tier-latest-minutes",
env = "P_HOT_TIER_LATEST_MINUTES",
default_value = "10",
help = "Files whose timestamp is within the last N minutes are 'latest'; rest are 'historic'."
)]
pub hot_tier_latest_minutes: i64,

#[arg(
long = "hot-tier-historic-sync-minutes",
env = "P_HOT_TIER_HISTORIC_SYNC_MINUTES",
default_value = "5",
help = "Interval (minutes) at which the historic hot-tier sync runs."
)]
pub hot_tier_historic_sync_minutes: u32,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the new hot-tier time knobs at parse time.

hot_tier_latest_minutes currently accepts negatives, and hot_tier_historic_sync_minutes accepts 0. With the new perpetual per-stream sync loops, that can turn the latest/historic split invalid or create a tight historic-sync loop. Please bound these in clap instead of relying on downstream code to recover.

Suggested fix
     #[arg(
         long = "hot-tier-latest-minutes",
         env = "P_HOT_TIER_LATEST_MINUTES",
         default_value = "10",
+        value_parser = value_parser!(i64).range(1..),
         help = "Files whose timestamp is within the last N minutes are 'latest'; rest are 'historic'."
     )]
     pub hot_tier_latest_minutes: i64,

     #[arg(
         long = "hot-tier-historic-sync-minutes",
         env = "P_HOT_TIER_HISTORIC_SYNC_MINUTES",
         default_value = "5",
+        value_parser = value_parser!(u32).range(1..),
         help = "Interval (minutes) at which the historic hot-tier sync runs."
     )]
     pub hot_tier_historic_sync_minutes: u32,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cli.rs` around lines 341 - 355, Make these clap args enforce valid ranges
at parse time: for the field hot_tier_latest_minutes (i64) add a value_parser
that requires >= 1 (e.g. value_parser = clap::value_parser!(i64).range(1..)) so
negatives/zero are rejected, and for hot_tier_historic_sync_minutes (u32) add a
value_parser that requires >= 1 (e.g. value_parser =
clap::value_parser!(u32).range(1..)) so zero is rejected; update the #[arg(...)]
attributes for those two fields in the CLI struct accordingly.

Comment thread src/hottier.rs
Comment on lines +109 to +127
async fn get_or_load_state(
&self,
stream: &str,
tenant_id: &Option<String>,
) -> Result<Arc<StreamSyncState>, HotTierError> {
let key: StreamKey = (tenant_id.clone(), stream.to_owned());
if let Some(state) = self.state_cache.read().await.get(&key).cloned() {
return Ok(state);
}
let mut cache = self.state_cache.write().await;
if let Some(state) = cache.get(&key).cloned() {
return Ok(state);
}
let sht = self.reconcile_stream(stream, tenant_id).await?;
let state = Arc::new(StreamSyncState {
sht: AsyncMutex::new(sht),
});
cache.insert(key, state.clone());
Ok(state)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refresh state_cache when a stream already has running tasks.

StreamSyncState is cached once and then reused on every tick. If a caller updates .hot_tier.json for an existing stream and calls spawn_stream_tasks again, this early return keeps the old size/available_size in memory indefinitely, so the background sync keeps reserving against stale limits. Invalidate or reload the cached state before returning here.

♻️ Minimal fix
-        {
-            let tasks = self.tasks.read().await;
-            if let Some(existing) = tasks.get(&key)
-                && !existing.latest.is_finished()
-                && !existing.historic.is_finished()
-            {
-                return;
-            }
-        }
+        let already_running = {
+            let tasks = self.tasks.read().await;
+            matches!(
+                tasks.get(&key),
+                Some(existing) if !existing.latest.is_finished() && !existing.historic.is_finished()
+            )
+        };
+        if already_running {
+            self.invalidate_state(&stream, &tenant_id).await;
+            return;
+        }

Also applies to: 487-496

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/hottier.rs` around lines 109 - 127, The cached StreamSyncState returned
by get_or_load_state can become stale; change get_or_load_state so that when a
state exists in state_cache you still reconcile and refresh it before returning:
call reconcile_stream(stream, tenant_id).await to obtain the latest sht, create
a new Arc<StreamSyncState> (with AsyncMutex::new(sht)) and replace the existing
entry in state_cache (cache.insert(key, state.clone())) so the returned state
reflects updated .hot_tier.json limits; keep the existing fast-path only if you
add an explicit no-refresh flag, otherwise always refresh the cached entry
before returning from get_or_load_state (references: get_or_load_state,
state_cache, StreamSyncState, reconcile_stream, spawn_stream_tasks).

Comment thread src/hottier.rs
Comment thread src/hottier.rs
Comment thread src/storage/azure_blob.rs Outdated
Comment thread src/storage/s3.rs
Comment thread src/storage/s3.rs
Comment on lines +385 to +389
let chunk = PARSEABLE
.options
.hot_tier_download_chunk_size
.max(8 * 1024 * 1024);
let concurrency = PARSEABLE.options.hot_tier_download_concurrency.max(16);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor the configured chunk size and download concurrency.

These .max(...) calls silently ignore any values below 8 MiB / 16, so P_HOT_TIER_DOWNLOAD_CHUNK_SIZE and P_HOT_TIER_DOWNLOAD_CONCURRENCY are not actually tunable downward. If you want hard minimums, validate them in src/cli.rs; don't override parsed values here. The same clamp exists in GCS.

Suggested fix
-        let chunk = PARSEABLE
-            .options
-            .hot_tier_download_chunk_size
-            .max(8 * 1024 * 1024);
-        let concurrency = PARSEABLE.options.hot_tier_download_concurrency.max(16);
+        let chunk = PARSEABLE.options.hot_tier_download_chunk_size;
+        let concurrency = PARSEABLE.options.hot_tier_download_concurrency;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let chunk = PARSEABLE
.options
.hot_tier_download_chunk_size
.max(8 * 1024 * 1024);
let concurrency = PARSEABLE.options.hot_tier_download_concurrency.max(16);
let chunk = PARSEABLE.options.hot_tier_download_chunk_size;
let concurrency = PARSEABLE.options.hot_tier_download_concurrency;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/s3.rs` around lines 385 - 389, The code is clamping parsed values
with .max(...) which forces minimums and prevents lowering
P_HOT_TIER_DOWNLOAD_CHUNK_SIZE and P_HOT_TIER_DOWNLOAD_CONCURRENCY; remove the
.max(...) calls on PARSEABLE.options.hot_tier_download_chunk_size and
PARSEABLE.options.hot_tier_download_concurrency so the parsed values are
honored, and if you require hard minimums enforce/validate them at parse time in
the CLI parsing/validation logic that handles the P_HOT_TIER_DOWNLOAD_* options
(rather than silently overriding here).

Comment thread src/storage/s3.rs
Comment on lines +404 to +416
handles.push(tokio::spawn(async move {
let _permit = semaphore
.acquire_owned()
.await
.map_err(|e| ObjectStorageError::Custom(format!("semaphore closed: {e}")))?;
let bytes = client.get_range(&src, r.clone()).await?;
let offset = r.start;
tokio::task::spawn_blocking(move || -> std::io::Result<()> {
use std::os::unix::fs::FileExt;
std_file.write_all_at(&bytes, offset)
})
.await
.map_err(|e| ObjectStorageError::Custom(format!("join: {e}")))??;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C2 'std::os::unix::fs::FileExt|write_all_at\(' src/storage

Repository: parseablehq/parseable

Length of output: 1322


🏁 Script executed:

#!/bin/bash
# Check context around the Unix-specific code in s3.rs to see if it's behind a cfg gate
echo "=== Context around s3.rs code ===" 
sed -n '400,425p' src/storage/s3.rs

echo -e "\n=== Context around gcs.rs code ===" 
sed -n '245,260p' src/storage/gcs.rs

echo -e "\n=== Context around azure_blob.rs code ===" 
sed -n '280,295p' src/storage/azure_blob.rs

Repository: parseablehq/parseable

Length of output: 2739


🏁 Script executed:

#!/bin/bash
# Check if there are cfg(unix) guards at function level or module level
echo "=== Checking for cfg gates around the functions ===" 
rg -B20 'std::os::unix::fs::FileExt' src/storage/s3.rs | grep -E '(cfg\(|fn |async fn )' | tail -10

Repository: parseablehq/parseable

Length of output: 117


🏁 Script executed:

#!/bin/bash
# Check if the crate targets non-Unix platforms by looking at Cargo.toml
echo "=== Checking Cargo.toml for platform targets ===" 
cat Cargo.toml | head -50

Repository: parseablehq/parseable

Length of output: 1368


🏁 Script executed:

#!/bin/bash
# Find the function containing the Unix-specific code
echo "=== Finding function signatures for s3.rs ===" 
sed -n '380,420p' src/storage/s3.rs | grep -E '(fn |async fn |#\[cfg)'

echo -e "\n=== Finding function signatures for gcs.rs ===" 
sed -n '225,260p' src/storage/gcs.rs | grep -E '(fn |async fn |#\[cfg)'

echo -e "\n=== Finding function signatures for azure_blob.rs ===" 
sed -n '260,295p' src/storage/azure_blob.rs | grep -E '(fn |async fn |#\[cfg)'

Repository: parseablehq/parseable

Length of output: 214


🏁 Script executed:

#!/bin/bash
# Look further back to find the function definition
echo "=== Full function context for s3.rs (lines 360-420) ===" 
sed -n '360,420p' src/storage/s3.rs | head -50

Repository: parseablehq/parseable

Length of output: 2073


🏁 Script executed:

#!/bin/bash
# Check if there are any cfg guards at module or file level
echo "=== Checking module-level cfg in s3.rs ===" 
head -50 src/storage/s3.rs

echo -e "\n=== Checking if this is in a feature-gated module ===" 
rg '#\[cfg' src/storage/ | grep -E '(unix|windows|target)'

Repository: parseablehq/parseable

Length of output: 1900


🏁 Script executed:

#!/bin/bash
# Check if S3Provider or these storage modules are feature-gated
echo "=== Checking for feature gates around storage modules ===" 
rg '#\[cfg|#\!\[cfg' src/storage/s3.rs src/storage/gcs.rs src/storage/azure_blob.rs

echo -e "\n=== Checking Cargo.toml for conditional features ===" 
grep -A20 '\[features\]' Cargo.toml

Repository: parseablehq/parseable

Length of output: 518


🏁 Script executed:

#!/bin/bash
# Check calls to _parallel_download_inner to see if there are guards
echo "=== Finding callers of _parallel_download_inner ===" 
rg '_parallel_download_inner' src/storage/

Repository: parseablehq/parseable

Length of output: 576


🏁 Script executed:

#!/bin/bash
# Verify this is truly unconditional by checking if the code compiles for non-Unix
echo "=== Checking if code is designed for cross-platform ===" 
rg 'target_os|target_family' src/storage/ | head -20

Repository: parseablehq/parseable

Length of output: 121


This offset-write path is Unix-only and affects three storage providers.

std::os::unix::fs::FileExt is used unconditionally in _parallel_download_inner() in S3, GCS, and Azure Blob storage, causing compilation failure on Windows. Add #[cfg(unix)] guards around the offset-write path or switch to a platform-independent API (e.g., seeking + writing).

Affected locations:
  • src/storage/s3.rs, lines 412-413
  • src/storage/gcs.rs, lines 251-252
  • src/storage/azure_blob.rs, lines 285-286
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/s3.rs` around lines 404 - 416, The parallel download path in
_parallel_download_inner() uses std::os::unix::fs::FileExt and
std_file.write_all_at unconditionally (see the tokio::spawn block that calls
client.get_range, then spawn_blocking using FileExt), which breaks Windows
builds; either gate that entire offset-write branch with #[cfg(unix)] or replace
the offset-write logic with a platform-independent approach (seek to offset on a
duplicated/mutex-protected std::fs::File or use a cross-platform write_at
helper) in the S3, GCS, and Azure Blob implementations (s3.rs, gcs.rs,
azure_blob.rs) so the semaphore + client.get_range + spawn_blocking flow
compiles on non-Unix platforms.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/storage/s3.rs (1)

401-426: 💤 Low value

All chunk tasks are spawned upfront; consider bounding task creation.

Unlike the Azure implementation which uses buffer_unordered(concurrency), this S3 implementation spawns all ranges.len() tasks immediately. For large files (e.g., 1GB with 8MB chunks = 128 tasks), this creates many idle tasks in the scheduler. The semaphore limits concurrent I/O but not task handle allocation.

Consider using futures::stream::iter(ranges).map(...).buffer_unordered(concurrency) as done in azure_blob.rs for consistent behavior and lower memory overhead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/s3.rs` around lines 401 - 426, The current loop in the S3
implementation spawns one tokio task per range (creating many idle tasks) which
wastes memory; change the logic to process ranges via a bounded async stream
like futures::stream::iter(ranges).map(...).buffer_unordered(concurrency) (as in
azure_blob.rs) so only up to `concurrency` fetch/write futures are in-flight.
Move the per-range closure that clones `client`, `src`, `std_file`, acquires the
`semaphore`, calls `client.get_range(&src, r.clone()).await`, and writes with
`std_file.write_all_at(&bytes, offset)` into the stream’s map closure, drive it
to completion with .for_each_concurrent or
.buffer_unordered(concurrency).collect/for_each, and remove the Vec of `handles`
and separate join loop.
src/storage/azure_blob.rs (1)

258-262: 💤 Low value

Inconsistent concurrency defaults between storage backends.

Azure uses .max(6) for concurrency while S3 uses .max(16). This inconsistency may confuse users and lead to different performance characteristics. Consider aligning the defaults or documenting the rationale for different values.

Additionally, the .max(...) clamping silently overrides user configuration (same issue as S3).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/azure_blob.rs` around lines 258 - 262, The Azure code uses
PARSEABLE.options.hot_tier_download_concurrency.max(6) which is inconsistent
with S3's .max(16) and also silently overrides user settings; update this by
aligning the concurrency ceiling with S3 (use the same max value or a shared
constant, e.g., HOT_TIER_DOWNLOAD_CONCURRENCY_MAX = 16) and replace the inline
.max(...) usage with explicit validation/clamping at config parse time (read
PARSEABLE.options.hot_tier_download_concurrency, clamp with
.min(HOT_TIER_DOWNLOAD_CONCURRENCY_MAX) or validate and log/warn/error if out of
range) so users are not silently overridden; apply the same pattern for
hot_tier_download_chunk_size if needed to avoid silent overrides.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/storage/azure_blob.rs`:
- Around line 226-228: The current success arm performs
tokio::fs::rename(&partial, &write_path).await? but doesn't remove the partial
file if the rename fails; update the Ok(()) match arm so the rename is done with
explicit error handling: call tokio::fs::rename(&partial, &write_path).await,
and on Err(e) attempt to remove the partial file via
tokio::fs::remove_file(&partial).await (ignore/remove any error from that
cleanup) and then return the original rename error; use the same identifiers
partial, write_path, tokio::fs::rename and tokio::fs::remove_file to locate and
implement the change.

---

Nitpick comments:
In `@src/storage/azure_blob.rs`:
- Around line 258-262: The Azure code uses
PARSEABLE.options.hot_tier_download_concurrency.max(6) which is inconsistent
with S3's .max(16) and also silently overrides user settings; update this by
aligning the concurrency ceiling with S3 (use the same max value or a shared
constant, e.g., HOT_TIER_DOWNLOAD_CONCURRENCY_MAX = 16) and replace the inline
.max(...) usage with explicit validation/clamping at config parse time (read
PARSEABLE.options.hot_tier_download_concurrency, clamp with
.min(HOT_TIER_DOWNLOAD_CONCURRENCY_MAX) or validate and log/warn/error if out of
range) so users are not silently overridden; apply the same pattern for
hot_tier_download_chunk_size if needed to avoid silent overrides.

In `@src/storage/s3.rs`:
- Around line 401-426: The current loop in the S3 implementation spawns one
tokio task per range (creating many idle tasks) which wastes memory; change the
logic to process ranges via a bounded async stream like
futures::stream::iter(ranges).map(...).buffer_unordered(concurrency) (as in
azure_blob.rs) so only up to `concurrency` fetch/write futures are in-flight.
Move the per-range closure that clones `client`, `src`, `std_file`, acquires the
`semaphore`, calls `client.get_range(&src, r.clone()).await`, and writes with
`std_file.write_all_at(&bytes, offset)` into the stream’s map closure, drive it
to completion with .for_each_concurrent or
.buffer_unordered(concurrency).collect/for_each, and remove the Vec of `handles`
and separate join loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f59df567-5986-4f32-bec1-3415e208fcff

📥 Commits

Reviewing files that changed from the base of the PR and between 2b2fe61 and 43e437c.

📒 Files selected for processing (4)
  • src/handlers/http/logstream.rs
  • src/hottier.rs
  • src/storage/azure_blob.rs
  • src/storage/s3.rs
✅ Files skipped from review due to trivial changes (1)
  • src/hottier.rs

Comment thread src/storage/azure_blob.rs
Comment on lines +226 to +228
Ok(()) => {
tokio::fs::rename(&partial, &write_path).await?;
Ok(())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clean up the .partial file if the final rename fails.

Unlike the S3 implementation which was updated to remove the partial file on rename failure, the Azure implementation still lacks this cleanup. A failed rename leaves the temp file behind, and in a retrying hot-tier sync loop, orphaned files can accumulate.

Suggested fix
         match self
             ._parallel_download_inner(path, tenant_id, partial.clone())
             .await
         {
             Ok(()) => {
-                tokio::fs::rename(&partial, &write_path).await?;
+                if let Err(e) = tokio::fs::rename(&partial, &write_path).await {
+                    let _ = tokio::fs::remove_file(&partial).await;
+                    return Err(e.into());
+                }
                 Ok(())
             }
             Err(e) => {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Ok(()) => {
tokio::fs::rename(&partial, &write_path).await?;
Ok(())
Ok(()) => {
if let Err(e) = tokio::fs::rename(&partial, &write_path).await {
let _ = tokio::fs::remove_file(&partial).await;
return Err(e.into());
}
Ok(())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/storage/azure_blob.rs` around lines 226 - 228, The current success arm
performs tokio::fs::rename(&partial, &write_path).await? but doesn't remove the
partial file if the rename fails; update the Ok(()) match arm so the rename is
done with explicit error handling: call tokio::fs::rename(&partial,
&write_path).await, and on Err(e) attempt to remove the partial file via
tokio::fs::remove_file(&partial).await (ignore/remove any error from that
cleanup) and then return the original rename error; use the same identifiers
partial, write_path, tokio::fs::rename and tokio::fs::remove_file to locate and
implement the change.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/hottier.rs`:
- Around line 860-870: The code currently mutates the cached StreamHotTier (sht)
before persisting, causing in-memory/state divergence if self.put_hot_tier
fails; instead create an updated StreamHotTier copy (clone or new struct)
reflecting the deducted/refunded/committed bytes, call self.put_hot_tier(stream,
&mut updated_sht, tenant_id).await?, and only after that succeeds swap or assign
the updated_sht back into sht; apply the same pattern for the other two spots
mentioned (the blocks around lines 909-913 and 935-937) so all mutations are
persisted before updating the in-memory cache.
- Around line 216-249: The reconciliation currently only reads the immediate
date_dir entries in drop_partials, so it misses nested hour=/minute=
subdirectories and uses file_name() which collides; change drop_partials to walk
the date_dir recursively (e.g., async recursive traversal) so you inspect files
in all subdirs, detect and remove any file ending with ".partial" anywhere under
date_dir, skip ".manifest.json" files, and insert the file's path relative to
date_dir (not file_name()) into on_disk (use a relative path string like
relative_path.to_string_lossy().into_owned()); make the same
recursive/relative-path change in the other similar block around lines 277-280
that builds the on_disk set.
- Around line 263-266: The code currently uses
serde_json::from_slice(&bytes).unwrap_or_default(), which hides JSON parse
errors and can lead to destructive cleanup; change this to propagate the
deserialization error instead (e.g. let manifest: Manifest =
serde_json::from_slice(&bytes)? or serde_json::from_slice(&bytes).map_err(|e| /*
add context */ e)?), removing unwrap_or_default so a corrupt manifest returns an
Err from the surrounding function (or, alternatively, detect the Err and
explicitly skip destructive cleanup for that date); refer to manifest_path,
Manifest, bytes, fs::read, and serde_json::from_slice in your change.
- Around line 1220-1226: After successfully writing the manifest and deleting
the minute (the block calling fs::write(...),
fs::remove_dir_all(minute_to_delete) and delete_empty_directory_hot_tier(...)),
set delete_successful = true so the function returns success and later calls
put_hot_tier; also change the eviction condition using
stream_hot_tier.available_size from <= parquet_file_size to < parquet_file_size
so exact-fit (available_size == parquet_file_size) stops evicting instead of
continuing—i.e., after the delete sequence, if stream_hot_tier.available_size <
parquet_file_size continue; else set delete_successful = true and break
'loop_dates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fcfd13c9-8755-4c28-9615-7dc10ccb77bf

📥 Commits

Reviewing files that changed from the base of the PR and between 43e437c and e00e8a3.

📒 Files selected for processing (1)
  • src/hottier.rs

Comment thread src/hottier.rs
Comment on lines +216 to +249
async fn drop_partials(
&self,
on_disk: &mut HashSet<String>,
date_dir: &PathBuf,
partials_removed: &mut usize,
stream: &str,
tenant_id: &Option<String>,
) -> Result<(), HotTierError> {
let mut entries = fs::read_dir(date_dir).await?;
while let Some(entry) = entries.next_entry().await? {
let p = entry.path();
let Some(name_os) = p.file_name() else {
continue;
};
let name = name_os.to_string_lossy();
if name.ends_with(".partial") {
let _ = fs::remove_file(&p).await;
*partials_removed += 1;
info!(
stream = %stream,
tenant = ?tenant_id,
path = %p.display(),
"reconcile: deleted partial orphan"
);
continue;
}
if name.ends_with(".manifest.json") {
continue;
}
if !p.is_file() {
continue;
}
on_disk.insert(name.into_owned());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Reconcile misses the real on-disk payload tree.

fs::read_dir(date_dir) only inspects the date=... directory, but the parquet and .partial files live below hour=/minute=. In the normal layout, on_disk stays empty, stale partials survive, and pass 3 never sees orphan parquet files. Please walk the date subtree recursively here, and compare tracked files by relative path rather than just file_name() to avoid collisions across minute directories.

Also applies to: 277-280

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/hottier.rs` around lines 216 - 249, The reconciliation currently only
reads the immediate date_dir entries in drop_partials, so it misses nested
hour=/minute= subdirectories and uses file_name() which collides; change
drop_partials to walk the date_dir recursively (e.g., async recursive traversal)
so you inspect files in all subdirs, detect and remove any file ending with
".partial" anywhere under date_dir, skip ".manifest.json" files, and insert the
file's path relative to date_dir (not file_name()) into on_disk (use a relative
path string like relative_path.to_string_lossy().into_owned()); make the same
recursive/relative-path change in the other similar block around lines 277-280
that builds the on_disk set.

Comment thread src/hottier.rs
Comment thread src/hottier.rs Outdated
Comment thread src/hottier.rs
Comment on lines +1220 to +1226
fs::write(manifest_file.path(), serde_json::to_vec(&manifest)?).await?;
fs::remove_dir_all(minute_to_delete).await?;
delete_empty_directory_hot_tier(minute_to_delete.to_path_buf()).await?;
if stream_hot_tier.available_size <= parquet_file_size {
continue;
} else {
fs::write(manifest_file.path(), serde_json::to_vec(&manifest)?).await?;
break 'loop_dates;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Eviction never reports success, and exact-fit still over-evicts.

delete_successful is never set to true, so a successful eviction still returns false and the caller skips the download. Because that early return bypasses the later put_hot_tier, the freed-space accounting also never gets persisted. This branch should mark success after deleting a minute and stop when available_size == parquet_file_size instead of continuing to evict.

♻️ Minimal fix
                     fs::write(manifest_file.path(), serde_json::to_vec(&manifest)?).await?;
                     fs::remove_dir_all(minute_to_delete).await?;
                     delete_empty_directory_hot_tier(minute_to_delete.to_path_buf()).await?;
-                    if stream_hot_tier.available_size <= parquet_file_size {
+                    delete_successful = true;
+                    if stream_hot_tier.available_size < parquet_file_size {
                         continue;
                     } else {
                         break 'loop_dates;
                     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fs::write(manifest_file.path(), serde_json::to_vec(&manifest)?).await?;
fs::remove_dir_all(minute_to_delete).await?;
delete_empty_directory_hot_tier(minute_to_delete.to_path_buf()).await?;
if stream_hot_tier.available_size <= parquet_file_size {
continue;
} else {
fs::write(manifest_file.path(), serde_json::to_vec(&manifest)?).await?;
break 'loop_dates;
fs::write(manifest_file.path(), serde_json::to_vec(&manifest)?).await?;
fs::remove_dir_all(minute_to_delete).await?;
delete_empty_directory_hot_tier(minute_to_delete.to_path_buf()).await?;
delete_successful = true;
if stream_hot_tier.available_size < parquet_file_size {
continue;
} else {
break 'loop_dates;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/hottier.rs` around lines 1220 - 1226, After successfully writing the
manifest and deleting the minute (the block calling fs::write(...),
fs::remove_dir_all(minute_to_delete) and delete_empty_directory_hot_tier(...)),
set delete_successful = true so the function returns success and later calls
put_hot_tier; also change the eviction condition using
stream_hot_tier.available_size from <= parquet_file_size to < parquet_file_size
so exact-fit (available_size == parquet_file_size) stops evicting instead of
continuing—i.e., after the delete sequence, if stream_hot_tier.available_size <
parquet_file_size continue; else set delete_successful = true and break
'loop_dates.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/hottier.rs (1)

993-1000: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

unwrap() on date parsing can panic on malformed directory names.

If a directory exists under the stream path that doesn't match the date=YYYY-MM-DD format (e.g., created manually or by another process), this unwrap() will panic and crash the hot-tier sync loop for all streams.

🐛 Proposed fix: skip malformed entries
             let date = NaiveDate::parse_from_str(
                 date.file_name()
                     .to_string_lossy()
                     .trim_start_matches("date="),
                 "%Y-%m-%d",
-            )
-            .unwrap();
-            date_list.push(date);
+            );
+            match date {
+                Ok(d) => date_list.push(d),
+                Err(_) => {
+                    tracing::warn!(
+                        path = %date.path().display(),
+                        "skipping directory with invalid date format"
+                    );
+                    continue;
+                }
+            }
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/hottier.rs` around lines 993 - 1000, The current use of
NaiveDate::parse_from_str(...).unwrap() can panic for malformed directory names;
replace the unwrap with error-handling to skip entries that fail to parse (e.g.,
use match or if let Ok(parsed) = NaiveDate::parse_from_str(...) and only call
date_list.push(parsed) on success), and optionally log a debug/warn including
the original file_name() when parsing fails; update the code around
NaiveDate::parse_from_str, the date variable, and date_list.push to implement
this non-panicking behavior.
♻️ Duplicate comments (2)
src/hottier.rs (2)

1247-1251: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Exact-fit still triggers unnecessary eviction iteration.

The condition stream_hot_tier.available_size <= parquet_file_size at line 1247 causes one extra eviction loop when space exactly matches. The past review flagged this (claimed addressed in commit 5ee0454), but it's still <=. Change to < so exact-fit stops evicting:

🐛 Proposed fix
-                    if stream_hot_tier.available_size <= parquet_file_size {
+                    if stream_hot_tier.available_size < parquet_file_size {
                         continue;
                     } else {
                         break 'loop_dates;
                     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/hottier.rs` around lines 1247 - 1251, The eviction loop uses the
condition stream_hot_tier.available_size <= parquet_file_size which causes one
extra eviction when available_size exactly equals parquet_file_size; update that
comparison to stream_hot_tier.available_size < parquet_file_size so an exact fit
will break out of 'loop_dates instead of performing another eviction. Locate the
check inside the eviction logic where parquet_file_size is compared to
stream_hot_tier.available_size and replace the <= with <, keeping the
surrounding control flow (the continue and break 'loop_dates) unchanged.

263-268: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Corrupt manifest still treated as empty despite claimed fix.

The past review indicated this was addressed, but unwrap_or_default() is still present. A JSON parse failure will produce an empty Manifest, and the subsequent orphan-cleanup pass (lines 187-197) can delete every parquet file under that date. Propagate the error instead:

🐛 Proposed fix
         let mut manifest: Manifest = if manifest_path.exists() {
             let bytes = fs::read(&manifest_path).await?;
-            serde_json::from_slice(&bytes).unwrap_or_default()
+            serde_json::from_slice(&bytes)?
         } else {
             Manifest::default()
         };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/hottier.rs` around lines 263 - 268, The manifest JSON parse is being
swallowed by unwrap_or_default() which can convert a corrupt manifest into an
empty Manifest and cause catastrophic orphan cleanup; change the read/parse
logic for manifest_path so that serde_json::from_slice(&bytes) returns its error
instead of defaulting—i.e. replace the unwrap_or_default() on the
serde_json::from_slice call with proper error propagation (use the ? operator or
map_err to attach context) so that failures reading/parsing the manifest (in the
block that assigns to manifest) surface to the caller.
🧹 Nitpick comments (1)
src/hottier.rs (1)

720-723: 💤 Low value

.max(4) enforces a minimum, not a maximum concurrency.

The option hot_tier_files_per_stream_concurrency suggests the user can limit concurrency, but .max(4) means the actual concurrency is max(user_value, 4). If an operator sets this to 2 to reduce load, they still get 4 concurrent downloads. This is likely inverted logic—consider .min(MAX_CAP).max(1) or just trust the configured value.

♻️ Suggested fix
         let concurrency = PARSEABLE
             .options
             .hot_tier_files_per_stream_concurrency
-            .max(4);
+            .max(1); // Ensure at least 1; trust the configured upper limit
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/hottier.rs` around lines 720 - 723, The current expression sets
concurrency = PARSEABLE.options.hot_tier_files_per_stream_concurrency.max(4)
which enforces a minimum of 4 rather than an upper bound; change this logic so
the configured value is respected and bounded above (and optionally below)
instead — replace the .max(4) usage with bounding logic such as .min(4).max(1)
(or .min(MAX_CAP).max(1) if you have a named cap) so
hot_tier_files_per_stream_concurrency provides the intended limit, and ensure
the variable name concurrency remains assigned from
PARSEABLE.options.hot_tier_files_per_stream_concurrency after applying the
correct bounds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/hottier.rs`:
- Around line 993-1000: The current use of
NaiveDate::parse_from_str(...).unwrap() can panic for malformed directory names;
replace the unwrap with error-handling to skip entries that fail to parse (e.g.,
use match or if let Ok(parsed) = NaiveDate::parse_from_str(...) and only call
date_list.push(parsed) on success), and optionally log a debug/warn including
the original file_name() when parsing fails; update the code around
NaiveDate::parse_from_str, the date variable, and date_list.push to implement
this non-panicking behavior.

---

Duplicate comments:
In `@src/hottier.rs`:
- Around line 1247-1251: The eviction loop uses the condition
stream_hot_tier.available_size <= parquet_file_size which causes one extra
eviction when available_size exactly equals parquet_file_size; update that
comparison to stream_hot_tier.available_size < parquet_file_size so an exact fit
will break out of 'loop_dates instead of performing another eviction. Locate the
check inside the eviction logic where parquet_file_size is compared to
stream_hot_tier.available_size and replace the <= with <, keeping the
surrounding control flow (the continue and break 'loop_dates) unchanged.
- Around line 263-268: The manifest JSON parse is being swallowed by
unwrap_or_default() which can convert a corrupt manifest into an empty Manifest
and cause catastrophic orphan cleanup; change the read/parse logic for
manifest_path so that serde_json::from_slice(&bytes) returns its error instead
of defaulting—i.e. replace the unwrap_or_default() on the serde_json::from_slice
call with proper error propagation (use the ? operator or map_err to attach
context) so that failures reading/parsing the manifest (in the block that
assigns to manifest) surface to the caller.

---

Nitpick comments:
In `@src/hottier.rs`:
- Around line 720-723: The current expression sets concurrency =
PARSEABLE.options.hot_tier_files_per_stream_concurrency.max(4) which enforces a
minimum of 4 rather than an upper bound; change this logic so the configured
value is respected and bounded above (and optionally below) instead — replace
the .max(4) usage with bounding logic such as .min(4).max(1) (or
.min(MAX_CAP).max(1) if you have a named cap) so
hot_tier_files_per_stream_concurrency provides the intended limit, and ensure
the variable name concurrency remains assigned from
PARSEABLE.options.hot_tier_files_per_stream_concurrency after applying the
correct bounds.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b7f512ab-7b6d-40e2-8822-61826d3cbc26

📥 Commits

Reviewing files that changed from the base of the PR and between e00e8a3 and b185b38.

📒 Files selected for processing (1)
  • src/hottier.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant