Conversation
Create new platform-challenge-registry crate with: - Challenge discovery and registration - Version management (semver-based) - Lifecycle state machine (registered/starting/running/stopping/stopped) - Health monitoring with configurable checks - State persistence and hot-reload support - Migration planning for version upgrades Modules: - registry: Main registry with CRUD operations - lifecycle: State machine for challenge states - health: Health monitoring and status tracking - state: State snapshots for hot-reload - discovery: Challenge discovery from various sources - migration: Version migration planning - version: Semantic versioning support - error: Registry-specific error types
- Add ShutdownHandler struct for checkpoint management - Create periodic checkpoints every 5 minutes - Save final checkpoint on graceful shutdown (Ctrl+C) - Persist evaluation state for hot-reload recovery This enables validators to update without losing evaluation progress.
📝 WalkthroughWalkthroughIntroduces a comprehensive challenge registry system with state persistence, checkpoint/restoration capabilities, and health monitoring infrastructure. Adds a new Changes
Sequence Diagram(s)sequenceDiagram
participant Validator as Validator Node
participant Registry as Challenge Registry
participant Checkpoint as Checkpoint Manager
participant Disk as Disk Storage
participant Restorable as Restoration Manager
Validator->>Registry: Initialize challenge registry
Registry-->>Validator: Ready
loop During Runtime
Validator->>Registry: Update challenge state<br/>(evaluations, health)
Registry-->>Validator: Acknowledged
Validator->>Checkpoint: Periodic create_checkpoint()
Checkpoint->>Checkpoint: Serialize state + compute hash
Checkpoint->>Disk: Atomic write (temp → final)
Disk-->>Checkpoint: Checkpoint persisted
end
Note over Validator: Ctrl+C Signal
Validator->>Checkpoint: Final create_checkpoint()
Checkpoint->>Disk: Write final state
Validator-->>Validator: Shutdown
Note over Restorable: Later: Recovery
Restorable->>Checkpoint: load_latest()
Checkpoint->>Disk: Read checkpoint + verify hash
Disk-->>Checkpoint: CheckpointData
Checkpoint-->>Restorable: (CheckpointHeader, CheckpointData)
Restorable->>Restorable: Validate & filter state
Restorable-->>Validator: Restored state ready
sequenceDiagram
participant Client as Client
participant Registry as Challenge Registry
participant Discovery as Discovery Engine
participant Health as Health Monitor
participant Lifecycle as Lifecycle Manager
Client->>Discovery: discover_from_local(path)
Discovery->>Discovery: Scan filesystem<br/>Check challenge.toml
Discovery-->>Client: Vec<DiscoveredChallenge>
Client->>Registry: register(ChallengeEntry)
Registry->>Lifecycle: Track new lifecycle
Registry->>Health: Initialize health state
Registry-->>Client: ChallengeId
loop Health Checks
Health->>Health: Check endpoint<br/>Record metrics
Health-->>Registry: Update HealthStatus
end
Client->>Registry: update_version(id, new_version)
Registry->>Registry: Validate compatibility
Registry-->>Client: Old version
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Fix all issues with AI agents
In `@challenges/README.md`:
- Around line 7-12: The fenced code block in README.md that displays the
directory tree (the block starting with the triple backticks followed by the
tree: "challenges/ ├── README.md ...") is missing a language tag; update that
fenced block to include the language tag "text" (i.e., change the opening ``` to
```text) so Markdownlint MD040 is satisfied and the tree is treated as plain
text.
In `@crates/challenge-registry/src/health.rs`:
- Around line 91-100: The method record_failure currently uses a hardcoded
threshold (>= 3) to flip status to Unhealthy; update it to use the configured
threshold instead by reading HealthConfig.failure_threshold (or accept a
threshold parameter) so changes to HealthConfig take effect — either change
record_failure on the Health struct to accept a threshold argument and use that
value, or have HealthMonitor::record_failure call into the existing
record_failure and pass self.config.failure_threshold; make sure to replace the
literal 3 with the referenced failure_threshold and keep updating
consecutive_failures, last_check_at, and HealthStatus logic unchanged.
In `@crates/challenge-registry/src/migration.rs`:
- Around line 345-357: The current finalize_migration removes the plan from
active_plans before checking its status, which can drop non-terminal plans and
prevents failed plans from being archived; update finalize_migration to first
look up (read/clone or inspect) the MigrationPlan in self.active_plans without
removing it, verify that its status is a terminal state (allow Completed,
RolledBack, and Failed as terminal), and only then call remove(challenge_id) to
take it out and return the plan; reference the finalize_migration method,
active_plans map, MigrationPlan status checks (e.g., is_complete or status enum
variants) and ensure compatibility with fail_migration which sets Failed so
failed plans become removable/archivable.
In `@crates/challenge-registry/src/registry.rs`:
- Around line 177-202: The update_state function allows arbitrary lifecycle
changes but must validate transitions using
ChallengeLifecycle::is_valid_transition; modify update_state (in registry.rs) to
check ChallengeLifecycle::is_valid_transition(&old_state, &new_state) before
applying changes and if invalid return a new
RegistryError::InvalidStateTransition containing old_state and new_state (add
that variant to RegistryError), only update
registered.entry.lifecycle_state/updated_at and emit
LifecycleEvent::StateChanged when the transition is valid; ensure debug logging
remains and that the function returns Err(...) for invalid transitions instead
of proceeding.
In `@crates/challenge-registry/src/version.rs`:
- Around line 69-79: The current Ord impl for ChallengeVersion (fn cmp) ignores
the prerelease field; update cmp in the impl for ChallengeVersion to enforce
semver precedence: after comparing major, minor, patch, if equal compare
prerelease such that a missing prerelease (release) has higher precedence than
any prerelease (i.e., None > Some), and when both have prerelease strings
compare their dot-separated identifiers in order using semver rules (numeric
identifiers compare numerically and have lower precedence than non-numeric
identifiers; non-numeric compare lexicographically; if all equal the shorter
identifier list has lower precedence). Use the existing fields
ChallengeVersion.prerelease and the cmp function to implement this logic in the
Ord::cmp for ChallengeVersion.
In `@crates/core/src/checkpoint.rs`:
- Around line 231-235: Do not increment self.current_sequence before persisting;
instead compute let next_sequence = self.current_sequence + 1, use
checkpoint_filename(next_sequence) to write and rename the temp file, and only
after the atomic rename succeeds assign self.current_sequence = next_sequence so
failures don't advance the in-memory sequence. Apply the same pattern to the
other similar routine (the block at 279-283) so current_sequence is updated only
on successful persistence; keep all filename construction and rename logic using
checkpoint_filename(next_sequence) and leave load_latest behavior unchanged.
- Around line 321-357: The code reads untrusted sizes into header_len and
header.data_size and allocates vectors directly, which can OOM on corrupt files;
before allocating check bounds: define reasonable maxs (e.g. MAX_HEADER_SIZE and
MAX_DATA_SIZE), validate header_len is <= MAX_HEADER_SIZE and less than the
remaining reader/file length, then allocate header_bytes; after deserializing
CheckpointHeader, validate header.data_size <= MAX_DATA_SIZE and that
header.data_size does not exceed remaining reader/file length before allocating
data_bytes; on violations return a MiniChainError::Storage with a clear message.
Use the existing symbols header_len, CheckpointHeader, header.data_size, reader,
and CHECKPOINT_VERSION to locate insertion points.
In `@crates/rpc-server/src/health.rs`:
- Around line 196-221: get_overall_status currently ignores the challenges field
on ComponentStatus even though set_component_status and ComponentStatus track
it; update get_overall_status to include components.challenges in the unhealthy
and degraded checks (treat challenges the same way as p2p/storage/consensus),
and ensure the final special-case check (the bittensor check) still runs as
intended — modify the get_overall_status function to reference
components.challenges alongside components.p2p, components.storage, and
components.consensus when computing overall HealthStatus.
In `@docs/challenge-integration.md`:
- Around line 42-51: The Markdown code fence showing the project tree lacks a
language tag; update the block in docs/challenge-integration.md by changing the
opening ``` to ```text so the project-structure block (the my-challenge/ tree
with Cargo.toml, src/lib.rs, evaluation.rs, scoring.rs, Dockerfile, README.md)
is fenced as ```text which will silence MD040 warnings and improve
rendering/readability.
🧹 Nitpick comments (13)
challenges/.gitkeep (1)
1-1: Consider making the .gitkeep file truly empty or add a comment.By convention,
.gitkeepfiles are typically either completely empty (0 bytes) or contain a brief comment explaining their purpose. The current single newline works but is slightly unconventional.📝 Alternative approaches
Option 1: Remove the empty line entirely to make it a true 0-byte file.
Option 2: Add a descriptive comment:
- +# Placeholder directory for challenge crate modulescrates/challenge-registry/src/error.rs (1)
45-49: Consider preserving more context for I/O errors.Mapping all
std::io::ErrortoInternalloses context about whether the error originated from state persistence, file operations, or network I/O. This could make debugging harder.💡 Optional improvement
Consider mapping I/O errors contextually where they occur, or including the error kind:
impl From<std::io::Error> for RegistryError { fn from(err: std::io::Error) -> Self { - RegistryError::Internal(err.to_string()) + RegistryError::Internal(format!("IO error ({}): {}", err.kind(), err)) } }tests/checkpoint_tests.rs (1)
139-146: Clarify the distinction between data sequence and checkpoint sequence.The test expects
latest.sequence(fromCheckpointData) to be 9, whileheader.sequenceis 10. This works because the loop indexigoes 0-9, but it might confuse readers. Consider adding a comment explaining the distinction.📝 Optional clarification
// Latest should be sequence 10 + // Note: header.sequence is the checkpoint file sequence (1-indexed) + // data.sequence is the application sequence from the loop (0-indexed) let (header, latest) = manager .load_latest() .expect("Failed to load") .expect("No checkpoint"); assert_eq!(latest.sequence, 9); assert_eq!(header.sequence, 10);crates/challenge-registry/src/lifecycle.rs (1)
56-78: Auto-restart configuration defined but restart tracking not implemented.The
auto_restartandmax_restart_attemptsfields are configured but there's no state or method to track actual restart attempts per challenge. Consider whether restart count tracking should be managed here or delegated to the registry.crates/challenge-registry/src/health.rs (1)
111-114:recovery_thresholdis defined but never used.The
HealthConfig.recovery_thresholdfield (line 113) is documented as "Number of successes to recover from unhealthy" butrecord_successunconditionally sets status toHealthy. Consider implementing recovery logic or removing the unused field.crates/challenge-registry/src/discovery.rs (2)
158-194: Code duplication betweenchallenge.tomlandCargo.tomlhandling.The logic for extracting the challenge name and creating
DiscoveredChallengeis nearly identical between the two branches. Consider extracting a helper function.Proposed refactor to reduce duplication
+ fn create_discovered_from_path(&self, path: &PathBuf) -> DiscoveredChallenge { + let name = path + .file_name() + .and_then(|n| n.to_str()) + .unwrap_or("unknown") + .to_string(); + + DiscoveredChallenge { + name, + version: ChallengeVersion::default(), + docker_image: None, + local_path: Some(path.clone()), + health_endpoint: None, + evaluation_endpoint: None, + metadata: ChallengeMetadata::default(), + source: DiscoverySource::LocalFilesystem(path.clone()), + } + } + pub fn discover_from_local(&self, path: &PathBuf) -> RegistryResult<Vec<DiscoveredChallenge>> { // ... validation ... if path.is_dir() { let challenge_toml = path.join("challenge.toml"); let cargo_toml = path.join("Cargo.toml"); if challenge_toml.exists() || cargo_toml.exists() { - // ... duplicated code ... + challenges.push(self.create_discovered_from_path(path)); } } Ok(challenges) }
121-140:discover_allonly implements local discovery; Docker and P2P are placeholders.The method iterates only over
local_pathswhiledocker_registriesandenable_p2pconfig fields are unused. This appears intentional for the initial infrastructure, but consider adding TODO comments or returning a warning when these sources are configured but not yet implemented.crates/challenge-registry/src/state.rs (1)
206-209: Inefficient snapshot trimming withremove(0)in loop.Using
remove(0)on aVecin a loop is O(n²) because each removal shifts all subsequent elements. With defaultmax_snapshots=5, this is negligible, but consider usingdrainfor cleaner O(n) removal if this could be configured higher.Proposed fix using drain
// Trim old snapshots - while snapshots.len() > self.max_snapshots { - snapshots.remove(0); - } + if snapshots.len() > self.max_snapshots { + let excess = snapshots.len() - self.max_snapshots; + snapshots.drain(0..excess); + }crates/core/src/restoration.rs (4)
256-275: Stale evaluation filtering is a no-op placeholder.The
skip_stale_evaluationsoption is respected, but the actual filter logic always returnstrue, keeping all evaluations. The comment acknowledges this: "For now, keep all pending (they don't have epoch info)".Consider either:
- Implementing the logic if
PendingEvaluationStatehas epoch information available- Removing the
skip_stale_evaluationsoption until it's implemented- Adding a TODO comment that's more visible
286-292: Magic number for epoch validation should be a named constant.The
1_000_000epoch limit is arbitrary. Consider defining it as a constant with documentation explaining the rationale.Proposed fix
+/// Maximum reasonable epoch value for validation. +/// This is a sanity check to detect corrupted checkpoints. +const MAX_REASONABLE_EPOCH: u64 = 1_000_000; + fn validate_data(&self, data: &CheckpointData) -> Result<()> { // Validate epoch is reasonable - if data.epoch > 1_000_000 { + if data.epoch > MAX_REASONABLE_EPOCH { return Err(MiniChainError::Validation( "Checkpoint epoch seems unreasonably high".into(), )); }
336-351:get_checkpoint_infoloads full checkpoint data for metadata extraction.This method loads the entire checkpoint (including all evaluations) just to extract summary information. If
CheckpointManagersupported header-only loading, this could be more efficient for listing many checkpoints.
372-379: Add documentation explaining expected implementors of theRestorabletrait.The trait is exported in the public API but has no implementations in the codebase and lacks documentation about intended use. Consider adding a doc comment clarifying whether external crates are expected to implement this trait or if it's reserved for future use within this crate.
crates/challenge-registry/src/registry.rs (1)
274-279: Event listeners executed synchronously under lock.
emit_eventcalls all listeners while holding the read lock onevent_listeners. Long-running or blocking listeners could delay other registry operations. Consider:
- Cloning listeners before iterating (release lock sooner)
- Spawning listener calls asynchronously
- Documenting that listeners should be non-blocking
Option: Clone listeners to release lock faster
fn emit_event(&self, event: LifecycleEvent) { - for listener in self.event_listeners.read().iter() { + let listeners: Vec<_> = self.event_listeners.read().iter().collect(); + // Lock released here + for listener in listeners { listener(event.clone()); } }Note: This requires listeners to be
Clone, whichBox<dyn Fn>is not. Alternative: storeArc<dyn Fn>instead.
| ``` | ||
| challenges/ | ||
| ├── README.md # This file | ||
| ├── example-challenge/ # Example challenge template (future) | ||
| └── [your-challenge]/ # Your custom challenge crate | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to the fenced block.
Markdownlint MD040 flags fenced blocks without a language; use text here.
Proposed fix
-```
+```text
challenges/
├── README.md # This file
├── example-challenge/ # Example challenge template (future)
└── [your-challenge]/ # Your custom challenge crate</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
In @challenges/README.md around lines 7 - 12, The fenced code block in README.md
that displays the directory tree (the block starting with the triple backticks
followed by the tree: "challenges/ ├── README.md ...") is missing a language
tag; update that fenced block to include the language tag "text" (i.e., change
the opening totext) so Markdownlint MD040 is satisfied and the tree is
treated as plain text.
</details>
<!-- fingerprinting:phantom:poseidon:eagle -->
<!-- This is an auto-generated comment by CodeRabbit -->
| pub fn record_failure(&mut self, reason: String) { | ||
| self.consecutive_failures += 1; | ||
| self.last_check_at = chrono::Utc::now().timestamp_millis(); | ||
|
|
||
| if self.consecutive_failures >= 3 { | ||
| self.status = HealthStatus::Unhealthy(reason); | ||
| } else { | ||
| self.status = HealthStatus::Degraded(reason); | ||
| } | ||
| } |
There was a problem hiding this comment.
Hardcoded failure threshold ignores HealthConfig.failure_threshold.
The record_failure method hardcodes >= 3 for marking unhealthy, but HealthConfig has a configurable failure_threshold field (default 3). This creates an inconsistency where changing the config has no effect.
Proposed fix to use configurable threshold
Either pass the threshold to record_failure:
- pub fn record_failure(&mut self, reason: String) {
+ pub fn record_failure(&mut self, reason: String, failure_threshold: u32) {
self.consecutive_failures += 1;
self.last_check_at = chrono::Utc::now().timestamp_millis();
- if self.consecutive_failures >= 3 {
+ if self.consecutive_failures >= failure_threshold {
self.status = HealthStatus::Unhealthy(reason);
} else {
self.status = HealthStatus::Degraded(reason);
}
}Or have HealthMonitor::record_failure apply the threshold from its config.
🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/health.rs` around lines 91 - 100, The method
record_failure currently uses a hardcoded threshold (>= 3) to flip status to
Unhealthy; update it to use the configured threshold instead by reading
HealthConfig.failure_threshold (or accept a threshold parameter) so changes to
HealthConfig take effect — either change record_failure on the Health struct to
accept a threshold argument and use that value, or have
HealthMonitor::record_failure call into the existing record_failure and pass
self.config.failure_threshold; make sure to replace the literal 3 with the
referenced failure_threshold and keep updating consecutive_failures,
last_check_at, and HealthStatus logic unchanged.
| /// Finalize and archive a completed migration | ||
| pub fn finalize_migration(&self, challenge_id: &ChallengeId) -> RegistryResult<MigrationPlan> { | ||
| let plan = self | ||
| .active_plans | ||
| .write() | ||
| .remove(challenge_id) | ||
| .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?; | ||
|
|
||
| if !plan.is_complete() { | ||
| return Err(RegistryError::MigrationFailed( | ||
| "Migration not complete".to_string(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Finalize only after verifying a terminal status (and keep failed plans archivable).
Line 345 removes the plan before checking completion, so a premature call drops the active migration. Also, fail_migration sets Failed but finalize_migration rejects anything outside Completed/RolledBack, leaving failed plans stuck in active_plans. Validate status while the plan is still in the map and allow terminal states before removal.
🔧 Suggested fix
- let plan = self
- .active_plans
- .write()
- .remove(challenge_id)
- .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?;
-
- if !plan.is_complete() {
- return Err(RegistryError::MigrationFailed(
- "Migration not complete".to_string(),
- ));
- }
+ let mut plans = self.active_plans.write();
+ let is_terminal = plans
+ .get(challenge_id)
+ .map(|p| {
+ matches!(
+ p.status,
+ MigrationStatus::Completed
+ | MigrationStatus::RolledBack
+ | MigrationStatus::Failed(_)
+ )
+ })
+ .ok_or_else(|| RegistryError::MigrationFailed("No active migration".to_string()))?;
+
+ if !is_terminal {
+ return Err(RegistryError::MigrationFailed(
+ "Migration not complete".to_string(),
+ ));
+ }
+
+ let plan = plans.remove(challenge_id).expect("checked above");🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/migration.rs` around lines 345 - 357, The
current finalize_migration removes the plan from active_plans before checking
its status, which can drop non-terminal plans and prevents failed plans from
being archived; update finalize_migration to first look up (read/clone or
inspect) the MigrationPlan in self.active_plans without removing it, verify that
its status is a terminal state (allow Completed, RolledBack, and Failed as
terminal), and only then call remove(challenge_id) to take it out and return the
plan; reference the finalize_migration method, active_plans map, MigrationPlan
status checks (e.g., is_complete or status enum variants) and ensure
compatibility with fail_migration which sets Failed so failed plans become
removable/archivable.
| /// Update challenge lifecycle state | ||
| pub fn update_state(&self, id: &ChallengeId, new_state: LifecycleState) -> RegistryResult<()> { | ||
| let mut challenges = self.challenges.write(); | ||
| let registered = challenges | ||
| .get_mut(id) | ||
| .ok_or_else(|| RegistryError::ChallengeNotFound(id.to_string()))?; | ||
|
|
||
| let old_state = registered.entry.lifecycle_state.clone(); | ||
| registered.entry.lifecycle_state = new_state.clone(); | ||
| registered.entry.updated_at = chrono::Utc::now().timestamp_millis(); | ||
|
|
||
| debug!( | ||
| challenge_id = %id, | ||
| old_state = ?old_state, | ||
| new_state = ?new_state, | ||
| "Challenge state updated" | ||
| ); | ||
|
|
||
| self.emit_event(LifecycleEvent::StateChanged { | ||
| challenge_id: *id, | ||
| old_state, | ||
| new_state, | ||
| }); | ||
|
|
||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
update_state doesn't validate transitions using ChallengeLifecycle.is_valid_transition.
The ChallengeLifecycle struct has is_valid_transition logic (defined in lifecycle.rs), but update_state doesn't use it. Invalid state transitions are allowed, which could lead to inconsistent states.
Proposed fix to validate transitions
pub fn update_state(&self, id: &ChallengeId, new_state: LifecycleState) -> RegistryResult<()> {
let mut challenges = self.challenges.write();
let registered = challenges
.get_mut(id)
.ok_or_else(|| RegistryError::ChallengeNotFound(id.to_string()))?;
let old_state = registered.entry.lifecycle_state.clone();
+
+ if !self.lifecycle.is_valid_transition(&old_state, &new_state) {
+ return Err(RegistryError::InvalidStateTransition(
+ format!("{:?} -> {:?}", old_state, new_state)
+ ));
+ }
+
registered.entry.lifecycle_state = new_state.clone();
registered.entry.updated_at = chrono::Utc::now().timestamp_millis();Note: This requires adding an InvalidStateTransition variant to RegistryError.
🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/registry.rs` around lines 177 - 202, The
update_state function allows arbitrary lifecycle changes but must validate
transitions using ChallengeLifecycle::is_valid_transition; modify update_state
(in registry.rs) to check ChallengeLifecycle::is_valid_transition(&old_state,
&new_state) before applying changes and if invalid return a new
RegistryError::InvalidStateTransition containing old_state and new_state (add
that variant to RegistryError), only update
registered.entry.lifecycle_state/updated_at and emit
LifecycleEvent::StateChanged when the transition is valid; ensure debug logging
remains and that the function returns Err(...) for invalid transitions instead
of proceeding.
| impl Ord for ChallengeVersion { | ||
| fn cmp(&self, other: &Self) -> Ordering { | ||
| match self.major.cmp(&other.major) { | ||
| Ordering::Equal => match self.minor.cmp(&other.minor) { | ||
| Ordering::Equal => self.patch.cmp(&other.patch), | ||
| ord => ord, | ||
| }, | ||
| ord => ord, | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Prerelease versions are not considered in ordering.
The Ord implementation ignores the prerelease field, meaning 1.0.0-alpha and 1.0.0 are considered equal in comparisons. Per semver specification, prerelease versions should have lower precedence than the release version (i.e., 1.0.0-alpha < 1.0.0).
This could cause issues when comparing versions during migration planning or constraint satisfaction.
💡 Suggested fix for semver-compliant ordering
impl Ord for ChallengeVersion {
fn cmp(&self, other: &Self) -> Ordering {
match self.major.cmp(&other.major) {
Ordering::Equal => match self.minor.cmp(&other.minor) {
- Ordering::Equal => self.patch.cmp(&other.patch),
+ Ordering::Equal => match self.patch.cmp(&other.patch) {
+ Ordering::Equal => {
+ // Prerelease versions have lower precedence than release
+ match (&self.prerelease, &other.prerelease) {
+ (None, None) => Ordering::Equal,
+ (Some(_), None) => Ordering::Less, // prerelease < release
+ (None, Some(_)) => Ordering::Greater, // release > prerelease
+ (Some(a), Some(b)) => a.cmp(b), // compare prerelease strings
+ }
+ }
+ ord => ord,
+ },
ord => ord,
},
ord => ord,
}
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| impl Ord for ChallengeVersion { | |
| fn cmp(&self, other: &Self) -> Ordering { | |
| match self.major.cmp(&other.major) { | |
| Ordering::Equal => match self.minor.cmp(&other.minor) { | |
| Ordering::Equal => self.patch.cmp(&other.patch), | |
| ord => ord, | |
| }, | |
| ord => ord, | |
| } | |
| } | |
| } | |
| impl Ord for ChallengeVersion { | |
| fn cmp(&self, other: &Self) -> Ordering { | |
| match self.major.cmp(&other.major) { | |
| Ordering::Equal => match self.minor.cmp(&other.minor) { | |
| Ordering::Equal => match self.patch.cmp(&other.patch) { | |
| Ordering::Equal => { | |
| // Prerelease versions have lower precedence than release | |
| match (&self.prerelease, &other.prerelease) { | |
| (None, None) => Ordering::Equal, | |
| (Some(_), None) => Ordering::Less, // prerelease < release | |
| (None, Some(_)) => Ordering::Greater, // release > prerelease | |
| (Some(a), Some(b)) => a.cmp(b), // compare prerelease strings | |
| } | |
| } | |
| ord => ord, | |
| }, | |
| ord => ord, | |
| }, | |
| ord => ord, | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@crates/challenge-registry/src/version.rs` around lines 69 - 79, The current
Ord impl for ChallengeVersion (fn cmp) ignores the prerelease field; update cmp
in the impl for ChallengeVersion to enforce semver precedence: after comparing
major, minor, patch, if equal compare prerelease such that a missing prerelease
(release) has higher precedence than any prerelease (i.e., None > Some), and
when both have prerelease strings compare their dot-separated identifiers in
order using semver rules (numeric identifiers compare numerically and have lower
precedence than non-numeric identifiers; non-numeric compare lexicographically;
if all equal the shorter identifier list has lower precedence). Use the existing
fields ChallengeVersion.prerelease and the cmp function to implement this logic
in the Ord::cmp for ChallengeVersion.
| pub fn create_checkpoint(&mut self, data: &CheckpointData) -> Result<PathBuf> { | ||
| self.current_sequence += 1; | ||
| let sequence = self.current_sequence; | ||
| let filename = self.checkpoint_filename(sequence); | ||
|
|
There was a problem hiding this comment.
Defer current_sequence updates until after the checkpoint is safely persisted.
Line 232 increments before the write/rename. If I/O fails, current_sequence points to a non-existent file, and load_latest returns None even though older checkpoints exist. Compute the next sequence locally and update only after the rename succeeds.
🔧 Suggested fix
- self.current_sequence += 1;
- let sequence = self.current_sequence;
+ let sequence = self.current_sequence + 1;
let filename = self.checkpoint_filename(sequence);
@@
fs::rename(&temp_filename, &filename).map_err(|e| {
MiniChainError::Storage(format!("Failed to finalize checkpoint: {}", e))
})?;
+
+ self.current_sequence = sequence;Also applies to: 279-283
🤖 Prompt for AI Agents
In `@crates/core/src/checkpoint.rs` around lines 231 - 235, Do not increment
self.current_sequence before persisting; instead compute let next_sequence =
self.current_sequence + 1, use checkpoint_filename(next_sequence) to write and
rename the temp file, and only after the atomic rename succeeds assign
self.current_sequence = next_sequence so failures don't advance the in-memory
sequence. Apply the same pattern to the other similar routine (the block at
279-283) so current_sequence is updated only on successful persistence; keep all
filename construction and rename logic using checkpoint_filename(next_sequence)
and leave load_latest behavior unchanged.
| // Read header length | ||
| let mut header_len_bytes = [0u8; 4]; | ||
| reader | ||
| .read_exact(&mut header_len_bytes) | ||
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?; | ||
| let header_len = u32::from_le_bytes(header_len_bytes) as usize; | ||
|
|
||
| // Read header | ||
| let mut header_bytes = vec![0u8; header_len]; | ||
| reader | ||
| .read_exact(&mut header_bytes) | ||
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?; | ||
|
|
||
| let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| { | ||
| MiniChainError::Serialization(format!("Failed to deserialize header: {}", e)) | ||
| })?; | ||
|
|
||
| // Verify magic | ||
| if !header.verify_magic() { | ||
| return Err(MiniChainError::Storage( | ||
| "Invalid checkpoint magic bytes".into(), | ||
| )); | ||
| } | ||
|
|
||
| // Verify version compatibility | ||
| if header.version > CHECKPOINT_VERSION { | ||
| return Err(MiniChainError::Storage(format!( | ||
| "Checkpoint version {} is newer than supported version {}", | ||
| header.version, CHECKPOINT_VERSION | ||
| ))); | ||
| } | ||
|
|
||
| // Read data | ||
| let mut data_bytes = vec![0u8; header.data_size as usize]; | ||
| reader | ||
| .read_exact(&mut data_bytes) | ||
| .map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?; |
There was a problem hiding this comment.
Add size bounds before allocating header/data buffers.
Line 326 uses a header length sourced from disk, and Line 354 uses data_size from the header. A corrupt file can force large allocations and OOM. Add reasonable caps (or validate against file length) before allocating.
🔧 Suggested fix
// Read header length
let mut header_len_bytes = [0u8; 4];
reader
.read_exact(&mut header_len_bytes)
.map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?;
let header_len = u32::from_le_bytes(header_len_bytes) as usize;
+ const MAX_HEADER_SIZE: usize = 16 * 1024;
+ const MAX_DATA_SIZE: u64 = 512 * 1024 * 1024;
+ if header_len == 0 || header_len > MAX_HEADER_SIZE {
+ return Err(MiniChainError::Storage("Checkpoint header too large".into()));
+ }
@@
let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| {
MiniChainError::Serialization(format!("Failed to deserialize header: {}", e))
})?;
+
+ if header.data_size > MAX_DATA_SIZE {
+ return Err(MiniChainError::Storage("Checkpoint data too large".into()));
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Read header length | |
| let mut header_len_bytes = [0u8; 4]; | |
| reader | |
| .read_exact(&mut header_len_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?; | |
| let header_len = u32::from_le_bytes(header_len_bytes) as usize; | |
| // Read header | |
| let mut header_bytes = vec![0u8; header_len]; | |
| reader | |
| .read_exact(&mut header_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?; | |
| let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| { | |
| MiniChainError::Serialization(format!("Failed to deserialize header: {}", e)) | |
| })?; | |
| // Verify magic | |
| if !header.verify_magic() { | |
| return Err(MiniChainError::Storage( | |
| "Invalid checkpoint magic bytes".into(), | |
| )); | |
| } | |
| // Verify version compatibility | |
| if header.version > CHECKPOINT_VERSION { | |
| return Err(MiniChainError::Storage(format!( | |
| "Checkpoint version {} is newer than supported version {}", | |
| header.version, CHECKPOINT_VERSION | |
| ))); | |
| } | |
| // Read data | |
| let mut data_bytes = vec![0u8; header.data_size as usize]; | |
| reader | |
| .read_exact(&mut data_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?; | |
| // Read header length | |
| let mut header_len_bytes = [0u8; 4]; | |
| reader | |
| .read_exact(&mut header_len_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header length: {}", e)))?; | |
| let header_len = u32::from_le_bytes(header_len_bytes) as usize; | |
| const MAX_HEADER_SIZE: usize = 16 * 1024; | |
| const MAX_DATA_SIZE: u64 = 512 * 1024 * 1024; | |
| if header_len == 0 || header_len > MAX_HEADER_SIZE { | |
| return Err(MiniChainError::Storage("Checkpoint header too large".into())); | |
| } | |
| // Read header | |
| let mut header_bytes = vec![0u8; header_len]; | |
| reader | |
| .read_exact(&mut header_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read header: {}", e)))?; | |
| let header: CheckpointHeader = bincode::deserialize(&header_bytes).map_err(|e| { | |
| MiniChainError::Serialization(format!("Failed to deserialize header: {}", e)) | |
| })?; | |
| if header.data_size > MAX_DATA_SIZE { | |
| return Err(MiniChainError::Storage("Checkpoint data too large".into())); | |
| } | |
| // Verify magic | |
| if !header.verify_magic() { | |
| return Err(MiniChainError::Storage( | |
| "Invalid checkpoint magic bytes".into(), | |
| )); | |
| } | |
| // Verify version compatibility | |
| if header.version > CHECKPOINT_VERSION { | |
| return Err(MiniChainError::Storage(format!( | |
| "Checkpoint version {} is newer than supported version {}", | |
| header.version, CHECKPOINT_VERSION | |
| ))); | |
| } | |
| // Read data | |
| let mut data_bytes = vec![0u8; header.data_size as usize]; | |
| reader | |
| .read_exact(&mut data_bytes) | |
| .map_err(|e| MiniChainError::Storage(format!("Failed to read data: {}", e)))?; |
🤖 Prompt for AI Agents
In `@crates/core/src/checkpoint.rs` around lines 321 - 357, The code reads
untrusted sizes into header_len and header.data_size and allocates vectors
directly, which can OOM on corrupt files; before allocating check bounds: define
reasonable maxs (e.g. MAX_HEADER_SIZE and MAX_DATA_SIZE), validate header_len is
<= MAX_HEADER_SIZE and less than the remaining reader/file length, then allocate
header_bytes; after deserializing CheckpointHeader, validate header.data_size <=
MAX_DATA_SIZE and that header.data_size does not exceed remaining reader/file
length before allocating data_bytes; on violations return a
MiniChainError::Storage with a clear message. Use the existing symbols
header_len, CheckpointHeader, header.data_size, reader, and CHECKPOINT_VERSION
to locate insertion points.
| fn get_overall_status(&self) -> HealthStatus { | ||
| let components = self.components.read(); | ||
|
|
||
| // If any component is unhealthy, overall is unhealthy | ||
| if components.p2p == HealthStatus::Unhealthy | ||
| || components.storage == HealthStatus::Unhealthy | ||
| || components.consensus == HealthStatus::Unhealthy | ||
| { | ||
| return HealthStatus::Unhealthy; | ||
| } | ||
|
|
||
| // If any critical component is degraded, overall is degraded | ||
| if components.p2p == HealthStatus::Degraded | ||
| || components.storage == HealthStatus::Degraded | ||
| || components.consensus == HealthStatus::Degraded | ||
| { | ||
| return HealthStatus::Degraded; | ||
| } | ||
|
|
||
| // If Bittensor is down but others are fine, degraded | ||
| if components.bittensor == HealthStatus::Unhealthy { | ||
| return HealthStatus::Degraded; | ||
| } | ||
|
|
||
| HealthStatus::Healthy | ||
| } |
There was a problem hiding this comment.
The challenges component status is tracked but not used in health calculation.
The set_component_status method accepts "challenges" (line 190), and ComponentStatus includes a challenges field, but get_overall_status() doesn't consider it when determining overall health. If challenge container health should affect the overall status, it should be included in this logic.
💡 Suggested fix if challenges should affect health
fn get_overall_status(&self) -> HealthStatus {
let components = self.components.read();
// If any component is unhealthy, overall is unhealthy
if components.p2p == HealthStatus::Unhealthy
|| components.storage == HealthStatus::Unhealthy
|| components.consensus == HealthStatus::Unhealthy
{
return HealthStatus::Unhealthy;
}
// If any critical component is degraded, overall is degraded
if components.p2p == HealthStatus::Degraded
|| components.storage == HealthStatus::Degraded
|| components.consensus == HealthStatus::Degraded
{
return HealthStatus::Degraded;
}
- // If Bittensor is down but others are fine, degraded
- if components.bittensor == HealthStatus::Unhealthy {
+ // If Bittensor or challenges are down but core is fine, degraded
+ if components.bittensor == HealthStatus::Unhealthy
+ || components.challenges == HealthStatus::Unhealthy
+ {
return HealthStatus::Degraded;
}
HealthStatus::Healthy
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn get_overall_status(&self) -> HealthStatus { | |
| let components = self.components.read(); | |
| // If any component is unhealthy, overall is unhealthy | |
| if components.p2p == HealthStatus::Unhealthy | |
| || components.storage == HealthStatus::Unhealthy | |
| || components.consensus == HealthStatus::Unhealthy | |
| { | |
| return HealthStatus::Unhealthy; | |
| } | |
| // If any critical component is degraded, overall is degraded | |
| if components.p2p == HealthStatus::Degraded | |
| || components.storage == HealthStatus::Degraded | |
| || components.consensus == HealthStatus::Degraded | |
| { | |
| return HealthStatus::Degraded; | |
| } | |
| // If Bittensor is down but others are fine, degraded | |
| if components.bittensor == HealthStatus::Unhealthy { | |
| return HealthStatus::Degraded; | |
| } | |
| HealthStatus::Healthy | |
| } | |
| fn get_overall_status(&self) -> HealthStatus { | |
| let components = self.components.read(); | |
| // If any component is unhealthy, overall is unhealthy | |
| if components.p2p == HealthStatus::Unhealthy | |
| || components.storage == HealthStatus::Unhealthy | |
| || components.consensus == HealthStatus::Unhealthy | |
| { | |
| return HealthStatus::Unhealthy; | |
| } | |
| // If any critical component is degraded, overall is degraded | |
| if components.p2p == HealthStatus::Degraded | |
| || components.storage == HealthStatus::Degraded | |
| || components.consensus == HealthStatus::Degraded | |
| { | |
| return HealthStatus::Degraded; | |
| } | |
| // If Bittensor or challenges are down but core is fine, degraded | |
| if components.bittensor == HealthStatus::Unhealthy | |
| || components.challenges == HealthStatus::Unhealthy | |
| { | |
| return HealthStatus::Degraded; | |
| } | |
| HealthStatus::Healthy | |
| } |
🤖 Prompt for AI Agents
In `@crates/rpc-server/src/health.rs` around lines 196 - 221, get_overall_status
currently ignores the challenges field on ComponentStatus even though
set_component_status and ComponentStatus track it; update get_overall_status to
include components.challenges in the unhealthy and degraded checks (treat
challenges the same way as p2p/storage/consensus), and ensure the final
special-case check (the bittensor check) still runs as intended — modify the
get_overall_status function to reference components.challenges alongside
components.p2p, components.storage, and components.consensus when computing
overall HealthStatus.
| ``` | ||
| my-challenge/ | ||
| ├── Cargo.toml | ||
| ├── src/ | ||
| │ ├── lib.rs # Challenge implementation | ||
| │ ├── evaluation.rs # Evaluation logic | ||
| │ └── scoring.rs # Scoring algorithm | ||
| ├── Dockerfile # Container build | ||
| └── README.md # Documentation | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to the project-structure block.
This avoids MD040 warnings and improves readability.
Proposed fix
-```
+```text
my-challenge/
├── Cargo.toml
├── src/
│ ├── lib.rs # Challenge implementation
│ ├── evaluation.rs # Evaluation logic
│ └── scoring.rs # Scoring algorithm
├── Dockerfile # Container build
└── README.md # Documentation</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
In @docs/challenge-integration.md around lines 42 - 51, The Markdown code fence
showing the project tree lacks a language tag; update the block in
docs/challenge-integration.md by changing the opening totext so the
project-structure block (the my-challenge/ tree with Cargo.toml, src/lib.rs,
evaluation.rs, scoring.rs, Dockerfile, README.md) is fenced as ```text which
will silence MD040 warnings and improve rendering/readability.
</details>
<!-- fingerprinting:phantom:poseidon:eagle -->
<!-- This is an auto-generated comment by CodeRabbit -->
Summary
This PR adds comprehensive infrastructure for integrating challenge crates into Platform v2, with a focus on enabling zero-downtime updates.
Changes
Challenge Infrastructure
challenges/directory structure for hosting challenge cratesplatform-challenge-registrycrate for challenge lifecycle managementCheckpoint System
platform-corefor state persistenceRolling Updates
Documentation
Testing
Benefits
Testing
cargo check --workspaceSummary by CodeRabbit
Release Notes
New Features
Documentation
Infrastructure