Skip to content

Conversation

@prabakaranklst
Copy link
Contributor

@prabakaranklst prabakaranklst commented Dec 10, 2025

Parent Task-->spwans -----------> Task-A
           -->join_all
                                  Waker Data ptr ------> Task-B (External task, different type)
                                                         Waker Data ptr ----------------------------------> Parent Task

For the above scenario with safety worker enabled in runtime, after polling of Task-A which return error, it is detected as safety error and a safety waker is created from normal waker and wake() function is called. In the wake function, the waker data ptr is typecasted as (parent) task and further try to clone. But, there is no valid pointer in the task vtable for clone() and it results in segmentation fault.

As fix, a safety_error flag in TaskHeader and thread local schedule_safety flag are set instead of creating safety waker. When the wake function is called, it goes through Task-B and then our wake() function is executed. In this wake(), the schedule_safety flag is checked and the task is scheduled on either safety or normal worker.

In case if task is completed prior to setting JoinHandle waker, then the safety_error flag is checked in JoinHandle.poll function to reschedule the task into safety worker.

Notes for Reviewer

Pre-Review Checklist for the PR Author

  • PR title is short, expressive and meaningful
  • Commits are properly organized
  • Relevant issues are linked in the References section
  • Tests are conducted
  • Unit tests are added

Checklist for the PR Reviewer

  • Commits are properly organized and messages are according to the guideline
  • Unit tests have been written for new behavior
  • Public API is documented
  • PR title describes the changes

Post-review Checklist for the PR Author

  • All open points are addressed and tracked via issues

References

Closes #18, #20, #39

@github-actions
Copy link

github-actions bot commented Dec 10, 2025

License Check Results

🚀 The license check job ran with the Bazel command:

bazel run //:license-check

Status: ⚠️ Needs Review

Click to expand output
[License Check Output]
Extracting Bazel installation...
Starting local Bazel server (8.3.0) and connecting to it...
INFO: Invocation ID: 64305c3c-7e9b-437b-90c2-9da8d5816685
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
DEBUG: Rule 'rust_qnx8_toolchain+' indicated that a canonical reproducible form can be obtained by modifying arguments integrity = "sha256-eQOopREOYCL5vtTb6c1cwZrql4GVrJ1FqgxarQRe1xs="
DEBUG: Repository rust_qnx8_toolchain+ instantiated at:
  <builtin>: in <toplevel>
Repository rule http_archive defined at:
  /home/runner/.bazel/external/bazel_tools/tools/build_defs/repo/http.bzl:394:31: in <toplevel>
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
WARNING: For repository 'bazel_skylib', the root module requires module version [email protected], but got [email protected] in the resolved dependency graph. Please update the version in your MODULE.bazel or set --check_direct_dependencies=off
WARNING: For repository 'rules_rust', the root module requires module version [email protected], but got [email protected] in the resolved dependency graph. Please update the version in your MODULE.bazel or set --check_direct_dependencies=off
WARNING: For repository 'rules_cc', the root module requires module version [email protected], but got [email protected] in the resolved dependency graph. Please update the version in your MODULE.bazel or set --check_direct_dependencies=off
WARNING: For repository 'aspect_rules_lint', the root module requires module version [email protected], but got [email protected] in the resolved dependency graph. Please update the version in your MODULE.bazel or set --check_direct_dependencies=off
WARNING: For repository 'buildifier_prebuilt', the root module requires module version [email protected], but got [email protected] in the resolved dependency graph. Please update the version in your MODULE.bazel or set --check_direct_dependencies=off
WARNING: For repository 'score_process', the root module requires module version [email protected], but got [email protected] in the resolved dependency graph. Please update the version in your MODULE.bazel or set --check_direct_dependencies=off
WARNING: For repository 'googletest', the root module requires module version [email protected], but got [email protected] in the resolved dependency graph. Please update the version in your MODULE.bazel or set --check_direct_dependencies=off
WARNING: For repository 'score_crates', the root module requires module version [email protected], but got [email protected] in the resolved dependency graph. Please update the version in your MODULE.bazel or set --check_direct_dependencies=off
Loading: 
Loading: 0 packages loaded
Loading: 0 packages loaded
    currently loading: 
Analyzing: target //:license-check (1 packages loaded, 0 targets configured)
Analyzing: target //:license-check (1 packages loaded, 0 targets configured)

Analyzing: target //:license-check (38 packages loaded, 10 targets configured)

Analyzing: target //:license-check (118 packages loaded, 84 targets configured)

Analyzing: target //:license-check (146 packages loaded, 2764 targets configured)

Analyzing: target //:license-check (154 packages loaded, 7069 targets configured)

INFO: Analyzed target //:license-check (157 packages loaded, 9085 targets configured).
[11 / 14] [Prepa] Generating Dash formatted dependency file ...
INFO: From Generating Dash formatted dependency file ...:
INFO: Successfully converted 209 packages from Cargo.lock to bazel-out/k8-fastbuild/bin/formatted.txt
INFO: Found 1 target...
Target //:license.check.license_check up-to-date:
  bazel-bin/license.check.license_check
  bazel-bin/license.check.license_check.jar
INFO: Elapsed time: 22.302s, Critical Path: 0.34s
INFO: 14 processes: 5 disk cache hit, 9 internal.
INFO: Build completed successfully, 14 total actions
INFO: Running command line: bazel-bin/license.check.license_check ./formatted.txt <args omitted>
usage: org.eclipse.dash.licenses.cli.Main [-batch <int>] [-cd <url>]
       [-confidence <int>] [-ef <url>] [-excludeSources <sources>] [-help] [-lic
       <url>] [-project <shortname>] [-repo <url>] [-review] [-summary <file>]
       [-timeout <seconds>] [-token <token>]

@github-actions
Copy link

The created documentation from the pull request is available at: docu-html

@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from 383e198 to 2d1640f Compare December 10, 2025 14:09
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch 2 times, most recently from 09a7db9 to 75b4719 Compare December 17, 2025 12:18
@pawelrutkaq pawelrutkaq force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from 75b4719 to 792e4e7 Compare December 18, 2025 09:50
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from 792e4e7 to 8d8d7b1 Compare December 23, 2025 08:42
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from 8d8d7b1 to d14d4e5 Compare December 23, 2025 09:28
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from d14d4e5 to c3b2561 Compare December 23, 2025 09:29
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from c3b2561 to 3e7796d Compare December 23, 2025 09:49
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from 3e7796d to dcc80cf Compare January 5, 2026 07:49
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from dcc80cf to b5c1eff Compare January 7, 2026 12:13
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from b5c1eff to 08752f3 Compare January 8, 2026 16:18
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from 08752f3 to f57a27f Compare January 12, 2026 10:07
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from f57a27f to a885727 Compare January 13, 2026 03:57
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from a885727 to 4662199 Compare January 13, 2026 03:57
})
// Safety: Unset join handle flag before setting waker, the flag would have been set previously in the first poll of join handle.
// If flag is not cleared, another worker finishing the task will see the flag set and
// read the waker to call wake() while it is written here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is anyway race condition or ? since if it landed here, someone may actually take it anyway right before you ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets consider the below scenario.

  1. Worker-X execute parent task (PT1) and it spawns 5-child tasks (CT2,CT3,etc.), then JoinHandle.poll is executed and waker is set for all child tasks.
  2. Worker-X completes execution of CT2 and wake PT1
  3. A worker (say W1) executes CT3 and at the same time another worker (say W2) executes PT1
    case1 (racing - without unset_join_handle()): PT1 inside set_waker() and writing the waker, at the same time CT3 is completed and it finds join handle waker, reads the waker to call wake() and this results in runtime failure.
    case2 (fix - with unset_join_handle()): PT1 inside set_waker() unset_join_handle flag only if CT3 is NOT completed and write waker. Otherwise the flag is not touched, also waker is not written.
    So, CT3 will either find waker if execution is completed before unsetting flag, otherwise will not find waker.


res.unwrap_or_else(|| {
if is_safety_err && self.is_with_safety {
// Set saftey error flag which would be checked in JoinHandle poll() to schedule parent task into safety worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not required. If the Joinhandle.poll had a time to execute with result of this task it's all good. If You read doc, it's guaranteed that if task fail Parent task will get executed, but it can be in safety workers or in normal worker depending on state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets consider simple scenario,

  1. Parent task spawns a safety task
  2. Safety task is executed by a worker and results in safety error
  3. Parent task does something and await on join handle & returns result
  4. Parent task executes recovery action if safety task failed

At step 3, in JoinHandle.poll, if we are not detecting safety error and not waking (wake_by_ref) into safety worker, then step 4 will be executed by async worker. I hope this is not the expected behaviour.

// For now stupid respawn
self.scheduler.spawn_from_runtime(task, &self.producer_consumer);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change exeactly ? Action to move parent task to safety worker is happening once child task is finished or ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in above comment (#32 (comment)), wake_by_ref() is called from JoinHandle.polll to re-schedule the task into safety worker which sets safety notified flag. So, the safety notified flag is checked and re-spawned into safety worker.

let task_ref = unsafe { TaskRef::from_raw(task_header_ptr) };

task_ref.schedule();
if TaskContext::should_wake_task_into_safety() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work as is, the waker can be called outside of runtime (ie iceoryx2 event thread), so it cannot use static CTX in the whole path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified ctx_get_schedule_safety()

Comment on lines +83 to +84
waker.wake_by_ref();
FutureInternalReturn::polled()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shall not call your waker here, since you are in a task that is running so, all is fine or ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refer #32 (comment)

@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from 4662199 to a26bda4 Compare January 19, 2026 11:50
Fixed bugs related to handling and scheduling safety task.
eclipse-score#18
eclipse-score#20
eclipse-score#39
@prabakaranklst prabakaranklst force-pushed the prabakaran_fix_crash_due_to_safety_tasks branch from a26bda4 to 2d9c994 Compare January 19, 2026 14:31
is_safety_enabled: bool,

/// This flag is used to schedule parent task of failing safety task into safety worker
schedule_safety: Cell<bool>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RefCell<Option> and borrow_mut on usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Crash after queuing 30 tasks for the safety worker

2 participants