-
Notifications
You must be signed in to change notification settings - Fork 2
scheduler: Fix bug in scheduling safety task #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
scheduler: Fix bug in scheduling safety task #32
Conversation
License Check Results🚀 The license check job ran with the Bazel command: bazel run //:license-checkStatus: Click to expand output |
|
The created documentation from the pull request is available at: docu-html |
383e198 to
2d1640f
Compare
09a7db9 to
75b4719
Compare
75b4719 to
792e4e7
Compare
792e4e7 to
8d8d7b1
Compare
8d8d7b1 to
d14d4e5
Compare
d14d4e5 to
c3b2561
Compare
c3b2561 to
3e7796d
Compare
3e7796d to
dcc80cf
Compare
dcc80cf to
b5c1eff
Compare
b5c1eff to
08752f3
Compare
08752f3 to
f57a27f
Compare
f57a27f to
a885727
Compare
a885727 to
4662199
Compare
| }) | ||
| // Safety: Unset join handle flag before setting waker, the flag would have been set previously in the first poll of join handle. | ||
| // If flag is not cleared, another worker finishing the task will see the flag set and | ||
| // read the waker to call wake() while it is written here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is anyway race condition or ? since if it landed here, someone may actually take it anyway right before you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets consider the below scenario.
- Worker-X execute parent task (PT1) and it spawns 5-child tasks (CT2,CT3,etc.), then JoinHandle.poll is executed and waker is set for all child tasks.
- Worker-X completes execution of CT2 and wake PT1
- A worker (say W1) executes CT3 and at the same time another worker (say W2) executes PT1
case1 (racing - without unset_join_handle()): PT1 insideset_waker()and writing the waker, at the same time CT3 is completed and it finds join handle waker, reads the waker to call wake() and this results in runtime failure.
case2 (fix - with unset_join_handle()): PT1 insideset_waker()unset_join_handle flag only if CT3 is NOT completed and write waker. Otherwise the flag is not touched, also waker is not written.
So, CT3 will either find waker if execution is completed before unsetting flag, otherwise will not find waker.
|
|
||
| res.unwrap_or_else(|| { | ||
| if is_safety_err && self.is_with_safety { | ||
| // Set saftey error flag which would be checked in JoinHandle poll() to schedule parent task into safety worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not required. If the Joinhandle.poll had a time to execute with result of this task it's all good. If You read doc, it's guaranteed that if task fail Parent task will get executed, but it can be in safety workers or in normal worker depending on state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets consider simple scenario,
- Parent task spawns a safety task
- Safety task is executed by a worker and results in safety error
- Parent task does something and await on join handle & returns result
- Parent task executes recovery action if safety task failed
At step 3, in JoinHandle.poll, if we are not detecting safety error and not waking (wake_by_ref) into safety worker, then step 4 will be executed by async worker. I hope this is not the expected behaviour.
| // For now stupid respawn | ||
| self.scheduler.spawn_from_runtime(task, &self.producer_consumer); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change exeactly ? Action to move parent task to safety worker is happening once child task is finished or ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in above comment (#32 (comment)), wake_by_ref() is called from JoinHandle.polll to re-schedule the task into safety worker which sets safety notified flag. So, the safety notified flag is checked and re-spawned into safety worker.
| let task_ref = unsafe { TaskRef::from_raw(task_header_ptr) }; | ||
|
|
||
| task_ref.schedule(); | ||
| if TaskContext::should_wake_task_into_safety() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't work as is, the waker can be called outside of runtime (ie iceoryx2 event thread), so it cannot use static CTX in the whole path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified ctx_get_schedule_safety()
| waker.wake_by_ref(); | ||
| FutureInternalReturn::polled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shall not call your waker here, since you are in a task that is running so, all is fine or ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please refer #32 (comment)
4662199 to
a26bda4
Compare
Fixed bugs related to handling and scheduling safety task. eclipse-score#18 eclipse-score#20 eclipse-score#39
a26bda4 to
2d9c994
Compare
| is_safety_enabled: bool, | ||
|
|
||
| /// This flag is used to schedule parent task of failing safety task into safety worker | ||
| schedule_safety: Cell<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RefCell<Option> and borrow_mut on usage.
For the above scenario with safety worker enabled in runtime, after polling of Task-A which return error, it is detected as safety error and a
safety wakeris created from normal waker andwake()function is called. In the wake function, thewaker data ptris typecasted as (parent) task and further try to clone. But, there is no valid pointer in the task vtable for clone() and it results in segmentation fault.As fix, a safety_error flag in TaskHeader and thread local schedule_safety flag are set instead of creating safety waker. When the wake function is called, it goes through Task-B and then our wake() function is executed. In this wake(), the schedule_safety flag is checked and the task is scheduled on either safety or normal worker.
In case if task is completed prior to setting JoinHandle waker, then the safety_error flag is checked in JoinHandle.poll function to reschedule the task into safety worker.
Notes for Reviewer
Pre-Review Checklist for the PR Author
Checklist for the PR Reviewer
Post-review Checklist for the PR Author
References
Closes #18, #20, #39