Skip to content

Conversation

@MelReyCG
Copy link
Contributor

@MelReyCG MelReyCG commented Oct 8, 2025

This PR is based on Amandine work on adding error YAML file in GEOS (PR #3828), and aims at adding a detection & management inside GEOS of 1. Error signals, 2. External errors from dependencies, in order to be able to manage & output them in the log & error YAML file.

Managing those external errors gives us the opportunity to:

  • output error message & context reliably in the log, even if the stderr get lost or used for another reason,
  • be sure to detect any kernel / system allocator errors and add the stack-trace of these errors,
  • factorize them with external tools / scripts, thus highlighting which are the source rank(s) of the issue.
  • also has the effect to prevent the stacktrace to be cut by other ranks message, which could previously happen on a signal.

Without the pipe redirection proposed here, we encounter a lot of errors without any stacktrace and too minimalist (and sometimes even no message at all, just a cut log) on many HPC platforms and on the CI. The branch has been tested to provide the lacking info on many issues on P4.

This work also gives us the opportunity to tag later each dependency message (system, LvArray, Hypre, ...) to quickly identify / filter issues source.


(Replaces #3722)

… link between GEOS_THROW_CTX_IF and LVARRAY_THROW_IF_TEST( EXP, MSG, TYPE )
… in try/catch statements

Problem: Retrieves everything that was thrown, so not just the message.
…y spaces.

The previous condition checked whether an argument was present and whether the option was immediately followed by a value like -test"value", which excluded valid cases like -test "value" et -test     "value".
@paveltomin
Copy link
Collaborator

@rrsettgast @wrtobin @paveltomin I need a review, this work will be useful for further debugging tasks

i more or less understand what is done here but don't really understand how it works, sorry
seems very low-level, can you share some examples - situation before and after ?

@MelReyCG
Copy link
Contributor Author

MelReyCG commented Nov 5, 2025

@paveltomin thanks for the feedback, I'll work on the code clarity!

Copy link
Collaborator

@wrtobin wrtobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only note is we might prefer ExternalSignalHandler just for specificity/clarity vs our existing error handling mechanisms, but not a hard requirement.

@MelReyCG
Copy link
Contributor Author

MelReyCG commented Nov 20, 2025

I would like to keep the name general as this component does not process only signals, but also any message from external code (as an exemple, VTK error / warning messages).
Some messages may appear as (non-crashing) error for now, but we will be able categorize them with this component (or a sub-component of it).
Thanks for the review

@MelReyCG MelReyCG merged commit 2c0879d into develop Nov 20, 2025
22 of 23 checks passed
@MelReyCG MelReyCG deleted the feature/rey/signal-and-external-error-managment-2 branch November 20, 2025 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI flag: ready for review type: bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants