Skip to content

Conversation

@amissael95
Copy link
Contributor

@amissael95 amissael95 commented Oct 28, 2025

Summary of changes

This pull request fix the following issue:

When SIGSTOP is received while writing data to tape, it could cause a position mismatch between what is present in LE cache and the one reported by the drive. If that is the case, the index is not written in the position described by its self-pointer, and the start_block of subsequent files do not reflect the actual position where the file was written, giving LTFS error LTFS11089E when reading those files.

By making the following changes we prevent this scenario:

  1. Checked drive position before writing the index
  2. Caught SIGCONT, set a flag and do the position check for writes
  3. Displayed message LTFS17294I when SIGCONT is received.

Description

Motivation and context for each change.

  1. When doing the sync process, ltfs code calls ltfs_write_index, so avoid writing the index if for any reason the current position reported by LE is not the same as real reported by the drive.

  2. Some customers may not do Sync frequently so we may need a quicker way to detect the issue, the current method is catching SIGCONT, set a flag and read it in tape_write function, since SIGSTOP signal cannot be caught.

  3. It is helpful to explicitly logging when ltfs process receives a SIGCONT signal.

Type of change

  • Bug fix

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have confirmed my fix is effective or that my feature works

@amissael95 amissael95 changed the title Prevent start_block misalignment caused by SIGSTOP fix: start_block misalignment caused by SIGSTOP Oct 28, 2025
@piste-jp
Copy link
Member

piste-jp commented Oct 28, 2025

It doesn't make sense that position mismatch happens after SIGSTOP/SIGCONT.

Please open an issue and show me a procedure how to recreate and a detail scenario how this problem happens.

@amissael95 amissael95 requested a review from syaoraang October 28, 2025 15:36
@madjesc
Copy link

madjesc commented Oct 28, 2025

It doesn't make sense that position mismatch happens after SIGSTOP/SIGCONT.

Please open an issue and show me a procedure how to recreate and a detail scenario how this problem happens.

It does happen.
I have a code that can replicate this but out of ltfs.

The problem is more sg related, after executing the ioctl request to the sg driver, if the SIGSTOP happens while ltfs is waiting for the response the sg driver would replicate the request and if we SIGCONT the program would not be aware of this.

In a write loop like this:

int count = 0;
wile(sg_write(...) > 0) {
  count++;
}

Lets say that we wrote 10 blocks, after the loop ends the count variable would be 10, but in the tape we wrote more than 10 blocks because of this.
LTFS is not aware after writing the extent that the sg driver duplicated some blocks, so its better if we check the position after the extent is written.

@amissael95
Copy link
Contributor Author

amissael95 commented Oct 28, 2025

Hello @piste-jp, thanks for your comment.

We have opened an issue in RHEL regarding the sg driver behavior mentioned by @madjesc.
The issue is reached on LTFS after some retries of sending the SIGSTOP/SIGCONT signal while ltfs is written data. This issue is more evident if you send SIGSTOP/SIGCONT while writing files whose length is not multiple of 512KiB, since they will not fill the last block size, then when reading those files ltfs will fail with LTFS11089E because the start_block in the index will not point to the real start_block of the file.

LTFS11089E Cannot read: expected 524288 bytes from the medium, but received %u bytes.

We will open an issue in this repository showing this.

@piste-jp
Copy link
Member

piste-jp commented Oct 29, 2025

We have opened an issue in RHEL regarding the sg driver behavior mentioned by @madjesc

I cannot open the link provided

I will open an issue in this repository showing this.

Please do this soon, because
- PR is code change request, problem itself shall be discussed into an issue
- PR shall be linked to an issue to be solved

@piste-jp
Copy link
Member

piste-jp commented Oct 29, 2025

I'm feeling this PR is really bad idea.

  • I believe SIGSTOP is used only from developer for debug
  • The code have a serious performance problem
  • The code is effective not only to the problem but also wider environment
    • Please think again this problem happens only on sg or all other backends

So I strongly recommend that you open an issue to describe the problem first and start again from the first step.

And also I strongly suggest to close this PR once.

@madjesc
Copy link

madjesc commented Oct 29, 2025

I'm feeling this PR is really bad idea.

  • I believe SIGSTOP is used only from developer for debug

System tap also send this signals. Users that have systemtap can encounter this behavior.

  • The code have a serious performance problem

I don't see how a simple position check after the extent is written is gonna be a performance problem. Can you elaborate?

  • The code is effective not only to the problem but also wider environment
    • Please think again this problem happens only on sg or all other backends

I also have a test with the lintape driver and it does not happen.

So I strongly recommend that you open an issue to describe the problem first and start again from the first step.
And also I strongly suggest to close this PR once.

Yes, I will open an issue with the detailed explanation and a code to replicate. But why close the PR? If we conclude in the issue that this is not needed, sure we will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants