Skip to content

Conversation

@BrianJKoopman
Copy link
Member

@BrianJKoopman BrianJKoopman commented Jan 21, 2026

Description

This PR adds error handling for all database transactions within the run process. If an error occurs the agent waits for 5 seconds before continuing at the top of the process loop and retrying all operations.

I believe this is a safe operation, but it'd be good to get a second opinion. It may reattempt a file transfer (if files were transferred, but couldn't be marked as such), but that should be fine.

I added errors_sqlite to the list of errors in the counters stat, stored in session data, so we can see how often this is happening.

I also added some comments, mainly to describe whether a write operation was happening within the called functions. (I had thought implementing #886 would have helped us here, but it's mostly writes. Only in the case of hitting a lock during srfm.get_archive_stats() will this actually help. That said, this is a relatively slow step, especially as the sqlite file grows.)

Motivation and Context

We've seen various OperationalError messages in the suprsync agent, which can occur at any point in the process that interacts with the database. This is because the Pysmurf Monitor agent also writes to the database file. This should fix the regularly occurring crashes in the suprsync agents on site.

Resolves #483.
Resolves #874.

How Has This Been Tested?

This branch was run on the E2E testing system. Timestreams were generated with the SMuRF file emulator and then manually added to the suprsync database.

(.venv) ocs@ocs3:/mnt/nfs/data/ocs3/temp_data$ suprsync add-local-files --db ./suprsync.db timestreams/17690/ timestreams
Adding 2 files to the add to ./suprsync.db from /mnt/nfs/data/ocs3/temp_data/timestreams/17690
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.47it/s]

The timestream suprsync agent then picked up and copied the files:

2026-01-21T21:35:55+0000 Creating timecode dirs for recent files.....
2026-01-21T21:35:55+0000 Finished creating tcdirs
2026-01-21T21:37:06+0000 Copying files:
2026-01-21T21:37:06+0000 - /mnt/nfs/data/ocs3/temp_data/timestreams/17690/emulator2/1769031165_000.g3
2026-01-21T21:37:06+0000 - /mnt/nfs/data/ocs3/temp_data/timestreams/17690/emulator2/1769031151_000.g3
2026-01-21T21:37:08+0000 Checksumming on remote.
2026-01-21T21:37:08+0000 Copy session complete.

I don't really have a good method for testing the database lock and corresponding handling. But I'm satisfied that normal behavior works. Ideas for testing certainly welcome.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent: suprsync bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suprsync: Locked database crashes main process Synchronization issues with suprsync and E2E testing

2 participants