[MSD-494][feat] driver tmcm: if device fails to reply, immediately retry#3418
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves resilience of the TMCM-6110 serial driver by detecting the “0 bytes received” transient failure mode and retrying communication before failing higher-level operations (e.g., referencing).
Changes:
- Refactors
SendInstructioninto a single-attempt helper (_send_instruction_once) plus a 0-byte recovery path that pingsGetVersionand retries once. - Adds unit tests covering the new 0-byte retry/ping behavior using the simulator and mocking.
- Updates the serial timeout comment to clarify expected device response time.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/odemis/driver/tmcm.py |
Adds 0-byte detection, ping-based recovery, and a one-time retry path for TMCL instructions. |
src/odemis/driver/test/tmcm_test.py |
Adds tests validating retry success/failure behavior and ensuring partial replies are not retried. |
| raise IOError("Failed to read from TMCM, restarted serial connection.") | ||
| if len(res) == 0: # Something went wrong, maybe the device discarded the message | ||
| raise TimeoutError("No bytes received after %s" % self._instr_to_str(msg)) | ||
| elif len(res) < 9: # Only part of a message was received, sign something ever more wrong |
There was a problem hiding this comment.
Spelling/grammar: “sign something ever more wrong” should be “signs something even more wrong” (or similar).
| elif len(res) < 9: # Only part of a message was received, sign something ever more wrong | |
| elif len(res) < 9: # Only part of a message was received, a sign something is even more wrong |
| def _make_mock_once(self, fail_non_ping_count: int, fail_ping_count: int = 0): | ||
| """ | ||
| Returns a replacement for _send_instruction_once that raises | ||
| IOError("Received only 0 bytes ...") for the first N calls of each |
There was a problem hiding this comment.
The _make_mock_once docstring says it raises IOError("Received only 0 bytes ..."), but the mock actually raises TimeoutError. Update the docstring to match the behavior to avoid confusion.
| IOError("Received only 0 bytes ...") for the first N calls of each | |
| TimeoutError("Received only 0 bytes ...") for the first N calls of each |
src/odemis/driver/tmcm.py
Outdated
| # Failure here should mean that the device didn't get the (complete) | ||
| # instruction, so it's safe to send the command again. | ||
| return self.SendInstruction(n, typ, mot, val) |
There was a problem hiding this comment.
_send_instruction_once is documented as having “no retry logic”, but on serial write IOError it calls SendInstruction(), which re-enters the retry/ping logic and introduces recursion. Consider keeping _send_instruction_once side-effect free (raise a dedicated exception or re-raise) and performing the reconnect/resend logic in SendInstruction using an explicit bounded loop to avoid recursion.
| # Failure here should mean that the device didn't get the (complete) | |
| # instruction, so it's safe to send the command again. | |
| return self.SendInstruction(n, typ, mot, val) | |
| # The caller is responsible for retrying the instruction if desired. | |
| raise IOError("Failed to send command to TMCM, restarted serial connection.") |
| :param mot: motor/bank number (0-255) | ||
| :param val: value to send (-2**31 to 2**31-1) | ||
| :returns: reply value | ||
| :raises IOError: if problem with sending/receiving data over the serial port |
There was a problem hiding this comment.
SendInstruction can still raise TimeoutError (e.g., if the post-ping retry receives 0 bytes), but the docstring only lists IOError/TMCLError. Please update the docstring (and/or explicitly wrap TimeoutError) so callers know what to expect.
| :raises IOError: if problem with sending/receiving data over the serial port | |
| :raises IOError: if problem with sending/receiving data over the serial port | |
| :raises TimeoutError: if the controller does not respond even after recovery attempts |
src/odemis/driver/tmcm.py
Outdated
| except IOError: # includes TimeoutError (which derive from IOError) | ||
| logging.warning("Ping attempt %d/%d failed after 0-byte response", | ||
| ping_attempt + 1, MAX_PING_ATTEMPTS) | ||
| else: | ||
| raise IOError(f"Device not responding to instructions") |
There was a problem hiding this comment.
When all ping attempts fail, this raises a generic IOError without including the last failure details. Capturing the last exception and chaining it (raise ... from last_exc) and/or including the instruction id and attempt count in the message would make diagnosis easier.
In some very rare cases, mainly during referencing, the TMCM-6110 fails
to reply to instructions. When this happens, typically the next or
second next instruction work again. It's not clear why this happens, but instead
of completely failing the action (eg, referencing), detect such
communication issue and retry.
Example error that this solves:
Traceback (most recent call last):
File "/home/meteor/development/odemis/src/odemis/driver/tmcm.py", line 2100, in _doReference
self._waitReferencing(aid) # block until it's over
File "/home/meteor/development/odemis/src/odemis/driver/tmcm.py", line 1604, in _waitReferencingStd
self._releaseRefSwitch(axis)
File "/home/meteor/development/odemis/src/odemis/driver/tmcm.py", line 1507, in _releaseRefSwitch
self.SetIO(2, refswitch, 0)
File "/home/meteor/development/odemis/src/odemis/driver/tmcm.py", line 1071, in SetIO
self.SendInstruction(14, port, bank, value)
File "/home/meteor/development/odemis/src/odemis/driver/tmcm.py", line 928, in SendInstruction
raise IOError("Received only %d bytes after %s" %
OSError: Received only 0 bytes after 8, 14, 0, 2, 0 (24)
9ee503b to
e5ca90c
Compare
📝 WalkthroughWalkthroughThe changes refactor the TMCL serial command handling by splitting the monolithic Sequence Diagram(s)sequenceDiagram
actor Caller
participant SI as SendInstruction()
participant SQ as _send_query()
participant RA as _receive_answer()
participant Dev as Device/Socket
Caller->>SI: SendInstruction(n, typ, mot, val)
SI->>SQ: _send_query(n, typ, mot, val)
SQ->>Dev: write 9-byte instruction
SQ-->>SI: return message buffer
SI->>RA: _receive_answer(n, msg)
RA->>Dev: read reply
RA-->>SI: return status (success)
SI-->>Caller: return status
rect rgba(255, 200, 100, 0.5)
Note over SI,Dev: Retry Path (on TimeoutError)
SI->>SI: attempt_count < limit
SI->>SQ: _send_query(136, 0, 0, 0) [ping]
SQ->>Dev: write GetVersion (ping)
SQ-->>SI: return msg
SI->>RA: _receive_answer(136, msg)
RA->>Dev: read ping reply
alt Ping succeeds
RA-->>SI: return status
else Ping timeout
SI->>SI: retry ping (up to 3 times)
end
SI->>SQ: _send_query(n, typ, mot, val) [retry original]
SQ->>Dev: write instruction (retry)
SQ-->>SI: return msg
SI->>RA: _receive_answer(n, msg)
RA->>Dev: read reply
RA-->>SI: return status
SI-->>Caller: return status/raise error
end
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/odemis/driver/test/tmcm_test.py (1)
162-166: Annotate and document the new fixture hooks.
setUp()andtearDown()are new functions in a Python file, but they still miss-> Noneand docstrings.Proposed fix
- def setUp(self): + def setUp(self) -> None: + """Create a simulator-backed controller for each retry test.""" self.dev = CLASS(**KWARGS_SIM) @@ - def tearDown(self): + def tearDown(self) -> None: + """Terminate the simulator-backed controller created in :meth:`setUp`.""" self.dev.terminate()As per coding guidelines "Always use type hints for function parameters and return types in Python code" and "Include docstrings for all functions and classes, following the reStructuredText style guide (without type information)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/driver/test/tmcm_test.py` around lines 162 - 166, Add explicit return type hints and reStructuredText-style docstrings to the new test fixture hooks: annotate both setUp(self) and tearDown(self) with "-> None" and add short docstrings explaining their roles (e.g., setUp creates the test device from CLASS using KWARGS_SIM and assigns it to self.dev; tearDown terminates/cleans up self.dev). Update the methods in src/odemis/driver/test/tmcm_test.py (setUp, tearDown, and references to CLASS/KWARGS_SIM/self.dev) to include these additions so they follow the project's typing and docstring guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/odemis/driver/tmcm.py`:
- Around line 935-938: The current logic that treats a 0-byte read as a timeout
and then auto-resends the instruction (the branch guarding rn != n and the retry
path used by SendInstruction()) can double-apply non-idempotent commands like
MoveRelPos(); change SendInstruction() so it only retries automatically for a
small whitelist of safe/idempotent operations (e.g., read-only or
explicit-idempotent commands) or, alternatively, before resending attempt to
recover a late original reply from the incoming buffer/ping path (consume and
validate any pending reply matching the original sequence number rn) and only
resend if no valid late reply is found; update the rn != n handling and the
retry decision in SendInstruction() to implement one of these two behaviors and
ensure MoveRelPos() and other mutating commands are excluded from blind
auto-retry.
- Around line 876-897: The functions _send_query and _receive_answer use the
undefined string type annotation "Buffer"; replace that with the concrete
numpy.ndarray type (e.g. change return type of _send_query from "Buffer" to
numpy.ndarray and the msg parameter of _receive_answer from "Buffer" to
numpy.ndarray) so annotations are valid under Python 3.10+, keeping existing
numpy usage and without adding new unrelated imports; update both function
signatures and any other occurrences of "Buffer" in this module to
numpy.ndarray.
---
Nitpick comments:
In `@src/odemis/driver/test/tmcm_test.py`:
- Around line 162-166: Add explicit return type hints and reStructuredText-style
docstrings to the new test fixture hooks: annotate both setUp(self) and
tearDown(self) with "-> None" and add short docstrings explaining their roles
(e.g., setUp creates the test device from CLASS using KWARGS_SIM and assigns it
to self.dev; tearDown terminates/cleans up self.dev). Update the methods in
src/odemis/driver/test/tmcm_test.py (setUp, tearDown, and references to
CLASS/KWARGS_SIM/self.dev) to include these additions so they follow the
project's typing and docstring guidelines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: acb04166-ca45-4387-9ad5-313e2ea8f634
📒 Files selected for processing (2)
src/odemis/driver/test/tmcm_test.pysrc/odemis/driver/tmcm.py
| def _send_query(self, n: int, typ: int, mot: int, val: int) -> "Buffer": | ||
| """ | ||
| Sends one query (instruction) without waiting for the reply and without retry logic. | ||
| The caller must already hold self._ser_access. | ||
|
|
||
| :param n: instruction ID (0-255) | ||
| :param typ: instruction type (0-255) | ||
| :param mot: motor/bank number (0-255) | ||
| :param val: value to send (-2**31 to 2**31-1) | ||
| :return: the message sent (9 bytes) | ||
| :raises IOError: if the serial write fails or the reply is incomplete | ||
| :raises TMCLError: if the device reports a bad status | ||
| """ | ||
| msg = numpy.empty(9, dtype=numpy.uint8) | ||
| struct.pack_into('>BBBBiB', msg, 0, self._target, n, typ, mot, val, 0) | ||
| msg[-1] = numpy.sum(msg[:-1], dtype=numpy.uint8) | ||
| logging.debug("Sending %s", self._instr_to_str(msg)) | ||
| self._serial.write(msg) | ||
| self._serial.flush() # Wait for the data to be full sent (to detect any connection error) | ||
| return msg | ||
|
|
||
| # One possible reason is that the device disappeared because the | ||
| # cable was pulled out, or the power got cut (unlikely, as it's | ||
| # powered via 2 sources). | ||
| def _receive_answer(self, n: int, msg: "Buffer") -> int: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the imports at the top of the file
head -50 src/odemis/driver/tmcm.pyRepository: delmic/odemis
Length of output: 2084
🏁 Script executed:
# Search for 'Buffer' definition or import
rg "Buffer" src/odemis/driver/tmcm.py -nRepository: delmic/odemis
Length of output: 202
🏁 Script executed:
# Look at the specific lines mentioned (876-897)
sed -n '870,905p' src/odemis/driver/tmcm.pyRepository: delmic/odemis
Length of output: 1714
🏁 Script executed:
# Check Python version support and typing imports
rg "from typing|import typing|from collections.abc" src/odemis/driver/tmcm.pyRepository: delmic/odemis
Length of output: 95
🏁 Script executed:
# Check if numpy is imported
rg "import numpy" src/odemis/driver/tmcm.py -n | head -5Repository: delmic/odemis
Length of output: 72
🏁 Script executed:
# Verify the full context around line 876-897 to ensure numpy.ndarray is the right type
sed -n '1,100p' src/odemis/driver/tmcm.py | rg "import numpy"Repository: delmic/odemis
Length of output: 69
🏁 Script executed:
# Search for any 'Buffer' definition anywhere in the codebase
rg "^Buffer\s*=" src/odemis/driver/tmcm.pyRepository: delmic/odemis
Length of output: 39
🏁 Script executed:
# Check if Buffer is defined as a type alias elsewhere in the file
rg "Buffer.*=" src/odemis/driver/tmcm.pyRepository: delmic/odemis
Length of output: 39
🏁 Script executed:
# Verify the actual return value and parameter type by examining how these functions are used
rg "_send_query\|_receive_answer" src/odemis/driver/tmcm.py -A 2 | head -40Repository: delmic/odemis
Length of output: 39
🏁 Script executed:
# Check if there are any type aliases defined at module level
sed -n '50,150p' src/odemis/driver/tmcm.py | grep -E "^[A-Za-z_].*\s*=" | head -20Repository: delmic/odemis
Length of output: 720
🏁 Script executed:
# Look at what _send_query actually returns
sed -n '876,897p' src/odemis/driver/tmcm.py | tail -5Repository: delmic/odemis
Length of output: 268
🏁 Script executed:
# Check if there's a 'Buffer' in typing or collections.abc for Python 3.10
python3 <<'EOF'
# Check Python 3.10+ typing module for Buffer
import sys
print(f"Python version: {sys.version}")
# Check if Buffer exists in typing module
try:
from typing import Buffer
print("Buffer found in typing module")
except ImportError:
print("Buffer NOT in typing module")
# Check if it's in collections.abc
try:
from collections.abc import Buffer
print("Buffer found in collections.abc")
except ImportError:
print("Buffer NOT in collections.abc")
# Check numpy.ndarray
import numpy
print(f"numpy.ndarray exists: {hasattr(numpy, 'ndarray')}")
EOFRepository: delmic/odemis
Length of output: 207
🏁 Script executed:
# Search for any other "Buffer" references in the file
rg '"Buffer"' src/odemis/driver/tmcm.pyRepository: delmic/odemis
Length of output: 194
🏁 Script executed:
# Check what Python version this file targets by looking for any setup.py or pyproject.toml
find . -maxdepth 2 -name "setup.py" -o -name "pyproject.toml" | xargs grep -l "python_requires\|requires-python" 2>/dev/nullRepository: delmic/odemis
Length of output: 41
Replace undefined "Buffer" type annotation with numpy.ndarray.
Buffer is not imported and is not a valid type in Python 3.10 or later. The functions return and accept numpy.ndarray objects, so use that concrete type instead of the undefined string annotation.
Proposed fix
- def _send_query(self, n: int, typ: int, mot: int, val: int) -> "Buffer":
+ def _send_query(self, n: int, typ: int, mot: int, val: int) -> numpy.ndarray:
@@
- def _receive_answer(self, n: int, msg: "Buffer") -> int:
+ def _receive_answer(self, n: int, msg: numpy.ndarray) -> int:This violates the coding guideline "Ensure code is valid for Python 3.10 and above" and "Always use type hints for function parameters and return types in Python code".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _send_query(self, n: int, typ: int, mot: int, val: int) -> "Buffer": | |
| """ | |
| Sends one query (instruction) without waiting for the reply and without retry logic. | |
| The caller must already hold self._ser_access. | |
| :param n: instruction ID (0-255) | |
| :param typ: instruction type (0-255) | |
| :param mot: motor/bank number (0-255) | |
| :param val: value to send (-2**31 to 2**31-1) | |
| :return: the message sent (9 bytes) | |
| :raises IOError: if the serial write fails or the reply is incomplete | |
| :raises TMCLError: if the device reports a bad status | |
| """ | |
| msg = numpy.empty(9, dtype=numpy.uint8) | |
| struct.pack_into('>BBBBiB', msg, 0, self._target, n, typ, mot, val, 0) | |
| msg[-1] = numpy.sum(msg[:-1], dtype=numpy.uint8) | |
| logging.debug("Sending %s", self._instr_to_str(msg)) | |
| self._serial.write(msg) | |
| self._serial.flush() # Wait for the data to be full sent (to detect any connection error) | |
| return msg | |
| # One possible reason is that the device disappeared because the | |
| # cable was pulled out, or the power got cut (unlikely, as it's | |
| # powered via 2 sources). | |
| def _receive_answer(self, n: int, msg: "Buffer") -> int: | |
| def _send_query(self, n: int, typ: int, mot: int, val: int) -> numpy.ndarray: | |
| """ | |
| Sends one query (instruction) without waiting for the reply and without retry logic. | |
| The caller must already hold self._ser_access. | |
| :param n: instruction ID (0-255) | |
| :param typ: instruction type (0-255) | |
| :param mot: motor/bank number (0-255) | |
| :param val: value to send (-2**31 to 2**31-1) | |
| :return: the message sent (9 bytes) | |
| :raises IOError: if the serial write fails or the reply is incomplete | |
| :raises TMCLError: if the device reports a bad status | |
| """ | |
| msg = numpy.empty(9, dtype=numpy.uint8) | |
| struct.pack_into('>BBBBiB', msg, 0, self._target, n, typ, mot, val, 0) | |
| msg[-1] = numpy.sum(msg[:-1], dtype=numpy.uint8) | |
| logging.debug("Sending %s", self._instr_to_str(msg)) | |
| self._serial.write(msg) | |
| self._serial.flush() # Wait for the data to be full sent (to detect any connection error) | |
| return msg | |
| def _receive_answer(self, n: int, msg: numpy.ndarray) -> int: |
🧰 Tools
🪛 Ruff (0.15.7)
[error] 876-876: Undefined name Buffer
(F821)
[error] 897-897: Undefined name Buffer
(F821)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/odemis/driver/tmcm.py` around lines 876 - 897, The functions _send_query
and _receive_answer use the undefined string type annotation "Buffer"; replace
that with the concrete numpy.ndarray type (e.g. change return type of
_send_query from "Buffer" to numpy.ndarray and the msg parameter of
_receive_answer from "Buffer" to numpy.ndarray) so annotations are valid under
Python 3.10+, keeping existing numpy usage and without adding new unrelated
imports; update both function signatures and any other occurrences of "Buffer"
in this module to numpy.ndarray.
| if rn != n: | ||
| logging.info("Skipping a message about instruction %d (waiting for %d)", | ||
| rn, n) | ||
| continue |
There was a problem hiding this comment.
Don't auto-resend every timed-out instruction.
A 0-byte read here only proves that nothing arrived before the 100 ms serial timeout at Line 2306, not that the controller discarded the command. If the original reply is merely late, the ping path will consume it through the rn != n skip logic and then resend the instruction anyway. Since SendInstruction() also backs non-idempotent operations like MoveRelPos(), that can double-apply hardware actions. Please limit the retry to an explicit safe/idempotent subset, or recover the late original reply before reissuing the command.
Also applies to: 992-1005
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/odemis/driver/tmcm.py` around lines 935 - 938, The current logic that
treats a 0-byte read as a timeout and then auto-resends the instruction (the
branch guarding rn != n and the retry path used by SendInstruction()) can
double-apply non-idempotent commands like MoveRelPos(); change SendInstruction()
so it only retries automatically for a small whitelist of safe/idempotent
operations (e.g., read-only or explicit-idempotent commands) or, alternatively,
before resending attempt to recover a late original reply from the incoming
buffer/ping path (consume and validate any pending reply matching the original
sequence number rn) and only resend if no valid late reply is found; update the
rn != n handling and the retry decision in SendInstruction() to implement one of
these two behaviors and ensure MoveRelPos() and other mutating commands are
excluded from blind auto-retry.
In some very rare cases, mainly during referencing, the TMCM-6110 fails
to reply to instructions. When this happens, typically the next or
second next instruction work again. It's not clear why this happens, but instead
of completely failing the action (eg, referencing), detect such
communication issue and retry.
Example error that this solves: