Skip to content

Fix check_ssl_connection() function#866

Merged
badrogger merged 3 commits into
skalenetwork:developfrom
kucharskim:develop
Jun 24, 2025
Merged

Fix check_ssl_connection() function#866
badrogger merged 3 commits into
skalenetwork:developfrom
kucharskim:develop

Conversation

@kucharskim

@kucharskim kucharskim commented Jun 17, 2025

Copy link
Copy Markdown
Contributor

Function doesn't wait for openssl s_client ... to finish. It assumes that when the command is still running that is the successful condition. However the function should wait for exit code from the binary. We saw in production intermittent and very often skale ssl upload failures. This change should fix this problem and underlying race condition.

Related to:

@kucharskim

Copy link
Copy Markdown
Contributor Author

There sill maybe an issue in node_cli/core/ssl/check.py with other functions, which are on the client side, but I don't know how to test those code paths, so I prefer not to touch them.

Comment thread node_cli/core/ssl/check.py Outdated
Comment on lines +205 to +206
for _ in range(10):
code = dp.poll()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may use also dp.wait(timeout) instead of dp.pool() and a for loop.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please. It is a little more clear.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will do dp.wait(timeout) then.

Comment thread node_cli/core/ssl/check.py Outdated
@@ -202,7 +202,15 @@ def check_ssl_connection(host, port, silent=False):
expose_output = not silent
with detached_subprocess(ssl_check_cmd, expose_output=expose_output) as dp:
time.sleep(1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sleep is not needed anymore.

Comment thread node_cli/core/ssl/check.py Outdated
Comment on lines +205 to +206
for _ in range(10):
code = dp.poll()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please. It is a little more clear.

@kucharskim kucharskim marked this pull request as draft June 23, 2025 18:56
raise SSLHealthcheckError('OpenSSL connection verification failed')
timeout = 20
try:
dp.wait(timeout=timeout)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This consistently times out :/

Based on https://docs.python.org/3/library/subprocess.html#subprocess.Popen.wait I suspect:

Note This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.

@kucharskim kucharskim Jun 23, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading now https://docs.python.org/3/library/subprocess.html#subprocess.Popen.communicate

communicate() returns a tuple (stdout_data, stderr_data).

and looking how def detached_subprocess() is defined in core/ssl/utils.py, that it is p.stdout.read() and grabs the output, I am inclined to actually lave the loop and dp.poll() as it was.

@kucharskim kucharskim Jun 23, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With dp.wait() it consistently throws subprocess.TimeoutExpired and openssl s_client -connect 127.0.0.1:4536 -verify_return_error -verify 2 is not finishing, but it is running.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. it times out for me now with dp.poll() as well, but it worked before 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think there were two problems and another fix was stdin=subprocess.DEVNULL in utils.py

Function doesn't wait for `openssl s_client ...` to finish.
It assumes that when the command is still running that is the
successful condition. However the function should wait for
exit code from the binary. We saw in production intermittent
and very often `skale ssl upload` failures. This change
should fix this problem and underlying race condition.
Replace for loop and dp.poll() with more straightforward
dp.wait() with a timeout, as requested during diff review.
@kucharskim kucharskim marked this pull request as ready for review June 23, 2025 20:36
@kucharskim kucharskim requested a review from badrogger June 23, 2025 20:36
Redirect the child's standard input to subprocess.DEVNULL,
so it starts with no stdin attached. This prevents the OpenSSL
health-check process from reading from, or blocking on, the
parent's terminal or execution environment stdin stream.
@kucharskim

Copy link
Copy Markdown
Contributor Author

With all 3 commits, command python main.py ssl upload -f -k <key.pem> -c <fullchain.pem> finally works reliably on our machines.

@kucharskim

Copy link
Copy Markdown
Contributor Author

Tested my pull req interactively and via systemd unit file - both work fine.

@badrogger badrogger merged commit c48113f into skalenetwork:develop Jun 24, 2025
1 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

skale ssl upload fails intermittently, often skale ssl check broken on Ubuntu 22.04

3 participants