Skip to content

Conversation

nirs
Copy link
Contributor

@nirs nirs commented Sep 19, 2025

This change makes driver version checking more strict. Previously we failed only if the driver was not found or running "driver version" failed. For all other errors we logged a warning and use the currently installed driver. With this change all the errors are fatal and you must have a valid driver to use the kvm or hyperkit driver.

Checking the driver version is more correct and strict. We separate driver version stdout and stderr, parse the driver output using yaml parser, and have more detailed logging and error messages.

The tests for extracting driver version from "driver version" output were replaced with test running driver version command and parsing the version.

This change makes it easy to validate the driver commit hash for addressing #21582.

Based on #21597 for testing to avoid the vfkit test failures.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 19, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nirs
Once this PR has been reviewed and has the lgtm label, please assign prezha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 19, 2025
@nirs nirs force-pushed the auxdriver-version branch 4 times, most recently from 74d6c94 to 0d8a2c2 Compare September 19, 2025 20:09
Rename the test after the function it tests, and change to failure
message to match the function name.
Make struct fields and helper function names more readable.
We complained that "driver --version" failed but the real command is
"driver version".
Previously we failed only if we could not find the driver in the PATH,
or running "driver version" failed, assuming that an old driver may
work. Hoping that using a invalid driver that does not return a version
will work is not a good error handling strategy.

Now will will fail loudly helping the user to fix the installation, or
failing tests that run with invalid driver.
Previously we had special treatment for driver not found, and drive
version failed. Since we treat all errors as fatal errors now, we don't
need the special errors.

Previously we logged the same errors at least twice; once in
validateDriver() and then later in the callers. This create more noise
in the log and makes it harder to debug issues.

All errors are wrapped now with more context using the modern way (%w)
instead of the legacy errors.Wrap(). The context was improved to
describe the issue better.

Handling of the special errors was also not idiomatic and not thread
safe; we modified the global Err* variables at the time of the error.
These special variables are removed now.
Rename arguments and temporary variable to make the code more clear.
Add a driverVersion helper and auxdriver.Version type for parsing driver
version yaml output. The helper read the output of the "driver version",
parses the yaml, and validate that both version and commit are set.

This will make it possible to validate the driver commit hash during the
tests to ensure we test the driver built from the current code.

We log or include in the error message both the driver version and
commit hash, to make it easier to debug issues related to using the
wrong driver.

This changes fixes possible issue if "driver version" command logs
errors or warnings. Previously this could break the code parsing the
version since we combined stdout and stderr. Now we extract stderr from
the command ExitError on errors.
@nirs nirs force-pushed the auxdriver-version branch from 0d8a2c2 to 532dacb Compare September 19, 2025 21:49
@nirs
Copy link
Contributor Author

nirs commented Sep 19, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Sep 19, 2025
@k8s-ci-robot
Copy link
Contributor

@nirs: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-minikube-integration 532dacb link true /test pull-minikube-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nirs
Copy link
Contributor Author

nirs commented Sep 19, 2025

Testing with function tests

Started functional tests in one shell

env MINIKUBE_HOME=/tmp TEST_ARGS="'-minikube-start-args=--driver=kvm'" make functional

Looking at the logs in another shell:

$ MINIKUBE_HOME=/tmp minikube logs
...
I0920 00:57:34.743726  141622 start.go:304] selected driver: kvm2
I0920 00:57:34.743738  141622 start.go:918] validating driver "kvm2" against <nil>
I0920 00:57:34.743765  141622 start.go:929] status for kvm2: {Installed:true Healthy:true Running:true NeedsImprovement:false Error:<nil> Reason: Fix: Doc: Version:}
I0920 00:57:34.746915  141622 install.go:51] acquiring lock: {Name:mk900956b073697a4aa6c80a27c6bb0742a99a53 Clock:{} Delay:500ms Timeout:10m0s Cancel:<nil>}
I0920 00:57:34.747205  141622 install.go:123] Validating docker-machine-driver-kvm2, PATH=/tmp/.minikube/bin:/home/nsoffer/go/pkg/mod/golang.org/[email protected]/bin:/home/nsoffer/bin:/home/nsoffer/.krew/bin:/home/nsoffer/sdk/go1.23.2/bin:/home/nsoffer/go/bin:/home/nsoffer/.local/bin:/home/nsoffer/bin:/usr/local/bin:/usr/bin
W0920 00:57:34.747320  141622 install.go:61] docker-machine-driver-kvm2: failed to find driver "docker-machine-driver-kvm2": exec: "docker-machine-driver-kvm2": executable file not found in $PATH
I0920 00:57:34.747443  141622 out.go:179] * Downloading driver docker-machine-driver-kvm2:
I0920 00:57:34.747571  141622 download.go:108] Downloading: https://github.com/kubernetes/minikube/releases/download/v1.37.0/docker-machine-driver-kvm2-amd64?checksum=file:https://github.com/kubernetes/minikube/releases/download/v1.37.0/docker-machine-driver-kvm2-amd64.sha256 -> /tmp/.minikube/bin/docker-machine-driver-kvm2
I0920 00:57:38.135430  141622 install.go:123] Validating docker-machine-driver-kvm2, PATH=/tmp/.minikube/bin:/home/nsoffer/go/pkg/mod/golang.org/[email protected]/bin:/home/nsoffer/bin:/home/nsoffer/.krew/bin:/home/nsoffer/sdk/go1.23.2/bin:/home/nsoffer/go/bin:/home/nsoffer/.local/bin:/home/nsoffer/bin:/usr/local/bin:/usr/bin
I0920 00:57:38.166004  141622 install.go:134] /tmp/.minikube/bin/docker-machine-driver-kvm2 version is {Version:v1.37.0 Commit:1af8bdc072232de4b1fec3b6cc0e8337e118bc83}

We used:

out/minikube start -p functional-945141 --memory=4096 --apiserver-port=8441 --wait=all --driver=kvm --auto-update-drivers=false

But the driver was downloaded during the test!

$ tree /tmp/.minikube/bin/
/tmp/.minikube/bin/
└── docker-machine-driver-kvm2

I would expect that with --auto-update-drivers=false the driver will not be downloaded and the tests will fail very early.

This seems to be the issue - we install the driver if it does not exist even if autoUpdate is false:

	if !exists || (err != nil && autoUpdate) {
		klog.Warningf("%s: %v", executable, err)
		path = filepath.Join(directory, executable)
		if err := download.Driver(executable, path, v); err != nil {
			return err
		}
	}

The driver must not be installed when autoUpdate is false.

@medyagh medyagh changed the title Require valid driver version Require valid aux driver version Sep 19, 2025
exit.Error(reason.DrvAuxNotHealthy, "Aux driver"+driverName, err)
} //if failed to update but not a fatal error, log it and continue (old version might still work)
out.WarningT("Unable to update {{.driver}} driver: {{.error}}", out.V{"driver": driverName, "error": err})
exit.Error(reason.DrvAuxNotHealthy, fmt.Sprintf("Auxiliary driver %q", driverName), err)
Copy link
Member

@medyagh medyagh Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we barely ever make a new kvm aux driver, may releases can just work with old one,
plus many Embedded users (like cloud code users for example) download minikube aux drivers ones, and some of them need root (for hyperkit) and they Prefer to give this root permission once to that binary and not be forced to get a new aux version on every single minikube update.

it is too much to exit when we can run. warning is fine. no need to disrupt ppl when not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kvm driver is built from minikube source. The docker-machine-driver-kvm2 is just a small wrapper around minikube code. It does not make sense to run an old driver from previous release with new minikube since we don't know if it will work or not. Hoping that an old driver will work is a not a valid release engineering strategy.

On system with a package manager (dnf, apt) this driver should be installed by the package manager in the standard location in the same way minikube is installed. This will ensure that we always run with the right driver tested by the CI. There is no need to install a driver dynamically when software is managed by a package manager.

When minikube is installed by downloading the minikube executable, installing the driver on the first use is a nice feature, but we want to make sure the driver is the driver released with minikube.

If the driver version command does not produce the expected output, something is very wrong and it is better to fail early and loudly, helping the user to fix the issue with minimal debugging.

If you have an old driver and minkube fail to download the driver, the user can download it manually in the same way they downloaded minikube itself. If we think that downloading the driver is not reliable enough, we can provide a tarball with minikube and the driver to avoid the unreliable automatic install. If we think the automatic install is reliable enough, there is no reason to continue if minikube cannot install the right driver.

Hyperkit is depcated and should be removed in minikube 1.38 (#21601) so we can ignore it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider converting the kvm driver from an external driver into an internal driver, then when hyperkit is gone there wouldn't be any more "aux" drivers and you wouldn't need the extra build... The virsh command is already required by the driver, and most communication with libvirt is done through XML anyway...

That is, replace the libvirt-go calls with the matching exec.Command - like with all other minikube drivers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using virsh instead of the libvirt api is not great since it is not designed for machines. But given the complexity and trouble caused by auxiliary drivers it sounds like a good plan.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minikube-pr-bot
Copy link

kvm2 driver with docker runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 21594 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 44.1s    │ 44.4s                  │
│ enable ingress │ 16.0s    │ 16.7s                  │
└────────────────┴──────────┴────────────────────────┘

Times for minikube start: 45.8s 41.1s 45.0s 46.1s 42.6s
Times for minikube (PR 21594) start: 44.1s 42.7s 45.1s 44.7s 45.2s

Times for minikube ingress: 16.3s 15.8s 16.3s 15.8s 15.8s
Times for minikube (PR 21594) ingress: 19.8s 16.3s 15.8s 15.8s 15.8s

docker driver with docker runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 21594 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 22.6s    │ 21.9s                  │
│ enable ingress │ 12.4s    │ 12.0s                  │
└────────────────┴──────────┴────────────────────────┘

Times for minikube start: 25.9s 22.2s 21.9s 20.7s 22.6s
Times for minikube (PR 21594) start: 22.4s 21.8s 20.8s 22.4s 22.1s

Times for minikube ingress: 13.6s 13.6s 10.6s 13.6s 10.6s
Times for minikube (PR 21594) ingress: 12.6s 12.6s 10.6s 10.6s 13.6s

docker driver with containerd runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 21594 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 20.7s    │ 20.6s                  │
│ enable ingress │ 26.5s    │ 29.7s                  │
└────────────────┴──────────┴────────────────────────┘

Times for minikube start: 20.1s 20.7s 22.7s 19.7s 20.3s
Times for minikube (PR 21594) start: 20.4s 20.5s 22.1s 20.4s 19.7s

Times for minikube ingress: 23.1s 39.1s 23.1s 23.1s 24.1s
Times for minikube (PR 21594) ingress: 39.1s 23.1s 24.1s 39.1s 23.1s

@minikube-pr-bot
Copy link

Here are the number of top 10 failed tests in each environments with lowest flake rate.

Environment Test Name Flake Rate
Docker_Linux_crio_arm64 (7 failed) TestAddons/serial/GCPAuth/FakeCredentials(gopogh) 0.00% (chart)
Docker_Linux_crio_arm64 (7 failed) TestFunctional/parallel/ServiceCmdConnect(gopogh) 0.00% (chart)
Docker_Linux_crio_arm64 (7 failed) TestFunctional/parallel/ServiceCmd/DeployApp(gopogh) 0.00% (chart)
Docker_Linux_crio_arm64 (7 failed) TestFunctional/parallel/ServiceCmd/HTTPS(gopogh) 0.00% (chart)
Docker_Linux_crio_arm64 (7 failed) TestFunctional/parallel/ServiceCmd/Format(gopogh) 0.00% (chart)
Docker_Linux_crio_arm64 (7 failed) TestFunctional/parallel/ServiceCmd/URL(gopogh) 0.00% (chart)

Besides the following environments also have failed tests:

To see the flake rates of all tests by environment, click here.

@nirs
Copy link
Contributor Author

nirs commented Sep 20, 2025

/cc @prezha
/cc @afbjorklund

@nirs
Copy link
Contributor Author

nirs commented Sep 23, 2025

I think this is the wrong direction - fixing the aux driver is hard, and we don't have a reason to use aux driver for libvirt. I'll work instead on making the #21618.

@nirs
Copy link
Contributor Author

nirs commented Sep 23, 2025

Replaced by #21625

@nirs nirs closed this Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants