Skip to content

Improving the versions.json rebuild process #51

@ararslan

Description

@ararslan

The way that versions.json is built currently is that we take the list of tags on the Julia repository, a list of known platforms, and the file types we expect to exist for each platform, and try to download each version, platform, file type combination. If the combination is downloadable, we record some information about it, otherwise we skip it. At the end of the build, we unconditionally upload the resulting file into "production."

There are a few limitations with this implementation:

  1. That process is very slow, so the overall build takes a considerable amount of time: my most recent runs as of this writing took 3.5 hours each, though earlier runs had usually been more like 1.5 hours (which is still a long time). It will continue to take more time for each new version that's added.
  2. It ignores the fact that platform support varies across versions. For example, we don't need to look for aarch64-apple-darwin until 1.7, or x86_64-unknown-freebsd until 0.6. While it's possible for someone to retroactively build binaries for a platform, that's historically been quite rare in practice.
  3. There is no opportunity to review the result prior to deployment. The 1.11.7 issue (see Incorrect S3 ACL #49) would have been caught in review of versions.json before going live and wreaking havoc, but instead wasn't caught until the JuliaUp versiondb was built.

I think all three of these limitations are solvable simultaneously, but it would require significant changes both to the implementation and how one interacts with this repository.

To start, note the description of the current implementation that exists in the code:

# We're going to collect the combinatorial explosion of version/os-arch possible downloads.
# We don't have a nice, neat list of what is or is not available, and so we're just going to
# try and download each file, and if it exists, yay. Otherwise, bleh.

We don't have a nice, neat list of what is or is not available

We didn't at the time but we do now: it's versions.json! We could use the latest versions.json as a starting point when building a new one. We would only need to know what's changed.

The way I propose we go about this is as follows:

  • Create a human-friendly representation of the core data from versions.json (versions, platforms, files) that's checked into this repository. I know everybody hates YAML, but for the sake of example, say this is like expected-versions.yaml.
  • When a new version is created, part of the release process becomes submitting a PR to this repository to make the corresponding changes to expected-versions.yaml, including the version, its platforms, and the files associated with each platform for that version. Making new expected platform binaries for a given version or set of versions would go through the same process.
  • A new versions.json is built in the PR, using only the changes made in that PR, and a diff against the currently live versions.json is displayed. (It could be shown in an auto-posted comment, for example—any way of getting insight into the effects of the change should be fine.)
  • When the PR is merged, the versions.json built as part of the PR is deployed. This would require enabling the repository option that requires PR branches to be up to date with the base branch before merging.

This setup would also provide a concrete set of expectations that can be checked via CI prior to deployment to ensure correctness in the result.

There is one additional implementation detail that I think we could take advantage of, and that's that our binaries are hosted on AWS S3. That exposes object metadata that can be queried to see whether changes were made outside of this workflow that need to be reflected in versions.json. For example, we could compare ETags (equal to the MD5 hash for files not uploaded in parts), or even just the last modified date, and only redo checksumming if something has changed. We could also use the checksum files that are built as part of the release process rather than needing to download the file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions