GitHub - language-transfer/lt-tools: Various tools used to upload and label files

Dagger

data/ houses all the data, both canonical and derived. Most of this is .gitignored, except *.meta.* files, which generally contain SHA hashes.

data/core is the canonical data. It's not stored in this repo, since it's ~5-10GB (it's also not available under the same license as the code). Once you have the core data, you can verify that it's the same as my copy by running ./check-core-integrity.sh. If you need to update the core (e.g., to add files or swap files in/out), you can do that with ./create-core-integrity-data.sh. Then, commit to this repo!

./build.sh will build the full object storage dump that needs to go into a content-addressed bucket. It's probably good to append-only to buckets; don't delete old files, since they might still be referenced by someone's install somewhere.

All objects are stored just by their sha256 hash, with a two-character prefix as the folder (similar to git object folders) and the remaining 62 hex characters as the filename. The one 'exception' isn't really an object and maybe we won't put it in the same 'cas' (content-addressed store) folder, and that's all-courses.json, which serves as an index and will need to be at a stable URL. Everything else is indexed by something else higher up in the hierarchy, both course data JSON files and course audio files. The browser directory contains a simple page that gives a human-readable view of the store.

Not all steps will be deterministic. We mostly use the Dagger cache to keep remuxes/transcodes/etc. around, but we also materialize the cache in case certain steps (e.g., ffmpeg) aren't deterministic. The reason this matters is that it'd be bad if we lost/pruned our Dagger cache, had to re-encode files, had slight changes, and as a result ended up busting everyone's local downloads after generating new metadata. Ideally, of course, everything would be deterministic and we could just derive everything from the core directory, but in practice I'm not super confident in this (even though I'm getting identical runs on my local machine, so far). This caching mechanism is pretty awkward and Dagger doesn't love it, but I've gotten it to comply begrudgingly.

Before or after building, run ./build-cache.sh; this will materialize the cached steps in a directory (taking some multiple of the rest of the data/ dir). This should run reasonably quickly if you've built already and the Dagger cache is big enough to store intermediate steps. If not, it'd be good to run ./build.sh after building the cache (which should be reasonably quick), since this will actually write the cached results to disk outside the Dagger cache, and those results will be used instead of re-transcoding from scratch, even if the cache is pruned. We essentially have our own 'shadow cache' running in parallel that will survive Dagger evictions.

build-for-language.sh and build-cache-for-language.sh shouldn't be necessary, though they can help if you want to test a smaller subset of the data. I wrote these scripts when I was worried I wouldn't get Dagger to behave when building everything at once; it wasn't working great, for a while there.

data/courses $ rclone copy . lt-r2:lt-app-cas/ --ignore-existing

Content-addressed layout

Build output is a flat CAS rooted at a base URL (defaults to https://downloads.languagetransfer.org/cas). Objects are named by SHA-256; internal storage layout is <prefix>/<rest> where prefix is the first 2 hex chars and rest is the remaining 62. Server responses are application/octet-stream; clients should rely on the pointer metadata for the real MIME (mp4 or json).
all-courses.json sits at the CAS root (not hashed) and is the entry point: { buildVersion: 2, casBaseURL, courses: [ { id, meta, lessons } ] }. meta is a file pointer.
File pointers: fields { _type: "file", object: string, filesize: number, mimeType: string } where object is the SHA-256 string (no slashes). Consumers fetch at ${casBaseURL}/${object} and infer MIME from the pointer. Requests should use the full SHA-256 hash; this will be redirected to prefix/hash
Per-course meta files live in CAS under their hash and should be interpreted as JSON. Shape: { buildVersion: 2, lessons: [ { id, title, duration, variants: { hq: FilePointer, lq: FilePointer } } ] }. Lesson id is <courseId><index+1>, titles default to Lesson N.
Media pipeline: each lesson track from data/core/courses/<id>/tracks is remuxed to metadata-free HQ mp4 and transcoded to LQ AAC mono mp4; both variants are hashed, placed in CAS, and referenced from the course meta.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.dagger		.dagger
browser		browser
data/core-integrity		data/core-integrity
legacy		legacy
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
build-cache-for-language.sh		build-cache-for-language.sh
build-cache.sh		build-cache.sh
build-for-language.sh		build-for-language.sh
build.sh		build.sh
check-core-integrity.sh		check-core-integrity.sh
create-core-integrity-data.sh		create-core-integrity-data.sh
dagger.json		dagger.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Content-addressed layout

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

language-transfer/lt-tools

Folders and files

Latest commit

History

Repository files navigation

Content-addressed layout

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages