Dagger
data/ houses all the data, both canonical and derived. Most of this is .gitignored, except *.meta.* files, which generally contain SHA hashes.
data/core is the canonical data. It's not stored in this repo, since it's ~5-10GB (it's also not available under the same license as the code). Once you have the core data, you can verify that it's the same as my copy by running ./check-core-integrity.sh. If you need to update the core (e.g., to add files or swap files in/out), you can do that with ./create-core-integrity-data.sh. Then, commit to this repo!
./build.sh will build the full object storage dump that needs to go into a content-addressed bucket. It's probably good to append-only to buckets; don't delete old files, since they might still be referenced by someone's install somewhere.
All objects are stored just by their sha256 hash, with a two-character prefix as the folder (similar to git object folders) and the remaining 62 hex characters as the filename. The one 'exception' isn't really an object and maybe we won't put it in the same 'cas' (content-addressed store) folder, and that's all-courses.json, which serves as an index and will need to be at a stable URL. Everything else is indexed by something else higher up in the hierarchy, both course data JSON files and course audio files. The browser directory contains a simple page that gives a human-readable view of the store.
Not all steps will be deterministic. We mostly use the Dagger cache to keep remuxes/transcodes/etc. around, but we also materialize the cache in case certain steps (e.g., ffmpeg) aren't deterministic. The reason this matters is that it'd be bad if we lost/pruned our Dagger cache, had to re-encode files, had slight changes, and as a result ended up busting everyone's local downloads after generating new metadata. Ideally, of course, everything would be deterministic and we could just derive everything from the core directory, but in practice I'm not super confident in this (even though I'm getting identical runs on my local machine, so far). This caching mechanism is pretty awkward and Dagger doesn't love it, but I've gotten it to comply begrudgingly.
Before or after building, run ./build-cache.sh; this will materialize the cached steps in a directory (taking some multiple of the rest of the data/ dir). This should run reasonably quickly if you've built already and the Dagger cache is big enough to store intermediate steps. If not, it'd be good to run ./build.sh after building the cache (which should be reasonably quick), since this will actually write the cached results to disk outside the Dagger cache, and those results will be used instead of re-transcoding from scratch, even if the cache is pruned. We essentially have our own 'shadow cache' running in parallel that will survive Dagger evictions.
build-for-language.sh and build-cache-for-language.sh shouldn't be necessary, though they can help if you want to test a smaller subset of the data. I wrote these scripts when I was worried I wouldn't get Dagger to behave when building everything at once; it wasn't working great, for a while there.
data/courses $ rclone copy . lt-r2:lt-app-cas/ --ignore-existing
- Build output is a flat CAS rooted at a base URL (defaults to https://downloads.languagetransfer.org/cas). Objects are named by SHA-256; internal storage layout is
<prefix>/<rest>whereprefixis the first 2 hex chars andrestis the remaining 62. Server responses areapplication/octet-stream; clients should rely on the pointer metadata for the real MIME (mp4orjson). all-courses.jsonsits at the CAS root (not hashed) and is the entry point:{ buildVersion: 2, casBaseURL, courses: [ { id, meta, lessons } ] }.metais a file pointer.- File pointers: fields
{ _type: "file", object: string, filesize: number, mimeType: string }whereobjectis the SHA-256 string (no slashes). Consumers fetch at${casBaseURL}/${object}and infer MIME from the pointer. Requests should use the full SHA-256 hash; this will be redirected to prefix/hash - Per-course meta files live in CAS under their hash and should be interpreted as JSON. Shape:
{ buildVersion: 2, lessons: [ { id, title, duration, variants: { hq: FilePointer, lq: FilePointer } } ] }. Lessonidis<courseId><index+1>, titles default toLesson N. - Media pipeline: each lesson track from
data/core/courses/<id>/tracksis remuxed to metadata-free HQ mp4 and transcoded to LQ AAC mono mp4; both variants are hashed, placed in CAS, and referenced from the course meta.