diff --git a/.gitignore b/.gitignore index a60f863..aff0cf1 100644 --- a/.gitignore +++ b/.gitignore @@ -1,26 +1,25 @@ -node_modules/ -dist/ -.datasets/ -.gcr/ -public/data/ + +# Site-specific generated files (produced by generate-data per SITE_ID) +*.gem +*.log +*.tgz *.tsbuildinfo .DS_Store +.datasets/ .env -.env.local .env.* - -# Site-specific generated files (produced by generate-data per SITE_ID) +.env.local +.gcr/ +.idea/ +.vscode/ +TODO* +TODO.update-browser/ +coverage/ +dist/ +node_modules/ +public/data/ public/datasets.json +public/logos/ public/routing.json public/site-config.json -public/logos/ - -TODO* site-configs.yml -TODO.update-browser/ -*.gem -coverage/ -*.log -*.tgz -.idea/ -.vscode/ diff --git a/CLAUDE.md b/CLAUDE.md index 3bb64ae..65f2f12 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -16,7 +16,7 @@ Glossarist Concept Browser (`@glossarist/concept-browser`) — a Vue 3 SPA that - Run a single test: `npx vitest run src/__tests__/graph.test.ts` - `npm run fetch-datasets` — Clone/update source repos into `.datasets/`, harmonize concepts to canonical format. Supports `DATASET_SOURCE_{ID}` env var for local path override. - `npm run generate-data` — Convert harmonized YAML concepts to JSON-LD. Reads from `.datasets/` (populated by fetch-datasets) and `datasets.yml`. -- `node scripts/build-edges.js` — Pre-compute cross-reference edges from generated concept JSON files (run after `generate-data`) +- `node scripts/build-edges.js` — Pre-compute cross-reference and domain edges from generated concept JSON files, writes `edges.json` + `domain-nodes.json` (run after `generate-data`) - `npm run build:full` — Full pipeline: fetch + generate + build-edges + build - `npx concept-browser ` — CLI: fetch, generate, edges, build @@ -32,7 +32,7 @@ All datasets are harmonized to ONE canonical YAML format before `generate-data.m The target architecture uses GCR (Glossarist Concept Repository) files — sealed ZIP archives with harmonized concepts + metadata, modeled after LXR from `lutaml-xsd`. See `docs/gcr-spec.md`. Currently, the browser reads from cloned repos; when the glossarist gem provides `glossarist package`, the pipeline will switch to consuming `.gcr` files. ### Data Flow -`public/datasets.json` → lists dataset IDs → each maps to `public/data/{id}/` containing `manifest.json`, `index.json`, `edges.json`, and `concepts/*.json`. The `AdapterFactory` discovers datasets at startup, loads manifests and indexes, then concepts are fetched on-demand when a user navigates to one. +`public/datasets.json` → lists dataset IDs → each maps to `public/data/{id}/` containing `manifest.json`, `index.json`, `edges.json` (cross-reference + domain edges), `domain-nodes.json` (domain classification nodes with concept counts), and `concepts/*.json`. The `AdapterFactory` discovers datasets at startup, loads manifests and indexes, then concepts are fetched on-demand when a user navigates to one. ### Key Layers diff --git a/README.md b/README.md index c2cd1d2..f49e208 100644 --- a/README.md +++ b/README.md @@ -57,9 +57,10 @@ datasets.yml └─> public/data/{id}/ ├── manifest.json Dataset metadata ├── index.json Concept listing (chunked for large sets) - ├── edges.json Pre-computed cross-references + ├── edges.json Pre-computed cross-reference + domain edges + ├── domain-nodes.json Domain classification nodes └── concepts/*.json Individual concept documents - └─> scripts/build-edges.js (extract graph edges) + └─> scripts/build-edges.js (extract graph + domain edges) ``` ### Step-by-step diff --git a/TODO.generalized/01-canonical-concept-format.md b/TODO.generalized/01-canonical-concept-format.md deleted file mode 100644 index 7d568d7..0000000 --- a/TODO.generalized/01-canonical-concept-format.md +++ /dev/null @@ -1,71 +0,0 @@ -# Status: DONE - -# 01 — Canonical Concept Format Specification - -## Context - -All glossarist datasets currently use slightly different YAML formats (IEV bare strings, Geolexica arrays, osgeo `authoritative_source`). The browser must not handle format variants — all datasets must conform to ONE canonical format before the browser sees them. - -## Task - -Create `docs/dataset-schema.md` defining the canonical concept YAML format and the harmonization rules. - -### Canonical concept YAML - -```yaml -termid: "102-01-01" # string, unique within dataset -term: equality # convenience: preferred English term -eng: # language block (ISO 639-2 code) - terms: # REQUIRED, at least 1 - - type: expression # expression | symbol | abbreviation - designation: equality - normative_status: preferred # preferred | deprecated | admitted - gender: f # optional - plurality: singular # optional - usage_info: Mathematik # optional - definition: # ALWAYS array of {content: "..."} objects - - content: "relation between two entities..." - notes: # optional, array of strings - - "Note 1 content" - examples: # optional, array of strings - - "Example 1" - language_code: eng - entry_status: valid # valid | superseded | withdrawn | draft - sources: # ALWAYS array (normalize singular forms) - - type: authoritative # authoritative | lineage - origin: - ref: ISO 1087-1:2000 - clause: "3.4.16" - link: https://www.iso.org/standard/20057.html - dates: # ALWAYS array of {type, date} - - type: accepted - date: "2008-08-01T00:00:00+00:00" - review_date: "2024-01-01" - review_decision_date: "2024-01-01" - review_decision_event: published -``` - -### Harmonization rules - -| Variant | Source format | Harmonized to | -|---------|--------------|---------------| -| Definition | bare string `"text"` | `[{content: "text"}]` | -| Definition | `[{content: "text"}]` | unchanged | -| Sources | `authoritative_source: {link: "..."}` | `sources: [{type: authoritative, origin: {link: "..."}}]` | -| Sources | `sources: [{type, origin}]` | unchanged | -| Sources | absent (IEV) | absent (kept absent) | -| Dates | `date_accepted: "..."` scalar | `dates: [{type: accepted, date: "..."}]` | -| Dates | `dates: [{type, date}]` array | unchanged | -| Entry status | `"Standard"` | `"valid"` | -| Notes | bare strings | bare strings (kept) | -| Terms | `abbrev: true` (osgeo) | `type: abbreviation` | -| `_revisions` | present (isotc211) | **stripped** | - -## Files - -- Create: `docs/dataset-schema.md` - -## Verification - -- Document exists, covers all fields, lists all harmonization rules -- Cross-referenced by GCR spec and adding-a-dataset doc diff --git a/TODO.generalized/02-gcr-packaging-format.md b/TODO.generalized/02-gcr-packaging-format.md deleted file mode 100644 index 81abbe7..0000000 --- a/TODO.generalized/02-gcr-packaging-format.md +++ /dev/null @@ -1,85 +0,0 @@ -# Status: DONE - -# 02 — GCR Packaging Format Specification - -## Context - -Modeled after LXR from `lutaml-xsd`. A sealed `.gcr` ZIP file bundles harmonized concept data + metadata so that datasets are immutable, self-describing artifacts. The browser pipeline reads GCR files instead of raw repos. - -## Task - -Create `docs/gcr-spec.md` defining the GCR format. - -### GCR ZIP structure - -``` -my-dataset.gcr (ZIP) -├── metadata.yaml # Dataset metadata + statistics -├── register.yaml # Original register metadata from source repo -├── concepts/ # Harmonized concept YAML files (canonical format) -│ ├── 102-01-01.yaml -│ ├── 102-01-02.yaml -│ └── ... -└── concepts_data/ # Pre-serialized (optional, for fast loading) - └── ... # Future: JSON or Marshal serialized concepts -``` - -### metadata.yaml schema - -```yaml -title: IEC Electropedia (IEV) # required -description: International Electrotechnical... # required -glossarist_version: 2.4.0 # required -created_at: "2026-04-28T12:00:00+09:00" # required -created_by: glossarist CLI # required - -statistics: # required - concept_count: 22228 - languages: [eng, ara, deu, fra, ...] - concepts_with_definitions: 20000 - concepts_with_sources: 18000 - -owner: IEC TC 1 # optional -homepage: https://www.electropedia.org # optional -repository: https://github.com/glossarist/... # optional -license: CC-BY-SA # optional -tags: [electrotechnical, iec, multilingual] # optional - -appearance: # optional - color: "#3366ff" - -links: # optional - - name: IEC Electropedia - url: https://www.electropedia.org - -schema_version: "1.0.0" # required -``` - -### Validation rules (for `glossarist validate`) - -- `metadata.yaml` exists and parses -- `concepts/` directory exists with ≥1 YAML file -- Each concept has `termid` (string) -- Each concept has ≥1 language block with ≥1 term -- No duplicate `termid` values -- `definition` is always array of `{content: "..."}` (harmonized) -- `sources` is always array (no `authoritative_source` singular) -- `entry_status` values are from allowed set: `valid`, `superseded`, `withdrawn`, `draft` -- Cross-references (if present) are valid concept IDs - -### Reference: LXR format (lutaml-xsd) - -The LXR format is a ZIP with `metadata.yaml` + `schemas/*.xsd` + `schemas_data/*.marshal`. Key files: -- `/Users/mulgogi/src/lutaml/lutaml-xsd/lib/lutaml/xsd/schema_repository_package.rb` — ZIP read/write -- `/Users/mulgogi/src/lutaml/lutaml-xsd/lib/lutaml/xsd/package_builder.rb` — serialization orchestration -- `/Users/mulgogi/src/lutaml/lutaml-xsd/lib/lutaml/xsd/schema_repository_metadata.rb` — metadata model -- `/Users/mulgogi/src/lutaml/lutaml-xsd/lib/lutaml/xsd/package_configuration.rb` — strategy configuration - -## Files - -- Create: `docs/gcr-spec.md` - -## Verification - -- Document exists, specifies ZIP structure, metadata schema, validation rules -- References canonical format from `docs/dataset-schema.md` diff --git a/TODO.generalized/03-datasets-yml.md b/TODO.generalized/03-datasets-yml.md deleted file mode 100644 index c17d52e..0000000 --- a/TODO.generalized/03-datasets-yml.md +++ /dev/null @@ -1,72 +0,0 @@ -# Status: DONE - -# 03 — Create datasets.yml + .gitignore - -## Context - -The browser needs a configuration file listing all datasets with their source repos, colors, and metadata. Currently the dataset list is hardcoded in `generate-data.mjs` (lines 309-346). Externalizing it to `datasets.yml` means adding a dataset requires only editing one file. - -## Task - -### Create `datasets.yml` - -```yaml -# datasets.yml — Glossarist Vocabulary Browser dataset registry -# Add a new dataset by adding an entry below. No code changes required. -# Run: npm run fetch-datasets && npm run generate-data && npm run build-edges - -datasets: - - id: iev - sourceRepo: https://github.com/glossarist/glossarist-data-iev - title: "IEC Electropedia (IEV)" - owner: IEC TC 1 - existingSiteUrl: https://www.electropedia.org - color: "#3366ff" - tags: [electrotechnical, iec, multilingual] - - - id: isotc211 - sourceRepo: https://github.com/geolexica/isotc211-glossary - owner: ISO/TC 211 - existingSiteUrl: https://isotc211.geolexica.org - color: "#0d9488" - tags: [geographic-information, gis, iso, multilingual] - - - id: isotc204 - sourceRepo: https://github.com/geolexica/isotc204-glossary - owner: ISO/TC 204 - existingSiteUrl: https://isotc204.geolexica.org - color: "#d97706" - tags: [transport, its, iso, automated-driving] - - - id: osgeo - sourceRepo: https://github.com/geolexica/osgeo-glossary - owner: OSGeo - existingSiteUrl: https://osgeo.geolexica.org - color: "#059669" - tags: [osgeo, open-source, gis] -``` - -Metadata resolution: `datasets.yml` overrides → repo's `register.yaml` → defaults. - -### Create `.gitignore` - -``` -node_modules/ -dist/ -.datasets/ -public/data/ -*.tsbuildinfo -.DS_Store -.env -.env.local -``` - -## Files - -- Create: `datasets.yml` -- Create: `.gitignore` - -## Verification - -- `datasets.yml` parses as valid YAML -- `.gitignore` excludes generated data directories diff --git a/TODO.generalized/04-fetch-datasets.md b/TODO.generalized/04-fetch-datasets.md deleted file mode 100644 index 96cdc38..0000000 --- a/TODO.generalized/04-fetch-datasets.md +++ /dev/null @@ -1,48 +0,0 @@ -# Status: DONE - -# 04 — Create scripts/fetch-datasets.mjs - -## Context - -Currently dataset source directories are hardcoded absolute paths in `generate-data.mjs` (lines 11-13). Need a script that reads `datasets.yml`, clones/updates the source repos, and makes them available for data generation. - -## Task - -Create `scripts/fetch-datasets.mjs` that: - -1. Reads `datasets.yml` (using `js-yaml`, already a devDependency) -2. For each dataset: - - Check `DATASET_SOURCE_{ID}` env var for local path override - - If no override, `git clone --depth 1` into `.datasets/{id}/` (or `git fetch` + `reset` if exists) - - Supports `GITHUB_TOKEN` for private repos -3. Reads `.datasets/{id}/register.yaml` for metadata (title, description, languages) -4. Validates source directory exists with `.yaml` concept files -5. Outputs resolved metadata - -### Key implementation details - -- Use `child_process.execSync` for git operations -- Clone with `--depth 1` for speed (we don't need history) -- If `.datasets/{id}/` already exists, do `git fetch origin && git reset --hard origin/HEAD` -- Read `register.yaml` for `name` (→ title), `description`, `subregisters` (→ languages) -- Exit gracefully if a repo fails (don't block other datasets) -- Support `DATASET_SOURCE_IEV=/local/path` env var override for development - -### Example usage - -```bash -npm run fetch-datasets -# or with local override: -DATASET_SOURCE_IEV=/Users/me/src/glossarist/glossarist-data-iev npm run fetch-datasets -``` - -## Files - -- Create: `scripts/fetch-datasets.mjs` -- Modify: `package.json` — add `"fetch-datasets": "node scripts/fetch-datasets.mjs"` script - -## Verification - -- `npm run fetch-datasets` creates `.datasets/` with all 4 repos -- Re-running updates existing repos without errors -- `DATASET_SOURCE_IEV=/local/path npm run fetch-datasets` uses local path diff --git a/TODO.generalized/05-update-generate-data.md b/TODO.generalized/05-update-generate-data.md deleted file mode 100644 index d3f70e8..0000000 --- a/TODO.generalized/05-update-generate-data.md +++ /dev/null @@ -1,86 +0,0 @@ -# Status: DONE - -# 05 — Update scripts/generate-data.mjs - -## Context - -`generate-data.mjs` has hardcoded paths (lines 11-13), hardcoded cross-ref maps (lines 17-19), and format-variant handling (bare strings in `defsToJsonLd`, inline text scanning in `extractInlineRefs`). Must read from `datasets.yml` + `.datasets/` and handle only the canonical format. - -## Task - -### Remove - -- Hardcoded `IEV_DIR`, `TC211_DIR`, `TC204_DIR` constants (lines 11-13) -- Hardcoded `REF_PREFIX_MAP` and `URN_STANDARD_MAP` (lines 17-19) — inline refs are pre-extracted during harmonization -- Hardcoded `DATASETS` array (lines 309-346) -- Format-variant handling in `defsToJsonLd()` (line 57: `typeof defs === 'string' ? [...] : defs`) -- Format-variant handling in `extractInlineRefs()` (lines 86-91: bare string normalization) -- The entire `extractInlineRefs()` function — references are pre-extracted as `gl:references` during harmonization - -### Add - -- Read `datasets.yml` for dataset list and configuration -- Read `.datasets/{id}/register.yaml` for metadata (title, description, languages) -- Resolve source dirs from `.datasets/{id}/concepts/` or `DATASET_SOURCE_{ID}` env var -- Merge metadata: `datasets.yml` overrides → `register.yaml` → defaults -- Simplify `defsToJsonLd()` to assume array-of-objects format only - -### Keep unchanged - -- All JSON-LD conversion logic (`yamlToJsonLd`, `termToDesignation`, `sourcesToJsonLd`) -- `processDataset()` flow (chunking, manifest generation) -- `DS_PALETTE` fallback (used when no color in datasets.yml) - -### Simplified `defsToJsonLd` - -```js -function defsToJsonLd(defs) { - if (!defs || !Array.isArray(defs)) return []; - return defs - .map(d => ({ - '@type': 'gl:DetailedDefinition', - 'gl:content': d.content || '', - })) - .filter(d => d['gl:content']); -} -``` - -### Main loop reads from datasets.yml - -```js -import datasetsConfig from './datasets.yml' with { type: 'yaml' }; // or parse at runtime - -for (const ds of datasetsConfig.datasets) { - const dir = process.env[`DATASET_SOURCE_${ds.id.toUpperCase()}`] - || path.join(ROOT, '.datasets', ds.id, 'concepts'); - if (!fs.existsSync(dir)) { - console.warn(`Skipping ${ds.id}: source not found (${dir})`); - continue; - } - // Read register.yaml for metadata - const registerYaml = readYaml(path.join(ROOT, '.datasets', ds.id, 'register.yaml')); - processDataset(dir, ds.id, { - title: ds.title || registerYaml.name, - description: ds.description || registerYaml.description, - owner: ds.owner, - languages: ds.languages || Object.keys(registerYaml.subregisters || {}), - color: ds.color || DS_PALETTE[idx % DS_PALETTE.length], - sourceRepo: ds.sourceRepo, - existingSiteUrl: ds.existingSiteUrl, - tags: ds.tags, - }); -} -``` - -## Files - -- Modify: `scripts/generate-data.mjs` - -## Verification - -- `npm run generate-data` works with datasets from `.datasets/` -- `npm run generate-data` works with `DATASET_SOURCE_IEV` env var -- No hardcoded dataset paths remain -- `defsToJsonLd` does not handle bare strings -- `extractInlineRefs` removed -- All 4 datasets generate successfully (iev, isotc211, isotc204, osgeo) diff --git a/TODO.generalized/06-harmonize-osgeo.md b/TODO.generalized/06-harmonize-osgeo.md deleted file mode 100644 index eaff0ba..0000000 --- a/TODO.generalized/06-harmonize-osgeo.md +++ /dev/null @@ -1,7 +0,0 @@ -# 06 — Harmonize osgeo-glossary Dataset - -## Status: DONE (integrated into fetch-datasets.mjs) - -The harmonization is handled by `scripts/fetch-datasets.mjs` which normalizes all concept YAML files to canonical format during the fetch step. No separate script needed. - -See `docs/dataset-schema.md` for the harmonization rules applied. diff --git a/TODO.generalized/07-harmonize-iev.md b/TODO.generalized/07-harmonize-iev.md deleted file mode 100644 index 3b4cabb..0000000 --- a/TODO.generalized/07-harmonize-iev.md +++ /dev/null @@ -1,7 +0,0 @@ -# 07 — Harmonize IEV Dataset - -## Status: DONE (integrated into fetch-datasets.mjs) - -The harmonization is handled by `scripts/fetch-datasets.mjs` which normalizes all concept YAML files to canonical format during the fetch step. No separate script needed. - -See `docs/dataset-schema.md` for the harmonization rules applied. diff --git a/TODO.generalized/08-spa-deployment-config.md b/TODO.generalized/08-spa-deployment-config.md deleted file mode 100644 index 4bc21bb..0000000 --- a/TODO.generalized/08-spa-deployment-config.md +++ /dev/null @@ -1,119 +0,0 @@ -# Status: DONE - -# 08 — SPA Deployment Configuration - -## Context - -The browser needs to deploy as an SPA to GitHub Pages at https://www.geolexica.org. This requires: -- Base path configuration in Vite and Vue Router -- SPA fallback (404.html) for client-side routing -- GitHub Actions CI/CD pipeline - -## Task - -### vite.config.ts - -Add `base` option: - -```typescript -export default defineConfig({ - base: process.env.BASE_PATH || '/', - // ... rest unchanged -}) -``` - -### src/router/index.ts (line 34) - -```typescript -history: createWebHistory(import.meta.env.BASE_URL), -``` - -### scripts/generate-404.js - -Copy `dist/index.html` → `dist/404.html` for GitHub Pages SPA fallback. - -```js -import { copyFileSync } from 'fs'; -import { join, dirname } from 'path'; -import { fileURLToPath } from 'url'; - -const __dirname = dirname(fileURLToPath(import.meta.url)); -const dist = join(__dirname, '..', 'dist'); -copyFileSync(join(dist, 'index.html'), join(dist, '404.html')); -console.log('Created dist/404.html for SPA fallback'); -``` - -### package.json scripts - -Add: -```json -{ - "fetch-datasets": "node scripts/fetch-datasets.mjs", - "build:full": "npm run fetch-datasets && npm run generate-data && node scripts/build-edges.js && npm run build", - "postbuild": "node scripts/generate-404.js" -} -``` - -### .github/workflows/deploy.yml - -```yaml -name: Deploy to GitHub Pages - -on: - push: - branches: [main] - workflow_dispatch: - -permissions: - contents: read - pages: write - id-token: write - -concurrency: - group: pages - cancel-in-progress: false - -jobs: - build: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: actions/setup-node@v4 - with: - node-version: 20 - cache: npm - - run: npm ci - - run: npm run fetch-datasets - env: - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - - run: npm run generate-data - - run: node scripts/build-edges.js - - run: npm run build - - uses: actions/upload-pages-artifact@v3 - with: - path: dist - - deploy: - needs: build - runs-on: ubuntu-latest - environment: - name: github-pages - url: ${{ steps.deployment.outputs.page_url }} - steps: - - id: deployment - uses: actions/deploy-pages@v4 -``` - -## Files - -- Modify: `vite.config.ts` -- Modify: `src/router/index.ts` -- Modify: `package.json` -- Create: `scripts/generate-404.js` -- Create: `.github/workflows/deploy.yml` - -## Verification - -- `npm run build` creates `dist/404.html` -- SPA routes work with direct URL access (404.html fallback) -- GitHub Actions workflow runs on push to main diff --git a/TODO.generalized/09-update-docs.md b/TODO.generalized/09-update-docs.md deleted file mode 100644 index 9283093..0000000 --- a/TODO.generalized/09-update-docs.md +++ /dev/null @@ -1,54 +0,0 @@ -# Status: DONE - -# 09 — Update Documentation - -## Context - -`docs/adding-a-dataset.md` is outdated — it references the old color system (per-dataset Tailwind colors, `dsColor()` functions), old CLI flags (`--input`, `--id`), and inline cross-reference patterns that are being removed. The new pipeline uses `datasets.yml` + `fetch-datasets` + `generate-data` with no code changes. - -## Task - -### Rewrite `docs/adding-a-dataset.md` - -Reflect the new pipeline: - -1. Add entry to `datasets.yml` (id, sourceRepo, owner, color, tags) -2. Run `npm run fetch-datasets && npm run generate-data && npm run build-edges` -3. No code changes needed -4. Reference `docs/dataset-schema.md` for canonical concept format -5. Reference `docs/gcr-spec.md` for GCR packaging format - -Remove all references to: -- Per-dataset Tailwind color configuration -- `dsColor()`, `dsAccent()`, `REGISTER_COLORS` functions -- Inline cross-reference patterns (`{{...IEV:...}}`, `{urn:iso:...}`) -- `--input`, `--id`, `--title` CLI flags -- Manual `datasets.json` editing - -### Update `docs/architecture.md` - -Update data pipeline description to reflect: -- Source repos → `datasets.yml` + `fetch-datasets.mjs` → `.datasets/` -- `.datasets/` → `generate-data.mjs` (canonical format only) → `public/data/` -- No format-variant handling - -### Update `CLAUDE.md` - -Update to reflect: -- `datasets.yml` as the dataset registry (not `DATASETS` array in generate-data.mjs) -- `npm run fetch-datasets` command -- `npm run build:full` command -- GCR packaging format reference -- Canonical concept format - -## Files - -- Modify: `docs/adding-a-dataset.md` -- Modify: `docs/architecture.md` -- Modify: `CLAUDE.md` - -## Verification - -- No references to old color system remain -- No references to hardcoded paths remain -- Pipeline documentation matches actual scripts diff --git a/TODO.generalized/10-glossarist-gem-commands.md b/TODO.generalized/10-glossarist-gem-commands.md deleted file mode 100644 index f52827c..0000000 --- a/TODO.generalized/10-glossarist-gem-commands.md +++ /dev/null @@ -1,73 +0,0 @@ -# Status: DONE - -# 10 — Glossarist Gem: upgrade, package, validate Commands - -## Context - -The glossarist-ruby gem (`/Users/mulgogi/src/glossarist/glossarist-ruby/`) currently has only `generate_latex`. Three new commands are needed to support the GCR workflow. This is a **separate repo and separate effort** from the browser. - -Reference implementations from `lutaml-xsd`: -- `schema_repository_package.rb` — ZIP read/write logic -- `package_builder.rb` — serialization orchestration -- `schema_repository_metadata.rb` — metadata model with extensibility -- `package_configuration.rb` — strategy configuration -- `commands/package_command.rb` — CLI build/validate/info commands - -## Task - -### `glossarist harmonize -o ` - -Reads a source concept repository (any format variant), normalizes to canonical format. - -Harmonization rules (from `docs/dataset-schema.md`): -- Definitions: bare string → `[{content: "text"}]` -- Sources: `authoritative_source` → `sources` array -- Dates: scalar → `dates` array -- Entry status: `"Standard"` → `"valid"` -- Terms: `abbrev: true` → `type: abbreviation` -- Inline refs: `{{term, IEV:xxx}}` → structured `references` -- `_revisions`: stripped -- `termid`: cast to string - -### `glossarist package -o ` - -Creates a `.gcr` ZIP file: -1. Read harmonized YAML directory -2. Generate `metadata.yaml` (from `register.yaml` + computed statistics) -3. Compute statistics (concept count, languages, concepts with definitions/sources) -4. Assemble ZIP with `metadata.yaml`, `register.yaml`, `concepts/*.yaml` - -### `glossarist validate ` - -Validates a source directory or `.gcr` file: -- `metadata.yaml` exists and parses -- `concepts/` directory with ≥1 YAML file -- Each concept has `termid` (string) -- Each concept has ≥1 language block with ≥1 term -- No duplicate `termid` values -- Format compliance (canonical format rules) -- Cross-reference integrity (optional) - -### Implementation approach - -1. Add `Glossarist::CLI` Thor commands in `lib/glossarist/cli.rb` -2. Add `Glossarist::Package` module with `GcrPackage`, `GcrMetadata`, `GcrBuilder` classes -3. Use `rubyzip` gem for ZIP creation/extraction -4. Reuse `ManagedConceptCollection.load_from_files()` for reading concepts -5. Statistics computed from loaded collection - -## Files (in glossarist-ruby repo) - -- Modify: `lib/glossarist/cli.rb` -- Create: `lib/glossarist/package/` -- Create: `lib/glossarist/package/gcr_package.rb` -- Create: `lib/glossarist/package/gcr_metadata.rb` -- Create: `lib/glossarist/package/gcr_builder.rb` -- Modify: `glossarist.gemspec` — add `rubyzip` dependency - -## Verification - -- `glossarist harmonize` produces canonical YAML from any source format -- `glossarist package` creates a valid `.gcr` file -- `glossarist validate` catches format violations -- Browser pipeline can read `.gcr` output