Glossarist

Glossarist gem implements the Glossarist model in ruby. All the entities in the model are available as classes and all the attributes are available as methods of those classes. This gem also allows you to read/write data to concept dataset or create your own collection and save that to glossarist model V2 dataset.

The YAML schema for concept and localized_concept is available at Concept model/yaml_schemas

Installation

Add this line to your application’s Gemfile:

gem 'glossarist'

And then execute:

bundle install

Or install it yourself as:

gem install glossarist

Usage

Reading a Glossarist model V2 from files

Glossarist model V2 dataset is a collection of concepts and their localized concepts in the form of YAML files.

The storage structure of the dataset has 2 forms:

Each concept is stored in a concept YAML file and its localized concepts are stored in separate YAML files. The concept files are stored in the concept folder and its localized concepts are stored in the localized_concept folder.
Each concept and its related localized concepts are stored in a single YAML file. These concept files are stored directly in the specified path.

To load the glossarist model V2 dataset:

collection = Glossarist::ManagedConceptCollection.new
collection.load_from_files("path/to/glossarist-v2-dataset")

Writing a Glossarist model V2 to files

To write the glossarist model V2 dataset to files:

# load the collection from files
collection = Glossarist::ManagedConceptCollection.new
collection.load_from_files("path/to/glossarist-v2-dataset")

# ... Update the collection ...

collection.save_to_files("path/to/glossarist-v2-dataset")

To write the glossarist model V2 dataset with concepts and their localized concepts grouped into single files:

# load the collection from files
collection = Glossarist::ManagedConceptCollection.new
collection.load_from_files("path/to/glossarist-v2-dataset")

# ... Update the collection ...

collection.save_grouped_concepts_to_files("path/to/glossarist-v2-dataset")

ManagedConceptCollection

This is a collection for managed concepts. It includes the ruby 'Enumerable' module.

collection = Glossarist::ManagedConceptCollection.new

ManagedConcept

Following fields are available for ManagedConcept:

id: String identifier for the concept
uuid: UUID for the concept
related: Array of RelatedConcept
status: Enum for the normative status of the term.
dates: Array of ConceptDate
localized_concepts: Hash of all localizations where keys are language codes and values are uuid of the localized concept.
domains: Array of ConceptReference — upper concepts (subject areas, concept schemes, organizing concepts) that this concept belongs to across all languages. Each domain is a typed reference (e.g. { concept_id: "103", ref_type: "domain" }).
localizations: Hash of all localizations for this concept where keys are language codes and values are instances of LocalizedConcept.

There are two ways to initialize and populate a managed concept

Setting the fields by using a hash while initializing

concept = Glossarist::ManagedConcept.new({
  "data" => {
    "id" => "123",
    "localized_concepts" => {
      "ara" => "<uuid>",
      "eng" => "<uuid>"
    },
    "localizations" => <Array of localized concepts or localized concept hashes>,
    "domains" => [
      { "concept_id" => "103", "ref_type" => "domain" },
    ],
  },
})

Setting the fields after creating an object

concept = Glossarist::ManagedConcept.new
concept.id = "123"
concept.data.domains = [
  Glossarist::ConceptReference.new(concept_id: "103", ref_type: "domain"),
]
concept.localizations = <Array of localized concepts or localized concept hashes>

LocalizedConcept

Localizations of the term to different languages.

Localized concept has the following fields

id: An optional identifier for the term, to be used in cross-references.
uuid: UUID for the concept
designations: Array of Designations under which the term being defined is known. This method will also accept an array of hashes for designation and will convert them to their respective classes.
domain: URI reference to the subject area or section concept. Can be a relative URI (e.g. section-103-01), a URN (e.g. urn:iec:std:iec:60050-103-01), or a URL (e.g. https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=103-01). This is the per-language upper concept reference — the subject area for this specific localization. Different languages may assign the same abstract concept to different domains.
related: Array of RelatedConcept — per-language concept relationships. Concept hierarchies can differ across languages (e.g. Russian distinguishes голубой/siniy as coordinate basic colors, while English unifies them under "blue"). Language-specific broader/narrower/equivalent relationships go here.
subject: Subject of the term.
definition: Array of Detailed Definition of the term.
non_verb_rep: Array of non-verbal representations used to help define the term.
notes: Zero or more notes about the term. A note is in Detailed Definition format.
examples: Zero or more examples of how the term is to be used in Detailed Definition format.
language_code: The language of the localization, as an ISO-639 3-letter code.
script: The script of the localization, as an ISO 15924 4-letter code (e.g. Hans for Simplified Chinese, Latn for Latin, Cyrl for Cyrillic). Optional — when omitted, the default script for the language is assumed.
system: The ISO 24229 conversion system code used to produce this localization (e.g. Var:jpn-Hrkt:Latn:Hepburn-1886 for Hepburn-romanized Japanese). Optional — only set when the localization is a romanization or transliteration.
entry_status: Entry status of the concept. Must be one of the following: notValid, valid, superseded, retired.
classification: Classification of the concept. Must be one of the following: preferred, admitted, deprecated.

Designation

A name under which a managed term is known. Designations follow an inheritance hierarchy based on ISO 10241-1 and the Metanorma concept model.

Designation::Base (common to all types)

designation

String — the term text or symbol.

normative_status

Enum — one of preferred, admitted, deprecated, superseded.

geographical_area

String — geographic usage region (ISO 3166-1 country code).

language

String — language of this designation (ISO 639 code). Usually inherited from the LocalizedConcept’s language_code, but can differ for borrowed terms.

script

String — script of the designation text (ISO 15924 code, e.g. Hani for Kanji, Latn for Latin, Cyrl for Cyrillic).

system

String — ISO 24229 conversion system code used to produce this designation (e.g. Var:jpn-Hrkt:Latn:Hepburn-1886 for Hepburn romanization). Optional — only set when the designation is a romanization or transliteration.

international

Boolean — whether the designation is used internationally.

absent

Boolean — whether the designation is intentionally absent in this language.

pronunciation

Collection of Pronunciation entries — phonetic or romanized representations of the designation.

sources

Collection of ConceptSource entries — bibliographic sources for this designation (ISO 10241-1 §6.8).

term_type

Enum (ISO 12620) — optional classification of the designation’s term type. See ISO 12620 Term Types below.

related

Collection of RelatedConcept entries — term-level (designation-to-designation) relationships within the same concept entry. Used for linking abbreviated forms to full forms, short forms to expanded forms, etc. (TBX xref types).

Each Pronunciation entry has:

Attribute	Standard	Description
`content`	—	The pronunciation text
`language`	ISO 639	Language/dialect being pronounced (3-letter code)
`script`	ISO 15924	Script of the pronunciation text (4-letter code)
`country`	ISO 3166-1	Country variant (2-letter code, optional)
`system`	ISO 24229	Conversion system code or identifier (e.g. `IPA`, `Var:jpn-Hrkt:Latn:Hepburn-1886`)

Example:

pronunciation:
  - content: "toːkjoː"
    language: jpn
    script: Latn
    system: IPA
  - content: "Tōkyō"
    language: jpn
    script: Latn
    system: "Var:jpn-Hrkt:Latn:Hepburn-1886"

Designation::Expression (text-based, inherits Base)

prefix: String — text before the designation.
usage_info: String — disambiguation text for the designation.
field_of_application: String — IEC "specific use", appears in angle brackets after the designation (e.g. "in communication theory").
grammar_info: Array of GrammarInfo — gender, number, part of speech.

Designation::Abbreviation (inherits Expression)

acronym: Boolean — is this an acronym?
initialism: Boolean — is this an initialism?
truncation: Boolean — is this a truncation?

Designation::Symbol (inherits Base)

No additional attributes beyond Base.

Designation::LetterSymbol (inherits Symbol)

text: String — the letter symbol text.

Designation::GraphicalSymbol (inherits Symbol)

text: String — description of the symbol.
image: String — the graphical symbol (emoji, path, or data URL).

Factory Method

Designation::Base.from_h(options) creates a new designation instance based on the specified type.

Parameters

options (Hash) - The options for creating the designation.
"type" (String) - The type of designation (expression, symbol, abbreviation, graphical_symbol, letter_symbol). Note: type key should be string and not a symbol so { type: "expression" } will not work.
Additional options depend on the specific designation type.

Returns

Designation::{type}: A new instance of specified type.

Example

# Expression with field of application
expr = Designation::Base.from_h({
  "type" => "expression",
  "designation" => "information",
  "normative_status" => "preferred",
  "field_of_application" => "in communication theory",
})

# International abbreviation
abbr = Designation::Base.from_h({
  "type" => "abbreviation",
  "designation" => "ISO",
  "international" => true,
  "acronym" => true,
})

ISO 12620 Term Types

The term_type attribute on Designation::Base classifies designations according to ISO 12620 (also used as TBX termType). This is orthogonal to the structural designation type (expression/abbreviation/symbol): the structural type determines how the designation is serialized, while term_type provides ISO 12620 semantic classification.

Term type	Description
`abbreviation`	A shortened form of a word or phrase (general category)
`acronym`	An abbreviation pronounced as a word (e.g. NATO, laser)
`clipped_term`	A term formed by clipping part of a longer term (e.g. "phone" from "telephone")
`common_name`	A name in common use for a concept (e.g. "water" vs H₂O)
`entry_term`	The headword or main term in a terminological entry
`equation`	A mathematical equation used as a designation
`formula`	A chemical or mathematical formula (e.g. H₂O, E=mc²)
`full_form`	The complete, unabbreviated form of a designation (e.g. "World Wide Web")
`initialism`	An abbreviation pronounced letter by letter (e.g. "URL", "FBI")
`internationalism`	A term used with the same meaning across many languages (e.g. "computer", "algorithm")
`international_scientific_term`	A term established by international scientific agreement (e.g. "hydrogen")
`logical_expression`	A logical or Boolean expression used as a designation
`part_number`	A part number or catalog identifier used as a designation
`phraseological_unit`	A multi-word expression or phrase functioning as a term (e.g. "software engineering")
`transcribed_form`	A designation produced by phonetic transcription from another script
`transliterated_form`	A designation produced by transliteration from another script (e.g. "Moskva" from "Москва")
`short_form`	A shortened form of a designation that is not an abbreviation (e.g. "US" for "United States")
`shortcut`	A keyboard shortcut or command sequence (e.g. "Ctrl+V" for paste)
`sku`	A stock keeping unit identifier
`standard_text`	A standardized text passage used as a designation
`symbol`	A non-verbal symbol representing the concept (e.g. Ω for ohm)
`synonym`	A term with the same meaning in the same language, used as an alternative designation
`synonymous_phrase`	A phrase that is synonymous with the preferred designation
`variant`	A spelling, regional, or stylistic variant of another designation

Designation-Level Relationships (TBX xref)

Designations can have intra-entry relationships — links between designations of the same concept. These correspond to TBX xref elements on term information groups (<tig>).

Relationship type	Description
`abbreviated_form_for`	This designation is an abbreviated form of the target (e.g. "WWW" → "World Wide Web")
`short_form_for`	This designation is a short form of the target (e.g. "US" → "United States of America")

Example:

terms:
  - designation: WWW
    type: abbreviation
    term_type: acronym
    related:
      - type: abbreviated_form_for
        content: "World Wide Web"
  - designation: World Wide Web
    type: expression
    term_type: full_form

RelatedConcept

A concept related to the current concept with a typed relationship.

type: Enum — the relationship type (see Relationship Types below).
content: String — free-text content describing the related concept.
ref: A Citation reference to the related concept.

There are two ways to initialize and populate a related concept

Setting the fields by using a hash while initializing

related_concept = Glossarist::RelatedConcept.new({
  content: "Test content",
  type: :supersedes,
  ref: <concept citation>
})

Setting the fields after creating an object

related_concept = Glossarist::RelatedConcept.new
related_concept.type = "supersedes"
related_concept.content = "designation of the related concept"
related_concept.ref = <Citation object>

Relationship Types

Relationship types are drawn from ISO 10241-1, ISO 25964/SKOS, and ISO 12620/TBX. The table below shows each type with its provenance and cross-standard equivalents.

Glossarist type	Category	ISO 10241-1	ISO 25964 / SKOS	ISO 12620 / TBX
`deprecates`	Lifecycle	deprecates	—	—
`supersedes`	Lifecycle	supersedes	—	—
`superseded_by`	Lifecycle	superseded by	—	—
`broader`	Hierarchical	broader concept	BT (broaderTerm)	broaderTerm
`narrower`	Hierarchical	narrower concept	NT (narrowerTerm)	narrowerTerm
`broader_generic`	Hierarchical (generic)	—	BTG (broaderGeneric, is-a)	broaderTermGeneric
`narrower_generic`	Hierarchical (generic)	—	NTG (narrowerGeneric)	narrowerTermGeneric
`broader_partitive`	Hierarchical (partitive)	—	BTP (broaderPartitive, part-whole)	broaderTermPartitive
`narrower_partitive`	Hierarchical (partitive)	—	NTP (narrowerPartitive)	narrowerTermPartitive
`broader_instantial`	Hierarchical (instantial)	—	BTI (broaderInstantial, instance-of)	broaderTermInstantial
`narrower_instantial`	Hierarchical (instantial)	—	NTI (narrowerInstantial)	narrowerTermInstantial
`equivalent`	Equivalence	equivalent	exactMatch	—
`close_match`	Approx. equiv.	—	closeMatch	—
`broad_match`	Cross-vocab mapping	—	broadMatch	—
`narrow_match`	Cross-vocab mapping	—	narrowMatch	—
`related_match`	Cross-vocab mapping	—	relatedMatch	—
`compare`	Comparative	compare	—	—
`contrast`	Comparative	contrast	—	—
`see`	Associative	see also	RT (relatedTerm)	crossReference
`related_concept`	Associative	—	—	relatedConcept
`related_concept_broader`	Associative (broader)	—	—	relatedConceptBroader
`related_concept_narrower`	Associative (narrower)	—	—	relatedConceptNarrower
`sequentially_related_concept`	Associative (sequential)	—	—	sequentiallyRelatedConcept
`spatially_related_concept`	Associative (spatial)	—	—	spatiallyRelatedConcept
`temporally_related_concept`	Associative (temporal)	—	—	temporallyRelatedConcept
`homograph`	Lexical	—	—	homograph
`false_friend`	Lexical	—	—	falseFriend

ConceptReference

A typed reference to another concept, either local (within the same glossary) or external (in another concept registry).

term: String — the display text for the referenced concept.
concept_id: String — the identifier of the target concept.
source: String — the registry URI prefix for external references (e.g. urn:iec:std:iec:60050).
ref_type: String — the reference type: local, designation, or urn.
urn: String — a direct URN for the target concept (e.g. urn:iec:std:iec:60050-102-01-01).

Local references use concept_id without source. External references use source + concept_id or a direct urn.

# Local reference
ref = Glossarist::ConceptReference.new(term: "latitude", concept_id: "200", ref_type: "local")

# External reference via URN
ref = Glossarist::ConceptReference.new(
  term: "equality",
  concept_id: "102-01-01",
  source: "urn:iec:std:iec:60050",
  ref_type: "urn",
)

ref.local?    # => false
ref.external? # => true

Concept Date

A date relevant to the lifecycle of the managed term.

Following fields are available for the Concept Date

date: The date associated with the managed term in Iso8601Date format.
type: An enum to denote the event which occured on the given date and associated with the lifecycle of the managed term.

There are two ways to initialize and populate a concept date

Setting the fields by using a hash while initializing

concept_date = Glossarist::ConceptDate.new({
  date: "2010-11-01T00:00:00+00:00",
  type: :accepted,
})

Setting the fields after creating an object

concept_date = Glossarist::ConceptDate.new
concept_date.type = :accepted
concept_date.date = "2010-11-01T00:00:00+00:00"

DetailedDefinition

A definition of the managed term.

It has the following attributes:

content: The text of the definition of the managed term.
sources: List of Bibliographic references(Citation) for this particular definition of the managed term.

There are two ways to initialize and populate a detailed definition

Setting the fields by using a hash while initializing

detailed_definition = Glossarist::DetailedDefinition.new({
  content: "plain text reference",
  sources: [<list of citations>],
})

Setting the fields after creating an object

detailed_definition = Glossarist::DetailedDefinition.new
detailed_definition.content = "plain text reference",
detailed_definition.sources = [<list of citations>]

Citation

Citation can be either structured or unstructured. A citation is structured if its reference contains one or all of the following keys { id: "id", source: "source", version: "version"} and is unstructured if its reference is plain text. This also has 2 methods structured? and plain? to check if citation is structured or not.

Citation has the following attributes.

ref: A hash or string based on type of citation. Hash if citation is structured or string if citation is plain.
clause: Referred clause of the document.
link: Link to document.

There are two ways to initialize and populate a Citation

Setting the fields by using a hash while initializing

# Unstructured Citation
citation = Glossarist::Citation.new({
  ref: "plain text reference",
  clause: "clause",
  link: "link",
})

# Structured Citation
citation = Glossarist::Citation.new({
  ref: { id: "123", source: "source", version: "1.1" },
  clause: "clause",
  link: "link",
})

Setting the fields after creating an object

citation = Glossarist::Citation.new
citation.ref = <plain or structured ref>
citation.clause = "some clause"

NonVerbRep

Non-verbal representations are associated resources (images, tables, formulas) used to help define a concept (ISO 10241-1 §6.5). They live outside the concept model and are referenced by URI. Resources can be shared across concepts and belong either to the dataset package (relative path) or are externally referenced (URL/URN).

type: String — the type of representation: image, table, or formula.
ref: String — URI reference to the resource (relative path within the GCR package, URN, or URL).
text: String — optional text description or alt text.
sources: Collection of ConceptSource entries — bibliographic sources for the representation.

Example:

+

non_verbal_rep:
  - type: image
    ref: assets/images/figure-1.svg
    text: Diagram showing the concept hierarchy
  - type: formula
    ref: urn:gcr:assets:formula-eq1
    sources:
      - type: authoritative
        status: identical

ConceptSource

Concept Source has the following fields

status: The status of the managed term in the present context, relative to the term as found in the bibliographic source.
type: The type of the managed term in the present context.
origin: The bibliographic citation for the managed term. This is also aliased as ref.
modification: A description of the modification to the cited definition of the term, if any, as it is to be applied in the present context.

Commands

generate_latex

Convert Concepts to Latex format.

glossarist generate_latex -p PATH_TO_CONCEPTS

Options:

p, --concepts-path	Path to yaml concepts directory
l, --latex-concepts	File path having list of concepts that should be converted to LATEX format
o, --output-file	Output file path
e, --extra-attributes	List of extra attributes that are not in standard Glossarist Concept model

package

Create a .gcr ZIP archive from a concept dataset.

glossarist package DIR -o output.gcr --shortname mydataset --version 1.0.0 --uri-prefix urn:iso:std:iso:19111

Options:

o, --output (required)	Output `.gcr` file path
--shortname (required)	Machine-readable dataset shortname (e.g. `iev`, `iso19111`)
--version (required)	Semantic version (e.g. `1.0.0`)
--title	Human-readable dataset title
--description	Dataset description
--owner	Dataset owner
--register-yaml	Path to register.yaml to include in package
--uri-prefix	URI namespace this dataset provides (e.g. `urn:iec:std:iec:60050`)
--tags	Tags for the dataset
--compiled-formats	Comma-separated compiled formats to bundle (tbx,jsonld,turtle,jsonl)
--concept-uri-template	URI template for concept URIs

Ruby API:

GcrPackage.create_from_directory(
  "path/to/dataset",
  output: "output.gcr",
  shortname: "mydataset",
  version: "1.0.0",
  uri_prefix: "urn:iso:std:iso:19111",
  compiled_formats: ["jsonld", "turtle"],
)

export

Export concepts in machine-readable formats.

glossarist export PATH --format json --output DIR
glossarist export PATH --format jsonld --output DIR --shortname isotc211
glossarist export PATH --format turtle --output DIR
glossarist export PATH --format tbx --output DIR --shortname isotc211
glossarist export PATH --format jsonl --output DIR
glossarist export package.gcr --format json --output DIR

The path can be either a concept dataset directory or a .gcr file. When exporting from a .gcr, the shortname and uri_prefix are automatically resolved from the package metadata.

Output Formats

Format	Output	Files
`json`	Per-concept JSON files	`{concept_id}.json`
`tbx`	Single TBX-XML document (ISO 30042:2019)	`{shortname}.tbx.xml`
`jsonld`	Single JSON-LD file with `@graph`	`{shortname}.jsonld`
`turtle`	Single Turtle file with all concept triples	`{shortname}.ttl`
`jsonl`	JSONL file with one JSON-LD object per line	`{shortname}.jsonl`

Options:

--format (required)	Output format: `json`, `tbx`, `jsonld`, `turtle`, or `jsonl`
o, --output (required)	Output directory
--shortname	Dataset shortname for concept ID prefixing
--uri-prefix	URI/URN prefix for the dataset
--site-url	Base URL of the glossarist site
--title	Dataset title for document header

Ruby API:

# Export to JSON-LD
cmd = Glossarist::CLI::ExportCommand.new("path/to/dataset",
  format: "jsonld", output: "/tmp/export", shortname: "isotc211")
cmd.run

# Transform a single concept to SKOS
skos = Glossarist::Transforms::ConceptToSkosTransform.transform(concept)
puts skos.to_jsonld
puts skos.to_turtle

import

Import terminology concepts from STS XML files into a new or existing dataset.

# Import one or more STS XML files into a new dataset directory
glossarist import iso-8373.xml -o output_dir

# Import into a new GCR package (--shortname and --version required)
glossarist import iso-8373.xml -o iso-8373.gcr \
  --shortname iso-8373 --version 1.0.0 --title "ISO 8373 Robotics"

# Import multiple files into a new dataset
glossarist import iso-8373.xml iso-9000.xml -o combined_dataset

# Import into an existing dataset (dedup by designation + domain)
glossarist import iso-8373.xml --into existing_dataset/

# Import into an existing GCR (re-packages automatically)
glossarist import iso-8373.xml --into existing.gcr

# Control duplicate handling
glossarist import iso-8373.xml --into existing_dataset/ --on-duplicate replace

Deduplication is based on designation + domain (case-insensitive). When duplicates are found, the --on-duplicate strategy determines the behavior:

`skip` (default)	Keep the existing concept, skip the new one
`replace`	Replace the existing concept with the new one
`merge`	Add new localizations to the existing concept (e.g. add French to an English-only concept)

Options:

o, --output	Output directory or `.gcr` file path (new dataset)
--into	Path to existing dataset directory or `.gcr` file to merge into
--shortname	Dataset shortname (required for GCR output)
--version	Dataset version (required for GCR output)
--title	Dataset title
--description	Dataset description
--owner	Dataset owner
--uri-prefix	URI prefix for the dataset
--on-duplicate	How to handle duplicates: `skip`, `replace`, or `merge`

Ruby API:

require "glossarist/sts"

importer = Glossarist::Sts::Importer.new

# Import into a new dataset directory
result = importer.import_new(
  ["iso-8373.xml", "iso-9000.xml"],
  output: "output_dir",
)
puts result.concepts.length    # total concepts imported
puts result.conflicts.length   # duplicates detected
puts result.skipped_count      # skipped (strategy: skip)

# Import into a new GCR package
result = importer.import_new(
  ["iso-8373.xml"],
  output: "iso-8373.gcr",
  shortname: "iso-8373",
  version: "1.0.0",
  title: "ISO 8373 Robotics Vocabulary",
)

# Import into an existing dataset with merge strategy
importer = Glossarist::Sts::Importer.new(duplicate_strategy: :merge)
result = importer.import_into_existing(
  ["french_supplement.xml"],
  "existing_dataset/",
)
result.concepts.each do |mc|
  puts "#{mc.data.id}: #{mc.localizations.keys.join(', ')}"
end

Import result

import_new and import_into_existing return an ImportResult with:

concepts: Array<ManagedConcept> — the imported concepts
conflicts: Array<DuplicateConflict> — duplicate pairs detected by designation + domain
source_files: Array<String> — the input file paths
skipped_count: Integer — concepts skipped due to duplicates (strategy: skip)

validate

Validate a dataset directory or .gcr file for schema compliance, structural integrity, cross-reference resolution, and data quality.

glossarist validate PATH
glossarist validate PATH --reference-path path/to/gcrs/
glossarist validate PATH --strict

Options:

--strict	Treat warnings as errors
--format	Output format: `text`, `json`, or `yaml`
--reference-path	Path to directory of `.gcr` files for cross-dataset reference validation

Ruby API:

result = DatasetValidator.new.validate("path/to/dataset")
result = DatasetValidator.new.validate("path/to/dataset", reference_path: "gcrs/")
result.valid?   # => true/false
result.errors   # => [...]
result.warnings # => [...]

Validation System

Glossarist provides a rule-based validation framework that checks dataset directories and GCR packages for structural, schema, reference, integrity, quality, and localization issues.

Architecture

The validation system uses the rule-registry pattern (Open/Closed Principle). Each check is a self-describing rule class that subclasses Glossarist::Validation::Rules::Base. New rules are added by subclassing and registering — no existing code is modified.

Glossarist::Validation
├── Rules
│   ├── Base                    # Abstract rule: code, category, severity, scope, check
│   ├── Registry                # Global registry: register, all, for_category, for_scope
│   ├── DatasetContext          # Lazy-loaded access to a directory dataset
│   ├── GcrContext              # Lazy-loaded access to a .gcr package
│   └── (26 rule classes)      # One file per rule
├── ValidationIssue             # Single finding: severity, code, message, location, suggestion
├── BibliographyIndex           # Index of bibliography anchors from sources + bibliography.yaml
├── AssetIndex                  # Index of asset paths from images/ directory or GCR ZIP
├── ConceptValidator            # Orchestrator: runs per-concept rules
├── GcrValidator                # Orchestrator: runs GCR-level rules
└── DatasetValidator            # Orchestrator: runs directory-level + collection rules

Rule Categories

Rules are classified into six MECE (Mutually Exclusive, Collectively Exhaustive) categories:

Category	What it checks
`structure`	File/directory layout, ZIP contents, required parts
`schema`	Field types, enum values, required fields, YAML syntax
`references`	Cross-references between concepts, bibliography, assets
`integrity`	Metadata vs. reality, filename vs. ID, UUID cross-references
`quality`	Empty content, missing preferred terms, duplicate terms
`localization`	Language coverage, orphaned/missing localization files

Built-in Rules

The following rules are registered by default. Each rule has a unique code (e.g. GLS-001), a severity (error or warning), and a scope (:concept for per-concept checks or :collection for dataset-wide checks).

Structure Rules

Code	Rule	Severity	Scope
GLS-001	Concept ID is present	error	`:concept`
GLS-002	At least one localization per concept	error	`:concept`
GLS-005	Each localization has at least 1 term	error	`:concept`
GLS-020-YAML	bibliography.yaml is valid YAML	error	`:collection`

Schema Rules

Code	Rule	Severity	Scope
GLS-003	Entry status is a valid enum value	error	`:concept`
GLS-201	Concept status is a valid enum value	error	`:concept`
GLS-202/203	Source type and status are valid enums	error	`:concept`
GLS-200	Related concept type is valid	error	`:concept`
GLS-204	Designation normative_status is valid	error	`:concept`
GLS-205	Date type is a valid enum	warning	`:concept`
GLS-206	Language code is exactly 3 lowercase letters	error	`:concept`
GLS-207	Designation type maps to a known subclass	error	`:concept`

Reference Rules

Code	Rule	Severity	Scope
GLS-100	`{{…}}` concept mentions resolve locally	warning	`:concept`
GLS-102	`[anchor]` AsciiDoc xrefs resolve in bibliography index	warning	`:concept`
GLS-103-105	Image references resolve in asset index	warning	`:concept`
GLS-110	Related concept references resolve	warning	`:concept`
GLS-020	Orphaned bibliography entries	warning	`:collection`
GLS-021	Orphaned images	warning	`:collection`
GLS-112	Supersedes/superseded_by symmetry check	warning	`:collection`
GLS-113	No circular related-concept chains	error	`:collection`

Integrity Rules

Code	Rule	Severity	Scope
GLS-001-U	Concept IDs are unique	error	`:collection`
GLS-011	Concept count matches metadata	error	`:collection`
GLS-012	Language list matches actual languages	warning	`:collection`
GLS-013	Language coverage per concept	warning	`:concept`
GLS-015	Filename matches concept ID (GCR)	error	`:concept`
GLS-016	Concept URI is set or template is applicable	warning	`:collection`
GLS-018	Localized concept UUID cross-references resolve	error	`:concept`
GLS-019	Orphaned localization files	warning	`:collection`

Quality Rules

Code	Rule	Severity	Scope
GLS-300	Definition content is non-empty	warning	`:concept`
GLS-301	At least one preferred designation per localization	warning	`:concept`
GLS-302	No duplicate preferred terms within a language	warning	`:collection`
GLS-304	Source citation is not empty	warning	`:concept`
GLS-306	At least one authoritative source	warning	`:concept`
GLS-307	Date values are parseable	warning	`:concept`

Cross-Reference Validation

The validation system checks that all references in concept content point to resources that actually exist:

Bibliographic cross-references — AsciiDoc [anchor] xrefs are checked against a BibliographyIndex built from all ConceptSource entries and optional bibliography.yaml.
Image/asset references — image::path[] references and model-level asset paths (NonVerbRep, GraphicalSymbol) are checked against an AssetIndex built from the images/ directory or GCR ZIP entries.
Inter-concept references — {{…}} concept mentions are checked against the concept collection for local references, and against registered GCR packages for inter-set URN references.

Validation Result

ValidationResult holds the aggregated findings from all rules:

result = DatasetValidator.new.validate("path/to/dataset")
result.valid?    # => true if no errors
result.errors    # => Array of error strings
result.warnings  # => Array of warning strings
result.issues    # => Array of ValidationIssue objects (full detail)

Each ValidationIssue carries structured metadata:

issue = result.issues.first
issue.severity   # => "error" or "warning"
issue.code       # => "GLS-300"
issue.message    # => "definition 1 has empty content"
issue.location   # => "concepts/100.yaml/eng"
issue.suggestion # => "Add definition text or remove the empty entry"
issue.to_s       # => "[ERROR] [GLS-300] concepts/100.yaml/eng: definition 1 has empty content"

Adding Custom Rules

New validation rules are added by subclassing Base and registering with the global Registry. This extends validation without modifying existing code:

class MyCustomRule < Glossarist::Validation::Rules::Base
  def code = "CUSTOM-001"
  def category = :quality
  def severity = "warning"
  def scope = :concept

  def applicable?(context)
    context.concept&.localizations&.any?
  end

  def check(context)
    issues = []
    context.concept.localizations.each do |l10n|
      # ... your check logic ...
      if some_condition
        issues << issue("something is wrong",
          location: context.file_name,
          suggestion: "how to fix it")
      end
    end
    issues
  end
end

Glossarist::Validation::Rules::Registry.register(MyCustomRule)

Custom rules are automatically picked up by DatasetValidator, GcrValidator, and ConceptValidator on the next validation run.

upgrade

Upgrade a dataset to the current schema version.

glossarist upgrade SOURCE_DIR -o OUTPUT_DIR

Glossarist Concept Repository (GCR)

A GCR (Glossarist Concept Repository) is a distributable, versioned ZIP archive containing glossary concepts and metadata. GCR packages are created from v2 datasets.

GCR Package Format

A .gcr file is a ZIP archive with the following structure:

metadata.yaml          # Package metadata
register.yaml          # Optional register information
concepts/              # Concept YAML files
  102-01-01.yaml
  200.yaml

Creating a GCR Package

CLI:

glossarist package path/to/v2-dataset -o mydataset-1.0.0.gcr \
  --shortname mydataset --version 1.0.0 --uri-prefix urn:iso:std:iso:19111

Ruby API:

GcrPackage.create_from_directory(
  "path/to/v2-dataset",
  output: "mydataset-1.0.0.gcr",
  shortname: "mydataset",
  version: "1.0.0",
  uri_prefix: "urn:iso:std:iso:19111",
  title: "My Dataset",
  description: "A terminology dataset",
)

Loading a GCR Package

pkg = GcrPackage.load("mydataset-1.0.0.gcr")
pkg.metadata     # => Hash with metadata fields
pkg.concepts     # => Array of concept hashes

GCR Metadata

Metadata fields in metadata.yaml:

shortname	Machine-readable dataset identifier (e.g. `iev`)
version	Semantic version (e.g. `1.0.0`)
title	Human-readable title
description	Dataset description
owner	Dataset owner
tags	Array of tags
concept_count	Number of concepts in the package
languages	Array of language codes present
created_at	ISO 8601 timestamp of package creation
glossarist_version	Version of the Glossarist gem used
schema_version	Schema version of the package format
uri_prefix	URI namespace this dataset provides (e.g. `urn:iec:std:iec:60050`)
external_references	Array of `{uri: "…"}` for URI namespaces this dataset references

GCR Statistics

stats = GcrStatistics.from_concepts(concepts)
stats.total_concepts           # => 150
stats.languages                # => ["eng", "fra", "deu"]
stats.concepts_by_status       # => { "valid" => 140, "draft" => 10 }
stats.concepts_with_definitions # => 148
stats.concepts_with_sources    # => 130

Concept Mentions

Concepts can reference other concepts within the same dataset (intra-set) or in different datasets (inter-set) using inline mention syntax. All mentions use double braces {{…}}.

Syntax

The concept mention syntax mirrors HTML <a href="id">display_text</a> — the display text is independent of the target concept’s canonical designation.

Form	Syntax	Example	Resolution
ID only	`{{ID}}`	`{{200}}`	Intra-set: concept 200, auto-display
ID + display	`{{TEXT, ID}}`	`{{geodetic latitude, 200}}`	Intra-set: concept 200, custom display
Designation	`{{TEXT}}`	`{{geodetic latitude}}`	Intra-set: find by designation
URN + display	`{{TEXT, URN}}`	`{{equality, urn:iec:std:iec:60050-102-01-01}}`	Inter-set: resolve by URN
URN only	`{{URN}}`	`{{urn:iec:std:iec:60050-102-01-01}}`	Inter-set: resolve URN, auto-display

URN Schemes

IEC URN (IEV): urn:iec:std:iec:60050-{code} — source is urn:iec:std:iec:60050, concept_id is the IEV code
ISO URN (RFC 5141): urn:iso:std:iso:{std}:…:term:{id} — source is urn:iso:std:iso:{std}, concept_id is the term ID

Extracting Mentions (Ruby API)

extractor = ReferenceExtractor.new

# From a text string
refs = extractor.extract_from_text("See {{equality, urn:iec:std:iec:60050-102-01-01}} and {{lat, 200}}")
# => [ConceptReference(term: "equality", concept_id: "102-01-01",
#                      source: "urn:iec:std:iec:60050", ref_type: "urn"),
#     ConceptReference(term: "lat", concept_id: "200",
#                      source: nil, ref_type: "local")]

# From all text fields in a localized concept
refs = extractor.extract_from_localized(lc_hash)

# From all language blocks in a concept
refs = extractor.extract_from_concept_hash(concept_hash)

Resolving Mentions (Ruby API)

Resolution uses an adapter chain: route overrides → local → package → remote.

resolver = ReferenceResolver.new

# Register the current dataset for intra-set resolution
resolver.register_self(concepts)

# Register co-loaded GCRs with their URI prefixes
resolver.register_package(iev_concepts, uri_prefix: "urn:iec:std:iec:60050")
resolver.register_package(iso_concepts, uri_prefix: "urn:iso:std:iso:19111")

# Add URI route overrides (e.g. author used wrong URI)
resolver.add_route(from: "urn:iso:std:iso:19115", to: "urn:iso:std:iso:19111")

# Resolve a single reference
ref = ConceptReference.new(term: "equality", concept_id: "102-01-01",
                           source: "urn:iec:std:iec:60050", ref_type: "urn")
resolver.resolve(ref)  # => concept hash

# Validate all references in a package
result = resolver.validate_all(concepts)
result.errors    # => structural errors
result.warnings  # => unresolvable references

GCR Collection & Routing

When multiple GCRs are placed together in a directory, a collection.yaml configures resolution:

# collection.yaml
packages:
  - file: iev-2.0.0.gcr
  - file: iso19111-1.0.0.gcr

routes:
  - from: "urn:iso:std:iso:19115"
    to: "urn:iso:std:iso:19111"

remote:
  - uri_prefix: "urn:iec:std:iec:60050"
    endpoint: "https://vocabulary.example.org/api/concepts"

resolver = ReferenceResolver.new
resolver.load_collection("path/to/gcr_collection/")
# Packages auto-registered with their uri_prefix from metadata
# Route overrides applied
# Remote endpoints registered

Resolution Adapters

The resolution framework uses a chain of adapters, each implementing resolve(reference) → concept_hash | nil:

LocalAdapter: Resolves intra-set references by concept ID or designation lookup
PackageAdapter: Resolves inter-set references by matching source URI to a GCR’s uri_prefix
RouteAdapter: Remaps incorrect source URIs before delegation
RemoteAdapter: Resolves via HTTP to an online GCR endpoint

URN-to-HTTP Resolution

Concept mentions rendered as hyperlinks need HTTP URLs. The UrnResolver converts URNs to their canonical web locations:

# Class-level convenience
url = UrnResolver.resolve("urn:iec:std:iec:60050-102-01-01")
# => "https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=102-01-01"

url = UrnResolver.resolve("urn:iso:std:iso:19111:ed-3:v1:en:term:3.1.32")
# => "https://www.iso.org/obp/ui/#iso:std:iso:19111:ed-3:v1:en:term:3.1.32"

# Also accepts ConceptReference objects
ref = ConceptReference.new(term: "equality", concept_id: "102-01-01",
                           source: "urn:iec:std:iec:60050", ref_type: "urn")
url = UrnResolver.resolve(ref)
# => "https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=102-01-01"

Built-in mappings:

URN Prefix	Target	Example URL
`urn:iec:std:iec:60050-*`	IEC Electropedia	`electropedia.org/iev/iev.nsf/display?openform&ievref=102-01-01`
`urn:iso:*`	ISO Online Browsing Platform	`iso.org/obp/ui/#iso:std:iso:19111:term:3.1.32`

Register custom schemes:

resolver = UrnResolver.new
resolver.register_scheme("urn:example:") do |urn|
  "https://example.org/concepts/#{urn.sub('urn:example:', '')}"
end

Credits

This gem is developed, maintained and funded by Ribose Inc.

License

The gem is available as open source under the terms of the 2-Clause BSD License.

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
.github/workflows		.github/workflows
TODO.integration		TODO.integration
bin		bin
exe		exe
lib		lib
spec		spec
.editorconfig		.editorconfig
.gitignore		.gitignore
.hound.yml		.hound.yml
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.rubocop_todo.yml		.rubocop_todo.yml
CLAUDE.md		CLAUDE.md
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.adoc		README.adoc
Rakefile		Rakefile
config.yml		config.yml
glossarist.gemspec		glossarist.gemspec
relaton-bib-2.0.0.gem		relaton-bib-2.0.0.gem
relaton-bib-2.1.0.gem		relaton-bib-2.1.0.gem
relaton-cen-2.0.0.gem		relaton-cen-2.0.0.gem
relaton-iec-2.0.0.gem		relaton-iec-2.0.0.gem
relaton-iso-2.0.0.gem		relaton-iso-2.0.0.gem
relaton-itu-2.0.0.gem		relaton-itu-2.0.0.gem

Folders and files

Latest commit

History

Repository files navigation

Glossarist

Installation

Usage

Reading a Glossarist model V2 from files

Writing a Glossarist model V2 to files

ManagedConceptCollection

ManagedConcept

LocalizedConcept

Designation

Designation::Base (common to all types)

Designation::Expression (text-based, inherits Base)

Designation::Abbreviation (inherits Expression)

Designation::Symbol (inherits Base)

Designation::LetterSymbol (inherits Symbol)

Designation::GraphicalSymbol (inherits Symbol)

Factory Method

ISO 12620 Term Types

Designation-Level Relationships (TBX xref)

RelatedConcept

Relationship Types

ConceptReference

Concept Date

DetailedDefinition

Citation

NonVerbRep

ConceptSource

Commands

generate_latex

package

export

Output Formats

import

Import result

validate

Validation System

Architecture

Rule Categories

Built-in Rules

Structure Rules

Schema Rules

Reference Rules

Integrity Rules

Quality Rules

Cross-Reference Validation

Validation Result

Adding Custom Rules

upgrade

Glossarist Concept Repository (GCR)

GCR Package Format

Creating a GCR Package

Loading a GCR Package

GCR Metadata

GCR Statistics

Concept Mentions

Syntax

URN Schemes

Extracting Mentions (Ruby API)

Resolving Mentions (Ruby API)

GCR Collection & Routing

Resolution Adapters

URN-to-HTTP Resolution

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages