Glossarist gem implements the Glossarist model in ruby. All the entities in the model are available as classes and all the attributes are available as methods of those classes. This gem also allows you to read/write data to concept dataset or create your own collection and save that to glossarist model V2 dataset.
The YAML schema for concept and localized_concept is available at Concept model/yaml_schemas
Add this line to your application’s Gemfile:
gem 'glossarist'And then execute:
bundle installOr install it yourself as:
gem install glossaristGlossarist model V2 dataset is a collection of concepts and their localized concepts in the form of YAML files.
The storage structure of the dataset has 2 forms:
-
Each concept is stored in a concept YAML file and its localized concepts are stored in separate YAML files. The concept files are stored in the
conceptfolder and its localized concepts are stored in thelocalized_conceptfolder. -
Each concept and its related localized concepts are stored in a single YAML file. These concept files are stored directly in the specified path.
To load the glossarist model V2 dataset:
collection = Glossarist::ManagedConceptCollection.new
collection.load_from_files("path/to/glossarist-v2-dataset")To write the glossarist model V2 dataset to files:
# load the collection from files
collection = Glossarist::ManagedConceptCollection.new
collection.load_from_files("path/to/glossarist-v2-dataset")
# ... Update the collection ...
collection.save_to_files("path/to/glossarist-v2-dataset")To write the glossarist model V2 dataset with concepts and their localized concepts grouped into single files:
# load the collection from files
collection = Glossarist::ManagedConceptCollection.new
collection.load_from_files("path/to/glossarist-v2-dataset")
# ... Update the collection ...
collection.save_grouped_concepts_to_files("path/to/glossarist-v2-dataset")This is a collection for managed concepts. It includes the ruby 'Enumerable' module.
collection = Glossarist::ManagedConceptCollection.newFollowing fields are available for ManagedConcept:
- id
-
String identifier for the concept
- uuid
-
UUID for the concept
- related
-
Array of RelatedConcept
- status
-
Enum for the normative status of the term.
- dates
-
Array of ConceptDate
- localized_concepts
-
Hash of all localizations where keys are language codes and values are uuid of the localized concept.
- domains
-
Array of ConceptReference — upper concepts (subject areas, concept schemes, organizing concepts) that this concept belongs to across all languages. Each domain is a typed reference (e.g.
{ concept_id: "103", ref_type: "domain" }). - localizations
-
Hash of all localizations for this concept where keys are language codes and values are instances of LocalizedConcept.
There are two ways to initialize and populate a managed concept
-
Setting the fields by using a hash while initializing
concept = Glossarist::ManagedConcept.new({ "data" => { "id" => "123", "localized_concepts" => { "ara" => "<uuid>", "eng" => "<uuid>" }, "localizations" => <Array of localized concepts or localized concept hashes>, "domains" => [ { "concept_id" => "103", "ref_type" => "domain" }, ], }, })
-
Setting the fields after creating an object
concept = Glossarist::ManagedConcept.new concept.id = "123" concept.data.domains = [ Glossarist::ConceptReference.new(concept_id: "103", ref_type: "domain"), ] concept.localizations = <Array of localized concepts or localized concept hashes>
Localizations of the term to different languages.
Localized concept has the following fields
- id
-
An optional identifier for the term, to be used in cross-references.
- uuid
-
UUID for the concept
- designations
-
Array of Designations under which the term being defined is known. This method will also accept an array of hashes for designation and will convert them to their respective classes.
- domain
-
URI reference to the subject area or section concept. Can be a relative URI (e.g.
section-103-01), a URN (e.g.urn:iec:std:iec:60050-103-01), or a URL (e.g.https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=103-01). This is the per-language upper concept reference — the subject area for this specific localization. Different languages may assign the same abstract concept to different domains. - related
-
Array of RelatedConcept — per-language concept relationships. Concept hierarchies can differ across languages (e.g. Russian distinguishes голубой/siniy as coordinate basic colors, while English unifies them under "blue"). Language-specific broader/narrower/equivalent relationships go here.
- subject
-
Subject of the term.
- definition
-
Array of Detailed Definition of the term.
- non_verb_rep
-
Array of non-verbal representations used to help define the term.
- notes
-
Zero or more notes about the term. A note is in Detailed Definition format.
- examples
-
Zero or more examples of how the term is to be used in Detailed Definition format.
- language_code
-
The language of the localization, as an ISO-639 3-letter code.
- script
-
The script of the localization, as an ISO 15924 4-letter code (e.g.
Hansfor Simplified Chinese,Latnfor Latin,Cyrlfor Cyrillic). Optional — when omitted, the default script for the language is assumed. - system
-
The ISO 24229 conversion system code used to produce this localization (e.g.
Var:jpn-Hrkt:Latn:Hepburn-1886for Hepburn-romanized Japanese). Optional — only set when the localization is a romanization or transliteration. - entry_status
-
Entry status of the concept. Must be one of the following: notValid, valid, superseded, retired.
- classification
-
Classification of the concept. Must be one of the following: preferred, admitted, deprecated.
A name under which a managed term is known. Designations follow an inheritance hierarchy based on ISO 10241-1 and the Metanorma concept model.
- designation
-
String — the term text or symbol.
- normative_status
-
Enum — one of
preferred,admitted,deprecated,superseded. - geographical_area
-
String — geographic usage region (ISO 3166-1 country code).
- language
-
String — language of this designation (ISO 639 code). Usually inherited from the LocalizedConcept’s
language_code, but can differ for borrowed terms. - script
-
String — script of the designation text (ISO 15924 code, e.g.
Hanifor Kanji,Latnfor Latin,Cyrlfor Cyrillic). - system
-
String — ISO 24229 conversion system code used to produce this designation (e.g.
Var:jpn-Hrkt:Latn:Hepburn-1886for Hepburn romanization). Optional — only set when the designation is a romanization or transliteration. - international
-
Boolean — whether the designation is used internationally.
- absent
-
Boolean — whether the designation is intentionally absent in this language.
- pronunciation
-
Collection of
Pronunciationentries — phonetic or romanized representations of the designation. - sources
-
Collection of
ConceptSourceentries — bibliographic sources for this designation (ISO 10241-1 §6.8). - term_type
-
Enum (ISO 12620) — optional classification of the designation’s term type. See ISO 12620 Term Types below.
- related
-
Collection of
RelatedConceptentries — term-level (designation-to-designation) relationships within the same concept entry. Used for linking abbreviated forms to full forms, short forms to expanded forms, etc. (TBX xref types).Each
Pronunciationentry has:Attribute Standard Description content—
The pronunciation text
languageISO 639
Language/dialect being pronounced (3-letter code)
scriptISO 15924
Script of the pronunciation text (4-letter code)
countryISO 3166-1
Country variant (2-letter code, optional)
systemISO 24229
Conversion system code or identifier (e.g.
IPA,Var:jpn-Hrkt:Latn:Hepburn-1886)Example:
pronunciation: - content: "toːkjoː" language: jpn script: Latn system: IPA - content: "Tōkyō" language: jpn script: Latn system: "Var:jpn-Hrkt:Latn:Hepburn-1886"
- prefix
-
String — text before the designation.
- usage_info
-
String — disambiguation text for the designation.
- field_of_application
-
String — IEC "specific use", appears in angle brackets after the designation (e.g. "in communication theory").
- grammar_info
-
Array of GrammarInfo — gender, number, part of speech.
- acronym
-
Boolean — is this an acronym?
- initialism
-
Boolean — is this an initialism?
- truncation
-
Boolean — is this a truncation?
- text
-
String — description of the symbol.
- image
-
String — the graphical symbol (emoji, path, or data URL).
Designation::Base.from_h(options) creates a new designation instance based on the specified type.
- Parameters
-
-
options (Hash) - The options for creating the designation.
-
"type" (String) - The type of designation (
expression,symbol,abbreviation,graphical_symbol,letter_symbol). Note: type key should be string and not a symbol so{ type: "expression" }will not work. -
Additional options depend on the specific designation type.
-
- Returns
-
- Designation::{type}
-
A new instance of specified type.
Example
# Expression with field of application
expr = Designation::Base.from_h({
"type" => "expression",
"designation" => "information",
"normative_status" => "preferred",
"field_of_application" => "in communication theory",
})
# International abbreviation
abbr = Designation::Base.from_h({
"type" => "abbreviation",
"designation" => "ISO",
"international" => true,
"acronym" => true,
})The term_type attribute on Designation::Base classifies designations
according to ISO 12620 (also used as TBX termType). This is orthogonal to
the structural designation type (expression/abbreviation/symbol): the
structural type determines how the designation is serialized, while
term_type provides ISO 12620 semantic classification.
| Term type | Description |
|---|---|
|
A shortened form of a word or phrase (general category) |
|
An abbreviation pronounced as a word (e.g. NATO, laser) |
|
A term formed by clipping part of a longer term (e.g. "phone" from "telephone") |
|
A name in common use for a concept (e.g. "water" vs H₂O) |
|
The headword or main term in a terminological entry |
|
A mathematical equation used as a designation |
|
A chemical or mathematical formula (e.g. H₂O, E=mc²) |
|
The complete, unabbreviated form of a designation (e.g. "World Wide Web") |
|
An abbreviation pronounced letter by letter (e.g. "URL", "FBI") |
|
A term used with the same meaning across many languages (e.g. "computer", "algorithm") |
|
A term established by international scientific agreement (e.g. "hydrogen") |
|
A logical or Boolean expression used as a designation |
|
A part number or catalog identifier used as a designation |
|
A multi-word expression or phrase functioning as a term (e.g. "software engineering") |
|
A designation produced by phonetic transcription from another script |
|
A designation produced by transliteration from another script (e.g. "Moskva" from "Москва") |
|
A shortened form of a designation that is not an abbreviation (e.g. "US" for "United States") |
|
A keyboard shortcut or command sequence (e.g. "Ctrl+V" for paste) |
|
A stock keeping unit identifier |
|
A standardized text passage used as a designation |
|
A non-verbal symbol representing the concept (e.g. Ω for ohm) |
|
A term with the same meaning in the same language, used as an alternative designation |
|
A phrase that is synonymous with the preferred designation |
|
A spelling, regional, or stylistic variant of another designation |
Designations can have intra-entry relationships — links between
designations of the same concept. These correspond to TBX xref
elements on term information groups (<tig>).
| Relationship type | Description |
|---|---|
|
This designation is an abbreviated form of the target (e.g. "WWW" → "World Wide Web") |
|
This designation is a short form of the target (e.g. "US" → "United States of America") |
Example:
terms:
- designation: WWW
type: abbreviation
term_type: acronym
related:
- type: abbreviated_form_for
content: "World Wide Web"
- designation: World Wide Web
type: expression
term_type: full_formA concept related to the current concept with a typed relationship.
- type
-
Enum — the relationship type (see Relationship Types below).
- content
-
String — free-text content describing the related concept.
- ref
-
A Citation reference to the related concept.
There are two ways to initialize and populate a related concept
-
Setting the fields by using a hash while initializing
related_concept = Glossarist::RelatedConcept.new({ content: "Test content", type: :supersedes, ref: <concept citation> })
-
Setting the fields after creating an object
related_concept = Glossarist::RelatedConcept.new related_concept.type = "supersedes" related_concept.content = "designation of the related concept" related_concept.ref = <Citation object>
Relationship types are drawn from ISO 10241-1, ISO 25964/SKOS, and ISO 12620/TBX. The table below shows each type with its provenance and cross-standard equivalents.
| Glossarist type | Category | ISO 10241-1 | ISO 25964 / SKOS | ISO 12620 / TBX |
|---|---|---|---|---|
|
Lifecycle |
deprecates |
— |
— |
|
Lifecycle |
supersedes |
— |
— |
|
Lifecycle |
superseded by |
— |
— |
|
Hierarchical |
broader concept |
BT (broaderTerm) |
broaderTerm |
|
Hierarchical |
narrower concept |
NT (narrowerTerm) |
narrowerTerm |
|
Hierarchical (generic) |
— |
BTG (broaderGeneric, is-a) |
broaderTermGeneric |
|
Hierarchical (generic) |
— |
NTG (narrowerGeneric) |
narrowerTermGeneric |
|
Hierarchical (partitive) |
— |
BTP (broaderPartitive, part-whole) |
broaderTermPartitive |
|
Hierarchical (partitive) |
— |
NTP (narrowerPartitive) |
narrowerTermPartitive |
|
Hierarchical (instantial) |
— |
BTI (broaderInstantial, instance-of) |
broaderTermInstantial |
|
Hierarchical (instantial) |
— |
NTI (narrowerInstantial) |
narrowerTermInstantial |
|
Equivalence |
equivalent |
exactMatch |
— |
|
Approx. equiv. |
— |
closeMatch |
— |
|
Cross-vocab mapping |
— |
broadMatch |
— |
|
Cross-vocab mapping |
— |
narrowMatch |
— |
|
Cross-vocab mapping |
— |
relatedMatch |
— |
|
Comparative |
compare |
— |
— |
|
Comparative |
contrast |
— |
— |
|
Associative |
see also |
RT (relatedTerm) |
crossReference |
|
Associative |
— |
— |
relatedConcept |
|
Associative (broader) |
— |
— |
relatedConceptBroader |
|
Associative (narrower) |
— |
— |
relatedConceptNarrower |
|
Associative (sequential) |
— |
— |
sequentiallyRelatedConcept |
|
Associative (spatial) |
— |
— |
spatiallyRelatedConcept |
|
Associative (temporal) |
— |
— |
temporallyRelatedConcept |
|
Lexical |
— |
— |
homograph |
|
Lexical |
— |
— |
falseFriend |
A typed reference to another concept, either local (within the same glossary) or external (in another concept registry).
- term
-
String — the display text for the referenced concept.
- concept_id
-
String — the identifier of the target concept.
- source
-
String — the registry URI prefix for external references (e.g.
urn:iec:std:iec:60050). - ref_type
-
String — the reference type:
local,designation, orurn. - urn
-
String — a direct URN for the target concept (e.g.
urn:iec:std:iec:60050-102-01-01).
Local references use concept_id without source. External references use source + concept_id or a direct urn.
# Local reference
ref = Glossarist::ConceptReference.new(term: "latitude", concept_id: "200", ref_type: "local")
# External reference via URN
ref = Glossarist::ConceptReference.new(
term: "equality",
concept_id: "102-01-01",
source: "urn:iec:std:iec:60050",
ref_type: "urn",
)
ref.local? # => false
ref.external? # => trueA date relevant to the lifecycle of the managed term.
Following fields are available for the Concept Date
-
date: The date associated with the managed term in Iso8601Date format.
-
type: An enum to denote the event which occured on the given date and associated with the lifecycle of the managed term.
There are two ways to initialize and populate a concept date
-
Setting the fields by using a hash while initializing
concept_date = Glossarist::ConceptDate.new({ date: "2010-11-01T00:00:00+00:00", type: :accepted, })
-
Setting the fields after creating an object
concept_date = Glossarist::ConceptDate.new concept_date.type = :accepted concept_date.date = "2010-11-01T00:00:00+00:00"
A definition of the managed term.
It has the following attributes:
- content
-
The text of the definition of the managed term.
- sources
-
List of Bibliographic references(Citation) for this particular definition of the managed term.
There are two ways to initialize and populate a detailed definition
-
Setting the fields by using a hash while initializing
detailed_definition = Glossarist::DetailedDefinition.new({ content: "plain text reference", sources: [<list of citations>], })
-
Setting the fields after creating an object
detailed_definition = Glossarist::DetailedDefinition.new detailed_definition.content = "plain text reference", detailed_definition.sources = [<list of citations>]
Citation can be either structured or unstructured. A citation is structured if its reference contains one or all of the following keys { id: "id", source: "source", version: "version"} and is unstructured if its reference is plain text. This also has 2 methods structured? and plain? to check if citation is structured or not.
Citation has the following attributes.
- ref
-
A hash or string based on type of citation. Hash if citation is structured or string if citation is plain.
- clause
-
Referred clause of the document.
- link
-
Link to document.
There are two ways to initialize and populate a Citation
-
Setting the fields by using a hash while initializing
# Unstructured Citation citation = Glossarist::Citation.new({ ref: "plain text reference", clause: "clause", link: "link", }) # Structured Citation citation = Glossarist::Citation.new({ ref: { id: "123", source: "source", version: "1.1" }, clause: "clause", link: "link", })
-
Setting the fields after creating an object
citation = Glossarist::Citation.new citation.ref = <plain or structured ref> citation.clause = "some clause"
Non-verbal representations are associated resources (images, tables, formulas) used to help define a concept (ISO 10241-1 §6.5). They live outside the concept model and are referenced by URI. Resources can be shared across concepts and belong either to the dataset package (relative path) or are externally referenced (URL/URN).
- type
-
String — the type of representation:
image,table, orformula. - ref
-
String — URI reference to the resource (relative path within the GCR package, URN, or URL).
- text
-
String — optional text description or alt text.
- sources
-
Collection of ConceptSource entries — bibliographic sources for the representation.
Example:
+
non_verbal_rep:
- type: image
ref: assets/images/figure-1.svg
text: Diagram showing the concept hierarchy
- type: formula
ref: urn:gcr:assets:formula-eq1
sources:
- type: authoritative
status: identicalConcept Source has the following fields
- status
-
The status of the managed term in the present context, relative to the term as found in the bibliographic source.
- type
-
The type of the managed term in the present context.
- origin
-
The bibliographic citation for the managed term. This is also aliased as
ref. - modification
-
A description of the modification to the cited definition of the term, if any, as it is to be applied in the present context.
Convert Concepts to Latex format.
glossarist generate_latex -p PATH_TO_CONCEPTSOptions:
p, --concepts-path |
Path to yaml concepts directory |
l, --latex-concepts |
File path having list of concepts that should be converted to LATEX format |
o, --output-file |
Output file path |
e, --extra-attributes |
List of extra attributes that are not in standard Glossarist Concept model |
Create a .gcr ZIP archive from a concept dataset.
glossarist package DIR -o output.gcr --shortname mydataset --version 1.0.0 --uri-prefix urn:iso:std:iso:19111Options:
o, --output (required) |
Output |
--shortname (required) |
Machine-readable dataset shortname (e.g. |
--version (required) |
Semantic version (e.g. |
--title |
Human-readable dataset title |
--description |
Dataset description |
--owner |
Dataset owner |
--register-yaml |
Path to register.yaml to include in package |
--uri-prefix |
URI namespace this dataset provides (e.g. |
--tags |
Tags for the dataset |
--compiled-formats |
Comma-separated compiled formats to bundle (tbx,jsonld,turtle,jsonl) |
--concept-uri-template |
URI template for concept URIs |
Ruby API:
GcrPackage.create_from_directory(
"path/to/dataset",
output: "output.gcr",
shortname: "mydataset",
version: "1.0.0",
uri_prefix: "urn:iso:std:iso:19111",
compiled_formats: ["jsonld", "turtle"],
)Export concepts in machine-readable formats.
glossarist export PATH --format json --output DIR
glossarist export PATH --format jsonld --output DIR --shortname isotc211
glossarist export PATH --format turtle --output DIR
glossarist export PATH --format tbx --output DIR --shortname isotc211
glossarist export PATH --format jsonl --output DIR
glossarist export package.gcr --format json --output DIRThe path can be either a concept dataset directory or a .gcr file. When exporting from a .gcr, the shortname and uri_prefix are automatically resolved from the package metadata.
| Format | Output | Files |
|---|---|---|
|
Per-concept JSON files |
|
|
Single TBX-XML document (ISO 30042:2019) |
|
|
Single JSON-LD file with |
|
|
Single Turtle file with all concept triples |
|
|
JSONL file with one JSON-LD object per line |
|
Options:
--format (required) |
Output format: |
o, --output (required) |
Output directory |
--shortname |
Dataset shortname for concept ID prefixing |
--uri-prefix |
URI/URN prefix for the dataset |
--site-url |
Base URL of the glossarist site |
--title |
Dataset title for document header |
Ruby API:
# Export to JSON-LD
cmd = Glossarist::CLI::ExportCommand.new("path/to/dataset",
format: "jsonld", output: "/tmp/export", shortname: "isotc211")
cmd.run
# Transform a single concept to SKOS
skos = Glossarist::Transforms::ConceptToSkosTransform.transform(concept)
puts skos.to_jsonld
puts skos.to_turtleImport terminology concepts from STS XML files into a new or existing dataset.
# Import one or more STS XML files into a new dataset directory
glossarist import iso-8373.xml -o output_dir
# Import into a new GCR package (--shortname and --version required)
glossarist import iso-8373.xml -o iso-8373.gcr \
--shortname iso-8373 --version 1.0.0 --title "ISO 8373 Robotics"
# Import multiple files into a new dataset
glossarist import iso-8373.xml iso-9000.xml -o combined_dataset
# Import into an existing dataset (dedup by designation + domain)
glossarist import iso-8373.xml --into existing_dataset/
# Import into an existing GCR (re-packages automatically)
glossarist import iso-8373.xml --into existing.gcr
# Control duplicate handling
glossarist import iso-8373.xml --into existing_dataset/ --on-duplicate replaceDeduplication is based on designation + domain (case-insensitive). When
duplicates are found, the --on-duplicate strategy determines the behavior:
|
Keep the existing concept, skip the new one |
|
Replace the existing concept with the new one |
|
Add new localizations to the existing concept (e.g. add French to an English-only concept) |
Options:
o, --output |
Output directory or |
--into |
Path to existing dataset directory or |
--shortname |
Dataset shortname (required for GCR output) |
--version |
Dataset version (required for GCR output) |
--title |
Dataset title |
--description |
Dataset description |
--owner |
Dataset owner |
--uri-prefix |
URI prefix for the dataset |
--on-duplicate |
How to handle duplicates: |
Ruby API:
require "glossarist/sts"
importer = Glossarist::Sts::Importer.new
# Import into a new dataset directory
result = importer.import_new(
["iso-8373.xml", "iso-9000.xml"],
output: "output_dir",
)
puts result.concepts.length # total concepts imported
puts result.conflicts.length # duplicates detected
puts result.skipped_count # skipped (strategy: skip)
# Import into a new GCR package
result = importer.import_new(
["iso-8373.xml"],
output: "iso-8373.gcr",
shortname: "iso-8373",
version: "1.0.0",
title: "ISO 8373 Robotics Vocabulary",
)
# Import into an existing dataset with merge strategy
importer = Glossarist::Sts::Importer.new(duplicate_strategy: :merge)
result = importer.import_into_existing(
["french_supplement.xml"],
"existing_dataset/",
)
result.concepts.each do |mc|
puts "#{mc.data.id}: #{mc.localizations.keys.join(', ')}"
endimport_new and import_into_existing return an ImportResult with:
- concepts
-
Array<ManagedConcept>— the imported concepts - conflicts
-
Array<DuplicateConflict>— duplicate pairs detected by designation + domain - source_files
-
Array<String>— the input file paths - skipped_count
-
Integer— concepts skipped due to duplicates (strategy: skip)
Validate a dataset directory or .gcr file for schema compliance, structural
integrity, cross-reference resolution, and data quality.
glossarist validate PATH
glossarist validate PATH --reference-path path/to/gcrs/
glossarist validate PATH --strictOptions:
--strict |
Treat warnings as errors |
--format |
Output format: |
--reference-path |
Path to directory of |
Ruby API:
result = DatasetValidator.new.validate("path/to/dataset")
result = DatasetValidator.new.validate("path/to/dataset", reference_path: "gcrs/")
result.valid? # => true/false
result.errors # => [...]
result.warnings # => [...]Glossarist provides a rule-based validation framework that checks dataset directories and GCR packages for structural, schema, reference, integrity, quality, and localization issues.
The validation system uses the rule-registry pattern (Open/Closed
Principle). Each check is a self-describing rule class that subclasses
Glossarist::Validation::Rules::Base. New rules are added by subclassing and
registering — no existing code is modified.
Glossarist::Validation
├── Rules
│ ├── Base # Abstract rule: code, category, severity, scope, check
│ ├── Registry # Global registry: register, all, for_category, for_scope
│ ├── DatasetContext # Lazy-loaded access to a directory dataset
│ ├── GcrContext # Lazy-loaded access to a .gcr package
│ └── (26 rule classes) # One file per rule
├── ValidationIssue # Single finding: severity, code, message, location, suggestion
├── BibliographyIndex # Index of bibliography anchors from sources + bibliography.yaml
├── AssetIndex # Index of asset paths from images/ directory or GCR ZIP
├── ConceptValidator # Orchestrator: runs per-concept rules
├── GcrValidator # Orchestrator: runs GCR-level rules
└── DatasetValidator # Orchestrator: runs directory-level + collection rulesRules are classified into six MECE (Mutually Exclusive, Collectively Exhaustive) categories:
| Category | What it checks |
|---|---|
|
File/directory layout, ZIP contents, required parts |
|
Field types, enum values, required fields, YAML syntax |
|
Cross-references between concepts, bibliography, assets |
|
Metadata vs. reality, filename vs. ID, UUID cross-references |
|
Empty content, missing preferred terms, duplicate terms |
|
Language coverage, orphaned/missing localization files |
The following rules are registered by default. Each rule has a unique code
(e.g. GLS-001), a severity (error or warning), and a scope (:concept
for per-concept checks or :collection for dataset-wide checks).
| Code | Rule | Severity | Scope |
|---|---|---|---|
GLS-001 |
Concept ID is present |
error |
|
GLS-002 |
At least one localization per concept |
error |
|
GLS-005 |
Each localization has at least 1 term |
error |
|
GLS-020-YAML |
bibliography.yaml is valid YAML |
error |
|
| Code | Rule | Severity | Scope |
|---|---|---|---|
GLS-003 |
Entry status is a valid enum value |
error |
|
GLS-201 |
Concept status is a valid enum value |
error |
|
GLS-202/203 |
Source type and status are valid enums |
error |
|
GLS-200 |
Related concept type is valid |
error |
|
GLS-204 |
Designation normative_status is valid |
error |
|
GLS-205 |
Date type is a valid enum |
warning |
|
GLS-206 |
Language code is exactly 3 lowercase letters |
error |
|
GLS-207 |
Designation type maps to a known subclass |
error |
|
| Code | Rule | Severity | Scope |
|---|---|---|---|
GLS-100 |
|
warning |
|
GLS-102 |
|
warning |
|
GLS-103-105 |
Image references resolve in asset index |
warning |
|
GLS-110 |
Related concept references resolve |
warning |
|
GLS-020 |
Orphaned bibliography entries |
warning |
|
GLS-021 |
Orphaned images |
warning |
|
GLS-112 |
Supersedes/superseded_by symmetry check |
warning |
|
GLS-113 |
No circular related-concept chains |
error |
|
| Code | Rule | Severity | Scope |
|---|---|---|---|
GLS-001-U |
Concept IDs are unique |
error |
|
GLS-011 |
Concept count matches metadata |
error |
|
GLS-012 |
Language list matches actual languages |
warning |
|
GLS-013 |
Language coverage per concept |
warning |
|
GLS-015 |
Filename matches concept ID (GCR) |
error |
|
GLS-016 |
Concept URI is set or template is applicable |
warning |
|
GLS-018 |
Localized concept UUID cross-references resolve |
error |
|
GLS-019 |
Orphaned localization files |
warning |
|
| Code | Rule | Severity | Scope |
|---|---|---|---|
GLS-300 |
Definition content is non-empty |
warning |
|
GLS-301 |
At least one preferred designation per localization |
warning |
|
GLS-302 |
No duplicate preferred terms within a language |
warning |
|
GLS-304 |
Source citation is not empty |
warning |
|
GLS-306 |
At least one authoritative source |
warning |
|
GLS-307 |
Date values are parseable |
warning |
|
The validation system checks that all references in concept content point to resources that actually exist:
-
Bibliographic cross-references — AsciiDoc
[anchor]xrefs are checked against aBibliographyIndexbuilt from allConceptSourceentries and optionalbibliography.yaml. -
Image/asset references —
image::path[]references and model-level asset paths (NonVerbRep,GraphicalSymbol) are checked against anAssetIndexbuilt from theimages/directory or GCR ZIP entries. -
Inter-concept references —
{{…}}concept mentions are checked against the concept collection for local references, and against registered GCR packages for inter-set URN references.
ValidationResult holds the aggregated findings from all rules:
result = DatasetValidator.new.validate("path/to/dataset")
result.valid? # => true if no errors
result.errors # => Array of error strings
result.warnings # => Array of warning strings
result.issues # => Array of ValidationIssue objects (full detail)Each ValidationIssue carries structured metadata:
issue = result.issues.first
issue.severity # => "error" or "warning"
issue.code # => "GLS-300"
issue.message # => "definition 1 has empty content"
issue.location # => "concepts/100.yaml/eng"
issue.suggestion # => "Add definition text or remove the empty entry"
issue.to_s # => "[ERROR] [GLS-300] concepts/100.yaml/eng: definition 1 has empty content"New validation rules are added by subclassing Base and registering with the
global Registry. This extends validation without modifying existing code:
class MyCustomRule < Glossarist::Validation::Rules::Base
def code = "CUSTOM-001"
def category = :quality
def severity = "warning"
def scope = :concept
def applicable?(context)
context.concept&.localizations&.any?
end
def check(context)
issues = []
context.concept.localizations.each do |l10n|
# ... your check logic ...
if some_condition
issues << issue("something is wrong",
location: context.file_name,
suggestion: "how to fix it")
end
end
issues
end
end
Glossarist::Validation::Rules::Registry.register(MyCustomRule)Custom rules are automatically picked up by DatasetValidator, GcrValidator,
and ConceptValidator on the next validation run.
A GCR (Glossarist Concept Repository) is a distributable, versioned ZIP archive containing glossary concepts and metadata. GCR packages are created from v2 datasets.
A .gcr file is a ZIP archive with the following structure:
metadata.yaml # Package metadata register.yaml # Optional register information concepts/ # Concept YAML files 102-01-01.yaml 200.yaml
CLI:
glossarist package path/to/v2-dataset -o mydataset-1.0.0.gcr \
--shortname mydataset --version 1.0.0 --uri-prefix urn:iso:std:iso:19111Ruby API:
GcrPackage.create_from_directory(
"path/to/v2-dataset",
output: "mydataset-1.0.0.gcr",
shortname: "mydataset",
version: "1.0.0",
uri_prefix: "urn:iso:std:iso:19111",
title: "My Dataset",
description: "A terminology dataset",
)pkg = GcrPackage.load("mydataset-1.0.0.gcr")
pkg.metadata # => Hash with metadata fields
pkg.concepts # => Array of concept hashesMetadata fields in metadata.yaml:
shortname |
Machine-readable dataset identifier (e.g. |
version |
Semantic version (e.g. |
title |
Human-readable title |
description |
Dataset description |
owner |
Dataset owner |
tags |
Array of tags |
concept_count |
Number of concepts in the package |
languages |
Array of language codes present |
created_at |
ISO 8601 timestamp of package creation |
glossarist_version |
Version of the Glossarist gem used |
schema_version |
Schema version of the package format |
uri_prefix |
URI namespace this dataset provides (e.g. |
external_references |
Array of |
Concepts can reference other concepts within the same dataset (intra-set) or in different datasets (inter-set) using inline mention syntax. All mentions use double braces {{…}}.
The concept mention syntax mirrors HTML <a href="id">display_text</a> — the display text is independent of the target concept’s canonical designation.
| Form | Syntax | Example | Resolution |
|---|---|---|---|
ID only |
|
|
Intra-set: concept 200, auto-display |
ID + display |
|
|
Intra-set: concept 200, custom display |
Designation |
|
|
Intra-set: find by designation |
URN + display |
|
|
Inter-set: resolve by URN |
URN only |
|
|
Inter-set: resolve URN, auto-display |
- IEC URN (IEV)
-
urn:iec:std:iec:60050-{code}— source isurn:iec:std:iec:60050, concept_id is the IEV code - ISO URN (RFC 5141)
-
urn:iso:std:iso:{std}:…:term:{id}— source isurn:iso:std:iso:{std}, concept_id is the term ID
extractor = ReferenceExtractor.new
# From a text string
refs = extractor.extract_from_text("See {{equality, urn:iec:std:iec:60050-102-01-01}} and {{lat, 200}}")
# => [ConceptReference(term: "equality", concept_id: "102-01-01",
# source: "urn:iec:std:iec:60050", ref_type: "urn"),
# ConceptReference(term: "lat", concept_id: "200",
# source: nil, ref_type: "local")]
# From all text fields in a localized concept
refs = extractor.extract_from_localized(lc_hash)
# From all language blocks in a concept
refs = extractor.extract_from_concept_hash(concept_hash)Resolution uses an adapter chain: route overrides → local → package → remote.
resolver = ReferenceResolver.new
# Register the current dataset for intra-set resolution
resolver.register_self(concepts)
# Register co-loaded GCRs with their URI prefixes
resolver.register_package(iev_concepts, uri_prefix: "urn:iec:std:iec:60050")
resolver.register_package(iso_concepts, uri_prefix: "urn:iso:std:iso:19111")
# Add URI route overrides (e.g. author used wrong URI)
resolver.add_route(from: "urn:iso:std:iso:19115", to: "urn:iso:std:iso:19111")
# Resolve a single reference
ref = ConceptReference.new(term: "equality", concept_id: "102-01-01",
source: "urn:iec:std:iec:60050", ref_type: "urn")
resolver.resolve(ref) # => concept hash
# Validate all references in a package
result = resolver.validate_all(concepts)
result.errors # => structural errors
result.warnings # => unresolvable referencesWhen multiple GCRs are placed together in a directory, a collection.yaml configures resolution:
# collection.yaml
packages:
- file: iev-2.0.0.gcr
- file: iso19111-1.0.0.gcr
routes:
- from: "urn:iso:std:iso:19115"
to: "urn:iso:std:iso:19111"
remote:
- uri_prefix: "urn:iec:std:iec:60050"
endpoint: "https://vocabulary.example.org/api/concepts"resolver = ReferenceResolver.new
resolver.load_collection("path/to/gcr_collection/")
# Packages auto-registered with their uri_prefix from metadata
# Route overrides applied
# Remote endpoints registeredThe resolution framework uses a chain of adapters, each implementing resolve(reference) → concept_hash | nil:
- LocalAdapter
-
Resolves intra-set references by concept ID or designation lookup
- PackageAdapter
-
Resolves inter-set references by matching
sourceURI to a GCR’suri_prefix - RouteAdapter
-
Remaps incorrect source URIs before delegation
- RemoteAdapter
-
Resolves via HTTP to an online GCR endpoint
Concept mentions rendered as hyperlinks need HTTP URLs. The UrnResolver converts URNs to their canonical web locations:
# Class-level convenience
url = UrnResolver.resolve("urn:iec:std:iec:60050-102-01-01")
# => "https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=102-01-01"
url = UrnResolver.resolve("urn:iso:std:iso:19111:ed-3:v1:en:term:3.1.32")
# => "https://www.iso.org/obp/ui/#iso:std:iso:19111:ed-3:v1:en:term:3.1.32"
# Also accepts ConceptReference objects
ref = ConceptReference.new(term: "equality", concept_id: "102-01-01",
source: "urn:iec:std:iec:60050", ref_type: "urn")
url = UrnResolver.resolve(ref)
# => "https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=102-01-01"Built-in mappings:
| URN Prefix | Target | Example URL |
|---|---|---|
|
IEC Electropedia |
|
|
ISO Online Browsing Platform |
|
Register custom schemes:
resolver = UrnResolver.new
resolver.register_scheme("urn:example:") do |urn|
"https://example.org/concepts/#{urn.sub('urn:example:', '')}"
endThis gem is developed, maintained and funded by Ribose Inc.
The gem is available as open source under the terms of the 2-Clause BSD License.