Skip to content

glossarist/glossarist-ruby

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

272 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Glossarist

Glossarist gem implements the Glossarist model in ruby. All the entities in the model are available as classes and all the attributes are available as methods of those classes. This gem also allows you to read/write data to concept dataset or create your own collection and save that to glossarist model V2 dataset.

The YAML schema for concept and localized_concept is available at Concept model/yaml_schemas

Installation

Add this line to your application’s Gemfile:

gem 'glossarist'

And then execute:

bundle install

Or install it yourself as:

gem install glossarist

Usage

Reading a Glossarist model V2 from files

Glossarist model V2 dataset is a collection of concepts and their localized concepts in the form of YAML files.

The storage structure of the dataset has 2 forms:

  1. Each concept is stored in a concept YAML file and its localized concepts are stored in separate YAML files. The concept files are stored in the concept folder and its localized concepts are stored in the localized_concept folder.

  2. Each concept and its related localized concepts are stored in a single YAML file. These concept files are stored directly in the specified path.

To load the glossarist model V2 dataset:

collection = Glossarist::ManagedConceptCollection.new
collection.load_from_files("path/to/glossarist-v2-dataset")

Writing a Glossarist model V2 to files

To write the glossarist model V2 dataset to files:

# load the collection from files
collection = Glossarist::ManagedConceptCollection.new
collection.load_from_files("path/to/glossarist-v2-dataset")

# ... Update the collection ...

collection.save_to_files("path/to/glossarist-v2-dataset")

To write the glossarist model V2 dataset with concepts and their localized concepts grouped into single files:

# load the collection from files
collection = Glossarist::ManagedConceptCollection.new
collection.load_from_files("path/to/glossarist-v2-dataset")

# ... Update the collection ...

collection.save_grouped_concepts_to_files("path/to/glossarist-v2-dataset")

ManagedConceptCollection

This is a collection for managed concepts. It includes the ruby 'Enumerable' module.

collection = Glossarist::ManagedConceptCollection.new

ManagedConcept

Following fields are available for ManagedConcept:

id

String identifier for the concept

uuid

UUID for the concept

related

Array of RelatedConcept

status

Enum for the normative status of the term.

dates

Array of ConceptDate

localized_concepts

Hash of all localizations where keys are language codes and values are uuid of the localized concept.

domains

Array of ConceptReference — upper concepts (subject areas, concept schemes, organizing concepts) that this concept belongs to across all languages. Each domain is a typed reference (e.g. { concept_id: "103", ref_type: "domain" }).

localizations

Hash of all localizations for this concept where keys are language codes and values are instances of LocalizedConcept.

There are two ways to initialize and populate a managed concept

  1. Setting the fields by using a hash while initializing

    concept = Glossarist::ManagedConcept.new({
      "data" => {
        "id" => "123",
        "localized_concepts" => {
          "ara" => "<uuid>",
          "eng" => "<uuid>"
        },
        "localizations" => <Array of localized concepts or localized concept hashes>,
        "domains" => [
          { "concept_id" => "103", "ref_type" => "domain" },
        ],
      },
    })
  2. Setting the fields after creating an object

    concept = Glossarist::ManagedConcept.new
    concept.id = "123"
    concept.data.domains = [
      Glossarist::ConceptReference.new(concept_id: "103", ref_type: "domain"),
    ]
    concept.localizations = <Array of localized concepts or localized concept hashes>

LocalizedConcept

Localizations of the term to different languages.

Localized concept has the following fields

id

An optional identifier for the term, to be used in cross-references.

uuid

UUID for the concept

designations

Array of Designations under which the term being defined is known. This method will also accept an array of hashes for designation and will convert them to their respective classes.

domain

URI reference to the subject area or section concept. Can be a relative URI (e.g. section-103-01), a URN (e.g. urn:iec:std:iec:60050-103-01), or a URL (e.g. https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=103-01). This is the per-language upper concept reference — the subject area for this specific localization. Different languages may assign the same abstract concept to different domains.

related

Array of RelatedConcept — per-language concept relationships. Concept hierarchies can differ across languages (e.g. Russian distinguishes голубой/siniy as coordinate basic colors, while English unifies them under "blue"). Language-specific broader/narrower/equivalent relationships go here.

subject

Subject of the term.

definition

Array of Detailed Definition of the term.

non_verb_rep

Array of non-verbal representations used to help define the term.

notes

Zero or more notes about the term. A note is in Detailed Definition format.

examples

Zero or more examples of how the term is to be used in Detailed Definition format.

language_code

The language of the localization, as an ISO-639 3-letter code.

script

The script of the localization, as an ISO 15924 4-letter code (e.g. Hans for Simplified Chinese, Latn for Latin, Cyrl for Cyrillic). Optional — when omitted, the default script for the language is assumed.

system

The ISO 24229 conversion system code used to produce this localization (e.g. Var:jpn-Hrkt:Latn:Hepburn-1886 for Hepburn-romanized Japanese). Optional — only set when the localization is a romanization or transliteration.

entry_status

Entry status of the concept. Must be one of the following: notValid, valid, superseded, retired.

classification

Classification of the concept. Must be one of the following: preferred, admitted, deprecated.

Designation

A name under which a managed term is known. Designations follow an inheritance hierarchy based on ISO 10241-1 and the Metanorma concept model.

Designation::Base (common to all types)

designation

String — the term text or symbol.

normative_status

Enum — one of preferred, admitted, deprecated, superseded.

geographical_area

String — geographic usage region (ISO 3166-1 country code).

language

String — language of this designation (ISO 639 code). Usually inherited from the LocalizedConcept’s language_code, but can differ for borrowed terms.

script

String — script of the designation text (ISO 15924 code, e.g. Hani for Kanji, Latn for Latin, Cyrl for Cyrillic).

system

String — ISO 24229 conversion system code used to produce this designation (e.g. Var:jpn-Hrkt:Latn:Hepburn-1886 for Hepburn romanization). Optional — only set when the designation is a romanization or transliteration.

international

Boolean — whether the designation is used internationally.

absent

Boolean — whether the designation is intentionally absent in this language.

pronunciation

Collection of Pronunciation entries — phonetic or romanized representations of the designation.

sources

Collection of ConceptSource entries — bibliographic sources for this designation (ISO 10241-1 §6.8).

term_type

Enum (ISO 12620) — optional classification of the designation’s term type. See ISO 12620 Term Types below.

related

Collection of RelatedConcept entries — term-level (designation-to-designation) relationships within the same concept entry. Used for linking abbreviated forms to full forms, short forms to expanded forms, etc. (TBX xref types).

Each Pronunciation entry has:

Attribute Standard Description

content

The pronunciation text

language

ISO 639

Language/dialect being pronounced (3-letter code)

script

ISO 15924

Script of the pronunciation text (4-letter code)

country

ISO 3166-1

Country variant (2-letter code, optional)

system

ISO 24229

Conversion system code or identifier (e.g. IPA, Var:jpn-Hrkt:Latn:Hepburn-1886)

Example:

pronunciation:
  - content: "toːkjoː"
    language: jpn
    script: Latn
    system: IPA
  - content: "Tōkyō"
    language: jpn
    script: Latn
    system: "Var:jpn-Hrkt:Latn:Hepburn-1886"

Designation::Expression (text-based, inherits Base)

prefix

String — text before the designation.

usage_info

String — disambiguation text for the designation.

field_of_application

String — IEC "specific use", appears in angle brackets after the designation (e.g. "in communication theory").

grammar_info

Array of GrammarInfo — gender, number, part of speech.

Designation::Abbreviation (inherits Expression)

acronym

Boolean — is this an acronym?

initialism

Boolean — is this an initialism?

truncation

Boolean — is this a truncation?

Designation::Symbol (inherits Base)

No additional attributes beyond Base.

Designation::LetterSymbol (inherits Symbol)

text

String — the letter symbol text.

Designation::GraphicalSymbol (inherits Symbol)

text

String — description of the symbol.

image

String — the graphical symbol (emoji, path, or data URL).

Factory Method

Designation::Base.from_h(options) creates a new designation instance based on the specified type.

Parameters
  • options (Hash) - The options for creating the designation.

  • "type" (String) - The type of designation (expression, symbol, abbreviation, graphical_symbol, letter_symbol). Note: type key should be string and not a symbol so { type: "expression" } will not work.

  • Additional options depend on the specific designation type.

Returns
Designation::{type}

A new instance of specified type.

Example

# Expression with field of application
expr = Designation::Base.from_h({
  "type" => "expression",
  "designation" => "information",
  "normative_status" => "preferred",
  "field_of_application" => "in communication theory",
})

# International abbreviation
abbr = Designation::Base.from_h({
  "type" => "abbreviation",
  "designation" => "ISO",
  "international" => true,
  "acronym" => true,
})

ISO 12620 Term Types

The term_type attribute on Designation::Base classifies designations according to ISO 12620 (also used as TBX termType). This is orthogonal to the structural designation type (expression/abbreviation/symbol): the structural type determines how the designation is serialized, while term_type provides ISO 12620 semantic classification.

Term type Description

abbreviation

A shortened form of a word or phrase (general category)

acronym

An abbreviation pronounced as a word (e.g. NATO, laser)

clipped_term

A term formed by clipping part of a longer term (e.g. "phone" from "telephone")

common_name

A name in common use for a concept (e.g. "water" vs H₂O)

entry_term

The headword or main term in a terminological entry

equation

A mathematical equation used as a designation

formula

A chemical or mathematical formula (e.g. H₂O, E=mc²)

full_form

The complete, unabbreviated form of a designation (e.g. "World Wide Web")

initialism

An abbreviation pronounced letter by letter (e.g. "URL", "FBI")

internationalism

A term used with the same meaning across many languages (e.g. "computer", "algorithm")

international_scientific_term

A term established by international scientific agreement (e.g. "hydrogen")

logical_expression

A logical or Boolean expression used as a designation

part_number

A part number or catalog identifier used as a designation

phraseological_unit

A multi-word expression or phrase functioning as a term (e.g. "software engineering")

transcribed_form

A designation produced by phonetic transcription from another script

transliterated_form

A designation produced by transliteration from another script (e.g. "Moskva" from "Москва")

short_form

A shortened form of a designation that is not an abbreviation (e.g. "US" for "United States")

shortcut

A keyboard shortcut or command sequence (e.g. "Ctrl+V" for paste)

sku

A stock keeping unit identifier

standard_text

A standardized text passage used as a designation

symbol

A non-verbal symbol representing the concept (e.g. Ω for ohm)

synonym

A term with the same meaning in the same language, used as an alternative designation

synonymous_phrase

A phrase that is synonymous with the preferred designation

variant

A spelling, regional, or stylistic variant of another designation

Designation-Level Relationships (TBX xref)

Designations can have intra-entry relationships — links between designations of the same concept. These correspond to TBX xref elements on term information groups (<tig>).

Relationship type Description

abbreviated_form_for

This designation is an abbreviated form of the target (e.g. "WWW" → "World Wide Web")

short_form_for

This designation is a short form of the target (e.g. "US" → "United States of America")

Example:

terms:
  - designation: WWW
    type: abbreviation
    term_type: acronym
    related:
      - type: abbreviated_form_for
        content: "World Wide Web"
  - designation: World Wide Web
    type: expression
    term_type: full_form

RelatedConcept

A concept related to the current concept with a typed relationship.

type

Enum — the relationship type (see Relationship Types below).

content

String — free-text content describing the related concept.

ref

A Citation reference to the related concept.

There are two ways to initialize and populate a related concept

  1. Setting the fields by using a hash while initializing

    related_concept = Glossarist::RelatedConcept.new({
      content: "Test content",
      type: :supersedes,
      ref: <concept citation>
    })
  2. Setting the fields after creating an object

    related_concept = Glossarist::RelatedConcept.new
    related_concept.type = "supersedes"
    related_concept.content = "designation of the related concept"
    related_concept.ref = <Citation object>

Relationship Types

Relationship types are drawn from ISO 10241-1, ISO 25964/SKOS, and ISO 12620/TBX. The table below shows each type with its provenance and cross-standard equivalents.

Glossarist type Category ISO 10241-1 ISO 25964 / SKOS ISO 12620 / TBX

deprecates

Lifecycle

deprecates

supersedes

Lifecycle

supersedes

superseded_by

Lifecycle

superseded by

broader

Hierarchical

broader concept

BT (broaderTerm)

broaderTerm

narrower

Hierarchical

narrower concept

NT (narrowerTerm)

narrowerTerm

broader_generic

Hierarchical (generic)

BTG (broaderGeneric, is-a)

broaderTermGeneric

narrower_generic

Hierarchical (generic)

NTG (narrowerGeneric)

narrowerTermGeneric

broader_partitive

Hierarchical (partitive)

BTP (broaderPartitive, part-whole)

broaderTermPartitive

narrower_partitive

Hierarchical (partitive)

NTP (narrowerPartitive)

narrowerTermPartitive

broader_instantial

Hierarchical (instantial)

BTI (broaderInstantial, instance-of)

broaderTermInstantial

narrower_instantial

Hierarchical (instantial)

NTI (narrowerInstantial)

narrowerTermInstantial

equivalent

Equivalence

equivalent

exactMatch

close_match

Approx. equiv.

closeMatch

broad_match

Cross-vocab mapping

broadMatch

narrow_match

Cross-vocab mapping

narrowMatch

related_match

Cross-vocab mapping

relatedMatch

compare

Comparative

compare

contrast

Comparative

contrast

see

Associative

see also

RT (relatedTerm)

crossReference

related_concept

Associative

relatedConcept

related_concept_broader

Associative (broader)

relatedConceptBroader

related_concept_narrower

Associative (narrower)

relatedConceptNarrower

sequentially_related_concept

Associative (sequential)

sequentiallyRelatedConcept

spatially_related_concept

Associative (spatial)

spatiallyRelatedConcept

temporally_related_concept

Associative (temporal)

temporallyRelatedConcept

homograph

Lexical

homograph

false_friend

Lexical

falseFriend

ConceptReference

A typed reference to another concept, either local (within the same glossary) or external (in another concept registry).

term

String — the display text for the referenced concept.

concept_id

String — the identifier of the target concept.

source

String — the registry URI prefix for external references (e.g. urn:iec:std:iec:60050).

ref_type

String — the reference type: local, designation, or urn.

urn

String — a direct URN for the target concept (e.g. urn:iec:std:iec:60050-102-01-01).

Local references use concept_id without source. External references use source + concept_id or a direct urn.

# Local reference
ref = Glossarist::ConceptReference.new(term: "latitude", concept_id: "200", ref_type: "local")

# External reference via URN
ref = Glossarist::ConceptReference.new(
  term: "equality",
  concept_id: "102-01-01",
  source: "urn:iec:std:iec:60050",
  ref_type: "urn",
)

ref.local?    # => false
ref.external? # => true

Concept Date

A date relevant to the lifecycle of the managed term.

Following fields are available for the Concept Date

  • date: The date associated with the managed term in Iso8601Date format.

  • type: An enum to denote the event which occured on the given date and associated with the lifecycle of the managed term.

There are two ways to initialize and populate a concept date

  1. Setting the fields by using a hash while initializing

    concept_date = Glossarist::ConceptDate.new({
      date: "2010-11-01T00:00:00+00:00",
      type: :accepted,
    })
  2. Setting the fields after creating an object

    concept_date = Glossarist::ConceptDate.new
    concept_date.type = :accepted
    concept_date.date = "2010-11-01T00:00:00+00:00"

DetailedDefinition

A definition of the managed term.

It has the following attributes:

content

The text of the definition of the managed term.

sources

List of Bibliographic references(Citation) for this particular definition of the managed term.

There are two ways to initialize and populate a detailed definition

  1. Setting the fields by using a hash while initializing

    detailed_definition = Glossarist::DetailedDefinition.new({
      content: "plain text reference",
      sources: [<list of citations>],
    })
  2. Setting the fields after creating an object

    detailed_definition = Glossarist::DetailedDefinition.new
    detailed_definition.content = "plain text reference",
    detailed_definition.sources = [<list of citations>]

Citation

Citation can be either structured or unstructured. A citation is structured if its reference contains one or all of the following keys { id: "id", source: "source", version: "version"} and is unstructured if its reference is plain text. This also has 2 methods structured? and plain? to check if citation is structured or not.

Citation has the following attributes.

ref

A hash or string based on type of citation. Hash if citation is structured or string if citation is plain.

clause

Referred clause of the document.

link

Link to document.

There are two ways to initialize and populate a Citation

  1. Setting the fields by using a hash while initializing

    # Unstructured Citation
    citation = Glossarist::Citation.new({
      ref: "plain text reference",
      clause: "clause",
      link: "link",
    })
    
    # Structured Citation
    citation = Glossarist::Citation.new({
      ref: { id: "123", source: "source", version: "1.1" },
      clause: "clause",
      link: "link",
    })
  2. Setting the fields after creating an object

    citation = Glossarist::Citation.new
    citation.ref = <plain or structured ref>
    citation.clause = "some clause"

NonVerbRep

Non-verbal representations are associated resources (images, tables, formulas) used to help define a concept (ISO 10241-1 §6.5). They live outside the concept model and are referenced by URI. Resources can be shared across concepts and belong either to the dataset package (relative path) or are externally referenced (URL/URN).

type

String — the type of representation: image, table, or formula.

ref

String — URI reference to the resource (relative path within the GCR package, URN, or URL).

text

String — optional text description or alt text.

sources

Collection of ConceptSource entries — bibliographic sources for the representation.

Example:

+

non_verbal_rep:
  - type: image
    ref: assets/images/figure-1.svg
    text: Diagram showing the concept hierarchy
  - type: formula
    ref: urn:gcr:assets:formula-eq1
    sources:
      - type: authoritative
        status: identical

ConceptSource

Concept Source has the following fields

status

The status of the managed term in the present context, relative to the term as found in the bibliographic source.

type

The type of the managed term in the present context.

origin

The bibliographic citation for the managed term. This is also aliased as ref.

modification

A description of the modification to the cited definition of the term, if any, as it is to be applied in the present context.

Commands

generate_latex

Convert Concepts to Latex format.

glossarist generate_latex -p PATH_TO_CONCEPTS

Options:

p, --concepts-path

Path to yaml concepts directory

l, --latex-concepts

File path having list of concepts that should be converted to LATEX format

o, --output-file

Output file path

e, --extra-attributes

List of extra attributes that are not in standard Glossarist Concept model

package

Create a .gcr ZIP archive from a concept dataset.

glossarist package DIR -o output.gcr --shortname mydataset --version 1.0.0 --uri-prefix urn:iso:std:iso:19111

Options:

o, --output (required)

Output .gcr file path

--shortname (required)

Machine-readable dataset shortname (e.g. iev, iso19111)

--version (required)

Semantic version (e.g. 1.0.0)

--title

Human-readable dataset title

--description

Dataset description

--owner

Dataset owner

--register-yaml

Path to register.yaml to include in package

--uri-prefix

URI namespace this dataset provides (e.g. urn:iec:std:iec:60050)

--tags

Tags for the dataset

--compiled-formats

Comma-separated compiled formats to bundle (tbx,jsonld,turtle,jsonl)

--concept-uri-template

URI template for concept URIs

Ruby API:

GcrPackage.create_from_directory(
  "path/to/dataset",
  output: "output.gcr",
  shortname: "mydataset",
  version: "1.0.0",
  uri_prefix: "urn:iso:std:iso:19111",
  compiled_formats: ["jsonld", "turtle"],
)

export

Export concepts in machine-readable formats.

glossarist export PATH --format json --output DIR
glossarist export PATH --format jsonld --output DIR --shortname isotc211
glossarist export PATH --format turtle --output DIR
glossarist export PATH --format tbx --output DIR --shortname isotc211
glossarist export PATH --format jsonl --output DIR
glossarist export package.gcr --format json --output DIR

The path can be either a concept dataset directory or a .gcr file. When exporting from a .gcr, the shortname and uri_prefix are automatically resolved from the package metadata.

Output Formats

Format Output Files

json

Per-concept JSON files

{concept_id}.json

tbx

Single TBX-XML document (ISO 30042:2019)

{shortname}.tbx.xml

jsonld

Single JSON-LD file with @graph

{shortname}.jsonld

turtle

Single Turtle file with all concept triples

{shortname}.ttl

jsonl

JSONL file with one JSON-LD object per line

{shortname}.jsonl

Options:

--format (required)

Output format: json, tbx, jsonld, turtle, or jsonl

o, --output (required)

Output directory

--shortname

Dataset shortname for concept ID prefixing

--uri-prefix

URI/URN prefix for the dataset

--site-url

Base URL of the glossarist site

--title

Dataset title for document header

Ruby API:

# Export to JSON-LD
cmd = Glossarist::CLI::ExportCommand.new("path/to/dataset",
  format: "jsonld", output: "/tmp/export", shortname: "isotc211")
cmd.run

# Transform a single concept to SKOS
skos = Glossarist::Transforms::ConceptToSkosTransform.transform(concept)
puts skos.to_jsonld
puts skos.to_turtle

import

Import terminology concepts from STS XML files into a new or existing dataset.

# Import one or more STS XML files into a new dataset directory
glossarist import iso-8373.xml -o output_dir

# Import into a new GCR package (--shortname and --version required)
glossarist import iso-8373.xml -o iso-8373.gcr \
  --shortname iso-8373 --version 1.0.0 --title "ISO 8373 Robotics"

# Import multiple files into a new dataset
glossarist import iso-8373.xml iso-9000.xml -o combined_dataset

# Import into an existing dataset (dedup by designation + domain)
glossarist import iso-8373.xml --into existing_dataset/

# Import into an existing GCR (re-packages automatically)
glossarist import iso-8373.xml --into existing.gcr

# Control duplicate handling
glossarist import iso-8373.xml --into existing_dataset/ --on-duplicate replace

Deduplication is based on designation + domain (case-insensitive). When duplicates are found, the --on-duplicate strategy determines the behavior:

skip (default)

Keep the existing concept, skip the new one

replace

Replace the existing concept with the new one

merge

Add new localizations to the existing concept (e.g. add French to an English-only concept)

Options:

o, --output

Output directory or .gcr file path (new dataset)

--into

Path to existing dataset directory or .gcr file to merge into

--shortname

Dataset shortname (required for GCR output)

--version

Dataset version (required for GCR output)

--title

Dataset title

--description

Dataset description

--owner

Dataset owner

--uri-prefix

URI prefix for the dataset

--on-duplicate

How to handle duplicates: skip, replace, or merge

Ruby API:

require "glossarist/sts"

importer = Glossarist::Sts::Importer.new

# Import into a new dataset directory
result = importer.import_new(
  ["iso-8373.xml", "iso-9000.xml"],
  output: "output_dir",
)
puts result.concepts.length    # total concepts imported
puts result.conflicts.length   # duplicates detected
puts result.skipped_count      # skipped (strategy: skip)

# Import into a new GCR package
result = importer.import_new(
  ["iso-8373.xml"],
  output: "iso-8373.gcr",
  shortname: "iso-8373",
  version: "1.0.0",
  title: "ISO 8373 Robotics Vocabulary",
)

# Import into an existing dataset with merge strategy
importer = Glossarist::Sts::Importer.new(duplicate_strategy: :merge)
result = importer.import_into_existing(
  ["french_supplement.xml"],
  "existing_dataset/",
)
result.concepts.each do |mc|
  puts "#{mc.data.id}: #{mc.localizations.keys.join(', ')}"
end

Import result

import_new and import_into_existing return an ImportResult with:

concepts

Array<ManagedConcept> — the imported concepts

conflicts

Array<DuplicateConflict> — duplicate pairs detected by designation + domain

source_files

Array<String> — the input file paths

skipped_count

Integer — concepts skipped due to duplicates (strategy: skip)

validate

Validate a dataset directory or .gcr file for schema compliance, structural integrity, cross-reference resolution, and data quality.

glossarist validate PATH
glossarist validate PATH --reference-path path/to/gcrs/
glossarist validate PATH --strict

Options:

--strict

Treat warnings as errors

--format

Output format: text, json, or yaml

--reference-path

Path to directory of .gcr files for cross-dataset reference validation

Ruby API:

result = DatasetValidator.new.validate("path/to/dataset")
result = DatasetValidator.new.validate("path/to/dataset", reference_path: "gcrs/")
result.valid?   # => true/false
result.errors   # => [...]
result.warnings # => [...]

Validation System

Glossarist provides a rule-based validation framework that checks dataset directories and GCR packages for structural, schema, reference, integrity, quality, and localization issues.

Architecture

The validation system uses the rule-registry pattern (Open/Closed Principle). Each check is a self-describing rule class that subclasses Glossarist::Validation::Rules::Base. New rules are added by subclassing and registering — no existing code is modified.

Glossarist::Validation
├── Rules
│   ├── Base                    # Abstract rule: code, category, severity, scope, check
│   ├── Registry                # Global registry: register, all, for_category, for_scope
│   ├── DatasetContext          # Lazy-loaded access to a directory dataset
│   ├── GcrContext              # Lazy-loaded access to a .gcr package
│   └── (26 rule classes)      # One file per rule
├── ValidationIssue             # Single finding: severity, code, message, location, suggestion
├── BibliographyIndex           # Index of bibliography anchors from sources + bibliography.yaml
├── AssetIndex                  # Index of asset paths from images/ directory or GCR ZIP
├── ConceptValidator            # Orchestrator: runs per-concept rules
├── GcrValidator                # Orchestrator: runs GCR-level rules
└── DatasetValidator            # Orchestrator: runs directory-level + collection rules

Rule Categories

Rules are classified into six MECE (Mutually Exclusive, Collectively Exhaustive) categories:

Category What it checks

structure

File/directory layout, ZIP contents, required parts

schema

Field types, enum values, required fields, YAML syntax

references

Cross-references between concepts, bibliography, assets

integrity

Metadata vs. reality, filename vs. ID, UUID cross-references

quality

Empty content, missing preferred terms, duplicate terms

localization

Language coverage, orphaned/missing localization files

Built-in Rules

The following rules are registered by default. Each rule has a unique code (e.g. GLS-001), a severity (error or warning), and a scope (:concept for per-concept checks or :collection for dataset-wide checks).

Structure Rules

Code Rule Severity Scope

GLS-001

Concept ID is present

error

:concept

GLS-002

At least one localization per concept

error

:concept

GLS-005

Each localization has at least 1 term

error

:concept

GLS-020-YAML

bibliography.yaml is valid YAML

error

:collection

Schema Rules

Code Rule Severity Scope

GLS-003

Entry status is a valid enum value

error

:concept

GLS-201

Concept status is a valid enum value

error

:concept

GLS-202/203

Source type and status are valid enums

error

:concept

GLS-200

Related concept type is valid

error

:concept

GLS-204

Designation normative_status is valid

error

:concept

GLS-205

Date type is a valid enum

warning

:concept

GLS-206

Language code is exactly 3 lowercase letters

error

:concept

GLS-207

Designation type maps to a known subclass

error

:concept

Reference Rules

Code Rule Severity Scope

GLS-100

{{…​}} concept mentions resolve locally

warning

:concept

GLS-102

[anchor] AsciiDoc xrefs resolve in bibliography index

warning

:concept

GLS-103-105

Image references resolve in asset index

warning

:concept

GLS-110

Related concept references resolve

warning

:concept

GLS-020

Orphaned bibliography entries

warning

:collection

GLS-021

Orphaned images

warning

:collection

GLS-112

Supersedes/superseded_by symmetry check

warning

:collection

GLS-113

No circular related-concept chains

error

:collection

Integrity Rules

Code Rule Severity Scope

GLS-001-U

Concept IDs are unique

error

:collection

GLS-011

Concept count matches metadata

error

:collection

GLS-012

Language list matches actual languages

warning

:collection

GLS-013

Language coverage per concept

warning

:concept

GLS-015

Filename matches concept ID (GCR)

error

:concept

GLS-016

Concept URI is set or template is applicable

warning

:collection

GLS-018

Localized concept UUID cross-references resolve

error

:concept

GLS-019

Orphaned localization files

warning

:collection

Quality Rules

Code Rule Severity Scope

GLS-300

Definition content is non-empty

warning

:concept

GLS-301

At least one preferred designation per localization

warning

:concept

GLS-302

No duplicate preferred terms within a language

warning

:collection

GLS-304

Source citation is not empty

warning

:concept

GLS-306

At least one authoritative source

warning

:concept

GLS-307

Date values are parseable

warning

:concept

Cross-Reference Validation

The validation system checks that all references in concept content point to resources that actually exist:

  • Bibliographic cross-references — AsciiDoc [anchor] xrefs are checked against a BibliographyIndex built from all ConceptSource entries and optional bibliography.yaml.

  • Image/asset referencesimage::path[] references and model-level asset paths (NonVerbRep, GraphicalSymbol) are checked against an AssetIndex built from the images/ directory or GCR ZIP entries.

  • Inter-concept references{{…​}} concept mentions are checked against the concept collection for local references, and against registered GCR packages for inter-set URN references.

Validation Result

ValidationResult holds the aggregated findings from all rules:

result = DatasetValidator.new.validate("path/to/dataset")
result.valid?    # => true if no errors
result.errors    # => Array of error strings
result.warnings  # => Array of warning strings
result.issues    # => Array of ValidationIssue objects (full detail)

Each ValidationIssue carries structured metadata:

issue = result.issues.first
issue.severity   # => "error" or "warning"
issue.code       # => "GLS-300"
issue.message    # => "definition 1 has empty content"
issue.location   # => "concepts/100.yaml/eng"
issue.suggestion # => "Add definition text or remove the empty entry"
issue.to_s       # => "[ERROR] [GLS-300] concepts/100.yaml/eng: definition 1 has empty content"

Adding Custom Rules

New validation rules are added by subclassing Base and registering with the global Registry. This extends validation without modifying existing code:

class MyCustomRule < Glossarist::Validation::Rules::Base
  def code = "CUSTOM-001"
  def category = :quality
  def severity = "warning"
  def scope = :concept

  def applicable?(context)
    context.concept&.localizations&.any?
  end

  def check(context)
    issues = []
    context.concept.localizations.each do |l10n|
      # ... your check logic ...
      if some_condition
        issues << issue("something is wrong",
          location: context.file_name,
          suggestion: "how to fix it")
      end
    end
    issues
  end
end

Glossarist::Validation::Rules::Registry.register(MyCustomRule)

Custom rules are automatically picked up by DatasetValidator, GcrValidator, and ConceptValidator on the next validation run.

upgrade

Upgrade a dataset to the current schema version.

glossarist upgrade SOURCE_DIR -o OUTPUT_DIR

Glossarist Concept Repository (GCR)

A GCR (Glossarist Concept Repository) is a distributable, versioned ZIP archive containing glossary concepts and metadata. GCR packages are created from v2 datasets.

GCR Package Format

A .gcr file is a ZIP archive with the following structure:

metadata.yaml          # Package metadata
register.yaml          # Optional register information
concepts/              # Concept YAML files
  102-01-01.yaml
  200.yaml

Creating a GCR Package

CLI:

glossarist package path/to/v2-dataset -o mydataset-1.0.0.gcr \
  --shortname mydataset --version 1.0.0 --uri-prefix urn:iso:std:iso:19111

Ruby API:

GcrPackage.create_from_directory(
  "path/to/v2-dataset",
  output: "mydataset-1.0.0.gcr",
  shortname: "mydataset",
  version: "1.0.0",
  uri_prefix: "urn:iso:std:iso:19111",
  title: "My Dataset",
  description: "A terminology dataset",
)

Loading a GCR Package

pkg = GcrPackage.load("mydataset-1.0.0.gcr")
pkg.metadata     # => Hash with metadata fields
pkg.concepts     # => Array of concept hashes

GCR Metadata

Metadata fields in metadata.yaml:

shortname

Machine-readable dataset identifier (e.g. iev)

version

Semantic version (e.g. 1.0.0)

title

Human-readable title

description

Dataset description

owner

Dataset owner

tags

Array of tags

concept_count

Number of concepts in the package

languages

Array of language codes present

created_at

ISO 8601 timestamp of package creation

glossarist_version

Version of the Glossarist gem used

schema_version

Schema version of the package format

uri_prefix

URI namespace this dataset provides (e.g. urn:iec:std:iec:60050)

external_references

Array of {uri: "…​"} for URI namespaces this dataset references

GCR Statistics

stats = GcrStatistics.from_concepts(concepts)
stats.total_concepts           # => 150
stats.languages                # => ["eng", "fra", "deu"]
stats.concepts_by_status       # => { "valid" => 140, "draft" => 10 }
stats.concepts_with_definitions # => 148
stats.concepts_with_sources    # => 130

Concept Mentions

Concepts can reference other concepts within the same dataset (intra-set) or in different datasets (inter-set) using inline mention syntax. All mentions use double braces {{…​}}.

Syntax

The concept mention syntax mirrors HTML <a href="id">display_text</a> — the display text is independent of the target concept’s canonical designation.

Form Syntax Example Resolution

ID only

{{ID}}

{{200}}

Intra-set: concept 200, auto-display

ID + display

{{TEXT, ID}}

{{geodetic latitude, 200}}

Intra-set: concept 200, custom display

Designation

{{TEXT}}

{{geodetic latitude}}

Intra-set: find by designation

URN + display

{{TEXT, URN}}

{{equality, urn:iec:std:iec:60050-102-01-01}}

Inter-set: resolve by URN

URN only

{{URN}}

{{urn:iec:std:iec:60050-102-01-01}}

Inter-set: resolve URN, auto-display

URN Schemes

IEC URN (IEV)

urn:iec:std:iec:60050-{code} — source is urn:iec:std:iec:60050, concept_id is the IEV code

ISO URN (RFC 5141)

urn:iso:std:iso:{std}:…​:term:{id} — source is urn:iso:std:iso:{std}, concept_id is the term ID

Extracting Mentions (Ruby API)

extractor = ReferenceExtractor.new

# From a text string
refs = extractor.extract_from_text("See {{equality, urn:iec:std:iec:60050-102-01-01}} and {{lat, 200}}")
# => [ConceptReference(term: "equality", concept_id: "102-01-01",
#                      source: "urn:iec:std:iec:60050", ref_type: "urn"),
#     ConceptReference(term: "lat", concept_id: "200",
#                      source: nil, ref_type: "local")]

# From all text fields in a localized concept
refs = extractor.extract_from_localized(lc_hash)

# From all language blocks in a concept
refs = extractor.extract_from_concept_hash(concept_hash)

Resolving Mentions (Ruby API)

Resolution uses an adapter chain: route overrides → local → package → remote.

resolver = ReferenceResolver.new

# Register the current dataset for intra-set resolution
resolver.register_self(concepts)

# Register co-loaded GCRs with their URI prefixes
resolver.register_package(iev_concepts, uri_prefix: "urn:iec:std:iec:60050")
resolver.register_package(iso_concepts, uri_prefix: "urn:iso:std:iso:19111")

# Add URI route overrides (e.g. author used wrong URI)
resolver.add_route(from: "urn:iso:std:iso:19115", to: "urn:iso:std:iso:19111")

# Resolve a single reference
ref = ConceptReference.new(term: "equality", concept_id: "102-01-01",
                           source: "urn:iec:std:iec:60050", ref_type: "urn")
resolver.resolve(ref)  # => concept hash

# Validate all references in a package
result = resolver.validate_all(concepts)
result.errors    # => structural errors
result.warnings  # => unresolvable references

GCR Collection & Routing

When multiple GCRs are placed together in a directory, a collection.yaml configures resolution:

# collection.yaml
packages:
  - file: iev-2.0.0.gcr
  - file: iso19111-1.0.0.gcr

routes:
  - from: "urn:iso:std:iso:19115"
    to: "urn:iso:std:iso:19111"

remote:
  - uri_prefix: "urn:iec:std:iec:60050"
    endpoint: "https://vocabulary.example.org/api/concepts"
resolver = ReferenceResolver.new
resolver.load_collection("path/to/gcr_collection/")
# Packages auto-registered with their uri_prefix from metadata
# Route overrides applied
# Remote endpoints registered

Resolution Adapters

The resolution framework uses a chain of adapters, each implementing resolve(reference) → concept_hash | nil:

LocalAdapter

Resolves intra-set references by concept ID or designation lookup

PackageAdapter

Resolves inter-set references by matching source URI to a GCR’s uri_prefix

RouteAdapter

Remaps incorrect source URIs before delegation

RemoteAdapter

Resolves via HTTP to an online GCR endpoint

URN-to-HTTP Resolution

Concept mentions rendered as hyperlinks need HTTP URLs. The UrnResolver converts URNs to their canonical web locations:

# Class-level convenience
url = UrnResolver.resolve("urn:iec:std:iec:60050-102-01-01")
# => "https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=102-01-01"

url = UrnResolver.resolve("urn:iso:std:iso:19111:ed-3:v1:en:term:3.1.32")
# => "https://www.iso.org/obp/ui/#iso:std:iso:19111:ed-3:v1:en:term:3.1.32"

# Also accepts ConceptReference objects
ref = ConceptReference.new(term: "equality", concept_id: "102-01-01",
                           source: "urn:iec:std:iec:60050", ref_type: "urn")
url = UrnResolver.resolve(ref)
# => "https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=102-01-01"

Built-in mappings:

URN Prefix Target Example URL

urn:iec:std:iec:60050-*

IEC Electropedia

electropedia.org/iev/iev.nsf/display?openform&ievref=102-01-01

urn:iso:*

ISO Online Browsing Platform

iso.org/obp/ui/#iso:std:iso:19111:term:3.1.32

Register custom schemes:

resolver = UrnResolver.new
resolver.register_scheme("urn:example:") do |urn|
  "https://example.org/concepts/#{urn.sub('urn:example:', '')}"
end

Credits

This gem is developed, maintained and funded by Ribose Inc.

License

The gem is available as open source under the terms of the 2-Clause BSD License.

About

Concept modeller in Ruby

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages