Skip to content

Releases: poseidon-framework/poseidon-hs

Release v2.0.0.0

16 Apr 15:07

Choose a tag to compare

This release ushers in a new era in the development of poseidon-hs: We merged the poseidon-analysis-hs library into poseidon-hs, which was previously developed in a separate repository here.

We did this to keep xerxes, a software tool for Poseidon data analysis, fully in sync with trident. In the past xerxes often lagged behind and didn't use the most recent innovations and fixes available in poseidon-hs. We hope merging the repositories will make maintenance easier. So both software tools trident and xerxes will be versioned together from now on, starting with this release v2.0.0.0.

The large jump in the version number reflects the conceptual change in development strategy, and the pretty significant restructering that was necessary in the Haskell library code to house poseidon-analysis-hs in poseidon-hs. From a user perspective there were only minor changes in functionality:

trident

Since v1.7.0.0 we only fixed two minor bugs in genoconvert and forge regarding the handling of the new POSEIDON.yml fields referenceGenomeAssembly and referenceGenomeAssemblyURL. These were not properly forwarded/preserved in the respective operations. The genoconvert bug was reported in this issue.

xerxes

The last official xerxes release was v1.0.1.2. See xerxes_legacy_changelog/ for a documentation of previous development. Internal development had reached v1.0.2.0, which added VCF writing support for admixpops.

Building process

Note that we had to remove the UPX compression we had applied in the past to some of the static executables build for trident and xerxes upon release. That means the v2.0.0.0 executables will most likely be larger.

Release v1.7.0.0

23 Mar 03:02

Choose a tag to compare

This is a major release to add compatibility with Poseidon v3.0.0. It includes features to accomodate the new schema, and various other changes added since the last release V 1.6.7.3.

.janno-related changes for Poseidon v3.0.0

For the new schema release we modified the data structures that internally store .janno, .ssf and even POSEIDON.yml files. Please consult the schema changelog for the list of affected columns.

To keep it possible to read older Poseidon packages we introduced "smart" .janno (and .ssf) field constructors based on the relevant Poseidon version. Smart here means that different checks and even minor data transformations are applied depending on the input version. This renders all valid input minimally compatible with Poseidon v3.0.0. trident does not perform a comprehensive "upgrade" of old data, though. That would also entail replacing outdated .janno columns like Source_Tissue. The most intrusive change that is actually implemented is a rescaling of the columns Endogenous and Damage, which are not stored as percentages any more in Poseidon v3.0.0. The output of trident, e.g. of forge, thus always adheres to the latest supported Poseidon version, but may carry along additional columns as free-text.

Another possibly surprising change in this context concerns the handling of _Note columns in the .janno file. Poseidon v3.0.0 does not explcitly define individual _Note columns any more, so trident equally does not validate them. It instead treats them as unspecified, free-text columns. It does sort them, though, so that _Note columns are at least positioned sensibly when trident writes .janno files.

Minor interface changes that emerged as a result

Beyond these changes in the handling of .janno files, V 1.7.0.0 also comes with some minor changes in the trident CLI interface:

  1. The fact that we introduced smart, version-aware constructors when reading .janno and .ssf files has the consequence that the schema version must be known upon reading. We therefore added command line arguments for validate and jannocoalesce to set the expected Poseidon version when no POSEIDON.yml file is available: --pvJanno, --pvSSF, --pvSource, and --pvTarget. By default the latest supported schema version is assumed.
  2. As explained above trident can now read different Poseidon versions more explicitly, but it can always only write data following the latest schema. To avoid any confusion we made -o,--outFile mandatory in jannocoalesce, even when a -t,--targetFile is overwritten. Otherwise --pvTarget may be confused for a way to set the output version number.
  3. Poseidon v3.0.0 recommends that Poseidon_IDs and Group_Names only include the ASCII characters "A-Za-z0-9_-.". trident now prints a warning if it encounters any characters outside of this recommended range in these fields.

New features for archive maintenance

trident is not only a CLI tool for personal data management, but also includes essential tooling for the maintenance and distribution of the public Poseidon archives. https://server.poseidon-adna.org is run by trident. In this context trident V 1.7.0.0 sports two new features:

  1. The --archiveConfigFile, so the archive specification YAML file of the server, can now include a retiredPackagesFile field to specify retired packages. Retired packages are by default ignored in the /packages, /groups, /bibliography and /individuals endpoints of the web API, as well as ignored in the archive HTML page of the explorer. However, the /zip_file API endpoint still serves retired packages, so that they can be downloaded. The retired packages are still available in the per-package explorer HTML page. This feature allows us to retire outdated packages, e.g. in the community-archive.
  2. validate now includes a mechanism to check for the presence and completeness of usually optional .janno and .ssf columns with -j,--mandatoryJannoColumn and -s,--mandatorySSFColumn. This feature will allow us to gradually make more fields mandatory in the public archives, beyond the three that are already required by the schema (Poseidon_ID, Genetic_Sex, Group_Names).

Fixed a subtle bug in the forge language

A user reported an issue in the selection language parsing of forge, where package names with multiple hyphens and numbers caused the parsing to fail:

option --forgeString: Error when parsing the forge selection (either -f or --forgeFile):
unexpected "-"
expecting digit

We identified and fixed this bug.

Release v1.6.7.3

20 Aug 11:03

Choose a tag to compare

This is a minor release with few changes in the behaviour of trident. It mainly includes internal alterations that allow for better error reporting. On the user side there are three notable changes:

Better reporting of parsing errors for .ssf files

Every .ssf file column is now represented by its own data type, as it already has been the case for .janno columns. This allows for more precise reporting of issues. trident now points exactly to the broken column in case something is off.

More extensive warning mechanism for .janno and .ssf entries

We introduced a mechanism to not only report outright parsing failures on a per-column basis for .janno and .ssf files, but also minor deviations that make a given value not per-se wrong, but suspicious. These are now reported as warnings, while the respective Poseidon package is still read. The initial set of such checks in this release is small, but it is now easy to add more in the future.

Loosened requirements on accession ID columns in .ssf file

This release finally does away with the hard requirements on sample_accession, study_accession, and run_accession in the .ssf file reading process. These requirements were based on a particularly strict reading of the Poseidon schema. Now unexpected accession IDs only raise a warning.

Release v1.6.7.1

25 Jun 12:20

Choose a tag to compare

This release finally brings two long-anticipated features: VCF writing support and an html API for serve. It also includes some minor bugfixes.

Writing support for VCF files

v1.5.7.0 added experimental reading support for .vcf files. In this release trident finally learns to also write them as an output of forge and genoconvert. This new output option is available with --outFormat VCF.

VCF is a rich format (as specified here) and trident currently uses only the features relevant for the genotype data typically handled by Poseidon. In particular, as trident must be able to convert from Plink and Eigenstrat, many fields that are typically expected in VCF files (such as read- and allelic depths or genotype likelihoods) are not written.

On the other hand VCF files written by trident contain the extra headers ##group_names=Group1,Group2,... and ##genetic_sex=F,F,M,U,... to encode information typically not stored in VCF. This is to ensure compatibility with the PLINK and EIGENSTRAT data formats. trident has modified behavior for consistency checks between Ind- and Geno-file for VCFs, since VCF files do not have to have these custom header fields defined above.

Please note that the VCF format support is still not specified in the Poseidon schema version this trident version supports (v2.7.1), so the feature continues to be experimental.

HMTL API for the web server implementation

trident includes a web server to host Poseidon packages and relevant meta-information. It can be started with the subcommand serve. The central Poseidon server at https://server.poseidon-adna.org is nothing but a public instance of serve with access to the public package archives. Previously this web server provided only context data through a JSON API and allowed downloading packages as .zip archives (these interfaces are used by list --remote and fetch).

This release now adds HTML output, so a human-readable website, to the server's API. The central, public version is available here, but by running serve locally one can just as well host such a website for a private package archive.

serve can still be started with trident serve -d <name_of_archive>=<path/to/archive>, but now a new --archiveConfigFile argument allows to read more complex configuration in YML format.

More info from the POSEIDON.yml file in the list output

Added a new option --fullOutput for list --packages to extend the output with additional information from the underlying packages' POSEIDON.yml files (file names, contributors, etc.).

Fixed two bugs in rectify

Fixed a small bug that prevented calculation of checksums for genotype data in rectify, and another one that prevented trident from reading packages with a wrong individual file (.ind/.fam) checksum even in rectify, where this should be possible.

Release v1.6.2.1

19 Jan 17:01

Choose a tag to compare

This is a bigger release with various new features and improvements. It is technically breaking, because a minor, redundant argument of genoconvert was removed.

Writing support for gzipped genotype data

After reading support for zipped data was already added in V 1.5.7.0, this release now introduces the complementary writing feature for EIGENSTRAT and PLINK files in genoconvert and forge. Both commands get a new option -z which creates gzipped output.

  -z,--zip                 Should the resulting genotype- and snp-files be
                           gzipped?

Note that this feature includes a smart way of handling already available files to not overwrite them, but still consider them when updating a package's POSEIDON.yml file. -z is also usable with unpackaged genotype data (-p, --onlyGeno).

Future versions of the Poseidon package schema will formally specify this feature.

Bibliography information in list and the Web-API

The list subcommand now supports a new view (next to --packages, --groups and individuals): --bibliography allows to get a tabular overview of publications in a package repository.

$ trident list -d 2010_RasmussenNature --bibliography
...
.---------------------.--------------------------------------------------------------.-----------------------.------.---------------------------.---------------.
|       BibKey        |                            Title                             |        Author         | Year |            DOI            | Nr of samples |
:=====================:==============================================================:=======================:======:===========================:===============:
| AADR                | The Allen Ancient DNA Resource (AADR): A curated compendium… | Swapan Mallick et al. | 2023 | 10.1101/2023.04.06.535797 | 1             |
| AADRv424            | The Allen Ancient DNA Resource (AADR): A curated compendium… | S Mallick and D Reich | 2023 | 10.7910/DVN/FFIDCW        | 1             |
| RasmussenNature2010 | Ancient human genome sequence of an extinct Palaeo-Eskimo    | M Rasmussen et al.    | 2010 | 10.1038/nature08835       | 1             |
'---------------------'--------------------------------------------------------------'-----------------------'------'---------------------------'---------------'

Additional fields from the .bib file can be added to this table with -b|--bibField ... (just as -j|--jannoColumn ... for --individuals). --fullBib adds everything that is available (just as --fullJanno). As usual, tab-separated output can be requested with --raw for derived analyses on the command line.

Correspondingly the Web-API supports a new endpoint /bibliography to serve bibliography information via HTTP in JSON format. The optional query argument additionalJannoColumns=... allows to request extra fields here.

Remove empty .janno columns with rectify

The rectify subcommand was upgraded with a first option to manipulated .janno files in one or multiple packages: --jannoRemoveEmpty. This allows to remove empty columns from .janno files, so columns that only feature empty strings or n/a values.

  --jannoRemoveEmpty       Reorder the .janno file and remove empty colums.
                           Remember to pair this option with --checksumJanno to
                           also update the checksum.

With this change came a rewrite of the way trident fills empty fields with n/a when writing .janno and .ssf files. This behaviour now also affects the output of list!

Removed redundant --onlyGeno from genoconvert

We realized that --onlyGeno in genoconvert had the same effect as -o if a different output directory is chosen. We therefore decided to remove this argument and improve the documentation of -o:

  -o,--outPackagePath DIR  Path for the converted genotype files to be written
                           to. If a path is provided, only the converted
                           genotype files are written out, with no change of the
                           original package. If no path is provided, genotype
                           files will be converted in-place, including a change
                           in the POSEIDON.yml file to yield an updated valid
                           package (default: Nothing)

Bug fixes and technical changes

We fixed two bugs that broke the long-form genotype data input option (with --genoFile + --snpFile + ...). They were accidentally added with the recent interface changes for V 1.5.7.0. This input interface should now be fully functional again.

We finally switched to a new compiler version (GHC 9.6.6) and a new stackage resolver version (lts-22.43). This required some minor adjustments in the server code, but should not have any user-facing consequences.

Release v1.5.7.3

04 Nov 15:42

Choose a tag to compare

This patch release fixes three minor bugs, some of which were accidentally introduced with the big changes in v1.5.7.0.

  1. Fixed a bug in the .janno reading triggered by trailing à characters.
  2. Reverted unspecified behaviour: 0 is again allowed in the Nr_SNPs .janno column.
  3. Fixed a bug introduced in v1.5.5.0, where command line input using the -p option would not behave correctly if the input files have multiple file endings, separated by dots.

Release v1.5.7.0

26 Oct 18:38

Choose a tag to compare

Warning

On 2024/11/06 we realized that this release includes a breaking change that is not documented below.
The command line input interface for unpackaged genotype data was modified from previously --inFormat EIGENSTRAT|PLINK + --genoFile + --snpFile + --indFile to now --genoFile + --snpFile + --indFile and --bedFile + --bimFile + --famFile. So the format selection with the --inFormat argument was removed and replaced with separate file selectors for EIGENSTRAT and PLINK data.
This affects all trident subcommands that allow reading of unpackaged genotype data, namely init, forge, genoconvert and validate.

This release further improves .janno parsing error messages and adds reading support for gzipped PLINK (.bed and .bim) and EIGENSTRAT (.geno and .snp) files. We also added (experimental) support for reading VCF files.

Better .janno error messages

Working with Poseidon packages generally involves reading and validation of .janno files. trident parses them carefully and reports structural issues that compromise their machine-readability. So far the error reports generally only included the line and type of an offending entry. This made it sometimes hard to determine which column exactly is broken. For this release we introduced individual data types for all specified .janno columns, which allows more precises error messages.

To demonstrate this we modified an existing .janno file in the Poseidon community archive (2012_MeyerScience) and broke some of its columns. We added non-UTF8 encoded characters in the Relation_Note column of line 2, a trailing ; in the Coverage_on_Target_SNPs column of line 3, and a leading x to the Latitude column of line 7.

Here is how these issues were previously reported and how they are shown now:

[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 2:
-parse error (Failed reading: conversion error: Cannot decode byte '\x80': Data.Text.Encoding: Invalid UTF-8 stream)
+parse error (Failed reading: conversion error: Cannot decode byte '\x80': Data.Text.Encoding: Invalid UTF-8 stream in column Relation_Note)
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 3:
-parse error in one column (expected data type: Double, broken value: "32.12;", problematic characters: ";")
+parse error (Failed reading: conversion error: Coverage_on_Target_SNPs can not be converted to Double, because of a trailing ";")
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 7:
-parse error (Failed reading: conversion error: expected Double, got "x18.93726" (Failed reading: takeWhile1))
+parse error (Failed reading: conversion error: Latitude can not be converted to Double because input does not start with a digit)

The error messages now include the relevant column name and are more concrete and easy to understand.

Reading support for gzipped genotype data

Although not yet part of the Poseidon 2.7.1 standard, Poseidon packages can now contain gzipped genotype files. Specifically, for EIGENSTRAT-formatted genotype data, the genotype matrix file (.geno) and the snp-list file (.snp) can now also be zipped. This strictly requires file endings with .gz, so .geno.gz and .snp.gz, respectively. Similarly, for PLINK-formatted genotype data, we now also accept .bed.gz and .bim.gz. Any such files with the gz file ending are assumed to be gzipped, and are decoded on the fly using stream-processing. Gzipped and unzipped files can also be mixed within the same package.

For commands that support the --genoOne option (init, forge and genoconvert), note that we make some assumptions, which are summarised in the help text for the option:

 -p,--genoOne FILE        One of the input genotype data files. Expects .bed,
                          .bed.gz, .bim, .bim.gz or .fam for PLINK, or .geno,
                          .geno.gz, .snp, .snp.gz or .ind for EIGENSTRAT. The
                          other files must be in the same directory and must
                          have the same base name. If a gzipped file is given,
                          it is assumed that the file pairs (.geno.gz, .snp.gz)
                          or (.bim.gz, .bed.gz) are both zipped, but not the
                          .fam or .ind file. If a .ind or .fam file is given,
                          it is assumed that none of the file triples is
                          zipped. For VCF please see option --vcfFile

At this point, genoconvert and forge do not support writing of gzipped files. This will be added in the future.

VCF support for genotype data

Although not yet part of the Poseidon 2.7.1 standard, Poseidon packages can now contain VCF (Variant Call Format) files as genotype data, optionally gzipped. In contrast to EIGENSTRAT and PLINK format, which require triples of files, the VCF format requires just one file with ending .vcf or .vcf.gz. VCF files contain sample names, but no information about genetic sex or group names. This information is usually provided in .janno files, so there is no loss of information in Poseidon packages. For trident init, which constructs a minimal .janno file from the genotypem file, we set the Genetic_Sex column to "U", and the Group_Name column to "unknown".

The VCF file format is very flexible and can encode a large amount of information (see https://samtools.github.io/hts-specs/VCFv4.2.pdf). We do not consider our parsing of VCF files to be complete. The feature is for now experimental, since future users may encounter valid VCF files that cause parsing errors in edge cases. Do not hesitate to file an issue in such a case: https://github.com/poseidon-framework/poseidon-hs/issues.

At this point, genoconvert and forge do not support writing of VCF files. This will be added in the future.

Release v1.5.4.0

12 Jul 08:02

Choose a tag to compare

This bigger release adds a number of useful features to trident, some of them long requested. The highlights are ordered output for forge, a way to preserve key information if forge is applied to a singular source package, a new Web-API option to return the content of all available .janno columns, and better error messages for common trident issues.

Order forge output with --ordered

The order of samples in a Poseidon package created with trident forge depends on the order in which the relevant source packages are discovered by trident (e.g. when it crawls for packages in the -d base directories) and then the sample order within these packages. This mechanism did not allow for any convenient way to manually set the output order.

v1.5.4.0 adds a new option --ordered, which causes trident to output the resulting package with samples ordered according to the selection in -f or --forgeFile. This works through an alternative, slower sample selection algorithm that loops through the list of entities and checks for each entity which samples it adds or removes respectively from the final selection.

For simple, positive selection, packages, groups and samples are added as expected. Negative selection removes samples from the list again. If an entity is selected twice via positive selection, then its first occurrence is considered for the ordering.

Preserve the source package in forge with --preservePyml

For the specific task of subsetting a singular, existing Poseidon package it can be useful to preserve some fields of the POSEIDON.yml file of the source package, as well as supplementary information in the README.md and the CHANGELOG.md file. These are typically discarded by forge, but can now be copied over to the output package with the new --preservePyml output mode. Naturally this only works with a single source package!

--preservePyml specifically preserves the following POSEIDON.yml fields:

  • description
  • contributor
  • packageVersion
  • lastModified
  • readmeFile
  • changelogFile

Note that this does not include the package title, which can be easily set to be identical to the source with -n or -o if it is desired. The poseidonVersion field is also not copied, because trident can only ever produce output packages with the latest Poseidon schema version.

While implementing this we clearly separated the different forge output modes (--onlyGeno, --minimal, --preservePyml and the default) and made them mutually exclusive. We did so to avoid an increasingly complex set of interactions between them for the future.

One particular application of --preservePyml is the reordering of samples in an existing Poseidon package MyPac with the new --ordered flag. We suggest the following workflow for this application:

  1. Generate a --forgeFile with the desired order of the samples in MyPac. This can be done manually or with any suitable tool. Here is an example, where we employ qjanno to generate a forge selection so that the samples are ordered alphabetically by their Poseidon_ID:
qjanno "SELECT '<'||Poseidon_ID||'>' FROM d(MyPac) ORDER BY Poseidon_ID" --raw --noOutHeader > myOrder.txt
  1. Use trident forge with --ordered and --preservePyml to create the package with the specified order:
trident forge -d MyPac --forgeFile myOrder.txt -o MyPac2 --ordered --preservePyml
  1. Apply trident rectify to increment the package version number and document the reordering:
trident rectify -d MyPac2 --packageVersion Minor --logText "reordered the samples alphabetically by Poseidon_ID"

MyPac2 then acts as a stand-in replacement for MyPac that only differs in the order of samples (and maybe the order of variables/fields in the POSEIDON.yml, .janno, .ssf or .bib files). This workflow is not as convenient as in-place reordering would be -- but much safer.

Request all .janno columns in list and the Web-API

trident list --individuals allows to access per-sample information for Poseidon packages on the command line. With the -j option arbitrary additional columns from the .janno files can be appended to the output. Here, for example, the Country and the Genetic_Sex columns:

 trident list -d 2010_RasmussenNature --individuals -j "Country" -j "Genetic_Sex"

.------------.---------------------.----------------------.----------------.-----------.-----------.-------------.
| Individual |        Group        |       Package        | PackageVersion | Is Latest |  Country  | Genetic_Sex |
:============:=====================:======================:================:===========:===========:=============:
| Inuk.SG    | Greenland_Saqqaq.SG | 2010_RasmussenNature | 2.1.1          | True      | Greenland | M           |
'------------'---------------------'----------------------'----------------'-----------'-----------'-------------'

v1.5.4.0 adds a --fullJanno flag to request all columns at once, without having to list them individually with many -j arguments.

This convenience feature was also added to the Web-API, where it can be triggered with ?additionalJannoColumns=ALL on the /individuals endpoint:

https://server.poseidon-adna.org/individuals?additionalJannoColumns=ALL

Better error messages

In previous trident versions some common error messages were not well rendered on the command line. This concerned particularly errors when parsing command line input, the POSEIDON.yml file or genotype data. We applied multiple changes here to improve the cli output.

The behaviour of the global trident option --errLength was also changed. It now only truncates genotype data-related messages, but does so as well if these are raised on the [Warning] log level. This should make the previously often illegible trident output upon broken genotype data more readable.

Release v1.5.0.1

06 May 20:12

Choose a tag to compare

This very minor release only affects the static trident executables produced for every release.

It introduces a distinction between pre-built X64 and ARM64 executables for macOS, where changes in the main processor architecture have recently rendered old builds invalid for new systems and vice versa.

That means the executable trident-macOS will henceforward not longer exist, but instead the executables trident-macOS-X64 and trident-macOS-ARM64.

In the past we have not explicitly documented changes in the compilation pipeline - v1.5.0.0, for example, came with a major overhaul of the pipeline - but in this case a small version bump seems to be in order to announce the split in available artefacts.

Release v1.5.0.0

03 May 15:56

Choose a tag to compare

This is a minor, but technically breaking release. It removes the example contributor Josiah Carberry from new packages created by trident init and trident forge

Previously every package created by init or forge included an example entry in the contributor field of the POSEIDON.yml file:

- name: Josiah Carberry
  email: carberry@brown.edu
  orcid: 0000-0002-1825-0097

This served the purpose of reminding users to actually set a contributor and giving an example how to do so. To simplify scripting with Poseidon packages we now remove this slightly gimmicky default.

To encourage setting the contributor field we instead introduce a reading/validation warning in case the contributor field is empty:

[Warning] Contributor missing in POSEIDON.yml file of package 2010_RasmussenNature-2.1.1