High-performance Unicode and Punycode encoding/decoding for internationalized domain names (IDNs) in R.
The punycoder package provides fast, standards-based conversion between Unicode and ASCII representations of domain names, across two distinct surfaces:
- a low-level Punycode codec —
puny_encode()/puny_decode()— the raw RFC 3492 transform withxn--A-label framing (RFC 5890/5891) and letter-digit-hyphen checks, not an IDNA normalization API (no Unicode NFC, UTS #46 mapping, or case folding); - an IDNA/UTS-46 host-normalization surface —
host_normalize()— mapping a host name to its canonical lowercase ASCII comparison form under a pinned UTS #46 non-transitional profile.
host_normalize() is a UTS #46 profile, not IDNA2008 conformance — UTS #46 is compatibility processing and deliberately accepts labels IDNA2008 would reject (e.g. ☕.example → xn--53h.example). See ?host_normalize and normalization_profile_info() for the normative profile and full standards references (RFC 3492/5890/5891/5892/5893, UTS #46, UAX #15/#44, STD 3, RFC 8753).
punycoder has a small dependency footprint:
- Runtime dependencies:
R (>= 3.5.0),Rcpp - Optional system dependency:
libidn2(detected at compile time) - Optional build helper:
pkg-config(used byconfigureto detectlibidn2) - Development dependencies:
testthat,knitr,rmarkdown
Install the released version of punycoder from CRAN with:
install.packages("punycoder")Or install the development version from GitHub with:
# install.packages("remotes")
remotes::install_github("bart-turczynski/punycoder")punycoder works without extra system libraries. If libidn2 is available at
build time, the package enables a native backend automatically; otherwise it
uses the built-in C++ fallback backend.
To install the recommended optional dependency:
- macOS (Homebrew):
brew install libidn2 pkg-config
- Debian/Ubuntu:
sudo apt-get install libidn2-0-dev pkg-config
- Fedora/RHEL/CentOS:
sudo dnf install libidn2-devel pkgconf-pkg-config
- Arch Linux:
sudo pacman -S libidn2 pkgconf
Verify the library is visible before installing punycoder from source:
system("pkg-config --modversion libidn2")Then install/reinstall punycoder:
remotes::install_github("bart-turczynski/punycoder")library(punycoder)
# Basic encoding
puny_encode("café.com")
#> [1] "xn--caf-dma.com"
# Check if domain is punycode
is_punycode("xn--example")
#> [1] TRUE
# Validate domains
validate_domain("test.com")
#> Punycoder Domain Validation Results
#> ==================================
#>
#> Domain: test.com
#> Valid: TRUE- Reliable Encoding/Decoding: RFC 3492 compliant punycode conversion
- Best-effort host rewriting: Swap the host of a URL-shaped string in place (not a full URL parser; see below)
- High Performance: Vectorized operations for processing large datasets
- Comprehensive Validation: Robust error handling with informative messages
- Flexible Backend: Automatically uses
libidn2when available, with a built-in fallback backend
Process international websites with Unicode domain names:
international_urls <- c(
"https://café.paris.fr/menu",
"https://москва.рф/news",
"https://北京.中国/info"
)
# Convert for HTTP requests (best-effort host rewriting only)
ascii_urls <- url_encode(international_urls)
url_encode(),url_decode(), andparse_url()do best-effort host extraction and rewriting, not RFC 3986 / WHATWG URL parsing or canonicalization. They have no percent encoding/decoding, scheme validation, robust port/path/query semantics, full IPv6 (zone IDs / RFC 6874), or serialization guarantees, and are slated for eventual removal in favor of a dedicated URL package consuming punycoder’s host functions. Usehost_normalize()/puny_encode()directly when you control the host parse.
Clean and standardize URL datasets:
# Identify international domains
is_idn(c("café.com", "example.com", "москва.рф"))
# Validate domain names
validate_domain(c("valid.com", "invalid..domain"))punycoder currently provides:
- Low-level Punycode codec:
puny_encode(),puny_decode() - IDNA/UTS-46 host normalization:
host_normalize(),normalization_profile_info() - Best-effort URL host rewriting/extraction (not URL parsing/canonicalization):
url_encode(),url_decode(),parse_url() - Domain validation utilities:
is_punycode(),is_idn(),validate_domain() - Vectorized operations and strict/non-strict handling for malformed input
- Build-time backend selection (
libidn2when present, built-in fallback otherwise) - Best-effort structured host extraction where invalid inputs are returned as missing components
punycoder is a standards primitive for Punycode and host normalization. It is
deliberately agnostic about resolvability and safety; the following are not
part of its acceptance criteria:
- No spoof / homograph / mixed-script / display-safety detection.
host_normalize()is not a safety gate — a successful result says the host is valid and normalized under the pinned UTS #46 profile, nothing about whether it is visually safe or non-deceptive. Confusable and restriction-level checks (UTS #39 / UTR #36, which UTS #46 itself recommends only as application/UI-layer steps) belong upstack. - No URL canonicalization. The
url_*/parse_url()helpers do best-effort host rewriting only (see above), not RFC 3986 / WHATWG URL parsing. - No DNS resolvability or registrability / PSL classification.
- No address parsing. There is no
email-to-ASCII helper; splitting an address and IDNA-encoding its domain part is an addressing concern for an upstack consumer, not a Punycode primitive. - No per-TLD repertoire / allowed-character validation.
host_normalize()validates against the pinned UTS #46 profile, not against registry-specific IDN tables (which evolve independently of Unicode). TLD policy belongs upstack.
These opinions belong in higher layers that consume punycoder’s host functions.
Punycode/IDN libraries exist in most ecosystems. punycoder is most directly a
maintained, IDNA2008-era successor to the libidn-based R tooling — its public
API (puny_encode() / puny_decode() / is_punycode()) descends from
hrbrmstr/punycode. The table below
situates it against representative libraries.
| punycoder (R) | hrbrmstr/punycode (R) | punycoder (Dart) | simonmittag/punycoder (Go) | |
|---|---|---|---|---|
| Form | library | library | library | CLI tool |
| RFC 3492 codec | yes | yes | yes | yes |
| Engine | libidn2 + in-tree fallback |
GNU libidn |
pure Dart | Go x/net/idna |
| IDNA standard | 2008 / UTS #46 (non-transitional) | 2003 (nameprep) | RFC 3492 + IDNA helpers | UTS #46 (via x/net) |
| Unicode NFC | explicit (UAX #15) | implicit in nameprep | not documented | via x/net |
| Pinned Unicode version | yes — 16.0.0, regenerable | no (frozen at build) | no | tracks Go release |
| CheckBidi / CheckJoiners | always on | not surfaced | not documented | partial |
UTS #46 conformance corpus (IdnaTestV2) |
yes | no | no | — |
Strict / NA per-element policy |
yes | undocumented | validate flag |
n/a (CLI) |
| Vectorized | yes | yes | n/a | n/a |
| Maintenance | active | last commit 2015 | maintained | maintained |
The most consequential row is IDNA standard. IDNA2003 (GNU
libidn, nameprep) and IDNA2008 / UTS #46 disagree on real domains: the deviation charactersß,ς, and the joiners ZWJ/ZWNJ. Under IDNA2003faß.deis mapped tofass.de— a different host — whereaspunycoder’s pinned UTS #46 non-transitional profile preserves it asxn--fa-hia.de. A libidn-era pipeline therefore silently rewrites some hosts rather than erroring, which is the class of bugpunycoderexists to remove.
Comparisons reflect each project’s public documentation as of this writing and describe documented behavior, not an independent audit.
Running the same inputs through the comparable R packages surfaces concrete
behavioral differences (observed against punycode 0.2.5, urltools 1.7.3.1,
and the author’s own upstack toolkit rurl
1.4.0). The raw RFC 3492 codec output agrees byte-for-byte across the codecs
once direction is aligned — the divergences are in multi-label handling,
idempotency, validity philosophy, and input scope. rurl is a URL
parser/normalizer rather than a Punycode codec; it is included to show where the
URL-shaped inputs punycoder deliberately rejects are actually handled (it
delegates IDNA host conversion to punycoder), so — below means “out of
scope for that layer,” not a defect:
| Behavior | punycoder | hrbrmstr/punycode | urltools | rurl |
|---|---|---|---|---|
| Primary role | Punycode/IDNA host codec | Punycode codec (IDNA2003) | URL + punycode utilities | URL parser / normalizer |
puny_encode() direction |
Unicode → ASCII | ASCII → Unicode (names inverted) | Unicode → ASCII | — (no codec; IDNA via punycoder) |
Decode multi-label xn--hxakfddc2amo8b.xn--qxam |
ελράδειγμα.ελ ✓ |
ελράδειγμα.ελ ✓ |
ελράδειγμα.ελράδειγμα ✗ (second label corrupted) |
— (no xn-- → Unicode decoder) |
Re-encode an already-xn-- label |
unchanged — idempotent ✓ | unchanged ✓ | xn--xn--…-.xn--xn--…- ✗ (double-encoded) |
— |
Round-trip decode(encode(x)) == x |
yes | yes | no (from the decode bug above) | — |
gr€€n.no — EURO SIGN, valid under UTS #46 |
accepted → xn--grn-l50aa.no |
rejected by puny_tld_check (IDNA2008) |
— | parses; host preserved |
Full-URL input (http://…) |
rejected with an actionable error pointing at a URL parser (rurl) |
n/a (domain-only) | passed through unchanged | parsed — scheme/host/domain/TLD extracted; get_clean_url() lowercases the host and resolves dot-segments |
| Required system library | none (libidn2 optional) |
GNU libidn (v1) required to build |
none | none |
punycodenames its functions opposite to the usual convention:punycode::puny_encode()mapsxn--→ Unicode andpunycode::puny_decode()maps Unicode →xn--. The rows above align by transform direction, not by function name.
punycoder+rurl(+pslrfor the public-suffix/TLD truth) are designed to compose:rurlparses the URL and hands the host topunycoderfor IDNA canonicalization, each package owning a single concern.
These packages build on data, libraries, and prior work from many others. See ACKNOWLEDGMENTS.md for the full list of thanks.
punycoder is part of a small ecosystem of R packages by the same author:
- pslr — Public Suffix List engine that uses
punycoderfor IDNA canonicalization. Use it for eTLD and registrable-domain queries. - rurl — Full URL parsing, normalization, and joining toolkit built on top of both
punycoderandpslr.
If you use punycoder in your work, please cite it. Run citation("punycoder")
for the current citation, or see CITATION.cff.
Each release is archived on Zenodo. Cite the concept DOI 10.5281/zenodo.20973629 to refer to the software in general (it always resolves to the latest version), or the version-specific DOI shown on the Zenodo record for a particular release.
We welcome contributions. See CONTRIBUTING.md for the current development workflow.
Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
MIT