Skip to content

feat(grep): real PCRE -P via fancy-regex and GNU long-option aliases#1846

Merged
chaliy merged 3 commits into
mainfrom
claude/vibrant-noether-hlQLp
Jun 3, 2026
Merged

feat(grep): real PCRE -P via fancy-regex and GNU long-option aliases#1846
chaliy merged 3 commits into
mainfrom
claude/vibrant-noether-hlQLp

Conversation

@chaliy
Copy link
Copy Markdown
Contributor

@chaliy chaliy commented Jun 3, 2026

What

Two capability improvements to the grep builtin, plus a security hardening of its regex backtracking surface.

1. Real -P (PCRE) via fancy_regex

-P previously aliased ERE on the default regex crate, so lookaround and backreferences silently didn't work. It now routes to the backtracking fancy_regex engine:

  • Lookahead foo(?=bar), lookbehind (?<=\$)\d+, and backreferences (\w+) \1 work like GNU grep -P.
  • A shared Matcher enum in search_common.rs hides the two engines' differing APIs (fancy_regex returns Result from is_match/find_iter).
  • Recursive -P bypasses the indexed-search fast-path, since the backend's regex engine can't speak PCRE and could otherwise drop real matches.

2. GNU long-option aliases

Scripts using grep --ignore-case etc. previously failed (unknown long options were silently ignored). Added long-form aliases for every supported short flag — --ignore-case, --invert-match, --line-number, --count, --only-matching, --word-regexp, --line-regexp, --fixed-strings, --extended-regexp, --basic-regexp, --perl-regexp, --quiet/--silent, --byte-offset, --text, --null-data, --recursive, --with-filename, --no-filename, --regexp=PAT, --file=FILE, --max-count=N, --after-context=N, --before-context=N, --context=N — accepting both --name=value and --name value. Also added -G/--basic-regexp.

3. Fix + hardening

  • -b with -o now reports the match's byte offset rather than the line start (matches GNU).
  • grep -P backtracking is bounded by FANCY_BACKTRACK_LIMIT (1M steps, same posture as sed); a pattern that exceeds it yields "no match" instead of hanging the sandbox.

Why

This was scoped from a review of what's adoptable from uutils/grep. Its -P uses Oniguruma (onig/onig_sys, a C lib) — incompatible with bashkit's pure-Rust / WASM / sandbox constraints — so the crate-frugal path is reusing fancy_regex, which is already an always-on dependency (used by sed, rg, jq). No new crate is added.

How / Tests

  • 14 new unit tests: PCRE lookahead/lookbehind/backreference, --perl-regexp, invalid-pattern error path, long-option aliases (inline + space-separated value forms, missing-value error), -b+-o offset, and a catastrophic-backtracking regression test ((a+)+$) proving the backtrack limit terminates.
  • Default-feature behavior (BRE/ERE/-F) unchanged; differential spec tests don't use -P.
  • Gates: cargo fmt --check, cargo clippy --all-targets -- -D warnings, cargo test all green. TM-INF-022 source-scan passes.

Specs

  • specs/implementation-status.md: updated grep feature list.
  • specs/threat-model.md: TM-DOS-025 moves from Partial → MITIGATED (linear-time default engine; fancy-regex paths capped by FANCY_BACKTRACK_LIMIT), with a // THREAT[TM-DOS-025] marker at the mitigation point.

Generated by Claude Code

chaliy added 2 commits June 3, 2026 03:53
-P now routes to the backtracking fancy_regex engine (bounded by
FANCY_BACKTRACK_LIMIT) instead of aliasing ERE, so lookaround and
backreferences work like GNU grep -P. Recursive -P bypasses the indexed
search fast-path so a backend regex that can't speak PCRE cannot drop
real matches. A shared Matcher enum in search_common.rs hides the two
engines' differing APIs.

Adds GNU long-option aliases for every supported short flag
(--ignore-case, --invert-match, --max-count=N, --regexp=PAT, etc.),
accepting both --name=value and --name value forms, plus -G/--basic-regexp.
Also fixes -b with -o to report the match's byte offset rather than the
line start.
Add a catastrophic-backtracking regression test for grep -P and a
THREAT[TM-DOS-025] marker at the fancy-regex mitigation point. Update
the threat model: the default regex engine is linear-time and the
fancy-regex paths (grep -P, sed) are capped by FANCY_BACKTRACK_LIMIT,
so TM-DOS-025 moves from Partial to MITIGATED.
Copilot AI review requested due to automatic review settings June 3, 2026 04:08
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Jun 3, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
bashkit 4f680c8 Commit Preview URL

Branch Preview URL
Jun 03 2026, 04:17 AM

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the grep builtin by adding true PCRE-style -P support via fancy_regex, introducing GNU-compatible long-option aliases, and documenting/mitigating regex backtracking DoS risk via a bounded backtrack limit.

Changes:

  • Add a shared Matcher abstraction and route grep -P to fancy_regex with a fixed FANCY_BACKTRACK_LIMIT.
  • Implement GNU long-option aliases (including --name=value and --name value forms) and add -G/--basic-regexp.
  • Align -b + -o byte-offset behavior with GNU grep and update specs + add unit tests.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
specs/threat-model.md Marks TM-DOS-025 as mitigated and documents the new backtracking cap behavior.
specs/implementation-status.md Updates grep’s implemented feature list to reflect -G, true -P, long options, and -b+-o behavior.
crates/bashkit/src/builtins/search_common.rs Introduces Matcher, FANCY_BACKTRACK_LIMIT, and a fancy_regex matcher builder.
crates/bashkit/src/builtins/grep.rs Adds -P PCRE path, long-option parsing, indexed-search bypass for -P, byte-offset fix for -o, and new tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +60 to +74
/// Byte ranges `(start, end)` of all non-overlapping matches, left to
/// right. Slice `text[start..end]` to recover the matched substring.
pub(crate) fn find_ranges(&self, text: &str) -> Vec<(usize, usize)> {
match self {
Matcher::Standard(re) => re.find_iter(text).map(|m| (m.start(), m.end())).collect(),
// `find_iter` yields `Result<Match, _>`; `flatten` drops the Err
// arm (backtrack-limit / internal errors) — same "no match" policy
// as `is_match`.
Matcher::Fancy(re) => re
.find_iter(text)
.flatten()
.map(|m| (m.start(), m.end()))
.collect(),
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the eager Vec here intentionally. find_ranges is only ever called per-line (the caller loops over lines first), so the collection is bounded by matches within a single line, not the whole file — and the non--o hot paths (-q, -l/-L, plain matching) use is_match, which already early-exits and never calls find_ranges. --max-count is also enforced across lines via total_matches, so the only over-work is collecting one line's matches before the per-line break.

A lazy alternative would have to bridge two different concrete iterator types (regex::Matches vs fancy_regex::Matches, the latter yielding Result), which means either Box<dyn Iterator> (a per-line heap alloc — no better than the Vec) or a hand-rolled enum-iterator (~30 lines) for a gain that's marginal at per-line scale. Not worth the added surface here; will revisit if profiling on a real workload shows it matters.


Generated by Claude Code

Comment thread crates/bashkit/src/builtins/grep.rs
Comment thread crates/bashkit/src/builtins/grep.rs
Comment thread crates/bashkit/src/builtins/grep.rs
… in --help

Address PR review:
- Pattern-type flags now go through set_pattern_type(PatternType), so the
  last of -G/-E/-F/-P (and their long forms) wins, matching GNU grep.
  Previously -P set perl_regex without clearing extended/fixed, so a later
  -G/-E had no effect.
- --help now lists the GNU long-option aliases alongside each short flag.

Adds last-wins tests for both short and long forms.
@chaliy chaliy merged commit 87a5516 into main Jun 3, 2026
35 checks passed
@chaliy chaliy deleted the claude/vibrant-noether-hlQLp branch June 3, 2026 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants