Summary
I identified several issues in the google/robotstxt library during fuzzing and manual code review. These observations involve URL matching consistency, algorithmic complexity, boundary handling, and integer arithmetic robustness within robots.cc.
The most significant finding is an inconsistency in how percent-encoded paths are normalized and matched. This can lead to different policy decisions for semantically equivalent URLs.
These findings were previously shared with the Google OSS VRP, and I was advised to open an issue here for project-level review and discussion.
Observations & Analysis
1. Inconsistent Percent-Encoding Matching
Detail
MaybeEscapePattern() normalizes lowercase percent-encoded sequences appearing in robots.txt rules (e.g. %aa → %AA).
However, the URI path extracted by GetPathParamsQuery() does not appear to undergo equivalent normalization before matching.
As a result, semantically equivalent percent-encoded representations may produce different matching outcomes because matching is performed using case-sensitive comparisons.
Example
Rule:
Internal representation:
URI:
Observed result:
Expected result:
Impact
This is not an authorization bypass in the traditional sense because robots.txt is not an authorization mechanism.
However, it appears to cause inconsistent policy evaluation where equivalent URL representations can lead to different crawler decisions. Applications relying on robotstxt matching for crawler restrictions or policy enforcement may observe unintended behavior.
2. Algorithmic Complexity in Matches
Detail
The matching algorithm exhibits O(N × M) worst-case behavior.
Specific combinations of long paths and wildcard-heavy patterns (for example *a*a*a...) can cause significant growth in processing time.
Measurements
| Path Length |
Pattern Size |
Time |
| 10,000 |
200 |
0.05s |
| 80,000 |
1600 |
3.73s |
Impact
This may lead to excessive CPU consumption in high-throughput parsing environments.
3. Boundary Handling in ExtractUserAgent
Detail
ExtractUserAgent() iterates through a string_view without an explicit p < end boundary check.
While normal usage may provide safe backing storage, the parser can advance beyond the logical range represented by the string_view.
Impact
This is primarily a robustness and defensive-programming concern. Adding explicit boundary validation would ensure that parsing remains constrained to the intended slice.
4. Integer Arithmetic Robustness in Matches
Detail
numpos is stored as a 32-bit int.
Calculations derived from path length may overflow for sufficiently large inputs, potentially resulting in unexpected behavior.
Impact
Although extremely large paths are uncommon, replacing path-length-derived counters with size_t would improve resilience and eliminate this edge case.
Reproduction (Observation #1)
The matching inconsistency can be reproduced using the bundled CLI tool.
# Create robots.txt
echo -e "User-agent: *\nDisallow: /secret%aa" > robots.txt
# Run matcher
./build/robots robots.txt FooBot "http://example.com/secret%aa"
Observed:
Expected:
Suggested Fixes
Fix 1: Consistent Percent-Encoding Normalization
Apply the same percent-encoding normalization logic to incoming URI paths before matching.
Fix 2: Complexity Safeguards
Consider introducing a complexity threshold, state limit, or other safeguards for wildcard-heavy patterns.
Fix 3: Boundary Validation
Add explicit bounds checking in ExtractUserAgent():
Fix 4: Type Safety
Use size_t for path-length-derived counters and indices in Matches().
Summary
I identified several issues in the
google/robotstxtlibrary during fuzzing and manual code review. These observations involve URL matching consistency, algorithmic complexity, boundary handling, and integer arithmetic robustness withinrobots.cc.The most significant finding is an inconsistency in how percent-encoded paths are normalized and matched. This can lead to different policy decisions for semantically equivalent URLs.
These findings were previously shared with the Google OSS VRP, and I was advised to open an issue here for project-level review and discussion.
Observations & Analysis
1. Inconsistent Percent-Encoding Matching
Detail
MaybeEscapePattern()normalizes lowercase percent-encoded sequences appearing inrobots.txtrules (e.g.%aa→%AA).However, the URI path extracted by
GetPathParamsQuery()does not appear to undergo equivalent normalization before matching.As a result, semantically equivalent percent-encoded representations may produce different matching outcomes because matching is performed using case-sensitive comparisons.
Example
Rule:
Internal representation:
URI:
Observed result:
Expected result:
Impact
This is not an authorization bypass in the traditional sense because
robots.txtis not an authorization mechanism.However, it appears to cause inconsistent policy evaluation where equivalent URL representations can lead to different crawler decisions. Applications relying on
robotstxtmatching for crawler restrictions or policy enforcement may observe unintended behavior.2. Algorithmic Complexity in
MatchesDetail
The matching algorithm exhibits
O(N × M)worst-case behavior.Specific combinations of long paths and wildcard-heavy patterns (for example
*a*a*a...) can cause significant growth in processing time.Measurements
Impact
This may lead to excessive CPU consumption in high-throughput parsing environments.
3. Boundary Handling in
ExtractUserAgentDetail
ExtractUserAgent()iterates through astring_viewwithout an explicitp < endboundary check.While normal usage may provide safe backing storage, the parser can advance beyond the logical range represented by the
string_view.Impact
This is primarily a robustness and defensive-programming concern. Adding explicit boundary validation would ensure that parsing remains constrained to the intended slice.
4. Integer Arithmetic Robustness in
MatchesDetail
numposis stored as a 32-bitint.Calculations derived from path length may overflow for sufficiently large inputs, potentially resulting in unexpected behavior.
Impact
Although extremely large paths are uncommon, replacing path-length-derived counters with
size_twould improve resilience and eliminate this edge case.Reproduction (Observation #1)
The matching inconsistency can be reproduced using the bundled CLI tool.
Observed:
Expected:
Suggested Fixes
Fix 1: Consistent Percent-Encoding Normalization
Apply the same percent-encoding normalization logic to incoming URI paths before matching.
Fix 2: Complexity Safeguards
Consider introducing a complexity threshold, state limit, or other safeguards for wildcard-heavy patterns.
Fix 3: Boundary Validation
Add explicit bounds checking in
ExtractUserAgent():while (p < end && ...)Fix 4: Type Safety
Use
size_tfor path-length-derived counters and indices inMatches().