Skip to content

[Security] Multiple Memory Safety and DoS Vulnerabilities in robots.cc #85

@izumi-hyun

Description

@izumi-hyun

Summary

I identified several issues in the google/robotstxt library during fuzzing and manual code review. These observations involve URL matching consistency, algorithmic complexity, boundary handling, and integer arithmetic robustness within robots.cc.

The most significant finding is an inconsistency in how percent-encoded paths are normalized and matched. This can lead to different policy decisions for semantically equivalent URLs.

These findings were previously shared with the Google OSS VRP, and I was advised to open an issue here for project-level review and discussion.


Observations & Analysis

1. Inconsistent Percent-Encoding Matching

Detail

MaybeEscapePattern() normalizes lowercase percent-encoded sequences appearing in robots.txt rules (e.g. %aa%AA).

However, the URI path extracted by GetPathParamsQuery() does not appear to undergo equivalent normalization before matching.

As a result, semantically equivalent percent-encoded representations may produce different matching outcomes because matching is performed using case-sensitive comparisons.

Example

Rule:

Disallow: /secret%aa

Internal representation:

/secret%AA

URI:

/secret%aa

Observed result:

ALLOWED

Expected result:

DISALLOWED

Impact

This is not an authorization bypass in the traditional sense because robots.txt is not an authorization mechanism.

However, it appears to cause inconsistent policy evaluation where equivalent URL representations can lead to different crawler decisions. Applications relying on robotstxt matching for crawler restrictions or policy enforcement may observe unintended behavior.


2. Algorithmic Complexity in Matches

Detail

The matching algorithm exhibits O(N × M) worst-case behavior.

Specific combinations of long paths and wildcard-heavy patterns (for example *a*a*a...) can cause significant growth in processing time.

Measurements

Path Length Pattern Size Time
10,000 200 0.05s
80,000 1600 3.73s

Impact

This may lead to excessive CPU consumption in high-throughput parsing environments.


3. Boundary Handling in ExtractUserAgent

Detail

ExtractUserAgent() iterates through a string_view without an explicit p < end boundary check.

While normal usage may provide safe backing storage, the parser can advance beyond the logical range represented by the string_view.

Impact

This is primarily a robustness and defensive-programming concern. Adding explicit boundary validation would ensure that parsing remains constrained to the intended slice.


4. Integer Arithmetic Robustness in Matches

Detail

numpos is stored as a 32-bit int.

Calculations derived from path length may overflow for sufficiently large inputs, potentially resulting in unexpected behavior.

Impact

Although extremely large paths are uncommon, replacing path-length-derived counters with size_t would improve resilience and eliminate this edge case.


Reproduction (Observation #1)

The matching inconsistency can be reproduced using the bundled CLI tool.

# Create robots.txt
echo -e "User-agent: *\nDisallow: /secret%aa" > robots.txt

# Run matcher
./build/robots robots.txt FooBot "http://example.com/secret%aa"

Observed:

ALLOWED

Expected:

DISALLOWED

Suggested Fixes

Fix 1: Consistent Percent-Encoding Normalization

Apply the same percent-encoding normalization logic to incoming URI paths before matching.

Fix 2: Complexity Safeguards

Consider introducing a complexity threshold, state limit, or other safeguards for wildcard-heavy patterns.

Fix 3: Boundary Validation

Add explicit bounds checking in ExtractUserAgent():

while (p < end && ...)

Fix 4: Type Safety

Use size_t for path-length-derived counters and indices in Matches().

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions