[Security] Multiple Memory Safety and DoS Vulnerabilities in robots.cc

### Summary

I identified several issues in the `google/robotstxt` library during fuzzing and manual code review. These observations involve URL matching consistency, algorithmic complexity, boundary handling, and integer arithmetic robustness within `robots.cc`.

The most significant finding is an inconsistency in how percent-encoded paths are normalized and matched. This can lead to different policy decisions for semantically equivalent URLs.

These findings were previously shared with the Google OSS VRP, and I was advised to open an issue here for project-level review and discussion.

---

### Observations & Analysis

#### 1. Inconsistent Percent-Encoding Matching

**Detail**

`MaybeEscapePattern()` normalizes lowercase percent-encoded sequences appearing in `robots.txt` rules (e.g. `%aa` → `%AA`).

However, the URI path extracted by `GetPathParamsQuery()` does not appear to undergo equivalent normalization before matching.

As a result, semantically equivalent percent-encoded representations may produce different matching outcomes because matching is performed using case-sensitive comparisons.

**Example**

Rule:

```text
Disallow: /secret%aa
```

Internal representation:

```text
/secret%AA
```

URI:

```text
/secret%aa
```

Observed result:

```text
ALLOWED
```

Expected result:

```text
DISALLOWED
```

**Impact**

This is not an authorization bypass in the traditional sense because `robots.txt` is not an authorization mechanism.

However, it appears to cause inconsistent policy evaluation where equivalent URL representations can lead to different crawler decisions. Applications relying on `robotstxt` matching for crawler restrictions or policy enforcement may observe unintended behavior.

---

#### 2. Algorithmic Complexity in `Matches`

**Detail**

The matching algorithm exhibits `O(N × M)` worst-case behavior.

Specific combinations of long paths and wildcard-heavy patterns (for example `*a*a*a...`) can cause significant growth in processing time.

**Measurements**

| Path Length | Pattern Size | Time |
|------------|------------|------|
| 10,000 | 200 | 0.05s |
| 80,000 | 1600 | 3.73s |

**Impact**

This may lead to excessive CPU consumption in high-throughput parsing environments.

---

#### 3. Boundary Handling in `ExtractUserAgent`

**Detail**

`ExtractUserAgent()` iterates through a `string_view` without an explicit `p < end` boundary check.

While normal usage may provide safe backing storage, the parser can advance beyond the logical range represented by the `string_view`.

**Impact**

This is primarily a robustness and defensive-programming concern. Adding explicit boundary validation would ensure that parsing remains constrained to the intended slice.

---

#### 4. Integer Arithmetic Robustness in `Matches`

**Detail**

`numpos` is stored as a 32-bit `int`.

Calculations derived from path length may overflow for sufficiently large inputs, potentially resulting in unexpected behavior.

**Impact**

Although extremely large paths are uncommon, replacing path-length-derived counters with `size_t` would improve resilience and eliminate this edge case.

---

### Reproduction (Observation #1)

The matching inconsistency can be reproduced using the bundled CLI tool.

```bash
# Create robots.txt
echo -e "User-agent: *\nDisallow: /secret%aa" > robots.txt

# Run matcher
./build/robots robots.txt FooBot "http://example.com/secret%aa"
```

Observed:

```text
ALLOWED
```

Expected:

```text
DISALLOWED
```

---

### Suggested Fixes

#### Fix 1: Consistent Percent-Encoding Normalization

Apply the same percent-encoding normalization logic to incoming URI paths before matching.

#### Fix 2: Complexity Safeguards

Consider introducing a complexity threshold, state limit, or other safeguards for wildcard-heavy patterns.

#### Fix 3: Boundary Validation

Add explicit bounds checking in `ExtractUserAgent()`:

```cpp
while (p < end && ...)
```

#### Fix 4: Type Safety

Use `size_t` for path-length-derived counters and indices in `Matches()`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security] Multiple Memory Safety and DoS Vulnerabilities in robots.cc #85

Summary

Observations & Analysis

1. Inconsistent Percent-Encoding Matching

2. Algorithmic Complexity in `Matches`

3. Boundary Handling in `ExtractUserAgent`

4. Integer Arithmetic Robustness in `Matches`

Reproduction (Observation #1)

Suggested Fixes

Fix 1: Consistent Percent-Encoding Normalization

Fix 2: Complexity Safeguards

Fix 3: Boundary Validation

Fix 4: Type Safety

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Security] Multiple Memory Safety and DoS Vulnerabilities in robots.cc #85

Description

Summary

Observations & Analysis

1. Inconsistent Percent-Encoding Matching

2. Algorithmic Complexity in Matches

3. Boundary Handling in ExtractUserAgent

4. Integer Arithmetic Robustness in Matches

Reproduction (Observation #1)

Suggested Fixes

Fix 1: Consistent Percent-Encoding Normalization

Fix 2: Complexity Safeguards

Fix 3: Boundary Validation

Fix 4: Type Safety

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Algorithmic Complexity in `Matches`

3. Boundary Handling in `ExtractUserAgent`

4. Integer Arithmetic Robustness in `Matches`