|
1 | | -# php spam word filter based on Double-Array Trie tree |
| 1 | +# PHP Double-Array Trie Tree Extension |
2 | 2 |
|
3 | 3 |  |
4 | 4 |  |
5 | 5 |  |
6 | 6 |
|
7 | | -## Dependence |
| 7 | +`php_double_array_trie_tree` is a native PHP extension that provides high-performance keyword matching based on a Double-Array Trie (DAT) index. |
8 | 8 |
|
9 | | -- php >=5.4 |
| 9 | +The extension is designed for workloads such as sensitive-word filtering, keyword detection, lexical lookup, and other string matching scenarios where deterministic lookup latency and low memory overhead are preferred over regular-expression-based scanning. |
10 | 10 |
|
11 | | -## Useage |
| 11 | +Internally, the project embeds a `libdatrie`-style implementation and exposes a compact PHP API for: |
12 | 12 |
|
13 | | -**Compile and Install** |
| 13 | +- offline dictionary compilation into a serialized trie file |
| 14 | +- low-latency loading of the compiled dictionary |
| 15 | +- first-hit lookup |
| 16 | +- full-hit enumeration across an input string |
14 | 17 |
|
15 | | -```shell |
| 18 | +## Why Double-Array Trie |
| 19 | + |
| 20 | +A Double-Array Trie is a trie representation optimized for compactness and traversal efficiency. Compared with naive hash-based keyword scans or large regular-expression alternations, DAT-based matching typically offers: |
| 21 | + |
| 22 | +- predictable traversal semantics |
| 23 | +- efficient prefix-state transitions |
| 24 | +- reduced runtime overhead for large keyword sets |
| 25 | +- good suitability for repeated lookups against a precompiled dictionary |
| 26 | + |
| 27 | +This extension is therefore a strong fit for read-heavy filtering pipelines where the dictionary is built once and queried many times. |
| 28 | + |
| 29 | +## Feature Set |
| 30 | + |
| 31 | +- Native PHP extension implemented in C |
| 32 | +- Dictionary compilation to a reusable on-disk trie artifact |
| 33 | +- Fast lookup through a serialized double-array trie |
| 34 | +- `searchOne()` for first-match retrieval |
| 35 | +- `searchAll()` for exhaustive match enumeration |
| 36 | +- Support for repeated matches in left-to-right scan order |
| 37 | +- Bundled trie implementation, with no external runtime dependency required during build |
| 38 | + |
| 39 | +## Architecture Overview |
| 40 | + |
| 41 | +The extension follows a two-phase workflow: |
| 42 | + |
| 43 | +1. Build phase: a PHP array of keywords is compiled into a binary trie file via `Linger\TrieTree::build()`. |
| 44 | +2. Query phase: the trie file is loaded by the constructor and reused for matching operations. |
| 45 | + |
| 46 | +At initialization time, the trie alphabet is configured as the full byte range `0x00-0xff`, which makes the implementation byte-oriented rather than language-specific. In practice, this allows the extension to process arbitrary byte sequences, including UTF-8 encoded text, as long as the dictionary and the input content use the same encoding. |
| 47 | + |
| 48 | +## Requirements |
| 49 | + |
| 50 | +- PHP 5.4 or later |
| 51 | +- `phpize` |
| 52 | +- a C compiler toolchain compatible with PHP extension builds |
| 53 | + |
| 54 | +The source tree contains compatibility branches for legacy PHP 5 as well as modern PHP runtimes. |
| 55 | + |
| 56 | +## Installation |
| 57 | + |
| 58 | +Clone the repository and build the extension in the standard PHP extension toolchain: |
| 59 | + |
| 60 | +```bash |
16 | 61 | git clone https://github.com/liubang/php_double_array_trie_tree.git |
17 | 62 | cd php_double_array_trie_tree |
18 | 63 | phpize |
19 | | -./configure |
20 | | -make && sudo make install |
| 64 | +./configure --enable-linger_TrieTree |
| 65 | +make |
| 66 | +sudo make install |
21 | 67 | ``` |
22 | 68 |
|
| 69 | +Enable the extension in `php.ini`: |
| 70 | + |
| 71 | +```ini |
| 72 | +extension=linger_TrieTree.so |
| 73 | +``` |
| 74 | + |
| 75 | +To verify that the module is loaded: |
| 76 | + |
| 77 | +```bash |
| 78 | +php -m | grep linger_TrieTree |
| 79 | +``` |
| 80 | + |
| 81 | +## Public API |
| 82 | + |
| 83 | +The extension exposes the class `Linger\TrieTree`. |
| 84 | + |
| 85 | +### `Linger\TrieTree::build(array $dict, string $path): bool` |
| 86 | + |
| 87 | +Builds a trie from the provided dictionary and serializes it to `path`. |
| 88 | + |
| 89 | +Parameters: |
| 90 | + |
| 91 | +- `dict`: list of keywords to be indexed |
| 92 | +- `path`: output path for the serialized dictionary file |
| 93 | + |
| 94 | +Returns: |
| 95 | + |
| 96 | +- `true` on success |
| 97 | +- `false` on build failure |
| 98 | + |
| 99 | +### `new Linger\TrieTree(string $path)` |
| 100 | + |
| 101 | +Loads a previously compiled trie from disk. |
| 102 | + |
| 103 | +Parameters: |
| 104 | + |
| 105 | +- `path`: path to the serialized trie file |
| 106 | + |
| 107 | +Behavior: |
| 108 | + |
| 109 | +- throws an exception if the file cannot be loaded |
| 110 | + |
| 111 | +### `searchOne(string $content): ?string` |
| 112 | + |
| 113 | +Scans the input from left to right and returns the first matched keyword encountered during traversal. |
| 114 | + |
| 115 | +Behavior: |
| 116 | + |
| 117 | +- returns the first matched term as a string |
| 118 | +- returns `null` when no match is found |
| 119 | +- throws an exception when `content` is empty |
| 120 | + |
| 121 | +### `searchAll(string $content): array` |
| 122 | + |
| 123 | +Scans the input from left to right and returns all matched keywords in encounter order. |
| 124 | + |
| 125 | +Behavior: |
| 126 | + |
| 127 | +- returns an array of matched terms |
| 128 | +- preserves duplicates when the same keyword appears multiple times |
| 129 | +- returns an empty array when no match is found |
| 130 | +- throws an exception when `content` is empty |
| 131 | + |
| 132 | +## Usage Example |
| 133 | + |
| 134 | +The following example demonstrates a typical moderation-oriented workflow: compile a keyword dictionary once, load the serialized trie, and then execute both first-hit and full-hit matching against an input document. |
| 135 | + |
23 | 136 | ```php |
24 | 137 | <?php |
25 | 138 |
|
26 | | -$words = array('管理员','admin','哈哈','我擦'); |
27 | | -linger\TrieTree::build($words, "./bbb.dic"); |
| 139 | +$dictionary = [ |
| 140 | + '管理员', |
| 141 | + 'administrator', |
| 142 | + '敏感词', |
| 143 | + 'restricted term', |
| 144 | + 'internal use only', |
| 145 | +]; |
| 146 | +$dictionaryFile = __DIR__ . '/keywords.dic'; |
| 147 | + |
| 148 | +Linger\TrieTree::build($dictionary, $dictionaryFile); |
| 149 | + |
| 150 | +$trie = new Linger\TrieTree($dictionaryFile); |
28 | 151 |
|
29 | | -$filter = new linger\TrieTree('./bbb.dic'); |
30 | | -$content = 'this is testadmin这是一段测试文字,哈哈的飞洒管理员的司法地方哈哈,火红的萨来开发大健康我擦'; |
31 | | -$res = $filter->searchOne($content); |
32 | | -var_dump($res); |
33 | | -$res = $filter->searchAll($content); |
34 | | -print_r($res); |
| 152 | +$input = 'This document is marked for internal use only. 请立即通知管理员,因为本段内容包含敏感词和 restricted term。'; |
| 153 | + |
| 154 | +$firstMatch = $trie->searchOne($input); |
| 155 | +var_dump($firstMatch); |
| 156 | + |
| 157 | +$allMatches = $trie->searchAll($input); |
| 158 | +print_r($allMatches); |
35 | 159 | ``` |
36 | 160 |
|
37 | | -## Performance |
| 161 | +In this example: |
| 162 | + |
| 163 | +- `searchOne()` returns the earliest detected keyword in scan order |
| 164 | +- `searchAll()` returns every detected keyword, including repeated occurrences |
| 165 | + |
| 166 | +Representative output: |
| 167 | + |
| 168 | +```text |
| 169 | +string(17) "internal use only" |
| 170 | +Array |
| 171 | +( |
| 172 | + [0] => internal use only |
| 173 | + [1] => 管理员 |
| 174 | + [2] => 敏感词 |
| 175 | + [3] => restricted term |
| 176 | +) |
| 177 | +``` |
| 178 | + |
| 179 | +## Matching Semantics |
| 180 | + |
| 181 | +- Matching is performed over bytes, not Unicode code points. |
| 182 | +- `searchOne()` returns the first successful terminal match discovered during the scan. |
| 183 | +- `searchAll()` enumerates matches in left-to-right order and keeps repeated hits. |
| 184 | +- The dictionary should be built with the same character encoding used by the target content. |
| 185 | + |
| 186 | +For multilingual text, UTF-8 is typically the safest choice as long as both dictionary entries and input strings are consistently UTF-8 encoded. |
38 | 187 |
|
39 | | -- Size of dictionary words: 3954 |
40 | | -- Length of destination string: 14352 |
41 | | -- Times: 100000 |
| 188 | +## Performance Notes |
42 | 189 |
|
43 | | -```shell |
44 | | -liubang@venux:~/workspace/php/test/filter$ php test.php |
| 190 | +The original benchmark included in this project reports the following indicative result: |
| 191 | + |
| 192 | +- Dictionary size: `3954` entries |
| 193 | +- Input length: `14352` characters |
| 194 | +- Iterations: `100000` |
| 195 | + |
| 196 | +```text |
45 | 197 | Double-Array Trie tree: 1.86985206604 |
46 | | - regular expression: 63.114347934723 |
| 198 | +regular expression: 63.114347934723 |
| 199 | +``` |
| 200 | + |
| 201 | +Actual performance will depend on: |
| 202 | + |
| 203 | +- PHP version |
| 204 | +- compiler flags |
| 205 | +- CPU architecture |
| 206 | +- dictionary cardinality and keyword distribution |
| 207 | +- input size and match density |
| 208 | + |
| 209 | +Even so, the extension is clearly intended for scenarios where precompiled trie traversal materially outperforms repeated regular-expression evaluation. |
| 210 | + |
| 211 | +## Development and Test |
| 212 | + |
| 213 | +Build locally: |
| 214 | + |
| 215 | +```bash |
| 216 | +phpize |
| 217 | +./configure --enable-linger_TrieTree |
| 218 | +make clean |
| 219 | +make |
47 | 220 | ``` |
48 | 221 |
|
49 | | -## Thanks |
| 222 | +Run the bundled PHPT test suite: |
| 223 | + |
| 224 | +```bash |
| 225 | +make test |
| 226 | +``` |
| 227 | + |
| 228 | +## Repository Layout |
| 229 | + |
| 230 | +- `linger_TrieTree.c`: PHP extension entry points and exposed methods |
| 231 | +- `php_linger_TrieTree.h`: extension-level declarations and compatibility macros |
| 232 | +- `src/datrie/`: embedded double-array trie implementation |
| 233 | +- `tests/`: PHPT regression tests |
| 234 | +- `config.m4` / `config.w32`: Unix and Windows extension build configuration |
| 235 | + |
| 236 | +## Use Cases |
| 237 | + |
| 238 | +- sensitive-word filtering |
| 239 | +- moderation pipelines |
| 240 | +- blacklist or denylist matching |
| 241 | +- keyword-trigger routing |
| 242 | +- exact-term lexical detection in high-throughput services |
| 243 | + |
| 244 | +## Acknowledgements |
| 245 | + |
| 246 | +- Inspired by [wulijun/php-ext-trie-filter](https://github.com/wulijun/php-ext-trie-filter) |
| 247 | +- Built on an embedded [libdatrie](https://linux.thai.net/~thep/datrie/datrie.html)-style trie implementation |
50 | 248 |
|
51 | | -Inspired by [wulijun/php-ext-trie-filter](https://github.com/wulijun/php-ext-trie-filter.git) |
| 249 | +## License |
52 | 250 |
|
53 | | -Depends on [libdatrie-0.2.4](https://linux.thai.net/~thep/datrie/datrie.html) |
| 251 | +This project is released under the [MIT License](LICENSE). |
0 commit comments