Skip to content

Commit 00eb095

Browse files
committed
docs: improve README and usage examples
1 parent ef580ec commit 00eb095

3 files changed

Lines changed: 252 additions & 57 deletions

File tree

README.md

Lines changed: 224 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,251 @@
1-
# php spam word filter based on Double-Array Trie tree
1+
# PHP Double-Array Trie Tree Extension
22

33
![Build Status](https://github.com/liubang/php_double_array_trie_tree/workflows/integrate/badge.svg?branch=master)
44
![License: MIT](https://img.shields.io/github/license/liubang/php_double_array_trie_tree?style=flat-square)
55
![GitHub release (latest by date)](https://img.shields.io/github/v/release/liubang/php_double_array_trie_tree?style=flat-square)
66

7-
## Dependence
7+
`php_double_array_trie_tree` is a native PHP extension that provides high-performance keyword matching based on a Double-Array Trie (DAT) index.
88

9-
- php >=5.4
9+
The extension is designed for workloads such as sensitive-word filtering, keyword detection, lexical lookup, and other string matching scenarios where deterministic lookup latency and low memory overhead are preferred over regular-expression-based scanning.
1010

11-
## Useage
11+
Internally, the project embeds a `libdatrie`-style implementation and exposes a compact PHP API for:
1212

13-
**Compile and Install**
13+
- offline dictionary compilation into a serialized trie file
14+
- low-latency loading of the compiled dictionary
15+
- first-hit lookup
16+
- full-hit enumeration across an input string
1417

15-
```shell
18+
## Why Double-Array Trie
19+
20+
A Double-Array Trie is a trie representation optimized for compactness and traversal efficiency. Compared with naive hash-based keyword scans or large regular-expression alternations, DAT-based matching typically offers:
21+
22+
- predictable traversal semantics
23+
- efficient prefix-state transitions
24+
- reduced runtime overhead for large keyword sets
25+
- good suitability for repeated lookups against a precompiled dictionary
26+
27+
This extension is therefore a strong fit for read-heavy filtering pipelines where the dictionary is built once and queried many times.
28+
29+
## Feature Set
30+
31+
- Native PHP extension implemented in C
32+
- Dictionary compilation to a reusable on-disk trie artifact
33+
- Fast lookup through a serialized double-array trie
34+
- `searchOne()` for first-match retrieval
35+
- `searchAll()` for exhaustive match enumeration
36+
- Support for repeated matches in left-to-right scan order
37+
- Bundled trie implementation, with no external runtime dependency required during build
38+
39+
## Architecture Overview
40+
41+
The extension follows a two-phase workflow:
42+
43+
1. Build phase: a PHP array of keywords is compiled into a binary trie file via `Linger\TrieTree::build()`.
44+
2. Query phase: the trie file is loaded by the constructor and reused for matching operations.
45+
46+
At initialization time, the trie alphabet is configured as the full byte range `0x00-0xff`, which makes the implementation byte-oriented rather than language-specific. In practice, this allows the extension to process arbitrary byte sequences, including UTF-8 encoded text, as long as the dictionary and the input content use the same encoding.
47+
48+
## Requirements
49+
50+
- PHP 5.4 or later
51+
- `phpize`
52+
- a C compiler toolchain compatible with PHP extension builds
53+
54+
The source tree contains compatibility branches for legacy PHP 5 as well as modern PHP runtimes.
55+
56+
## Installation
57+
58+
Clone the repository and build the extension in the standard PHP extension toolchain:
59+
60+
```bash
1661
git clone https://github.com/liubang/php_double_array_trie_tree.git
1762
cd php_double_array_trie_tree
1863
phpize
19-
./configure
20-
make && sudo make install
64+
./configure --enable-linger_TrieTree
65+
make
66+
sudo make install
2167
```
2268

69+
Enable the extension in `php.ini`:
70+
71+
```ini
72+
extension=linger_TrieTree.so
73+
```
74+
75+
To verify that the module is loaded:
76+
77+
```bash
78+
php -m | grep linger_TrieTree
79+
```
80+
81+
## Public API
82+
83+
The extension exposes the class `Linger\TrieTree`.
84+
85+
### `Linger\TrieTree::build(array $dict, string $path): bool`
86+
87+
Builds a trie from the provided dictionary and serializes it to `path`.
88+
89+
Parameters:
90+
91+
- `dict`: list of keywords to be indexed
92+
- `path`: output path for the serialized dictionary file
93+
94+
Returns:
95+
96+
- `true` on success
97+
- `false` on build failure
98+
99+
### `new Linger\TrieTree(string $path)`
100+
101+
Loads a previously compiled trie from disk.
102+
103+
Parameters:
104+
105+
- `path`: path to the serialized trie file
106+
107+
Behavior:
108+
109+
- throws an exception if the file cannot be loaded
110+
111+
### `searchOne(string $content): ?string`
112+
113+
Scans the input from left to right and returns the first matched keyword encountered during traversal.
114+
115+
Behavior:
116+
117+
- returns the first matched term as a string
118+
- returns `null` when no match is found
119+
- throws an exception when `content` is empty
120+
121+
### `searchAll(string $content): array`
122+
123+
Scans the input from left to right and returns all matched keywords in encounter order.
124+
125+
Behavior:
126+
127+
- returns an array of matched terms
128+
- preserves duplicates when the same keyword appears multiple times
129+
- returns an empty array when no match is found
130+
- throws an exception when `content` is empty
131+
132+
## Usage Example
133+
134+
The following example demonstrates a typical moderation-oriented workflow: compile a keyword dictionary once, load the serialized trie, and then execute both first-hit and full-hit matching against an input document.
135+
23136
```php
24137
<?php
25138

26-
$words = array('管理员','admin','哈哈','我擦');
27-
linger\TrieTree::build($words, "./bbb.dic");
139+
$dictionary = [
140+
'管理员',
141+
'administrator',
142+
'敏感词',
143+
'restricted term',
144+
'internal use only',
145+
];
146+
$dictionaryFile = __DIR__ . '/keywords.dic';
147+
148+
Linger\TrieTree::build($dictionary, $dictionaryFile);
149+
150+
$trie = new Linger\TrieTree($dictionaryFile);
28151

29-
$filter = new linger\TrieTree('./bbb.dic');
30-
$content = 'this is testadmin这是一段测试文字,哈哈的飞洒管理员的司法地方哈哈,火红的萨来开发大健康我擦';
31-
$res = $filter->searchOne($content);
32-
var_dump($res);
33-
$res = $filter->searchAll($content);
34-
print_r($res);
152+
$input = 'This document is marked for internal use only. 请立即通知管理员,因为本段内容包含敏感词和 restricted term。';
153+
154+
$firstMatch = $trie->searchOne($input);
155+
var_dump($firstMatch);
156+
157+
$allMatches = $trie->searchAll($input);
158+
print_r($allMatches);
35159
```
36160

37-
## Performance
161+
In this example:
162+
163+
- `searchOne()` returns the earliest detected keyword in scan order
164+
- `searchAll()` returns every detected keyword, including repeated occurrences
165+
166+
Representative output:
167+
168+
```text
169+
string(17) "internal use only"
170+
Array
171+
(
172+
[0] => internal use only
173+
[1] => 管理员
174+
[2] => 敏感词
175+
[3] => restricted term
176+
)
177+
```
178+
179+
## Matching Semantics
180+
181+
- Matching is performed over bytes, not Unicode code points.
182+
- `searchOne()` returns the first successful terminal match discovered during the scan.
183+
- `searchAll()` enumerates matches in left-to-right order and keeps repeated hits.
184+
- The dictionary should be built with the same character encoding used by the target content.
185+
186+
For multilingual text, UTF-8 is typically the safest choice as long as both dictionary entries and input strings are consistently UTF-8 encoded.
38187

39-
- Size of dictionary words: 3954
40-
- Length of destination string: 14352
41-
- Times: 100000
188+
## Performance Notes
42189

43-
```shell
44-
liubang@venux:~/workspace/php/test/filter$ php test.php
190+
The original benchmark included in this project reports the following indicative result:
191+
192+
- Dictionary size: `3954` entries
193+
- Input length: `14352` characters
194+
- Iterations: `100000`
195+
196+
```text
45197
Double-Array Trie tree: 1.86985206604
46-
regular expression: 63.114347934723
198+
regular expression: 63.114347934723
199+
```
200+
201+
Actual performance will depend on:
202+
203+
- PHP version
204+
- compiler flags
205+
- CPU architecture
206+
- dictionary cardinality and keyword distribution
207+
- input size and match density
208+
209+
Even so, the extension is clearly intended for scenarios where precompiled trie traversal materially outperforms repeated regular-expression evaluation.
210+
211+
## Development and Test
212+
213+
Build locally:
214+
215+
```bash
216+
phpize
217+
./configure --enable-linger_TrieTree
218+
make clean
219+
make
47220
```
48221

49-
## Thanks
222+
Run the bundled PHPT test suite:
223+
224+
```bash
225+
make test
226+
```
227+
228+
## Repository Layout
229+
230+
- `linger_TrieTree.c`: PHP extension entry points and exposed methods
231+
- `php_linger_TrieTree.h`: extension-level declarations and compatibility macros
232+
- `src/datrie/`: embedded double-array trie implementation
233+
- `tests/`: PHPT regression tests
234+
- `config.m4` / `config.w32`: Unix and Windows extension build configuration
235+
236+
## Use Cases
237+
238+
- sensitive-word filtering
239+
- moderation pipelines
240+
- blacklist or denylist matching
241+
- keyword-trigger routing
242+
- exact-term lexical detection in high-throughput services
243+
244+
## Acknowledgements
245+
246+
- Inspired by [wulijun/php-ext-trie-filter](https://github.com/wulijun/php-ext-trie-filter)
247+
- Built on an embedded [libdatrie](https://linux.thai.net/~thep/datrie/datrie.html)-style trie implementation
50248

51-
Inspired by [wulijun/php-ext-trie-filter](https://github.com/wulijun/php-ext-trie-filter.git)
249+
## License
52250

53-
Depends on [libdatrie-0.2.4](https://linux.thai.net/~thep/datrie/datrie.html)
251+
This project is released under the [MIT License](LICENSE).

tests/001.phpt

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,10 @@
11
--TEST--
2-
Check for linger_TrieTree presence
2+
Verify that the linger_TrieTree extension is loaded
33
--SKIPIF--
44
<?php if (!extension_loaded("linger_TrieTree")) print "skip"; ?>
55
--FILE--
6-
<?php
6+
<?php
77
echo "linger_TrieTree extension is available";
8-
/*
9-
you can add regression tests for your extension here
10-
11-
the output of your test code has to be equal to the
12-
text in the --EXPECT-- section below for the tests
13-
to pass, differences between the output and the
14-
expected text are interpreted as failure
15-
16-
see php5/README.TESTING for further information on
17-
writing regression tests
18-
*/
198
?>
209
--EXPECT--
2110
linger_TrieTree extension is available

tests/002.phpt

Lines changed: 26 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,38 @@
11
--TEST--
2-
Check for linger_TrieTree presence
2+
Verify dictionary build and keyword matching workflow
33
--SKIPIF--
44
<?php if (!extension_loaded("linger_TrieTree")) print "skip"; ?>
55
--FILE--
6-
<?php
7-
$dic = './tmp.dic';
8-
$words = array('管理员','admin','哈哈','我擦');
9-
linger\TrieTree::build($words, $dic);
6+
<?php
7+
$dictionaryFile = './tmp.dic';
8+
$dictionary = array(
9+
'管理员',
10+
'administrator',
11+
'敏感词',
12+
'restricted term',
13+
'internal use only'
14+
);
15+
Linger\TrieTree::build($dictionary, $dictionaryFile);
1016

11-
$filter = new linger\TrieTree($dic);
12-
var_dump($filter);
13-
$content = 'this is testadmin这是一段测试文字,哈哈的飞洒管理员的司法地方哈哈,火红的萨来开发大健康我擦';
14-
$res = $filter->searchOne($content);
15-
var_dump($res);
16-
$res = $filter->searchAll($content);
17-
print_r($res);
17+
$trie = new Linger\TrieTree($dictionaryFile);
18+
var_dump($trie);
19+
20+
$input = 'This document is marked for internal use only. 请立即通知管理员,因为本段内容包含敏感词和 restricted term。';
21+
22+
$firstMatch = $trie->searchOne($input);
23+
var_dump($firstMatch);
24+
25+
$allMatches = $trie->searchAll($input);
26+
print_r($allMatches);
1827
?>
1928
--EXPECTF--
2029
object(Linger\TrieTree)#1 (0) {
2130
}
22-
string(5) "admin"
31+
string(17) "internal use only"
2332
Array
2433
(
25-
[0] => admin
26-
[1] => 哈哈
27-
[2] => 管理员
28-
[3] => 哈哈
29-
[4] => 我擦
34+
[0] => internal use only
35+
[1] => 管理员
36+
[2] => 敏感词
37+
[3] => restricted term
3038
)

0 commit comments

Comments
 (0)