@@ -6,7 +6,6 @@ php-text-analysis
66
77[ ![ Total Downloads] ( https://poser.pugx.org/yooper/php-text-analysis/downloads )] ( https://packagist.org/packages/yooper/php-text-analysis )
88
9- [ ![ Latest Unstable Version] ( https://poser.pugx.org/yooper/php-text-analysis/v/unstable )] ( https://packagist.org/packages/yooper/php-text-analysis )
109
1110PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language. All the documentation for this project can be found in the wiki.
1211
@@ -21,66 +20,52 @@ Documentation for the library resides in the wiki.
2120https://github.com/yooper/php-text-analysis/wiki
2221
2322
24-
25-
26- Dictionary Installation
27- =============
28-
29- Not required unless you use the dictionary stemmers
30-
31- * For Ubuntu < 16*
32- ```
33- sudo apt-get install libpspell-dev
34- sudo apt-get install php5-pspell
35- sudo apt-get install aspell-en
36- sudo apt-get install php5-enchant
37- ```
38- * For Ubuntu >= 16*
39- ```
40- sudo apt-get install libpspell-dev php7.0-pspell aspell-en php7.0-enchant
23+ ### Tokenization
24+ ``` php
25+ $tokens = tokenize($text);
4126```
4227
43-
44- * For Centos*
45- ```
46- sudo yum install php5-pspell
47- sudo yum install aspell-en
48- sudo yum install php5-enchant
28+ You can customize which type of tokenizer to tokenize with by passing in the name of the tokenizer class
29+ ``` php
30+ $tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);
4931```
32+ The default tokenizer is ** \TextAnalysis\Tokenizers\GeneralTokenizer::class** . Some tokenizers require parameters to be set upon instantiation.
5033
51- * PHP Pecl Stem* is not currently available in php 7.0.
34+ ### Normalization
35+ By default, ** normalize_tokens** uses the function ** strtolower** to lowercase all the tokens. To customize
36+ the normalize function, pass in either a function or a string to be used by array_map.
5237
38+ ``` php
39+ $normalizedTokens = normalize_tokens(array $tokens);
40+ ```
5341
54- Tokenize
55- =============
42+ ``` php
43+ $normalizedTokens = normalize_tokens(array $tokens, 'mb_strtolower');
5644
57- There are several tokenizers available
45+ $normalizedTokens = normalize_tokens(array $tokens, function($token){ return mb_strtoupper($token); });
46+ ```
5847
59- * FixedLengthTokenizer
60- * GeneralTokenizer
61- * LambdaTokenizer
62- * PennTreeBankTokenizer
63- * RegexTokenizer
64- * SentenceTokenizer
65- * WhitespaceTokenizer
48+ ### Frequency Distributions
6649
67- * Tokenizer Usage*
68- ```
69- $tokenizer = new GeneralTokenizer()
70- $tokens = $tokenizer->tokenize("Enter your text here");
50+ The call to ** freq_dist** returns a [ FreqDist] ( https://github.com/yooper/php-text-analysis/blob/master/src/Analysis/FreqDist.php ) instance.
51+ ``` php
52+ $freqDist = freq_dist(tokenize($text));
7153```
7254
73- Frequency Distribution
74- =============
55+ ### Ngram Generation
56+ By default bigrams are generated.
57+ ``` php
58+ $bigrams = ngrams($tokens);
7559```
76- $tokenizer = new \TextAnalysis\Tokenizers\GeneralTokenizer();
77- $tokens = $tokenizer->tokenize("time flies like an arrow and an arrow flies like time");
78- $freqDist = new \TextAnalysis\Analysis\FreqDist($tokens);
79- $freqDist->getHapaxes(); //Get the Hapaxes
80- $freqDist->getTotalTokens();
81- $freqDist->getTotalUniqueTokens();
60+ Customize the ngrams
61+ ``` php
62+ // create trigrams with a pipe delimiter in between each word
63+ $trigrams = ngrams($tokens,3, '|');
8264```
83- Check out the API for full documentation
84- https://github.com/yooper/php-text-analysis/blob/master/src/Analysis/FreqDist.php
85-
8665
66+ Dictionary Installation
67+ =============
68+
69+ To do
70+
71+
0 commit comments