Skip to content

Conversation

@tarekgh
Copy link
Member

@tarekgh tarekgh commented Nov 8, 2024

No description provided.

…vocab files (dotnet#7283)

* Add the governance file cgmanifest.json for tokenizer's vocab files

* Address the feedback

* apply more schema requirements on the doc
* Add Timeout to Regex used in the tokenizers

* Address the feedback
@tarekgh tarekgh changed the title Release/4.0 Backport tokenizer changes to Release/4.0 Nov 8, 2024
@tarekgh tarekgh requested a review from michaelgsharp November 8, 2024 19:07
@codecov
Copy link

codecov bot commented Nov 8, 2024

Codecov Report

Attention: Patch coverage is 79.61477% with 127 lines in your changes missing coverage. Please review.

Project coverage is 68.87%. Comparing base (a9b4212) to head (0613779).
Report is 1 commits behind head on release/4.0.

Files with missing lines Patch % Lines
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs 57.30% 48 Missing and 28 partials ⚠️
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs 67.24% 10 Missing and 9 partials ⚠️
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs 60.00% 7 Missing and 5 partials ⚠️
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 85.71% 4 Missing and 1 partial ⚠️
src/Microsoft.ML.Tokenizers/Model/Phi2Tokenizer.cs 0.00% 4 Missing ⚠️
src/Microsoft.ML.Tokenizers/Tokenizer.cs 75.00% 4 Missing ⚠️
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 89.65% 2 Missing and 1 partial ⚠️
...oft.ML.Tokenizers/Model/EnglishRobertaTokenizer.cs 90.90% 1 Missing ⚠️
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs 88.88% 0 Missing and 1 partial ⚠️
test/Microsoft.ML.Tokenizers.Tests/CodeGenTests.cs 98.93% 0 Missing and 1 partial ⚠️
... and 1 more
Additional details and impacted files
@@               Coverage Diff               @@
##           release/4.0    #7292      +/-   ##
===============================================
- Coverage        68.87%   68.87%   -0.01%     
===============================================
  Files             1467     1469       +2     
  Lines           273955   273989      +34     
  Branches         28380    28389       +9     
===============================================
+ Hits            188697   188710      +13     
- Misses           77946    77972      +26     
+ Partials          7312     7307       -5     
Flag Coverage Δ
Debug 68.87% <79.61%> (-0.01%) ⬇️
production 63.33% <69.66%> (-0.01%) ⬇️
test 89.18% <99.05%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/Microsoft.ML.Tokenizers/Model/BertOptions.cs 100.00% <100.00%> (ø)
...rc/Microsoft.ML.Tokenizers/Model/LlamaTokenizer.cs 59.09% <ø> (ø)
...Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs 78.28% <100.00%> (ø)
.../Microsoft.ML.Tokenizers/Model/WordPieceOptions.cs 100.00% <100.00%> (ø)
...crosoft.ML.Tokenizers/Normalizer/BertNormalizer.cs 62.85% <100.00%> (ø)
...ft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs 87.23% <100.00%> (ø)
src/Microsoft.ML.TorchSharp/NasBert/NerTrainer.cs 91.10% <100.00%> (ø)
...oft.ML.Tokenizers.Data.Tests/TokenizerDataTests.cs 100.00% <ø> (ø)
...icrosoft.ML.Tokenizers.Tests/BertTokenizerTests.cs 100.00% <100.00%> (ø)
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs 100.00% <100.00%> (ø)
... and 17 more

... and 11 files with indirect coverage changes

@tarekgh tarekgh merged commit b0fa194 into dotnet:release/4.0 Nov 8, 2024
25 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants