Commit 4a4b64a
authored
Feat/chipper v3 (#308)
This new version of Chipper should largely improve the output for
tables. In the attached file, the output looked as having many cells
spread across multiple columns, and largely because of the $ character,
which was inconsistently annotated in the Odetta set. As well colspan
did not work properly for the header. This new version of Chipper does
not predict thead and tbody tokens for tables.
To test it, you need to run the code below. It will print the predicted
elements. The code should print only one page and one element. The
element has a field name text_as_html. The HTML within that field can be
pasted in a new file renamed as html to be open with a browser.
Example with Chipperv2
<img width="1146" alt="image"
src="https://github.com/Unstructured-IO/unstructured-inference/assets/3939469/feffe674-8c9b-4c64-bd6d-08bd602c596a">
Example with Chipperv3
<img width="666" alt="image"
src="https://github.com/Unstructured-IO/unstructured-inference/assets/3939469/f06867a9-2636-4055-a158-42badc58dd09">
<img width="677" alt="apple"
src="https://github.com/Unstructured-IO/unstructured-inference/assets/3939469/d7ec628e-0dca-409c-894a-612350fce71f">
```
from unstructured_inference.inference.layout import DocumentLayout
from unstructured_inference.models.base import get_model
model = get_model("chipper")
doc = DocumentLayout.from_image_file("[point to the location of the file]/apple.png", detection_model=model)
for i in range(len(doc.pages)):
print(f"********** Page {i}")
print(*[element.__dict__ for element in doc.pages[i].elements], sep="\n")
```
---------
Co-authored-by: Antonio Jimeno Yepes <[email protected]>1 parent 4e5c4e6 commit 4a4b64a
File tree
6 files changed
+105
-18
lines changed- test_unstructured_inference/models
- unstructured_inference
- inference
- models
6 files changed
+105
-18
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
1 | 5 | | |
2 | 6 | | |
3 | 7 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
139 | 139 | | |
140 | 140 | | |
141 | 141 | | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
| 142 | + | |
| 143 | + | |
149 | 144 | | |
150 | 145 | | |
151 | 146 | | |
| |||
194 | 189 | | |
195 | 190 | | |
196 | 191 | | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
197 | 211 | | |
198 | 212 | | |
199 | 213 | | |
| |||
241 | 255 | | |
242 | 256 | | |
243 | 257 | | |
244 | | - | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
245 | 303 | | |
246 | 304 | | |
247 | 305 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
16 | 17 | | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
20 | 29 | | |
21 | 30 | | |
22 | 31 | | |
| |||
37 | 46 | | |
38 | 47 | | |
39 | 48 | | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
13 | 14 | | |
14 | 15 | | |
15 | 16 | | |
| |||
108 | 109 | | |
109 | 110 | | |
110 | 111 | | |
111 | | - | |
| 112 | + | |
112 | 113 | | |
113 | 114 | | |
114 | 115 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
| 18 | + | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
47 | 58 | | |
48 | 59 | | |
49 | 60 | | |
50 | 61 | | |
51 | | - | |
| 62 | + | |
52 | 63 | | |
53 | 64 | | |
54 | 65 | | |
| |||
390 | 401 | | |
391 | 402 | | |
392 | 403 | | |
393 | | - | |
| 404 | + | |
394 | 405 | | |
395 | 406 | | |
396 | 407 | | |
| |||
516 | 527 | | |
517 | 528 | | |
518 | 529 | | |
519 | | - | |
| 530 | + | |
| 531 | + | |
520 | 532 | | |
521 | | - | |
522 | | - | |
523 | | - | |
524 | | - | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
525 | 537 | | |
526 | 538 | | |
527 | 539 | | |
| |||
0 commit comments