AMICorpus-Meeting-Transcript-Extraction/EN2001a.txt at master · Utkichaps/AMICorpus-Meeting-Transcript-Extraction · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
E: 'Kay Gosh 'Kay.
A:Okay.
A:Does anyone want to see uh Steve's feedback from the specification.
E:Is there much more.
D:I.
E:in it than he.
D:it the last.
A:Right.
E:Is there much more.
D:time.
E:in it than he said yesterday.
A:Not really um.
E:Mm Hmm.
A:just what he's talking about like duplication of effort.
A:and.
E:Hmm.
A:Like duplication of effort and stuff and um yeah he was saying that we should maybe uh think about having a prototype for week six which is next.
D:Next week.
A:week.
A:Yeah So we should probably prioritize our packages.
E:Yeah now I'd say if for the prototype if we just like wherever possible chunk in the stuff that we have um pre-annotated and stuff and for the stuff.
A:Mm.
E:that we don't have pre-annotated write like a stupid baseline then we should probably be able to basically that means we focus on on the interface first sort of so that we we take the the ready-made parts and just see how we get them work together in the interface the way we want and and then we have a working prototype And then we can go back and replace pieces either by our own components or by more sophisticated components of our own So it's probably.
B:Yeah.
E:feasible The.
A:Yeah.
E:thing is I'm away this weekend So that's.
B:Yeah.
E:for me Oh yeah um yeah.
B:I mean if we just want to have um some data for the user face could even be random data.
D:Yeah.
E:No But also.
B:Uh mm mm Yeah.
E:I might like the the similarity thing like my just my matrix itself for my stuff I I I think I can do that fairly quickly because I have the algorithms.
E:Yeah.
B:I'm.
E:I think today's meeting is really the one where we where we sort of settle down the data structure and as soon.
D:Yeah.
E:as we have that uh probably like after today's meeting we then actually need to well go back first of all and look at NITE X_M_L_ to see in how far that that which we want is compatible with that NITE X_M_L_ offers us And then just sort of everyone make sure everyone understand the interface So I think if today we decide on what data we wanna have now and and later maybe even today we go and look at NITE X_M_L_ or some of look at NITE X_M_L_ in a bit more detail just trying to make some sense of that code and see how does the representation work in their system And then sort of with that knowledge we should be able to then say okay that type of NITE X_M_L_ data we wanna load into it and this is how everyone can access it and then.
A:Hmm Has.
E:we should.
A:has anyone.
E:go from.
A:actually looked at the Java code for.
E:No.
D:No.
A:the huh Hmm Yeah.
E:I've looked at the documentation and like seen enough to make me think that we want to use the NITE X_M_L_ framework because um they have a good event model that synchronizes sort of the data and and every display element So that takes a lot of work away from us Sort of that would be a reason for staying within their framework and using their general classes But beyond that I haven't looked at it at all which is something we should really do.
E:Who actually.
A:I.
E:like for this whole discussion I mean who of us is doing stuff that is happening on-line and who of us is doing stuff that's happening off-line Like my data is coming Hmm.
C:The basic word importance is off-line.
E:Yeah Okay.
C:as well The combined measure might not be if we want to wait what.
C:the user has typed in into the search.
E:Okay 'Kay.
D:Uh mine's gonna be mostly using the off-line But the actual stuff it's doing will be on-line But it won't be very.
D:um processor intensive or memory intensive I don't.
E:So.
D:think.
E:basically apart from the display module the the display itself we don't have an extremely high degree of interaction between sort of our modules that create the stuff and and the interface so the interface is mainly while it's running just working on data that's just loaded from a file I guess There.
A:Yeah.
E:isn't.
A:I I don't know about.
B:Hmm.
A:the search functionality that might be online Depends how it's gonna work.
E:Yeah I know Yeah the search is I guess the search is sort of a strange beast anyway because for the search we're leaving the NITE X_M_L_ framework.
A:Yeah.
E:Um but that's still sort of that's good That means that at least like we don't have the type of situation where somebody has to do like a billion calculations on on data on-line 'Cause that would make it a lot more like that would mean that our interface for the data would have to be a lot more careful about how it performs and and everything And nobody is modifying that data at at on-line time at all it seems Nobody's making any changes to the actual data on-line.
D:Don't think so Yeah.
E:So that's actually making it a lot easier That basically.
E:means our browser really is a viewer mostly which isn't doing much with the data except for sort of selecting a piece of it and and displaying it.
D:Are we still gonna go for dumping it into a database.
E:Hmm.
D:Are we still gonna dump it into a database.
E:Well some parts relevant for the search yes I'd say so.
D:'Cause if we are I reckon we should all read our classes out of the database It'll be so much easier.
E:Hmm.
D:Well if we're gonna dump the part of it into a database anyway we might as well dump all the fields we want into the database calculate everything from there Then we don't even have to worry that much about the underlying X_M_L_ representation We can just query it.
E:Yeah but nobody of us is doing much of searching from the data in the on-line stage And for all together like the display itself I think we are easier if we if it's sitting on the X_M_L_ than if it's sitting on the S_Q_L_ stuff because if it's sitting on the X_M_L_ we have the the NITE X_M_L_ framework with all its functionality for synchronizing through all the different levels whenever there's a change whenever something's moving forward and stuff And we can just more or less look at their code like how their player moves forward and how that moving forward is represented in different windows and stuff So I think in the actual browser itself I don't wanna sit on the S_Q_L_ if we can sit on the X_M_L_ because sitting on the X_M_L_ we have all we have so much help And for for like the the calculations that we're doing apart from the search it seems that everyone needs some special representations anyway.
D:Well if we're gonna do that we should try and store everything in in an X_M_L_ format as well.
E:You mean our results.
D:Yeah.
E:Yeah in in the NITE X_M_L_ X_M_L_ format so with their time stamps and stuff so that it's easy.
B:Yes.
E:to to tie together things.
D:Yeah.
E:What I'm like what we have to think about is if we go with this multi-level idea like this idea that sort of if you start with a whole meeting series as one entity as one thing that you display as one whole sort of that then the individual chunks of the individual meetings whereas and then you can click on a meeting and then sort of the meeting is the whole thing and the chunks are the individual segments that means sort of we have multiple levels of of representation which we probably If we if we do it this way like we we have to discuss that if we do it this way then we should probably find some abstraction model so that the interface in the sense like deals with it as if it's same so that the interface doesn't really have to worry whether it's a meeting in the whole meeting series or a segment within a meeting you know what I mean.
A:Mm-hmm.
B:Hmm yes Hmm.
E:And that's probably stuff that we have to sort of like process twice then Like for example that like the summary of a meeting within the whole meeting corpus or meeting series is meeting series a good word for that I don't really know what how to call it You know what I mean like.
E:not not the whole corpus but every meeting that has to do with one topic.
A:Yeah that makes sense.
E:Um so in in the meeting series so that a summary for a meeting within the meeting series are sort of compiled off-line by a summary module And that is separate from a summary of a segment within a meeting 'Cause I don't think we can.
D:Well we don't even need to.
B:I'm not.
D:do that.
B:so sure.
D:if we.
D:got our information density calculated off-line so all we do is treat the whole lot as one massive document I mean they'll it's not gonna be so big that we can't load in a information density for every utterance And we can just summarise based on that.
B:I I thought we would just have like um one big summary um all the uh different importance levels um displayed on it And depending on what our um zoom level is we just display a part of it.
E:So are we doing that at all levels Are we um.
B:And we would have one very big thing off-line And from that we would just select what we are displaying.
E:And just have different like levels sort of.
B:Yes So for example you would um give a high value to those um sequences you want to.
E:Mm 'Kay So the only.
B:display in the meeting series summary.
B:And you just.
E:thing.
B:cut off.
E:yeah so the only thing that would happen basically if I double-click let's say from the whole meeting series on a single meeting is that the zoom level changes Like the the start and the end position changes and the zoom level changes.
B:That was what I I thought.
D:I think.
B:yeah.
E:I.
D:you can.
B:I.
D:do it on-line.
E:I thought we couldn't do that Like I was under the impression that we couldn't do that because we couldn't load the data for all that But I don't know I mean that.
A:Hmm.
D:I don't think there's really much point in doing like that when it's just gonna feed off in the end the information density measure basically And that's all calculated off-line So what you're really doing is sorting a list is the computationally hard part of it.
E:So I'm not sure if I got it I.
D:Well.
E:was Mm-hmm.
D:like the ideas we're calculating are information density all off-line first for every utterance.
D:in the whole corpus.
E:Mm-hmm Mm-hmm.
D:right So what you do is you say if you're looking at a series of meetings you just say well our whole document comprises of all these.
D:stuck together And then all you have to do is sort them.
E:Mm-hmm Okay.
D:by information density Like maybe weighted.
D:with the search terms.
E:So Okay.
D:and then extract them I don't think it's too slow to do on-line.
D:to be honest Is.
E:I I was just worried about the total memory complexity of it But I I completely admit I mean I just sort of like took that from some thing that Jonathan once said about not loading everything But maybe I was just wrong about it How.
E:many utterances.
D:that.
B:But I think the difference might be that we want just want to have um the words And that's not so much what he meant with not possibly loading everything was.
E:Yeah and I.
B:that.
E:yeah Yeah.
B:um load all the uh.
D:Yeah.
B:annotation stuff all the sound.
A:Hmm.
E:Yeah.
B:files all.
E:Yeah So what we have is we would have a word Like we would have words with some priority levels And they would basically be because even the selection would would the summaries automatically feed from just how prioritized an individual word or how uh prioritized an individual utterance is Or are the summaries sort of refined from it and made by a machine to make sentences and stuff Or are they just sort of taking out the words with the highest priority and then the words of the second.
D:Well.
E:highest priority And the.
D:on the utterance level I was thinking So.
D:the utterances with the.
E:okay Are we doing.
D:highest like mean information density.
B:In.
E:it on the whole thing on the utterance level Or are we doing it on word level like the information.
D:Well the trouble.
E:density.
D:with doing it on the word level is if you want the audio to synch up you've got no way of getting in and extracting just that word I mean it's impossible.
E:We I think we have start and end times for words actually.
D:For every.
E:but.
D:single word.
E:yeah but it it might.
C:Yeah.
D:Oh okay Yeah.
E:but it might sound crazy in the player We should really.
E:maybe we can do that together at some point today that we check out how the player works.
D:I don't think that will.
E:But.
D:it We'll.
E:some.
D:have to.
B:Um.
E:in altogether.
D:buffer.
E:doing it on an utterance level.
B:I I.
E:in the.
B:getting quite lost um at the moment.
E:So.
B:because um what's um our difference between the um um uh the importance and the skimming I mean do we do both or is it the same thing.
D:Well the skimming's gonna use the importance.
E:Yeah.
B:Okay So.
D:But like at.
B:but when.
D:it's.
B:when we.
D:gonna.
B:talk about summaries you talk about this uh about skimming and not about.
D:Well mostly skimming yeah.
E:Well but also about the displays I mean the displays in the in the text body in the in the latest draft that we had sort of we came up with the idea that it isn't displaying utterance for utterance but it's also displaying uh a summarised version in you know like below the below the graph the part.
B:Yeah.
E:Maybe Yeah Hmm.
B:Yeah right isn't that the skimming.
B:Isn't that the skimming.
E:Oh yeah it's just like there there's like audio skimming and there's displayed skimming.
B:Yeah but it use the same data.
E:Yeah.
D:Yeah Well.
E:maybe there's some merit of going altogether for utterance level and not even bother to calculate I mean if you have to do it internally then you can do it But maybe like not even store the importance levels for individual words and just sort of rank utterances as.
E:a whole Hmm.
D:the nice thing about that is it will automatically be in sentences Well.
D:more or less.
E:Yeah 'Cause.
D:So it will make more sense and if you.
D:get just extract.
E:it.
D:words.
E:might be better skimming and less memory required at the same time And I mean if you if you know how to do it for individual words then you can just in the worst case if you can't find anything else just sort of make the mean of the words over the utterance You know what I mean.
C:I'm not quite so what it did you.
E:it's it's.
C:want to do it you.
C:just wanted to assign.
E:Well what's the smallest chunk at the moment you're thinking of of assigning an importance measure to is it a word or is it an utterance.
C:Uh I thought about words.
E:So we're thinking of like maybe just storing it on a per utterance level Because it's it's less stuff to store probably.
C:Mm.
E:for Dave in the in the audio playing And for in the display it's probably better if you have whole utterances than I don't know like what it's like if you just take.
B:Yeah.
E:single words out of utterances That probably doesn't make any sense at all whereas if you just uh show important.
D:Yeah.
E:utterances but the utterance as a whole it makes more sense So it doesn't actually make a for your algorithm 'cause it just means that if you're working on a word level then we just mean it over the utterance.
C:Mm okay.
B:yeah I think we also thought about combining that measure with um the I get from um uh hot-spots and so on.
E:They are on.
B:So.
B:that would also be on utterance level I think.
E:Oh so that's good anyway.
B:I think.
E:then yeah.
D:I.
E:Because that makes it a lot easier than to put it on utterance.
D:But it'll need.
E:level Oh yeah No.
D:to be calculated at word level though because otherwise there won't be enough occurrences of the terms to make any meaningful.
D:sense Yeah.
E:but I mean like how how Jasmine does it internally I don't know but it's probably yeah you probably have.
E:to work on word levels for importance But there should be ways of easily going from a word level to an utterance level.
D:Yeah I reckon you can just.
E:Okay.
D:mean it over the sentence.
C:Yeah.
C:but how about those words which don't carry any meaning at all the um and and something like that.
E:Yeah.
C:Because if we if.
D:I think.
C:we average.
D:filter.
C:average over over a whole utterance.
E:Hmm.
C:all the words and there are quite unimportant words in there but quite important words as well I think we should just disregard the the.
D:Maybe we should have like um a cut-off So it a word only gets a value if it's above a certain threshold So.
C:Okay Alright.
D:anything that has less than say nought point five importance gets assigned to zero.
E:Well we do a pre-filtering of sort of the whole thing.
D:Yeah that's the other.
E:sort of like.
E:but that like the problem with that is it's easy to do in the text level But that would mean it would still play the uh in your audio.
D:Yeah.
E:unless we sort of also store what pieces we cut out for the audio Yeah I think before we can like answer that specific question how we deal with that it's probably good for us to look at what the audio player is capable of doing.
D:I think we'll have to buffer the audio But.
E:Yes.
D:I don't think it will be very hard I think it would be like an hour or two's work.
E:So what do you mean by buffering Like you think directly.
D:Like just.
E:feeding.
D:build another wave file essentially.
A:Yeah you just concatenate.
E:But.
D:Yeah I mean.
A:them together.
E:yeah but not.
D:I bet.
E:but not.
D:be.
E:on the hard disk and then loaded in but loaded in directly from memory.
D:In memory yeah So just like there's bound to be like a media wave object or something like that And.
E:But it's probably.
D:just build.
E:a stream.
D:in memory I don't.
E:if it exists in Java it would be probably some binary stream going in of some.
E:type Okay.
D:know I have no idea.
D:But it.
E:yeah Okay.
D:must have like classes for dealing with files And if it has classes for concatenating files you can do it in.
D:memory So.
E:Okay so.
E:I mean so that means that there's probably even if you go on an per utterance level there's still some merit on within utterances cutting out stuff which clearly isn't relevant at all and that maybe also for the audio we'd have to do So let's say we play the whole phrase but then in addition to that we have some information that says minus that part of something That's okay that we can do.
D:Well what I think I might try and build is basically a class that you just feed it a linked list of um different wave-forms and it will just string them all together with maybe I don't know tenth of a second silence in between each one or something.
E:Yeah.
D:like that Normalise.
E:maybe even I mean that's sort of that depends on how how advanced we get If maybe if we realise that there's massive differences in in gain or in something you can probably just make some.
E:simple.
D:it yeah Oh yeah.
E:simple normalization but that really depends on how much time we have and and how much is necessary Yeah if like I I don't know anything about audio and I have never seen the player So if you find that the player accepts some input from memory and if it's easy to do then I guess that's that's fairly doable So but that means in the general structure we're actually quite lucky so we we have we load into memory for the whole series of meetings just the utterances and rankings for the utterances and some information probably that says well the I guess that goes with the utterance who's speaking.
B:Yes.
D:yeah we'll need.
E:Because.
D:that.
E:can also do the display about who's speaking.
D:We also really wanna be able to search by who's speaking as well.
E:Yeah But I'm I'm still confused 'cause I thought like that's just what Jonathan we do that we can't do like load a massive document of that size.
D:It doesn't.
E:On.
D:'cause all the calculation's done off-line.
E:The other hand I mean it shouldn't be like should be like fifty mega-byte in RAM or something it shouldn't be massive should it Actually fifty.
A:Hmm.
E:hundred megabyte is quite big in RAM Just thinking what's the so We do get an error message with the project if we load everything into the project with all the data they load So we know that doesn't work So our hope is essentially that we load less into it.
B:Yes.
A:Yeah It just.
E:What's this lazy loading thing somebody explain lazy loading to.
E:me.
A:means it loads on demand It only loads when it needs a particular type of file Like when it's being accessed.
E:Ah okay So that is that only by type of file Like if if if the same thing is in different files would it then maybe like you know if if utterances are split over three or ten or hundred different files is then a chance maybe that it doesn't try to load them all into memory at the same time but just.
A:Yeah I think that's the idea it just loads the particular.
E:So why does it fail.
A:ones.
E:then in the first place Then it shouldn't ever fail because then it should never.
A:But if you were doing a search over the whole corpus you'd have to.
E:Yeah but.
A:load.
E:yeah but um it uh it it failed right when you load it right the NITE X_M_L_ kit so that's interesting.
A:Hmm.
B:Oops it does So I define baseline and what it loads For example it loads all the utterances and so on but it doesn't load um the discourse acts and.
E:Hmm.
B:for example not the and what's what else there Not the summaries It only loads those on demand.
E:Let's check that out Um I'll I'll probably ask Jonathan about it So alternatively if we realise we can't do the whole thing in one go we can probably just process some sort of meta-data you know what I mean like sort of sort of for the whole series chunks representing the individual meetings or Like something that represents the whole series in in a in a structure very similar to the structure in which we represent individual um meetings but with data sort of always combined from the whole series so instead of having an single utterance that we display it would probably be like that would be representing a whole um topic a segment in a meeting And sort of so that using the same data.
B:you mean that you um basically split up the big thing into um different summaries For.
E:Well in a sense.
B:example.
B:that you have a very um top-level um summary and a separate file for for.
E:Uh I'm.
B:each.
E:I'm thinking of in a sense of like creating a virtual a virtual meeting out of the whole meeting series sort of.
D:That's easy You just like create a new X_M_L_ document.
E:Yeah sort.
D:in.
E:of like off-line create a virtual meeting which which basically treats the meeting series as if it was a meeting and treats the individual meetings within the series as if they were segments and treats the individual segments within meetings as if they were um utterances You know so we just sort of we shift it one level up.
B:Mm-hmm.
E:And in that way we could probably use the same algorithm and just like make like one or two that say okay if you are on a whole document uh a whole series level and that was a double-click then don't just go into that um segment but load a new file or something like it but in general use the same algorithm That would be an alternative if we can't actually load the whole thing and.
A:Mm.
E:'Cause also like even if we maybe this whole like maybe I'm worrying too much about the whole series in one thing display because actually I mean probably users wouldn't view that one too often.
D:I don't think it's really that much of a problem because if it's too big what we can do is just well all the off-line stuff doesn't really matter And all we can do is just process a bit at a time Like for summarisation say we wanted a hundred utterances in the summary just look at the meeting take the top one hundred utterances in each other meeting If it scores higher than the ones already in the summary so far just replace them And then you only have to process one meeting at a time.
E:Yeah but I'm I'm still worried Like for example for the display if you actually if you want a display uh like for the whole series the information density levels based on and and the and the only granularity you have is individual utterances that means you have to through every single utterance in a series of seventy hours.
D:Okay.
E:of meetings Yeah.
D:so maybe we should build a store a mean measure for the segments.
D:and meetings as well And speaker.
E:Yeah and if you make that structurally very similar to the the like one level down like the way how we uh store individual utterances and stuff then maybe we can more or less use the same code and just make a few and stuff Yeah so so but still so in in general we're having we're having utterances and they have a score And that's as much as we really need And of and they also have a time a time information of course.
E:Hmm And.
D:Speaker and.
E:a and a and a speaker.
E:information yeah.
D:um topic segmenting we'll need as well.
E:Yeah so an information which topic they're in yeah.
D:Yeah Well.
E:And and probably separate to that an information about the different topics like that Yeah So so the skimming can work on that because the skimming just sort of sorts the utterances and puts as many in as.
E:it needs Yeah.
D:yeah and then it'll preserve the order when it's displayed the.
D:Yeah Yeah.
E:Yeah it'll it'll play them in some.
E:order in which they were set because otherwise it's gonna be more entertaining Um but that that's enough data for the skimming.
D:Yeah I think so.
E:and.
E:the the searching so what the searching does is the searching leaves the whole framework goes to the S_Q_L_ database and gets like basically in the end gets just a time marker for where that is like that utterance that we are concerned with And then we have to find I'm sure there's some way in in NITE X_M_L_ to just say set position to that time mark And then it shifts the whole frame and it alerts every single element of the display and.
A:Hmm.
E:the display updates Yeah yeah.
A:Yeah we do not want it in to develop a little tree display as well for multiple results.
A:Yeah but that'd.
E:That.
A:be quite.
E:yeah.
A:easy to do.
E:so.
E:if if so yeah So if in that tree display somebody clicks on something.
A:You just need to find the time stamp.
E:Yeah and then you sort of feed the time stamp to and the NITE X_M_L_ central manager and that central manager.
A:Yeah Yeah.
E:alerts everything that's there like alerts the like the the audio display alerts the text display alerts the visual display and says we have a new time frame and then they all sort of do their update routines with respect to the current level of zoom So how much do they display and starting position at where the or maybe the mid-position of it I don't know like if start where the thing was found or if that thing was found it's in the middle of the part that we display that I don't know But that we can decide about but a general sort of It's the same thing if like whether you play and it moves forward or whether you jump to a position through search it's essentially for all the window handling it's the same event.
E:It's only that the event gets triggered by the search routine which sort of push that into NITE X_M_L_ and says please go there now.
D:So we should basically make our own X_M_L_ document in memory that everyone's um module changes that rather than the underlying data And then have that X_M_L_ uh NITE X_M_L_ document tied to the interface.
E:Why do we have to do it in memory But that stuff's.
D:Well.
E:so.
D:you can make it in a file if you want.
E:I mean like the information is coming from off-line So we probably we don't even have to change the utterance document right because the whole way like the whole beauty of the NITE X_M_L_ is that it ties together lots of different files So we can just create an additional X_M_L_ file which for every utterance like the utterances have I_D_s I presume some references.
D:Mm-hmm.
E:So we just.
C:Yeah.
B:Yes.
E:we tie uh just a very short X_M_L_ file which it's the only information it has that has whatever a number for for the um weight for the information density and we just tie that to the existing utterances and tie them to the existing speaker changes.
C:But there is no I_D_ for an utterance I think It's just for individual words So how do we do that then.
B:Uh no no it's for.
C:We.
B:No you're right Yeah It's.
C:I think it's.
B:for.
C:for one word So we.
A:Yeah I think.
C:have to.
A:those.
A:segments for each utterance are split up.
C:Yeah.
A:Think so Yeah I'm pretty.
E:Well otherwise we probably have to go over it and like add some integer that we just increment from top to bottom sort of to every utterance as an as an I_D_ some type Or or try to understand how NITE X_M_L_ I_D_s work and maybe there's some special route we have to follow when we use these I_D_s.
E:It's hmm.
A:sure it's already there.
A:Pretty sure that's already there The the utterances are numbered.
E:Yeah the the girl said the utterances themselves are not numbered at the moment.
B:Um.
C:Uh I'm not quite sure I have.
E:Okay Okay.
C:only seen that the.
C:uh the individual words have.
E:Okay.
C:got an I_D_.
E:Yeah So I guess that would be solvable if not.
C:Yeah You always could have a look at the time stamps and then take.
E:Mm-hmm.
C:the ones that uh belong together to form an utterance.
B:No.
E:Sorry.
B:I I think we would just take the segments that are already.
C:Yeah if they are.
B:that.
C:already there's it's easy.
B:Yeah.
C:but it.
B:um.
E:Okay.
C:would be possible.
B:this segments.
E:Okay.
B:you know the X_M_L_ segments.
C:Uh.
A:Hmm.
C:Okay.
E:Is that a board marker.
B:Oh That.
E:pen actually.
B:I don't.
A:Yeah.
B:know.
A:so.
E:That's just so like to make a list of all this stuff or we probably can somebody can do it on paper All these fancy pens So what so the stuff we have we have utterances and speakers and weights for utterances So for for every utterance sort of like the utterance has a speaker and a weight which is coming from outside Or we just tie it to it And there is segments which.
D:They are utterances aren't they.
E:hmm.
D:The segments are utterances aren't.
A:that's.
A:the impression I get yeah.
E:Oh so.
D:they.
E:sorry um Uh topic.
B:Yeah that's.
E:topic segments I meant Like they.
D:Alright okay.
E:are they are.
B:Mm-hmm.
E:So.
A:Oh.
E:so the utterances are tied to topic segments And if the time stamps are on a word level then we somehow have to extract time stamps for utterances where they start.
B:There there.
D:Well.
B:are.
D:easy.
B:time stamps um for well segments um.
E:what segments now.
B:and.
B:um segments is for example when when you look at the data what is displayed in one line.
E:Okay.
A:Hmm.
B:What when.
E:Is.
B:when you look.
E:uh is.
B:at it in.
E:same.
B:hmm.
E:is.
A:Mm.
E:that the same as utterances that.
D:Well it's close enough.
B:think.
D:isn't it.
B:Isn't.
A:Yeah uh.
D:It may.
D:not be exact every time but.
B:Um for.
D:it's a.
B:um I I.
D:we're looking.
B:compared it with what I did for the pause.
E:Mm-hmm Mm-hmm.
B:um duration extraction.
B:Um and basically it's uh words that are uttered in a sequence without pauses But sometimes um however there are um short pauses in it and they're indicated by square.
E:What so that's.
B:pause or something.
A:Right.
E:Oh But that's one.
B:in the.
E:one segment or is that two segments then.
B:Um uh but uh the annotators decided.
E:Yeah.
B:what was one segment and what wasn't.
E:Okay Okay So but but generally utterances is that which we just called uh sorry segments is that which we just called utterances now Like it's it's.
B:I.
E:the it's.
D:Yeah.
B:think so.
E:sort of.
E:like one person's contribution at a time sort of.
A:Okay Topics yeah.
E:Okay so yeah so we have those and and then we have some field somewhere else which has.
E:Yeah and and a topic's basically they are just on the I_D_ probably with a start time or something and and the utterances referenced to those topics I guess.
A:Yeah I think that's the.
E:So the.
A:one Hmm.
E:don't contain any redundant thing of like showing the whole topic again but they just sort of say a number and where they start and where they finish And the utterances then say which topic they belong to.
B:Yeah but um I think for some annotations um an utterance can have several um types For example for the dialogue acts and so on.
E:Yeah No But I was thinking of the topic segmentation.
B:Okay Yeah.
E:now.
B:that should be.
E:and.
A:Hmm.
B:for Yeah.
E:that there would only be one right because it's sort of like it's just.
E:a time window.
A:Mm-hmm.
B:Should be yeah.
E:Yeah So if this lazy loading works then this should definitely fit into I mean not memory then because it wouldn't all be in memory at the same time So if we just have those sort of that information like a long list of all the utterances slash segments and like short or smaller lists which give weight to them And even though probably if there's a lot of over-head in having two different files we can probably merge the weights into it off-line You know what I mean like.
D:Yeah.
E:if if there's a lot of bureaucracy involved with having two different trees and whether one ties to the other because the one has the weight for the other then it's probably quicker to just.
D:But why don't we just write it as a new X_M_L_ file Can NITE handle just loading arbitrary uh new like attributes and stuff I mean I would have thought they'd make it able to.
E:Yeah I thought that was the whole beauty.
D:Yeah.
E:that like you can just make a new X_M_L_ file and sort of tie that to the other and and it.
D:So why do we need to have two X_M_L_ trees in memory at once.
E:Oh yeah So no I didn't I didn't mean tree No No I meant just like handling two different files internally Sort of I was just thinking you know like if if the overhead for having the same amount of data coming from two files instead of from one file is massive then it would probably be for us easy to just like off-line put the the weight into into the file that has the segments uh yeah segments slash utterances already But that we can figure out I mean if it's going horrendously wrong.
D:The other thing is that would mean we'd be using their parser as well which means we wouldn't have to parse anything which be quite nice.
E:Yeah.
D:'Cause their parser is probably much faster than anything we've come up with anyway.
E:Yeah Yeah no we'd we'd be completely using like the whole infrastructure and basically just I mean the main difference really between our project and theirs really is that we load a different part of the data But otherwise we're doing it the same way that they are doing it So we just we're sort of running different types of queries on it We in a sense we I think we are running queries it's not just about um what we load and what we don't load but we're running queries in the sense that we dynamically select by by weights don't we That we have to check how fast that is like to say give us all the ones that whether that works with their query language whether that's.
A:Mm.
E:too many results and whether we You know if 'cause if it let's say I mean if if their query language is strange and if it would return ten million results and it can't handle it then we can just write our individual components in the way that they know which what the threshold is So they still get all the data and just they internally say oh no this is less than three and I'm not gonna display it or something.
D:Yeah I mean we can process it in chunks if it gets too big basically.
E:Hmm Yeah.
D:We can just process it all in chunks.
D:if.
E:No.
D:gets too big to load it into memory.
E:I'm just thinking for this.
A:Hmm.
E:whole thing of like a different level sort of cutting out different different pieces whether we do that through a query where we say give us everything that's above this and this weight or whether we skip the same infrastructure but every individual module like the player and the display say like they still get sort of all the different utterances uh all the different pieces but they say oh this piece I leave out because it's below the current threshold level.
D:I think we probably want to store Sorry I think we probably want to store um a hierarchical information density as well So like an density score for each meeting and each topic segment 'Cause otherwise we'd be recalculating the same thing over and over and over again.
A:Yeah that'd be much more efficient to.
D:Yeah.
A:do that Yeah.
D:And that will obviously make it much easier to display.
E:When do we need the one for the.
D:Well it may not for the whole meeting but like.
E:Okay Yeah I guess for the so when we have the display will we display the whole series Then if we have for the individual topic segments within the meetings if we have ready calculated um measures then we don't have to sort of extract that data from.
D:Yeah exactly.
E:the individual utterances.
E:Yeah and that's also fairly easy to store along with our segments isn't it.
D:Yeah.
E:For the segments are we extracting some type of title for them that we craft with some fancy algorithm or manually or we're just taking the single most highly valued key-word utterance for the segment heading.
D:Well we can start off like that Well I was gonna start off I've got sort of half-way through implementing one that does just I_D_F_.
E:Hmm Hmm.
D:And then just I can change that to work on.
D:whatever Yeah.
E:It's probably like in in the end probably it wouldn't be the best thing if it's just the most highly ranked phrase or key-word because like for example for an introduction that would most definitely not be anything that has any.
A:Hmm.
E:title anywhere similar to introduction.
D:And.
E:or something Yeah.
D:it should be weighted by stuff like the hot spots and um the key-words in the.
D:search and stuff like that.
E:Also like for this part maybe if we go over it with named entity in the end if I mean if one of the people doing DIL has some.
A:Hmm.
E:named entity code to spare and just like at least for the for sort of for finding topics titles for for segments just take a named entity which has a really high what's it called D_F_I_D_F_ whatever 'Cause you'd probably be quite likely if they're talking about a conference or a person that that would be a named entity which is very highly um frequented in that part.
D:Did he not say something about named entities So I thought he said there wasn't very many.
E:Yeah he said they're quite sparse So.
D:Yeah.
E:that basically was don't bother basing too much of your general calculation on it But like especially if they're sparse probably individual named entities which describe what a what a segment is about would probably be quite good Like if there's some name of some conference they would could probably say that name of the conference quite often even though he's right that they make indirect references to it Anyway.
C:You uh.
E:Sorry.
C:you said you are currently uh implementing the idea What.
D:Yeah It's not.
C:exactly are you.
D:T_F_I_D_F_ it's just inverse document.
C:Okay Okay.
D:frequency 'Cause it's really easy to do basically.
D:There's just like for a baseline really.
C:Mm-hmm.
A:Yeah you're able to do that in Java yeah.
D:Well I'm half-way through It's.
A:Yeah.
D:not working yet but it will do.
E:So you're doing that on a on a per word level.
D:Um yeah.
E:Okay Okay.
D:And then averaging it over the utterances.
D:But it's not like um related to the corpus at all It's just working on an arbitrary text file at the moment.
E:Okay cool I was just wondering where you had the corpus from at the moment.
D:No.
A:Huh.
E:So it it seems that the data structure isn't a big problem and that basically we don't have to have all these massive discussions of how we exactly interact with the data structure because most of our work isn't done with that data structure in memory in the browser but it's just done off-line and everyone can represent it anyway they want as long as they sort of store it in a useful X_M_L_ representation in the end So.
D:It.
E:like.
D:would be useful to know how everyone's gonna store their things though.
A:Hmm.
E:Yeah that would mean understanding the NITE X_M_L_ X_M_L_ sort of.
D:Yeah Yeah.
E:format in a lot more detail We should I think we should just have a long session in the computer room together and like now that we know a bit more what we want take a closer look.
E:at NITE.
D:Well.
E:X_M_L_.
D:got like a few hours free.
A:Yeah I've.
D:Like.
A:had a.
D:this.
A:I've had a look at the the topic segments how.
E:Mm-hmm Mm-hmm.
A:stored And then yeah those are few per meeting and it um well it gives a time stamp and inside each one there's uh the actual like utterance segments.
A:And the list of them that occurred And they're all numbered Um so that's where that's stored.
E:Good Yeah I haven't looked at this stuff much at all.
A:Yeah so I guess um if I'm gonna be segmenting it with.
E:Yeah Yeah.
A:a L_C_ then that's like same format I'd want to.
A:um put it back out in so it'd be equivalent.
E:Who's who's sort of doing the the the central coordination of of of the browser application now Like.
A:Well like the integration.
E:Hmm.
A:What do you mean integration.
E:Yeah or but also like all these elements like like the loading and yeah integration and and.
A:Hmm.
E:like handling the data loading and stuff.
A:I don't know I don't think anyone's been allocated to do that yet.
E:Nah I'm sort of like I think I'll take over the display just because I've started with a bit and found it.
A:Yeah yeah Yeah definitely.
E:found it doable.
E:So somebody should sort of be the one person who's who understands most about what's centrally going on with with the with the project like with the with the browser as a whole and where the data comes in and.
A:Hmm yeah.
E:Any volunteers.
D:It's the most boring task.
C:Mm-hmm.
E:It's also a complicated one.
A:Yeah.
D:Yeah.
A:it could be difficult yeah.
E:Yeah I know but uh I guess we can do it like several people together it's probably just those people have to work together a lot and very closely and just make sure that they're always understand what the other one is doing.
A:Yeah Well I guess the important thing is to get the crucial modules built.
D:Or at least um simple versions of them.
A:yeah Yeah.
E:Yeah.
A:and then.
E:or ready-made versions of them for that matter and.
A:Yeah and then we'll maybe have to prioritize somebody into just integrating it.
E:Yeah but I think actually like at the moment the integration comes first I mean it's sort of at the moment the building the browser comes first and then only comes the creating new sophisticated data chunks because that's sort of the whole thing about having a prototype system which is more or less working on on chunk data But it at least we have the framework in which we can then test everything and and look at everything 'Cause before we have that it's gonna be very difficult for anyone to really see how much the work that they're doing is making sense because you just well I guess you can see something from the data that you have in your individual X_M_L_ files files that you create but it would be nice to have some basic system which just displays some stuff.
D:So maybe we should try doing something really simple like just displaying a whole meeting And like just being able to scroll through it or something like that.
E:Or just adapt like their like.
D:Yeah.
E:just sort of go from their system and and adapt that piece for piece and see how we could how we could like adapt it to our system Does anyone want to like just sit with me and like play for three hours with NITE X_M_L_ at some point.
D:Are you free after this.
E:Uh I wouldn't like to be 'cause I'd like to go to the gym I'm theoretically free But if there's any time.
D:How about Friday then 'Cause.
E:hmm.
D:I'm off all Friday.
E:You have nothing no free time on Wednesday.
D:Uh Wednesday I've got a nine 'til twelve.
E:Hmm Nine 'til twelve and then you have.
D:Yeah nothing.
E:or.
D:in the afternoon.
E:Hmm.
D:I've got nothing in the afternoon So.
E:Anytime Wednesday afternoon I'd be cool.
D:Okay.
E:I think.
D:So you yeah Where about just in Appleton Tower.
E:Yo Forrest Hill whatever one's easier to discuss stuff I don't know.
D:Uh I'll be in um.
E:I'm not biased.
D:the Appleton anyway.
E:Okay What time do you wanna do.
D:Um well I'll be there from twelve I've got some other stuff that needs done on Matlab so if you're not there at twelve I can just work on that.
E:Okay so I'll just meet.
D:So Yeah.
E:you in in eighteen in the afternoon.
E:I guess at the moment nobody critically depends on like the NITE X_M_L_ stuff working right now right Like at the moment you can all.
A:Mm-hmm Yeah I.
E:do your stuff and I can do my L_S_A_ stuff And I can even do the display to a vast degree without actually having their supplying framework working So it's not that crucial.
C:Yeah I.
A:think.
C:I I would need the raw text pretty soon because I have to find out um how I have to put the segments into bins.
E:Yeah actually I need.
C:And.
E:the raw text as well Yeah but.
C:yeah.
E:I was I was I was.
A:Uh.
E:more thinking of the sort of the the whole browser framework as a running.
C:No that's not necessary.
E:programme now.
E:Yeah I think we all need the raw text in different in different flavours don't we.
A:Yeah yeah.
D:Why.
A:Jasmine I thought you just said that you'd uh looked at extracting the text.
C:Yes I did But um I've only just got the notes I have to still have uh to order everything by the time.
A:Yeah So.
C:and.
A:you you said you did it in Python yeah.
C:Yeah I think it's quite easy.
A:Yeah did.
C:after.
A:you use uh the X_L_ uh X_M_L_ parser.
C:Yeah Yeah.
A:in Python.
C:So uh.
A:good.
A:So um 'cause yeah I was having a look in it a look at it as well and.
C:Mm-hmm.
A:I noticed the um the speakers are all in that separate file So did did you have to combine them all and and then re-order them.
C:Yeah I uh that's what I was uh thought That.
A:Yeah yeah.
C:you just combine them and then order.
C:the time stamps accordingly.
A:Right.
A:Yeah so that's approach um well I was going to do So yeah we may as well collaborate.
C:Okay Um what I found out was that there are quite a lot of things without without time stamps in the beginning.
A:In the word files.
C:Yeah and uh X_M_L_.
B:Yes.
C:files Yeah that's just an I_D_ or something I don't.
B:Yeah everything.
C:know Just.
B:that's a word has a time stamp.
C:Yes but what are the other things that's uh some kind of number maybe the file number or something that is in the beginning What is that.
A:I'm not.
C:Do.
A:sure I what you mean.
C:know.
C:Um I think there are quite a lot of numbers in the beginning where there is no time stamp for the numbers It's Think they say um quite a lot of numbers and before that uh um there's this number Was.
E:But.
C:it.
E:number within the X_M_L_ context.
C:Yeah there are numbers in the um the W_ tag but there are no time stamps.
E:Are they spoken numbers Like do they look like they're utterances.
C:Yeah.
E:numbers There's the number task isn't there That's.
B:That's at.
E:part.
B:the end.
E:the.
B:That's at the end I think.
E:Okay.
C:Yeah in.
B:her.
C:the beginning as well sometimes I think.
A:Oh right.
C:At least I saw some.
B:Yeah maybe Didn't.
E:Hmm.
B:have a look at.
E:have to probably.
B:our meetings.
E:that out anyway for our project I don't know It's probably gonna screw up a lot of our data otherwise.
C:Yeah.
E:If.
A:Hmm.
B:Uh I.
E:Not sure if it what it.
E:does to document.
B:I.
B:think it wouldn't as it occurs I mean it would be it occurs in every meeting So.
E:It would probably make the yeah if if you have segments for that probably.
B:And.
E:the Okay.
B:I think it even has uh its own annotation like digits or something So.
B:that should be really easy to.
E:Uh I'm just thinking like.
B:cut.
E:it it probably like the L_S_A_ would perform quite well on it It would probably find another number task quite easily seeing that it's.
B:Yeah I'm sure.
E:a constrained.
E:vocabulary with a high co-occurrence of the same nine words So that.
C:But what it.
E:ten.
C:is it actually that numbers.
E:Hmm Yeah.
B:Ah it's just to test the system I think.
C:Okay so but there.
B:So.
C:are no time stamps annotated.
E:I think it's also something.
C:to that.
E:that they they said.
C:it's quite.
E:the numbers in order right.
B:Mm they have to read numbers.
E:Yeah.
B:from Uh.
E:I think it's it the it sounded like they wanted to check out how well they were doing with overlapping and stuff because basically it's like they're reading them at different speeds but you know in which order they are said.
E:Anyway.
B:I didn't have a look at that So.
E:ICSI has some reasons for doing it.
A:Hmm.
E:They must have been pissed off saying like numbers at the end of every meeting.
C:And also um there are different um combinations of letters B_R_E_ and something like that Is it everything ordered are the time stamps global or uh are they local at any point.
A:Mm I thought they were local.
B:They.
A:to a particular meeting.
C:Okay.
E:Um Dave if you would or actually for well if you're doing I_D_F_s or you whatever you call your your frequencies I always mix up the name uh you need some dictionary for that at some point though like you need to have some representation of a word as not not that specific occurrence of that word token but of of of a given word form Because you're making.
D:Yeah I'm.
E:counts for word.
D:just building a dictionary.
E:right.
E:Yeah so we should work together on that because I need a dictionary as well.
D:Oh mine's just gonna use the um hash map one in um Java 'Cause I'm only gonna do it on small documents.
E:Okay 'Kay.
D:It's just like until the information density.
D:is up and running Just something to get give me something to work with.
E:Okay Didn't you say that the.
D:So it's only.
E:the.
D:gonna use quite small documents you see to start with.
E:Yeah but for I'm just wondering for the whole thing Does somebody who was it of you two who said that um there's some programme which spits out a dictionary probably with frequencies.
C:Yeah it's Rainbow It's um I think it's just the dictionary in the first place But.
E:Okay Is anyone of you for the for the document frequency over total frequency you gonna have total frequencies of words then with that right Like over the whole corpus.
C:Um.
E:sort of Or.
C:no I have to bin it up and so I will only have counts for each each bin or something.
D:Why does it need to be classified into like different segments.
C:It's because um Rainbow is a text classification system And I think it's not possible to have just one class That's the problem.
D:Can we just fill a second class with junk that we don't care about.
C:Maybe we could.
D:Like I don't know copies of Shakespeare or something.
C:Yeah sure you sure we could do that but I don't that makes sense.
D:'Cause if what we're looking for is the um frequency statistics I don't see how that would be changed by the classification.
C:If we need just frequencies maybe we should just calculate them by.
D:I the.
C:using Perl or something I don't know.
D:Well there maybe another tool available.
C:Yeah it's quite easy to just count and or sort.
D:Yeah.
C:them by um frequency.
E:using which tool are you talking about.
C:Just using a Perl script.
E:Be careful with that Like my experience with the British National Corpus was that there's.
C:Is it too.
E:far.
C:big Yeah.
E:word types than you ever think because anything that's sort of unusual generally is a new.
E:word type Like any typo or any strange thing where they put two words together And also any number as a word type of its own So you can easily end up with hundred thousands of words when you.
C:Hmm.
E:didn't expect them So generally dictionaries can grow bigger then you think they do.
C:I don't know how you how many terms you can handle in Perl.
E:Well you can probably also you can probably pre-filter like with regular expressions even just say if it consists of only digits then skip it or even.
C:Mm yeah.
E:if it consists any special characters then skip it because it's probably something with a dot in between which is usually not something you wanna have and.
D:Um I can't remember who's got it Might be WordNet But one of these big has a list of stop words that you can download and they're just basically lists of really uninteresting boring words that we could filter out before we do that It's like that's one the papers I read that's um one things they did right at the beginning is they've got this big stop-list and they just ignore all of those throughout.
E:What I did.
D:the experiment Yeah.
E:for my project I just ignored the hundred most frequent words because they actually end up all being articles and and everything and stuff So we need like several of us need a dictionary Am I the only one who needs.
E:it with frequencies.
D:I it would be useful for me as well.
E:Am I the only one who needs it with frequencies Or.
D:It uh I think that'd be useful for me as well.
E:Frequencies.
D:Yeah Yeah.
E:Yeah Well I guess as soon as we have the raw text we can probably just start with the Java hash map and like.
E:just hash map over it and see how far we get I mean we can probably on a machine with a few hundred megabyte RAM you can go quite far You can write it on beefy So even if it goes wrong and even if it has a million.
D:Well all you really.
E:words.
D:wanna do is look into getting some sub-set of the ICSI corpus off the DICE machines 'Cause I hate working on DICE It's awful Like so I can use my home machine.
E:Oh yeah burning it on a like we should be able to burn the.
D:has a C_D_.
E:whole corpus.
D:burner.
E:hmm.
D:though has a C_D_ burner.
E:Ah I see I asked support about that two days ago In the Informatics building there oh sorry in in Appleton Tower five the ones closest two machines closest to the support office So I presume oh wait I have the exact email I think he's talking.
D:Yeah The right-hand corner far.
E:about sort of the ones that.
E:Yeah.
D:right Yeah.
E:if you if you enter the big room in the right-hand corner I think.
B:Mm-hmm.
E:Um the thing is like you can only burn from the local file-system So if it's from well actually I think if it's mounted you can directly burn from there but the problem is I have my data on beefy and so I have to get it into the local temp directory and burn it from there But you can burn it from there.
D:How big is it without um.
E:Uh we looked that up and.
D:the.
E:I we.
D:files and.
E:looked that up and I forgot.
D:'Cause I could just say at um going over S_C_P_ one night and just leave it going all night if I had to.
E:Yeah yeah No you you we should be able to get it at I don't think it was I don't think it was a gigabyte.
D:It's yeah I mean.
E:Hmm.
D:the wave data are obviously not gonna get off there completely.
E:See I would I would offer you to to get it on this one and then um like copy it But you know what I figured out I'm quicker down-loading over broad-band into my.
D:Really Oh.
E:computer than using this.
D:right.
E:There's something strange about the way how they access the hard disk how they mount it which is unfortunate.
D:I'll see if I can S_C_P_ it I suppose.
E:Hmm What operating system do you have.
D:I've got a Linux box and a Windows.
E:Okay.
D:box So Broad-band.
E:what connection do you have at home.
E:Yeah So if anyone of us gets it we can then just use.
D:Put on to C_D_.
E:an.
D:I can if I get.
E:Yeah.
D:down I.
E:it to.
D:can put to C_D_.
E:or.
E:yeah put it on on hard disk whatever.
D:Yeah not sure.
E:Question is if you're not quicker if you uh because you should get massive compression out of that Like fifty percent or something with a good algorithm So if you could compress it and just put it into a directory.
E:Like The.
D:if there's enough space Is.
D:how much do we get.
E:the usually have for gigabyte three or.
D:Really Okay.
E:two The.
E:yeah I like I mean there's not guarantee that anything stays there but overnight it'll stay And I think the usually have Ah yeah but that would have to be the directory off the machine you can S_S_H_ into directory of S_S_H_.
D:Yeah but I can do it from that session can't I You can compress it from a remote session and S_C_P_ it from the same session.
E:Yeah they they'd they'd probably hate you for doing it But They'd probably they'd like you more if you S_S_H_ uh into another computer compress it there and then sort of copy it into the into the gateway machine.
D:Do you think Yeah.
E:They have um if you S_S_ hey you know if you if you S_S_H_ and they have this big warning about doing nothing.
E:at all in the gateway machine.
D:Oh no no I was thinking of just into some machine.
E:Yeah To your home machine.
D:and then just it from there.
D:Yeah I mean it has to go through the gateway.
E:I haven't.
D:But Can you.
E:I haven't figured out how to tunnel through the gateway into another.
E:machine yet.
D:not do.
E:It's not it's not easy definitely That's why I end up sort of copying stuff into the directory at the gateway.
D:Mm I see.
E:machine.
E:Sorry if this is boring everybody else This is just details and how to get stuff home from what we.
B:Uh yeah.
E:can probably just.
D:Yeah.
E:when we're meeting I'm sorry.
B:'Kay Um I just um wondered so who's uh then doing um the frequencies on on the words because I'm I think I will also um I could also make use of it um for the agreement and disagreement thing Because I um I in my outline.
B:I talked about um using the um discourse acts first.
E:Mm-hmm Well.
B:and um then in the chunks of text I found looking for word patterns and so on So um I would for example need the um most um frequent words.
B:So if you cut.
E:yeah As soon.
B:off all that I'd won't be.
B:use or.
E:as.
E:somebody gives me the raw text of the whole thing I can probably just implement like a five line Java hash table frequency dictionary builder and see.
B:Yeah I I but I need it for my chunks then I would You know.
E:Oh did you not say frequencies of words in the whole sorry did.
B:Yeah.
E:uh.
B:but I'd uh I would like to look at the frequency of words in my um in the regions of text I found out to be interesting So.
D:So you.
B:I.
D:could just.
B:need it It it.
E:So you'd.
B:would have to be re-calculated.
E:Yeah you'd.
B:only.
E:have to count it yourself yeah.
B:segments Huh.
D:But first uh how big are the chunks.
D:How big are the chunks you're looking at.
B:Uh uh mm I think it would be you know as as big at as the hot-spot annotation.
D:So quite small then.
B:things.
D:So you could just.
B:that's.
D:um you could use just.
A:Hmm.
D:the same thing we used to build the big dictionary You just do that on-line 'cause that won't take long to build a little dictionary that big will it I mean just use the same tool that.
B:Yes Yeah.
D:we.
B:yeah.
D:Yeah Yeah.