Skip to content

Latest commit

 

History

History
938 lines (835 loc) · 576 KB

File metadata and controls

938 lines (835 loc) · 576 KB

Codebook

The codebook specifies the data types, possible values, and other information for each column in the data files.

Table of contents

Word features

Contains the word features for each of the stimulus texts.

Please find the files at this link: Word features

Column name Possible values Value type Description Num missing values Missing value description Source
word string Words as they appear in the stimuli texts. Words are split at white-space. 0 nan nan
word_with_punct string The word as it appears in the text, including punctuation. 0 nan nan
word_index_in_text 1-180 Integer The index of the word in the text. Indexing starts at 1. 0 nan nan
word_index_in_sent 1-51 Integer The index of the word in the sentence. Indexing starts at 1. 0 nan nan
sent_index_in_text 1-12 Integer The index of a sentence in the respective text. Indexing starts at 1. 0 nan nan
word_limit_char_indices no stats? Specifies the limits of each word in character indices. Format: [word_start],[word_end]. For example: 3,7 means a word starts at character index 3 in the text and ends at character index 7. The properties of the character indices are specified in char_index_in_text. 0 nan nan
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
text_id_numeric 0-11 Integer Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 0 nan Manually created
text_domain biology: 954, physics: 941 Categorical The domain of the stimulus text. 0 nan Manually tagged
text_domain_numeric 0: 954, 1: 941 Categorical Numerical value of text_domain; 0=biology, 1=physics. 0 nan Manually created
word_length 2-33 Integer Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters). 0 nan nan
STTS_punctuation_before nan: 1883, $(: 12 Categorical If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. 1883 nan Manually tagged
STTS_punctuation_after nan: 1689, $.: 101, $,: 93, $(: 10, $($,: 2 Categorical If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. 1689 nan Manually tagged
is_in_quote 0: 1881, 1: 14 Categorical Whether or not the word is part of an expression in quotes. 0 nan Manually tagged
is_in_parentheses 0: 1890, 1: 5 Categorical Whether or not the word is part of a phrase in parentheses. 0 nan Manually tagged
is_clause_beginning 0: 1796, 1: 99 Categorical Whether or not the word is the beginning of a clause. 0 nan Manually tagged
is_sent_beginning 0: 1798, 1: 97 Categorical Whether or not the word is the beginning of a new sentence. 0 nan Manually tagged
is_clause_end 0: 1797, 1: 98 Categorical Whether or not the word is the end of a clause. 0 nan Manually tagged
is_sent_end 0: 1798, 1: 97 Categorical Whether or not the word is the end of a sentence. 0 nan Manually tagged
is_abbreviation 0: 1890, 1: 5 Categorical Whether or not the entire word is an abbreviation. 0 nan Manually tagged
is_expert_technical_term 0: 1740, 1: 155 Categorical 1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"". 0 nan Manually tagged
is_general_technical_term 0: 1646, 1: 249 Categorical 1 if the word is a technical term that is generally understandable. E.g.: "elektrisch" 0 nan Manually tagged
contains_symbol 0: 1887, 1: 8 Categorical Whether or not the word contains a symbol. E.g.: β-D-Glucose 0 nan Manually tagged
contains_hyphen 0: 1866, 1: 29 Categorical Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)). 0 nan Manually tagged
contains_abbreviation 0: 1883, 1: 12 Categorical Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA. 0 nan Manually tagged
STTS_PoS_tag ADJA: 154, ADJD: 53, ADV: 73, APPR: 184, APPRART: 48, APZR: 1, ART: 276, CARD: 9, KOKOM: 17, KON: 66, KOUI: 6, KOUS: 16, NE: 4, NN: 515, PAV: 18, PDAT: 16, PDS: 7, PIAT: 5, PIDAT: 9, PIS: 10, PPER: 25, PPOSAT: 7, PRELAT: 6, PRELS: 29, PRF: 25, PTKA: 1, PTKNEG: 4, PTKVZ: 13, PTKZU: 10, PWAV: 1, TRUNC: 5, VAFIN: 73, VAINF: 8, VMFIN: 25, VMINF: 1, VVFIN: 102, VVINF: 33, VVIZU: 2, VVPP: 38 Categorical Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information. 0 nan Manually tagged
type string The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name. 4 nan dlexDB
type_length_chars 2.0-33.0 Integer The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted. 1 nan nan
PoS_tag adja: 162, adjd: 54, adv: 91, appr: 182, apprart: 48, art: 280, card: 9, kokom: 17, kon: 63, koui: 5, kous: 16, ne: 7, nn: 508, pdat: 16, pds: 7, piat: 5, pidat: 2, pis: 14, pper: 24, pposat: 7, prelat: 6, prels: 24, prf: 25, ptka: 1, ptkneg: 4, ptkvz: 15, ptkzu: 10, pwav: 1, trunc: 5, vafin: 73, vainf: 8, vmfin: 24, vminf: 1, vvfin: 103, vvinf: 33, vvizu: 2, vvpp: 38, xy: 5 Categorical Part-of-speech tag as defined by the dlexDB query. 0 nan dlexDB
lemma string nan 4 nan dlexDB
lemma_length_chars 1.0-32.0 Integer nan 3 nan dlexDB
syllables string nan 25 nan dlexDB
type_length_syllables 1.0-14.0 Integer nan 24 nan dlexDB
annotated_type_frequency_normalized min: 0.00817507899599, max: 24738.5901996, mean: 3889.8532, std: 6967.089 Float The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma. 127 nan dlexDB
type_frequency_normalized min: 0.00817507899599, max: 26530.3631386, mean: 4409.2283, std: 7712.5287 Float nan 115 nan dlexDB
lemma_frequency_normalized min: 0.00817507899599, max: 80100.3069113, mean: 13063.8057, std: 25247.1898 Float nan 115 nan dlexDB
familiarity_normalized min: 0.0, max: 26530.3631386, mean: 4074.0362, std: 7634.0602 Float nan 117 nan dlexDB
regularity_normalized min: 0.0, max: 2123.30585022, mean: 37.6119, std: 123.3575 Float nan 116 nan dlexDB
document_frequency_normalized min: 0.126068429944, max: 9372.80956103, mean: 3073.6225, std: 3377.4549 Float nan 116 nan dlexDB
sentence_frequency_normalized min: 0.0155184320176, max: 30912.3596552, mean: 6119.8019, std: 9642.457 Float nan 116 nan dlexDB
cumulative_syllable_corpus_frequency_normalized min: 1.40611358731, max: 125126.524676, mean: 16825.508, std: 15793.39 Float nan 116 nan dlexDB
cumulative_syllable_lexicon_frequency_normalized min: 0.428085856899, max: 218985.607753, mean: 23221.2613, std: 31879.0143 Float nan 119 nan dlexDB
cumulative_character_corpus_frequency_normalized min: 15533.2550482, max: 7810554.20193, mean: 1917789.2641, std: 1253328.3202 Float nan 116 nan dlexDB
cumulative_character_lexicon_frequency_normalized min: 47003.8270876, max: 18380479.713, mean: 4265792.357, std: 2812004.0938 Float nan 116 nan dlexDB
cumulative_character_bigram_corpus_frequency_normalized min: 5138.64210483, max: 1322150.62097, mean: 363265.3368, std: 217175.5613 Float nan 116 nan dlexDB
cumulative_character_bigram_lexicon_frequency_normalized min: 12677.7626521, max: 2788357.77704, mean: 590209.5889, std: 442407.5129 Float nan 116 nan dlexDB
cumulative_character_trigram_corpus_frequency_normalized min: 4358.04468689, max: 603427.130456, mean: 227949.9158, std: 122856.9432 Float nan 116 nan dlexDB
cumulative_character_trigram_lexicon_frequency_normalized min: 11942.3111499, max: 899592.89035, mean: 237804.6839, std: 171696.6712 Float nan 116 nan dlexDB
initial_letter_frequency_normalized min: 199.202149895, max: 110461.430317, mean: 38381.0963, std: 33346.9984 Float nan 116 nan dlexDB
initial_bigram_frequency_normalized min: 1.57779024623, max: 53801.2331077, mean: 12768.0203, std: 14670.9631 Float nan 116 nan dlexDB
initial_trigram_frequency_normalized min: -0.00817507899599, max: 29048.3692201, mean: 5888.4981, std: 8949.4325 Float nan 116 nan dlexDB
avg_cond_prob_in_bigrams min: 1.2e-07, max: 0.5006180465, mean: 0.0451, std: 0.0448 Float The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information. 116 nan dlexDB
avg_cond_prob_in_trigrams min: 3.153e-06, max: 25.0, mean: 0.2526, std: 0.6009 Float The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information. 116 nan dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized min: 0.0, max: 44055.247282, mean: 2248.7136, std: 7540.5582 Float nan 116 nan dlexDB
neighbors_coltheart_higher_freq_count_normalized min: 0.0, max: 8.13363128109, mean: 0.2077, std: 0.5007 Float nan 116 nan dlexDB
neighbors_coltheart_all_cum_freq_normalized min: 0.0, max: 49782.1108458, mean: 5076.6032, std: 10127.1033 Float nan 116 nan dlexDB
neighbors_coltheart_all_count_normalized min: 0.0, max: 47.5175301158, mean: 15.7971, std: 14.4153 Float nan 116 nan dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized min: 0.0, max: 44055.247282, mean: 2879.4346, std: 7921.0448 Float nan 116 nan dlexDB
neighbors_levenshtein_higher_freq_count_normalized min: 0.0, max: 11.9864039932, mean: 0.3277, std: 0.6576 Float nan 116 nan dlexDB
neighbors_levenshtein_all_cum_freq_normalized min: 0.0, max: 54875.2749862, mean: 6722.366, std: 11598.2601 Float nan 116 nan dlexDB
neighbors_levenshtein_all_count_normalized min: 0.0, max: 75.7711966712, mean: 24.6418, std: 22.5295 Float nan 116 nan dlexDB
sent_surprisal_gpt2-base min: 0.0005104430601932, max: 56.804420471191406, mean: 6.9134, std: 6.601 Float Surprisal value extracted from a language model (GerPT2-base) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_gpt2-base min: 0.0002225389762315, max: 53.041446685791016, mean: 5.5822, std: 5.709 Float Surprisal value extracted from a language model (GerPT2-base) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_gpt2-large min: 0.0002048997703241, max: 42.28059005737305, mean: 6.1407, std: 5.8854 Float Surprisal value extracted from a language model (GerPT2-large) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_gpt2-large min: 0.0001027531252475, max: 35.38883209228516, mean: 4.735, std: 4.8645 Float Surprisal value extracted from a language model (GerPT2-large) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_llama-7b min: 0.0001720042055239, max: 42.96158599853516, mean: 6.1564, std: 5.7273 Float Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_llama-7b min: 1.990775308513548e-05, max: 35.62324142456055, mean: 3.4794, std: 3.8552 Float Surprisal value extracted from a language model (LeoLM-7b) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_llama-13b min: 8.702239938429557e-06, max: 46.25139999389648, mean: 6.0065, std: 5.8588 Float Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_llama-13b min: 9.298280929215252e-06, max: 36.29869842529297, mean: 3.2454, std: 3.8091 Float Surprisal value extracted from a language model (LeoLM-13b) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_bert-base min: 1.1920928244535389e-07, max: 101.79562616348268, mean: 6.4507, std: 11.6184 Float Surprisal value extracted from a language model (BERT-base) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_bert-base min: -0.0, max: 88.84420316047726, mean: 6.2599, std: 11.5846 Float Surprisal value extracted from a language model (BERT-base) with the text as context. 0 nan See script get_surprisal.py

Stimuli and comprehension questions

Contains the stimulus information including the questions for each text.

Please find the file at this link: Stimuli including comprehension questions

Column name Possible values Value type Description Num missing values Missing value description Source
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
text_id_numeric 0-11 Integer Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 0 nan Manually created
text_domain biology: 6, physics: 6 Categorical The domain of the stimulus text. 0 nan Manually tagged
text_domain_numeric 0: 6, 1: 6 Categorical Numerical value of text_domain; 0=biology, 1=physics. 0 nan Manually created
source no stats? The source of the stimulus text. 0 nan nan
headline string The header of the respective stimulus text. 0 nan nan
tq_1 string Text question 1. 0 nan Manually created
tq_1_option1 string Option 1 for text question 1. 0 nan Manually created
tq_1_option2 string Option 2 for text question 1. 0 nan Manually created
tq_1_option3 string Option 3 for text question 1. 0 nan Manually created
tq_1_option4 string Option 4 for text question 1. 0 nan Manually created
tq_2 string Text question 2. 0 nan Manually created
tq_2_option1 string Option 1 for text question 2. 0 nan Manually created
tq_2_option2 string Option 2 for text question 2. 0 nan Manually created
tq_2_option3 string Option 3 for text question 2. 0 nan Manually created
tq_2_option4 string Option 4 for text question 2. 0 nan Manually created
tq_3 string Text question 3. 0 nan Manually created
tq_3_option1 string Option 1 for text question 3. 0 nan Manually created
tq_3_option2 string Option 2 for text question 3. 0 nan Manually created
tq_3_option3 string Option 3 for text question 3. 0 nan Manually created
tq_3_option4 string Option 4 for text question 3. 0 nan Manually created
bq_1 string Background question 1. 0 nan Manually created
bq_1_option1 string Option 1 for background question 1. 0 nan Manually created
bq_1_option2 string Option 2 for background question 1. 0 nan Manually created
bq_1_option3 string Option 3 for background question 1. 0 nan Manually created
bq_1_option4 string Option 4 for background question 1. 0 nan Manually created
bq_2 string Background question 2. 0 nan Manually created
bq_2_option1 string Option 1 for background question 2. 0 nan Manually created
bq_2_option2 string Option 2 for background question 2. 0 nan Manually created
bq_2_option3 string Option 3 for background question 2. 0 nan Manually created
bq_2_option4 string Option 4 for background question 2. 0 nan Manually created
bq_3 string Background question 3. 0 nan Manually created
bq_3_option1 string Option 1 for background question 3. 0 nan Manually created
bq_3_option2 string Option 2 for background question 3. 0 nan Manually created
bq_3_option3 string Option 3 for background question 3. 0 nan Manually created
bq_3_option4 string Option 4 for background question 3. 0 nan Manually created
correct_ans_tq_1 1-4 Integer The index of the correct answer for text question 1. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. 0 nan nan
correct_ans_tq_2 1-4 Integer The index of the correct answer for text question 2. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. 0 nan nan
correct_ans_tq_3 1-4 Integer The index of the correct answer for text question 3. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. 0 nan nan
correct_ans_bq_1 1-4 Integer The index of the correct answer for background question 1. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. 0 nan nan
correct_ans_bq_2 1-4 Integer The index of the correct answer for background question 2. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. 0 nan nan
correct_ans_bq_3 1-4 Integer The index of the correct answer for background question 3. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. 0 nan nan

Items

The file contains the information on the version number of the question answer randomization for each text.

Please find the file at this link: Items

Column name Possible values Value type Description Num missing values Missing value description Source
version 0-119 Integer Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv. 0 nan nan
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
text_domain biology: 720, physics: 720 Categorical The domain of the stimulus text. 0 nan Manually tagged
order_bq_1_ans no stats? The order in which the answers for background question 1 were presented. 0 nan nan
order_bq_2_ans no stats? See description of order_bq_1_ans 0 nan nan
order_bq_3_ans no stats? See description of order_bq_1_ans 0 nan nan
order_tq_1_ans no stats? See description of order_bq_1_ans 0 nan nan
order_tq_2_ans no stats? See description of order_bq_1_ans 0 nan nan
order_tq_3_ans no stats? See description of order_bq_1_ans 0 nan nan
trial 1-12 Integer Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. 0 nan nan

Areas of interest (AOI)

Contains the aoi files for each of the stimulus texts.

Please find the files at this link: AOI

Column name Possible values Value type Description Num missing values Missing value description Source
aoi_type The shape of the area of interest. In this corpus, all aois are rectangles around the characters. 0 nan SR Research data viewer
aoi 1-1121 Integer The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated. 0 nan SR Research experiment builder
start_x 80-1622 Integer The x-coordinate in pixels of the top left corner of the aoi rectangle. 0 nan nan
start_y 21-920 Integer The y-coordinate in pixels of the top left corner of the aoi rectangle. 0 nan nan
end_x 92-1634 Integer The x-coordinate in pixels of the bottom right corner of the aoi rectangle. 0 nan nan
end_y 99-998 Integer The y-coordinate in pixels of the bottom right corner of the aoi rectangle. 0 nan nan
character string Character as text. 0 nan nan
line 1-12 Integer The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1. 0 nan nan

Manually corrected constituency trees

The constituency trees that have been corrected manually.

Please find the file at this link:

Column name Possible values Value type Description Num missing values Missing value description Source
sent_index_in_text 1-12 Integer The index of a sentence in the respective text. Indexing starts at 1. 0 nan nan
sentence string The sentence in the text. 0 nan nan
spacy_constituency_tree no stats? The constituency tree of the sentence in the text as constructed by spacy. 0 nan Spacy
str_constituents no stats? The constituency tree in string format. This way it can be parsed easily and be displayed. 0 nan Spacy
spacy_pos no stats? The part-of-speech tags of the words in the sentence as tagged by spacy. 0 nan Spacy
constituents no stats? The constituents of the sentence tree as constructed by spacy. 0 nan Spacy
text_id_numeric 0-11 Integer Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 0 nan Manually created
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
manually_corrected False: 19, True: 79 Categorical Whether the sentence tree was manually corrected. 0 nan Manually tagged

Dependency trees

Contains the dependency trees for all stimuli which have been manually corrected.

Please find the file at this link: Dependency trees

Column name Possible values Value type Description Num missing values Missing value description Source
spacy_word The words in the sentence as tokenized by spacy. 0 nan Spacy
spacy_lemma The lemmas of the words in the sentence as constructed by spacy. 0 nan Spacy
spacy_pos no stats? The part-of-speech tags of the words in the sentence as tagged by spacy. 0 nan Spacy
spacy_tag The details part-of-speech tags of the words in the sentence as constructed by spacy (more fine-grained than spacy_pos). 0 nan Spacy
dependency no stats? The dependency relations of the words in the sentence as constructed by spacy. 0 nan Spacy
dependency_head no stats? The head of the dependency relation of the words in the sentence as constructed by spacy. 0 nan Spacy
dependency_head_pos no stats? The part-of-speech tag of the head of the dependency relation of the words in the sentence as constructed by spacy. 0 nan Spacy
dependency_children no stats? The children of the dependency relation of the words in the sentence as constructed by spacy. 0 nan Spacy
text_id_numeric 0-11 Integer Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 0 nan Manually created
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
sent_index_in_text 1.0-12.0 Integer The index of a sentence in the respective text. Indexing starts at 1. 1 nan nan
manually_corrected False: 1768, True: 193, nan: 153, Flse: 2 Categorical Whether the sentence tree was manually corrected. 153 nan Manually tagged

Raw data files (samples)

The raw eye tracking data (i.e. each line contains a sample) for each trial.

Please find the files at this link: Raw ET data

Column name Value type Description Source
time Float The time stamp of the sample. edf file created by EyeLink
x Float The x-coordinate of the sample. edf file created by EyeLink
y Float The y-coordinate of the sample. edf file created by EyeLink
pupil_diameter Float The pupil diameter of the sample. edf file created by EyeLink

Fixations

Computed gaze events of all trials for each reader.

Please find the files at this link: Fixations

Column name Possible values Value type Description Num missing values Missing value description Source
fixation_index 1-1469 Integer The index of the fixation in temporal order. 0 nan SR Research data viewer
text_domain bio: 203667, biology: 1032, physics: 199721 Categorical The domain of the stimulus text. 0 nan Manually tagged
trial 1-12 Integer Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. 0 nan nan
acc_bq_1 min: 0.0, max: 1.0, mean: 0.3869, std: 0.487 Float The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_2 min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789 Float The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_3 min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938 Float The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_1 min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729 Float The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_2 min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821 Float The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_3 min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749 Float The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
fixation_duration 2-4474 Integer The duration of the fixation in milliseconds. 0 nan SR Research data viewer
next_saccade_duration 1.0-9491.0 Integer The duration of the saccade that follows a fixation in milliseconds. 46 nan SR Research data viewer
previous_saccade_duration nan-nan Integer The duration of a saccade that preceeds a fixation in milliseconds. 515 nan SR Research data viewer
version 0-105 Integer Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv. 0 nan nan
line 1-12 Integer The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1. 0 nan nan
aoi 1-1121 Integer The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated. 0 nan SR Research experiment builder
char_index_in_line 1-100 Integer Index of a character in the line. Indexing starts at 1. 0 nan nan
original_fixation_index 1-1478 Integer The index of the uncorrected fixation. 0 nan SR Research data viewer
is_fixation_adjusted False: 382202, True: 22218 Categorical Whether or not the fixation has been adjusted manually. 0 nan Manually tagged.
reader_id 0-105 Integer The unique identifier given to each reader. Reader IDs start at 0. 0 nan Manually created
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan

Scanpaths

The scanpaths for each trial (i.e. fixations in fixation order).

Please find the files at this link: Scanpaths

Column name Possible values Value type Description Num missing values Missing value description Source
fixation_index 1-1469 Integer The index of the fixation in temporal order. 0 nan SR Research data viewer
text_domain bio: 4682, biology: 200017, physics: 199721 Categorical The domain of the stimulus text. 0 nan Manually tagged
trial 1-12 Integer Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. 0 nan nan
acc_bq_1 min: 0.0, max: 1.0, mean: 0.3869, std: 0.487 Float The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_2 min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789 Float The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_3 min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938 Float The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_1 min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729 Float The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_2 min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821 Float The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_3 min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749 Float The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
fixation_duration 2-4474 Integer The duration of the fixation in milliseconds. 0 nan SR Research data viewer
next_saccade_duration 1.0-9491.0 Integer The duration of the saccade that follows a fixation in milliseconds. 46 nan SR Research data viewer
previous_saccade_duration 1.0-9491.0 Integer The duration of a saccade that preceeds a fixation in milliseconds. 515 nan SR Research data viewer
version 0-105 Integer Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv. 0 nan nan
line 1-12 Integer The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1. 0 nan nan
aoi 1-1121 Integer The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated. 0 nan SR Research experiment builder
char_index_in_line 1-100 Integer Index of a character in the line. Indexing starts at 1. 0 nan nan
original_fixation_index 1-1478 Integer The index of the uncorrected fixation. 0 nan SR Research data viewer
is_fixation_adjusted False: 382202, True: 22218 Categorical Whether or not the fixation has been adjusted manually. 0 nan Manually tagged.
reader_id 0-105 Integer The unique identifier given to each reader. Reader IDs start at 0. 0 nan Manually created
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
word_index_in_text 1-180 Integer The index of the word in the text. Indexing starts at 1. 0 nan nan
sent_index_in_text 1-12 Integer The index of a sentence in the respective text. Indexing starts at 1. 0 nan nan
char_index_in_text 1-1121 Integer Index of a character in the text. Indexing starts at 1. 0 nan nan
word string Words as they appear in the stimuli texts. Words are split at white-space. 0 nan nan
character string Character as text. 0 nan nan
text_id_numeric 0-11 Integer Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 0 nan Manually created
text_domain_numeric 0: 204699, 1: 199721 Categorical Numerical value of text_domain; 0=biology, 1=physics. 0 nan Manually created
reader_discipline_numeric 0: 223158, 1: 181262 Categorical Numerical encoding of the reader discipline; 0=biology, 1=physics. 0 nan Manually created
level_of_studies_numeric 0: 154333, 1: 250087 Categorical Numerical value of level_of_studies; 0=beginner, 1=expert. 0 nan demographic questionnaire
expert_reading_label_numeric 0: 290883, 1: 113537 Categorical Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading 0 nan Manually tagged
expert_reading_label expert_reading: 113537, non-expert_reading: 290883 Categorical Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert) 0 nan Manually tagged

Reading measures

The word-level reading measures in a short format.

Please find the files at this link: Reading measures

Column name Possible values Value type Description Num missing values Missing value description Source
word_index_in_sent 1-51 Integer The index of the word in the sentence. Indexing starts at 1. 0 nan nan
sent_index_in_text 1-12 Integer The index of a sentence in the respective text. Indexing starts at 1. 0 nan nan
line 1-12 Integer The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1. 0 nan nan
FFD min: 0, max: 2144, mean: 166.4158, std: 132.8433 Float First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0. 0 nan compute_reading_measures.py
SFD min: 0, max: 2144, mean: 118.8309, std: 135.573 Float Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation). 0 nan compute_reading_measures.py
FD min: 0, max: 2144, mean: 203.5219, std: 116.9324 Float First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass). 0 nan compute_reading_measures.py
FPRT min: 0, max: 9649, mean: 247.1511, std: 298.6889 Float First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass). 0 nan compute_reading_measures.py
FRT min: 0, max: 9649, mean: 291.8272, std: 288.631 Float First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass). 0 nan compute_reading_measures.py
TFT min: 0, max: 25314, mean: 632.8199, std: 720.3975 Float Total-fixation time: sum of all fixations on a word (FPRT+RRT). 0 nan compute_reading_measures.py
RRT min: 0, max: 23902, mean: 385.6688, std: 597.5206 Float Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT). 0 nan compute_reading_measures.py
RPD_inc min: 0, max: 318898, mean: 632.8199, std: 3881.7376 Float Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT). 0 nan compute_reading_measures.py
RPD_exc min: 0, max: 315640, mean: 342.295, std: 3815.3786 Float Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT). 0 nan compute_reading_measures.py
RBRT min: 0, max: 10675, mean: 290.5249, std: 358.8929 Float Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc). 0 nan compute_reading_measures.py
Fix 0: 14182, 1: 127943 Categorical Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR). 0 nan compute_reading_measures.py
FPF 0: 38408, 1: 103717 Categorical First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0. 0 nan compute_reading_measures.py
RR 0: 48283, 1: 93842 Categorical Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)). 0 nan compute_reading_measures.py
FPReg 0: 119060, 1: 23065 Categorical First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)). 0 nan compute_reading_measures.py
TRC_out min: 0, max: 15, mean: 0.4226, std: 0.7828 Float Total count of outgoing regressions: total number of regressive saccades initiated from this word. 0 nan compute_reading_measures.py
TRC_in min: 0, max: 12, mean: 0.4219, std: 0.7892 Float Total count of incoming regressions: total number of regressive saccades landing on this word. 0 nan compute_reading_measures.py
LP min: 0, max: 28, mean: 2.7791, std: 2.0942 Float Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character. 0 nan compute_reading_measures.py
SL_in min: -162, max: 156, mean: 1.077, std: 3.0552 Float Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression. 0 nan compute_reading_measures.py
SL_out min: -179, max: 63, mean: 0.1881, std: 7.0821 Float Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated. 0 nan compute_reading_measures.py
TFC min: 0, max: 87, mean: 2.8392, std: 2.9135 Float The total fixation count on the word. 0 nan compute_reading_measures.py
text_domain_numeric 0: 71550, 1: 70575 Categorical Numerical value of text_domain; 0=biology, 1=physics. 0 nan Manually created
trial 1-12 Integer Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. 0 nan nan
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
text_id_numeric 0-11 Integer Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 0 nan Manually created
reader_id 0-105 Integer The unique identifier given to each reader. Reader IDs start at 0. 0 nan Manually created
gender_numeric 0.0: 66325, 1.0: 73905, nan: 1895 Categorical Numerical value of gender; 0=male, 1=female. 1895 nan nan
reader_discipline_numeric 0: 81485, 1: 60640 Categorical Numerical encoding of the reader discipline; 0=biology, 1=physics. 0 nan Manually created
level_of_studies_numeric 0: 53060, 1: 89065 Categorical Numerical value of level_of_studies; 0=beginner, 1=expert. 0 nan demographic questionnaire
discipline_level_of_studies_numeric 0: 30320, 1: 51165, 2: 22740, 3: 37900 Categorical Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert. 0 nan demographic questionnaire
expert_reading_label_numeric 0: 97547, 1: 44578 Categorical Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading 0 nan Manually tagged
expert_reading_label expert_reading: 44578, non-expert_reading: 97547 Categorical Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert) 0 nan Manually tagged
age min: 18.0, max: 41.0, mean: 24.1644, std: 4.1809 Float Reader's age. 3790 nan demographic questionnaire
mean_acc_bq min: 0.0, max: 0.999250936329588, mean: 0.6381, std: 0.3139 Float The mean accuracy of all text questions for one text read by one reader. 0 nan nan
mean_acc_tq min: 0.0, max: 0.9991603694374476, mean: 0.3875, std: 0.3161 Float The mean accuracy of all background questions for one text read by one reader. 0 nan nan
acc_bq_1 min: 0.0, max: 0.9993197278911564, mean: 0.3858, std: 0.4857 Float The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_2 min: 0.0, max: 0.9993197278911564, mean: 0.3559, std: 0.4778 Float The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_3 min: 0.0, max: 0.9992429977289932, mean: 0.4207, std: 0.4925 Float The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_1 min: 0.0, max: 0.9993197278911564, mean: 0.6364, std: 0.4794 Float The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_2 min: 0.0, max: 0.999250936329588, mean: 0.6322, std: 0.4805 Float The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_3 min: 0.0, max: 0.9993197278911564, mean: 0.6456, std: 0.4766 Float The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan

Merged: fixations, participant info, reading measures and word features

The word-level reading measures merged with trial, session and reader information, as well as more information on the words.

Please find the files at this link: Reading measures merged

Column name Possible values Value type Description Num missing values Missing value description Source
word string Words as they appear in the stimuli texts. Words are split at white-space. 0 nan nan
word_with_punct string The word as it appears in the text, including punctuation. 0 nan nan
word_index_in_text 1-180 Integer The index of the word in the text. Indexing starts at 1. 0 nan nan
word_index_in_sent 1-51 Integer The index of the word in the sentence. Indexing starts at 1. 0 nan nan
sent_index_in_text 1-12 Integer The index of a sentence in the respective text. Indexing starts at 1. 0 nan nan
text_id_numeric 0-11 Integer Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 0 nan Manually created
text_domain biology: 71550, physics: 70575 Categorical The domain of the stimulus text. 0 nan Manually tagged
word_length 2-33 Integer Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters). 0 nan nan
STTS_punctuation_before 0.0: 70800, 0: 70425, $(: 900 Categorical If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. 0 nan Manually tagged
STTS_punctuation_after $(: 750, $($,: 150, $,: 6975, $.: 7575, 0: 126675 Categorical If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. 0 nan Manually tagged
is_in_quote 0: 141075, 1: 1050 Categorical Whether or not the word is part of an expression in quotes. 0 nan Manually tagged
is_in_parentheses 0: 141750, 1: 375 Categorical Whether or not the word is part of a phrase in parentheses. 0 nan Manually tagged
is_clause_beginning 0: 134700, 1: 7425 Categorical Whether or not the word is the beginning of a clause. 0 nan Manually tagged
is_sent_beginning 0: 134850, 1: 7275 Categorical Whether or not the word is the beginning of a new sentence. 0 nan Manually tagged
is_clause_end 0: 134775, 1: 7350 Categorical Whether or not the word is the end of a clause. 0 nan Manually tagged
is_sent_end 0: 134850, 1: 7275 Categorical Whether or not the word is the end of a sentence. 0 nan Manually tagged
is_abbreviation 0: 141750, 1: 375 Categorical Whether or not the entire word is an abbreviation. 0 nan Manually tagged
is_expert_technical_term 0: 130500, 1: 11625 Categorical 1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"". 0 nan Manually tagged
is_general_technical_term 0: 123450, 1: 18675 Categorical 1 if the word is a technical term that is generally understandable. E.g.: "elektrisch" 0 nan Manually tagged
contains_symbol 0: 141525, 1: 600 Categorical Whether or not the word contains a symbol. E.g.: β-D-Glucose 0 nan Manually tagged
contains_hyphen 0: 139950, 1: 2175 Categorical Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)). 0 nan Manually tagged
contains_abbreviation 0: 141225, 1: 900 Categorical Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA. 0 nan Manually tagged
STTS_PoS_tag ADJA: 11550, ADJD: 3975, ADV: 5475, APPR: 13800, APPRART: 3600, APZR: 75, ART: 20700, CARD: 675, KOKOM: 1275, KON: 4950, KOUI: 450, KOUS: 1200, NE: 300, NN: 38625, PAV: 1350, PDAT: 1200, PDS: 525, PIAT: 375, PIDAT: 675, PIS: 750, PPER: 1875, PPOSAT: 525, PRELAT: 450, PRELS: 2175, PRF: 1875, PTKA: 75, PTKNEG: 300, PTKVZ: 975, PTKZU: 750, PWAV: 75, TRUNC: 375, VAFIN: 5475, VAINF: 600, VMFIN: 1875, VMINF: 75, VVFIN: 7650, VVINF: 2475, VVIZU: 150, VVPP: 2850 Categorical Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information. 0 nan Manually tagged
type string The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name. 0 nan dlexDB
type_length_chars 0.0-33.0 Integer The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted. 0 nan nan
PoS_tag adja: 12150, adjd: 4050, adv: 6825, appr: 13650, apprart: 3600, art: 21000, card: 675, kokom: 1275, kon: 4725, koui: 375, kous: 1200, ne: 525, nn: 38100, pdat: 1200, pds: 525, piat: 375, pidat: 150, pis: 1050, pper: 1800, pposat: 525, prelat: 450, prels: 1800, prf: 1875, ptka: 75, ptkneg: 300, ptkvz: 1125, ptkzu: 750, pwav: 75, trunc: 375, vafin: 5475, vainf: 600, vmfin: 1800, vminf: 75, vvfin: 7725, vvinf: 2475, vvizu: 150, vvpp: 2850, xy: 375 Categorical Part-of-speech tag as defined by the dlexDB query. 0 nan dlexDB
lemma string nan 0 nan dlexDB
lemma_length_chars 0.0-32.0 Integer nan 0 nan dlexDB
syllables string nan 0 nan dlexDB
type_length_syllables 0.0-14.0 Integer nan 0 nan dlexDB
annotated_type_frequency_normalized min: 0.0, max: 24738.5901996, mean: 3629.1612, std: 6797.6492 Float The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma. 0 nan dlexDB
type_frequency_normalized min: 0.0, max: 26530.3631386, mean: 4141.6498, std: 7546.5578 Float nan 0 nan dlexDB
lemma_frequency_normalized min: 0.0, max: 80100.3069113, mean: 12271.0154, std: 24660.3797 Float nan 0 nan dlexDB
familiarity_normalized min: 0.0, max: 26530.3631386, mean: 3822.4994, std: 7457.3314 Float nan 0 nan dlexDB
regularity_normalized min: 0.0, max: 2123.30585022, mean: 35.3095, std: 119.8288 Float nan 0 nan dlexDB
document_frequency_normalized min: 0.0, max: 9372.80956103, mean: 2885.4746, std: 3353.4877 Float nan 0 nan dlexDB
sentence_frequency_normalized min: 0.0, max: 30912.3596552, mean: 5745.1861, std: 9454.5921 Float nan 0 nan dlexDB
cumulative_syllable_corpus_frequency_normalized min: 0.0, max: 125126.524676, mean: 15795.556, std: 15820.9152 Float nan 0 nan dlexDB
cumulative_syllable_lexicon_frequency_normalized min: 0.0, max: 218985.607753, mean: 21763.0396, std: 31363.3366 Float nan 0 nan dlexDB
cumulative_character_corpus_frequency_normalized min: 0.0, max: 7810554.20193, mean: 1800394.2485, std: 1298158.5605 Float nan 0 nan dlexDB
cumulative_character_lexicon_frequency_normalized min: 0.0, max: 18380479.713, mean: 4004667.3367, std: 2909455.8454 Float nan 0 nan dlexDB
cumulative_character_bigram_corpus_frequency_normalized min: 0.0, max: 1322150.62097, mean: 341028.5141, std: 227677.2532 Float nan 0 nan dlexDB
cumulative_character_bigram_lexicon_frequency_normalized min: 0.0, max: 2788357.77704, mean: 554080.6642, std: 451286.9101 Float nan 0 nan dlexDB
cumulative_character_trigram_corpus_frequency_normalized min: 0.0, max: 603427.130456, mean: 213996.2534, std: 130950.6249 Float nan 0 nan dlexDB
cumulative_character_trigram_lexicon_frequency_normalized min: 0.0, max: 899592.89035, mean: 223247.7744, std: 175811.3775 Float nan 0 nan dlexDB
initial_letter_frequency_normalized min: 0.0, max: 110461.430317, mean: 36031.6466, std: 33586.1123 Float nan 0 nan dlexDB
initial_bigram_frequency_normalized min: 0.0, max: 53801.2331077, mean: 11986.4422, std: 14536.7787 Float nan 0 nan dlexDB
initial_trigram_frequency_normalized min: -0.00817507899599, max: 29048.3692201, mean: 5528.0412, std: 8782.9659 Float nan 0 nan dlexDB
avg_cond_prob_in_bigrams min: 0.0, max: 0.5006180465, mean: 0.0423, std: 0.0447 Float The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information. 0 nan dlexDB
avg_cond_prob_in_trigrams min: 0.0, max: 25.0, mean: 0.2371, std: 0.5852 Float The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information. 0 nan dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized min: 0.0, max: 44055.247282, mean: 2111.0615, std: 7323.9586 Float nan 0 nan dlexDB
neighbors_coltheart_higher_freq_count_normalized min: 0.0, max: 8.13363128109, mean: 0.195, std: 0.4875 Float nan 0 nan dlexDB
neighbors_coltheart_all_cum_freq_normalized min: 0.0, max: 49782.1108458, mean: 4765.8454, std: 9884.7277 Float nan 0 nan dlexDB
neighbors_coltheart_all_count_normalized min: 0.0, max: 47.5175301158, mean: 14.8301, std: 14.4676 Float nan 0 nan dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized min: 0.0, max: 44055.247282, mean: 2703.1737, std: 7703.635 Float nan 0 nan dlexDB
neighbors_levenshtein_higher_freq_count_normalized min: 0.0, max: 11.9864039932, mean: 0.3077, std: 0.6418 Float nan 0 nan dlexDB
neighbors_levenshtein_all_cum_freq_normalized min: 0.0, max: 54875.2749862, mean: 6310.865, std: 11349.5391 Float nan 0 nan dlexDB
neighbors_levenshtein_all_count_normalized min: 0.0, max: 75.7711966712, mean: 23.1334, std: 22.6083 Float nan 0 nan dlexDB
sent_surprisal_gpt2-base min: 0.0005104430601932, max: 56.804420471191406, mean: 6.9134, std: 6.5992 Float Surprisal value extracted from a language model (GerPT2-base) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_gpt2-base min: 0.0002225389762315, max: 53.041446685791016, mean: 5.5822, std: 5.7075 Float Surprisal value extracted from a language model (GerPT2-base) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_gpt2-large min: 0.0002048997703241, max: 42.28059005737305, mean: 6.1407, std: 5.8838 Float Surprisal value extracted from a language model (GerPT2-large) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_gpt2-large min: 0.0001027531252475, max: 35.38883209228516, mean: 4.735, std: 4.8632 Float Surprisal value extracted from a language model (GerPT2-large) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_llama-7b min: 0.0001720042055239, max: 42.96158599853516, mean: 6.1564, std: 5.7258 Float Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_llama-7b min: 1.990775308513548e-05, max: 35.62324142456055, mean: 3.4794, std: 3.8542 Float Surprisal value extracted from a language model (LeoLM-7b) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_llama-13b min: 8.702239938429557e-06, max: 46.25139999389648, mean: 6.0065, std: 5.8573 Float Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_llama-13b min: 9.298280929215252e-06, max: 36.29869842529297, mean: 3.2454, std: 3.8081 Float Surprisal value extracted from a language model (LeoLM-13b) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_bert-base min: 1.1920928244535389e-07, max: 101.79562616348268, mean: 6.4507, std: 11.6153 Float Surprisal value extracted from a language model (BERT-base) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_bert-base min: -0.0, max: 88.84420316047726, mean: 6.2599, std: 11.5816 Float Surprisal value extracted from a language model (BERT-base) with the text as context. 0 nan See script get_surprisal.py
line 1-12 Integer The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1. 0 nan nan
FFD min: 0, max: 2144, mean: 166.4158, std: 132.8433 Float First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0. 0 nan compute_reading_measures.py
SFD min: 0, max: 2144, mean: 118.8309, std: 135.573 Float Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation). 0 nan compute_reading_measures.py
FD min: 0, max: 2144, mean: 203.5219, std: 116.9324 Float First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass). 0 nan compute_reading_measures.py
FPRT min: 0, max: 9649, mean: 247.1511, std: 298.6889 Float First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass). 0 nan compute_reading_measures.py
FRT min: 0, max: 9649, mean: 291.8272, std: 288.631 Float First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass). 0 nan compute_reading_measures.py
TFT min: 0, max: 25314, mean: 632.8199, std: 720.3975 Float Total-fixation time: sum of all fixations on a word (FPRT+RRT). 0 nan compute_reading_measures.py
TFC min: 0, max: 87, mean: 2.8392, std: 2.9135 Float The total fixation count on the word. 0 nan compute_reading_measures.py
RRT min: 0, max: 23902, mean: 385.6688, std: 597.5206 Float Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT). 0 nan compute_reading_measures.py
RPD_inc min: 0, max: 318898, mean: 632.8199, std: 3881.7376 Float Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT). 0 nan compute_reading_measures.py
RPD_exc min: 0, max: 315640, mean: 342.295, std: 3815.3786 Float Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT). 0 nan compute_reading_measures.py
RBRT min: 0, max: 10675, mean: 290.5249, std: 358.8929 Float Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc). 0 nan compute_reading_measures.py
Fix 0: 14182, 1: 127943 Categorical Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR). 0 nan compute_reading_measures.py
FPF 0: 38408, 1: 103717 Categorical First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0. 0 nan compute_reading_measures.py
RR 0: 48283, 1: 93842 Categorical Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)). 0 nan compute_reading_measures.py
FPReg 0: 119060, 1: 23065 Categorical First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)). 0 nan compute_reading_measures.py
TRC_out min: 0, max: 15, mean: 0.4226, std: 0.7828 Float Total count of outgoing regressions: total number of regressive saccades initiated from this word. 0 nan compute_reading_measures.py
TRC_in min: 0, max: 12, mean: 0.4219, std: 0.7892 Float Total count of incoming regressions: total number of regressive saccades landing on this word. 0 nan compute_reading_measures.py
LP min: 0, max: 28, mean: 2.7791, std: 2.0942 Float Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character. 0 nan compute_reading_measures.py
SL_in min: -162, max: 156, mean: 1.077, std: 3.0552 Float Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression. 0 nan compute_reading_measures.py
SL_out min: -179, max: 63, mean: 0.1881, std: 7.0821 Float Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated. 0 nan compute_reading_measures.py
acc_bq_1 min: 0.0, max: 0.9993197278911564, mean: 0.3858, std: 0.4857 Float The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_2 min: 0.0, max: 0.9993197278911564, mean: 0.3559, std: 0.4778 Float The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_3 min: 0.0, max: 0.9992429977289932, mean: 0.4207, std: 0.4925 Float The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_1 min: 0.0, max: 0.9993197278911564, mean: 0.6364, std: 0.4794 Float The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_2 min: 0.0, max: 0.999250936329588, mean: 0.6322, std: 0.4805 Float The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_3 min: 0.0, max: 0.9993197278911564, mean: 0.6456, std: 0.4766 Float The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. 0 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
mean_acc_tq min: 0.0, max: 0.9991603694374476, mean: 0.3875, std: 0.3161 Float The mean accuracy of all background questions for one text read by one reader. 0 nan nan
mean_acc_bq min: 0.0, max: 0.999250936329588, mean: 0.6381, std: 0.3139 Float The mean accuracy of all text questions for one text read by one reader. 0 nan nan
text_domain_numeric 0: 71550, 1: 70575 Categorical Numerical value of text_domain; 0=biology, 1=physics. 0 nan Manually created
trial 1-12 Integer Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. 0 nan nan
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
reader_id 0-105 Integer The unique identifier given to each reader. Reader IDs start at 0. 0 nan Manually created
gender_numeric 0.0: 66325, 1.0: 73905, nan: 1895 Categorical Numerical value of gender; 0=male, 1=female. 1895 nan nan
reader_discipline_numeric 0: 81485, 1: 60640 Categorical Numerical encoding of the reader discipline; 0=biology, 1=physics. 0 nan Manually created
age min: 18.0, max: 41.0, mean: 24.1644, std: 4.1809 Float Reader's age. 3790 nan demographic questionnaire
level_of_studies_numeric 0: 53060, 1: 89065 Categorical Numerical value of level_of_studies; 0=beginner, 1=expert. 0 nan demographic questionnaire
discipline_level_of_studies_numeric 0: 30320, 1: 51165, 2: 22740, 3: 37900 Categorical Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert. 0 nan demographic questionnaire
expert_reading_label_numeric 0: 97547, 1: 44578 Categorical Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading 0 nan Manually tagged

Merged: scanpaths, participant info, reading measures and word features

Contains the scanpaths for each trial merged with infomration on the reader, texts, etc.

Please find the files at this link: Scanpaths merged

Column name Possible values Value type Description Num missing values Missing value description Source
fixation_index 1-1469 Integer The index of the fixation in temporal order. 0 nan SR Research data viewer
text_domain bio: 4682, biology: 200017, physics: 199721 Categorical The domain of the stimulus text. 0 nan Manually tagged
trial 1-12 Integer Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. 0 nan nan
acc_bq_1 min: 0.0, max: 1.0, mean: 0.3869, std: 0.487 Float The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_2 min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789 Float The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_3 min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938 Float The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_1 min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729 Float The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_2 min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821 Float The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_3 min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749 Float The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
fixation_duration 2-4474 Integer The duration of the fixation in milliseconds. 0 nan SR Research data viewer
next_saccade_duration 1.0-9491.0 Integer The duration of the saccade that follows a fixation in milliseconds. 46 nan SR Research data viewer
previous_saccade_duration 1.0-9491.0 Integer The duration of a saccade that preceeds a fixation in milliseconds. 515 nan SR Research data viewer
version 0-105 Integer Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv. 0 nan nan
line 1-12 Integer The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1. 0 nan nan
aoi 1-1121 Integer The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated. 0 nan SR Research experiment builder
char_index_in_line 1-100 Integer Index of a character in the line. Indexing starts at 1. 0 nan nan
original_fixation_index 1-1478 Integer The index of the uncorrected fixation. 0 nan SR Research data viewer
is_fixation_adjusted False: 382202, True: 22218 Categorical Whether or not the fixation has been adjusted manually. 0 nan Manually tagged.
reader_id 0-105 Integer The unique identifier given to each reader. Reader IDs start at 0. 0 nan Manually created
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
word_index_in_text 1-180 Integer The index of the word in the text. Indexing starts at 1. 0 nan nan
sent_index_in_text 1-12 Integer The index of a sentence in the respective text. Indexing starts at 1. 0 nan nan
char_index_in_text 1-1121 Integer Index of a character in the text. Indexing starts at 1. 0 nan nan
word string Words as they appear in the stimuli texts. Words are split at white-space. 0 nan nan
character string Character as text. 0 nan nan
text_id_numeric 0-11 Integer Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 0 nan Manually created
text_domain_numeric 0: 204699, 1: 199721 Categorical Numerical value of text_domain; 0=biology, 1=physics. 0 nan Manually created
reader_discipline_numeric 0: 223158, 1: 181262 Categorical Numerical encoding of the reader discipline; 0=biology, 1=physics. 0 nan Manually created
level_of_studies_numeric 0: 154333, 1: 250087 Categorical Numerical value of level_of_studies; 0=beginner, 1=expert. 0 nan demographic questionnaire
expert_reading_label_numeric 0: 290883, 1: 113537 Categorical Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading 0 nan Manually tagged
expert_reading_label expert_reading: 113537, non-expert_reading: 290883 Categorical Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert) 0 nan Manually tagged
word_with_punct string The word as it appears in the text, including punctuation. 96 nan nan
word_index_in_sent 1-51 Integer The index of the word in the sentence. Indexing starts at 1. 0 nan nan
word_length 2-33 Integer Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters). 0 nan nan
STTS_punctuation_before 0.0: 211108, 0: 189407, $(: 3905 Categorical If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. 0 nan Manually tagged
STTS_punctuation_after $(: 3260, $($,: 573, $,: 22559, $.: 25794, 0: 352234 Categorical If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. 0 nan Manually tagged
is_in_quote 0: 399715, 1: 4705 Categorical Whether or not the word is part of an expression in quotes. 0 nan Manually tagged
is_in_parentheses 0: 403155, 1: 1265 Categorical Whether or not the word is part of a phrase in parentheses. 0 nan Manually tagged
is_clause_beginning 0: 388232, 1: 16188 Categorical Whether or not the word is the beginning of a clause. 0 nan Manually tagged
is_sent_beginning 0: 386681, 1: 17739 Categorical Whether or not the word is the beginning of a new sentence. 0 nan Manually tagged
is_clause_end 0: 381545, 1: 22875 Categorical Whether or not the word is the end of a clause. 0 nan Manually tagged
is_sent_end 0: 380027, 1: 24393 Categorical Whether or not the word is the end of a sentence. 0 nan Manually tagged
is_abbreviation 0: 403478, 1: 942 Categorical Whether or not the entire word is an abbreviation. 0 nan Manually tagged
is_expert_technical_term 0: 332354, 1: 72066 Categorical 1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"". 0 nan Manually tagged
is_general_technical_term 0: 325333, 1: 79087 Categorical 1 if the word is a technical term that is generally understandable. E.g.: "elektrisch" 0 nan Manually tagged
contains_symbol 0: 400458, 1: 3962 Categorical Whether or not the word contains a symbol. E.g.: β-D-Glucose 0 nan Manually tagged
contains_hyphen 0: 388149, 1: 16271 Categorical Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)). 0 nan Manually tagged
contains_abbreviation 0: 399423, 1: 4997 Categorical Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA. 0 nan Manually tagged
STTS_PoS_tag ADJA: 51041, ADJD: 12714, ADV: 12236, APPR: 22470, APPRART: 5566, APZR: 91, ART: 37340, CARD: 1594, KOKOM: 2428, KON: 5798, KOUI: 654, KOUS: 2521, NE: 955, NN: 162980, PAV: 3444, PDAT: 3292, PDS: 1374, PIAT: 791, PIDAT: 1653, PIS: 1322, PPER: 2511, PPOSAT: 1360, PRELAT: 1302, PRELS: 4193, PRF: 3606, PTKA: 97, PTKNEG: 687, PTKVZ: 1490, PTKZU: 583, PWAV: 76, TRUNC: 1137, VAFIN: 10340, VAINF: 1206, VMFIN: 3953, VMINF: 153, VVFIN: 23854, VVINF: 7713, VVIZU: 578, VVPP: 9317 Categorical Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information. 0 nan Manually tagged
type string The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name. 0 nan dlexDB
type_length_chars 0.0-33.0 Integer The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted. 0 nan nan
PoS_tag adja: 53330, adjd: 12226, adv: 15728, appr: 22193, apprart: 5566, art: 37918, card: 1594, kokom: 2428, kon: 5405, koui: 559, kous: 2521, ne: 1386, nn: 160585, pdat: 3292, pds: 1374, piat: 791, pidat: 352, pis: 2063, pper: 2434, pposat: 1360, prelat: 1302, prels: 4076, prf: 3606, ptka: 97, ptkneg: 687, ptkvz: 1891, ptkzu: 583, pwav: 76, trunc: 1137, vafin: 10340, vainf: 1206, vmfin: 3829, vminf: 153, vvfin: 23978, vvinf: 7713, vvizu: 578, vvpp: 9317, xy: 746 Categorical Part-of-speech tag as defined by the dlexDB query. 0 nan dlexDB
lemma string nan 0 nan dlexDB
lemma_length_chars 0.0-32.0 Integer nan 0 nan dlexDB
syllables string nan 0 nan dlexDB
type_length_syllables 0.0-14.0 Integer nan 0 nan dlexDB
annotated_type_frequency_normalized min: 0.0, max: 24738.5901996, mean: 1950.9055, std: 5185.3006 Float The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma. 0 nan dlexDB
type_frequency_normalized min: 0.0, max: 26530.3631386, mean: 2247.4523, std: 5847.2187 Float nan 0 nan dlexDB
lemma_frequency_normalized min: 0.0, max: 80100.3069113, mean: 7203.2409, std: 19769.4428 Float nan 0 nan dlexDB
familiarity_normalized min: 0.0, max: 26530.3631386, mean: 2191.7786, std: 5759.2592 Float nan 0 nan dlexDB
regularity_normalized min: 0.0, max: 2123.30585022, mean: 46.8657, std: 137.5046 Float nan 0 nan dlexDB
document_frequency_normalized min: 0.0, max: 9372.80956103, mean: 1684.1043, std: 2829.0626 Float nan 0 nan dlexDB
sentence_frequency_normalized min: 0.0, max: 30912.3596552, mean: 3137.4539, std: 7374.8037 Float nan 0 nan dlexDB
cumulative_syllable_corpus_frequency_normalized min: 0.0, max: 125126.524676, mean: 15768.7784, std: 17529.5528 Float nan 0 nan dlexDB
cumulative_syllable_lexicon_frequency_normalized min: 0.0, max: 218985.607753, mean: 27232.3183, std: 36883.9628 Float nan 0 nan dlexDB
cumulative_character_corpus_frequency_normalized min: 0.0, max: 7810554.20193, mean: 2053804.334, std: 1596380.3916 Float nan 0 nan dlexDB
cumulative_character_lexicon_frequency_normalized min: 0.0, max: 18380479.713, mean: 4612580.9638, std: 3597155.0404 Float nan 0 nan dlexDB
cumulative_character_bigram_corpus_frequency_normalized min: 0.0, max: 1322150.62097, mean: 356831.454, std: 269772.388 Float nan 0 nan dlexDB
cumulative_character_bigram_lexicon_frequency_normalized min: 0.0, max: 2788357.77704, mean: 629626.1651, std: 539088.9742 Float nan 0 nan dlexDB
cumulative_character_trigram_corpus_frequency_normalized min: 0.0, max: 603427.130456, mean: 200341.8076, std: 144122.7012 Float nan 0 nan dlexDB
cumulative_character_trigram_lexicon_frequency_normalized min: 0.0, max: 899592.89035, mean: 236423.2776, std: 199573.1416 Float nan 0 nan dlexDB
initial_letter_frequency_normalized min: 0.0, max: 110461.430317, mean: 28045.0077, std: 30618.9167 Float nan 0 nan dlexDB
initial_bigram_frequency_normalized min: 0.0, max: 53801.2331077, mean: 8706.0335, std: 12743.2638 Float nan 0 nan dlexDB
initial_trigram_frequency_normalized min: -0.00817507899599, max: 29048.3692201, mean: 3754.6304, std: 7393.1224 Float nan 0 nan dlexDB
avg_cond_prob_in_bigrams min: 0.0, max: 0.5006180465, mean: 0.0313, std: 0.0466 Float The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information. 0 nan dlexDB
avg_cond_prob_in_trigrams min: 0.0, max: 25.0, mean: 0.2251, std: 0.8814 Float The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information. 0 nan dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized min: 0.0, max: 44055.247282, mean: 1276.643, std: 5775.4034 Float nan 0 nan dlexDB
neighbors_coltheart_higher_freq_count_normalized min: 0.0, max: 8.13363128109, mean: 0.1556, std: 0.4321 Float nan 0 nan dlexDB
neighbors_coltheart_all_cum_freq_normalized min: 0.0, max: 49782.1108458, mean: 2794.1781, std: 7982.6321 Float nan 0 nan dlexDB
neighbors_coltheart_all_count_normalized min: 0.0, max: 47.5175301158, mean: 9.0448, std: 12.679 Float nan 0 nan dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized min: 0.0, max: 44055.247282, mean: 1683.6273, std: 6153.8504 Float nan 0 nan dlexDB
neighbors_levenshtein_higher_freq_count_normalized min: 0.0, max: 11.9864039932, mean: 0.2681, std: 0.5814 Float nan 0 nan dlexDB
neighbors_levenshtein_all_cum_freq_normalized min: 0.0, max: 54875.2749862, mean: 3761.4734, std: 9299.5647 Float nan 0 nan dlexDB
neighbors_levenshtein_all_count_normalized min: 0.0, max: 75.7711966712, mean: 14.1417, std: 19.6383 Float nan 0 nan dlexDB
sent_surprisal_gpt2-base min: 0.0005104430601932, max: 56.804420471191406, mean: 10.0061, std: 9.1114 Float Surprisal value extracted from a language model (GerPT2-base) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_gpt2-base min: 0.0002225389762315, max: 53.041446685791016, mean: 8.0061, std: 8.0873 Float Surprisal value extracted from a language model (GerPT2-base) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_gpt2-large min: 0.0002048997703241, max: 42.28059005737305, mean: 8.76, std: 8.0159 Float Surprisal value extracted from a language model (GerPT2-large) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_gpt2-large min: 0.0001027531252475, max: 35.38883209228516, mean: 6.6792, std: 6.6522 Float Surprisal value extracted from a language model (GerPT2-large) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_llama-7b min: 0.0001720042055239, max: 42.96158599853516, mean: 8.0373, std: 7.0611 Float Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_llama-7b min: 1.990775308513548e-05, max: 35.62324142456055, mean: 4.7991, std: 4.9022 Float Surprisal value extracted from a language model (LeoLM-7b) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_llama-13b min: 8.702239938429557e-06, max: 46.25139999389648, mean: 7.7768, std: 7.1775 Float Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_llama-13b min: 9.298280929215252e-06, max: 36.29869842529297, mean: 4.5172, std: 4.9048 Float Surprisal value extracted from a language model (LeoLM-13b) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_bert-base min: 1.1920928244535389e-07, max: 101.79562616348268, mean: 8.1926, std: 13.1873 Float Surprisal value extracted from a language model (BERT-base) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_bert-base min: -0.0, max: 88.84420316047726, mean: 7.487, std: 12.7275 Float Surprisal value extracted from a language model (BERT-base) with the text as context. 0 nan See script get_surprisal.py
FFD min: 0, max: 2144, mean: 195.9741, std: 124.5597 Float First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0. 0 nan compute_reading_measures.py
SFD min: 0, max: 2144, mean: 107.9483, std: 134.474 Float Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation). 0 nan compute_reading_measures.py
FD min: 0, max: 2144, mean: 226.9857, std: 103.7904 Float First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass). 0 nan compute_reading_measures.py
FPRT min: 0, max: 9649, mean: 408.9247, std: 526.0428 Float First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass). 0 nan compute_reading_measures.py
FRT min: 0, max: 9649, mean: 456.8788, std: 518.1388 Float First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass). 0 nan compute_reading_measures.py
TFT min: 0, max: 25314, mean: 1333.0163, std: 1428.494 Float Total-fixation time: sum of all fixations on a word (FPRT+RRT). 0 nan compute_reading_measures.py
TFC min: 0, max: 87, mean: 5.8238, std: 5.5152 Float The total fixation count on the word. 0 nan compute_reading_measures.py
RRT min: 0, max: 23902, mean: 924.0916, std: 1240.0587 Float Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT). 0 nan compute_reading_measures.py
RPD_inc min: 0, max: 318898, mean: 1076.7946, std: 5339.73 Float Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT). 0 nan compute_reading_measures.py
RPD_exc min: 0, max: 315640, mean: 557.5849, std: 5209.143 Float Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT). 0 nan compute_reading_measures.py
RBRT min: 0, max: 10675, mean: 519.2098, std: 638.9024 Float Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc). 0 nan compute_reading_measures.py
Fix 0: 110, 1: 404310 Categorical Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR). 0 nan compute_reading_measures.py
FPF 0: 56838, 1: 347582 Categorical First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0. 0 nan compute_reading_measures.py
RR 0: 48241, 1: 356179 Categorical Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)). 0 nan compute_reading_measures.py
FPReg 0: 308156, 1: 96264 Categorical First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)). 0 nan compute_reading_measures.py
TRC_out min: 0, max: 15, mean: 0.8249, std: 1.193 Float Total count of outgoing regressions: total number of regressive saccades initiated from this word. 0 nan compute_reading_measures.py
TRC_in min: 0, max: 12, mean: 0.7776, std: 1.1734 Float Total count of incoming regressions: total number of regressive saccades landing on this word. 0 nan compute_reading_measures.py
LP min: 1, max: 28, mean: 3.3887, std: 2.3225 Float Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character. 0 nan compute_reading_measures.py
SL_in min: -162, max: 156, mean: 1.3449, std: 2.928 Float Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression. 0 nan compute_reading_measures.py
SL_out min: -179, max: 63, mean: -0.0835, std: 7.9375 Float Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated. 0 nan compute_reading_measures.py
mean_acc_tq min: 0.0, max: 0.9991603694374476, mean: 0.3819, std: 0.3148 Float The mean accuracy of all background questions for one text read by one reader. 0 nan nan
mean_acc_bq min: 0.0, max: 0.999250936329588, mean: 0.6398, std: 0.312 Float The mean accuracy of all text questions for one text read by one reader. 0 nan nan
gender_numeric 0.0: 187536, 1.0: 212874, nan: 4010 Categorical Numerical value of gender; 0=male, 1=female. 4010 nan nan
age min: 18.0, max: 41.0, mean: 24.0283, std: 4.1436 Float Reader's age. 8459 nan demographic questionnaire
discipline_level_of_studies_numeric 0: 89325, 1: 133833, 2: 65008, 3: 116254 Categorical Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert. 0 nan demographic questionnaire

AOI to word mapping

Contains the mapping of each aoi to the respective word in each of the texts.

Please find the file at this link: aoi to word mapping

Column name Possible values Value type Description Num missing values Missing value description Source
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
word_index_in_text 1-180 Integer The index of the word in the text. Indexing starts at 1. 0 nan nan
char_index_in_text 1-1121 Integer Index of a character in the text. Indexing starts at 1. 0 nan nan

Participants

In the participants' data file, all demographic information is stored.

Please find the file at this link: Participant information

Column name Possible values Value type Description Num missing values Missing value description Source
reader_id 0-105 Integer The unique identifier given to each reader. Reader IDs start at 0. 0 nan Manually created
reader_discipline biology: 43, physics: 32 Categorical The area of expertise of the reader. All readers are students whose major is either physics or biology. 0 nan demographic questionnaire
reader_discipline_numeric 0: 43, 1: 32 Categorical Numerical encoding of the reader discipline; 0=biology, 1=physics. 0 nan Manually created
level_of_studies graduate: 47, undergraduate: 28 Categorical Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners. 0 nan demographic questionnaire
level_of_studies_numeric 0: 28, 1: 47 Categorical Numerical value of level_of_studies; 0=beginner, 1=expert. 0 nan demographic questionnaire
discipline_level_of_studies biology-graduate: 27, biology-undergraduate: 16, physics-graduate: 20, physics-undergraduate: 12 Categorical The combination of the readers' major (reader_discipline) and their expertise (level_of_studies). 0 nan demographic questionnaire
discipline_level_of_studies_numeric 0: 16, 1: 27, 2: 12, 3: 20 Categorical Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert. 0 nan demographic questionnaire
glasses no: 54, yes: 20, nan: 1 Categorical Whether or not reader is wearing glasses. 1 nan demographic questionnaire
age min: 18.0, max: 41.0, mean: 24.1644, std: 4.2098 Float Reader's age. 2 nan demographic questionnaire
handedness right: 68, left: 6, nan: 1 Categorical Reader's handedness. 1 nan demographic questionnaire
hours_sleep min: 0.0, max: 11.0, mean: 7.2095, std: 1.3138 Float The hours of sleep of the participant before the experiment. 1 nan demographic questionnaire
alcohol no: 71, yes: 3, nan: 1 Categorical Whether or not a participant consumed alcohol within 24 hours before the experiment start. 1 nan demographic questionnaire
gender female: 39, male: 35, nan: 1 Categorical Reader's gender. 1 nan demographic questionnaire
gender_numeric 0.0: 35, 1.0: 39, nan: 1 Categorical Numerical value of gender; 0=male, 1=female. 1 nan nan
semester string The semester the reader is currently enrolled in. 1 nan demographic questionnaire
bilingual n: 73, j: 1, nan: 1 Categorical Whether the reader is bilingual. 1 nan demographic questionnaire
state string The German state the reader is from. 1 nan demographic questionnaire
grade string The grade of the reader in their university entrance diploma. 4 nan demographic questionnaire
subject_detailed The detailed subject of the reader's major. 1 nan demographic questionnaire

Participants' response accuracy

The response accuracy for each participant for each question.

Please find the file at this link: Participant response accuracy

Column name Possible values Value type Description Num missing values Missing value description Source
reader_id 0-105 Integer The unique identifier given to each reader. Reader IDs start at 0. 0 nan Manually created
reader_discipline biology: 516, physics: 384 Categorical The area of expertise of the reader. All readers are students whose major is either physics or biology. 0 nan demographic questionnaire
reader_discipline_numeric 0: 516, 1: 384 Categorical Numerical encoding of the reader discipline; 0=biology, 1=physics. 0 nan Manually created
level_of_studies graduate: 564, undergraduate: 336 Categorical Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners. 0 nan demographic questionnaire
level_of_studies_numeric 0: 336, 1: 564 Categorical Numerical value of level_of_studies; 0=beginner, 1=expert. 0 nan demographic questionnaire
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
text_domain biology: 450, physics: 450 Categorical The domain of the stimulus text. 0 nan Manually tagged
expert_reading_label expert-reading: 282, non-expert-reading: 618 Categorical Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert) 0 nan Manually tagged
expert_reading_label_numeric 0: 618, 1: 282 Categorical Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading 0 nan Manually tagged
acc_tq_1 min: 0.0, max: 1.0, mean: 0.6475, std: 0.478 Float The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_2 min: 0.0, max: 1.0, mean: 0.6441, std: 0.479 Float The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_3 min: 0.0, max: 1.0, mean: 0.6509, std: 0.477 Float The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_1 min: 0.0, max: 1.0, mean: 0.393, std: 0.4887 Float The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_2 min: 0.0, max: 1.0, mean: 0.366, std: 0.482 Float The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_3 min: 0.0, max: 1.0, mean: 0.4234, std: 0.4944 Float The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
mean_acc_tq min: 0.0, max: 1.0, mean: 0.6475, std: 0.3082 Float The mean accuracy of all background questions for one text read by one reader. 12 nan nan
mean_acc_bq min: 0.0, max: 1.0, mean: 0.3941, std: 0.3163 Float The mean accuracy of all text questions for one text read by one reader. 12 nan nan

Coding of the answers of the online survey

This file is an explanation of the values used in the online survey answer file (response_data_online_survey.csv). Each variable has four different options which are expressed as a numerical value and each of the option is mapped to the text option the participant saw.

Please find the file at this link: Answer coding online survey

Column name Possible values Value type Description Num missing values Missing value description Source
VAR string Variable name of the fields in the participant online survey. These are explanations of the names of the columns in the file: response_data_online_survey.csv 0 nan online survey tool
RESPONSE -9: 46, 0: 2, 1: 95, 2: 92, 3: 93, 4: 54, 5: 13, 6: 13, 7: 12, 8: 12, 9: 12, 10: 12, 11: 12, 12: 12 Categorical The response code given by the online survey tool. In the answer file these codes are used. 0 nan online survey tool
MEANING The literal meaning of the response. What the participant could see in the online survey. 0 nan online survey tool
CORRECT_ANSWER nan: 312, False: 126, True: 42 Categorical Whether or not this answer was a correct answer or not. 312 The value is missing if this is not applicable. If the answer means that the participant did not even answer. online survey tool

Response accuracy online survey

This file contains the response accuracy for the participants from the online survey.

Please find the file at this link: Response accuracy

Column name Possible values Value type Description Num missing values Missing value description Source
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
text_domain biology: 210, physics: 210 Categorical The domain of the stimulus text. 0 nan Manually tagged
mean_acc_tq min: 0.0, max: 1.0, mean: 0.2619, std: 0.2495 Float The mean accuracy of all background questions for one text read by one reader. 0 nan nan
reader_discipline biology: 108, other: 156, physics: 156 Categorical The area of expertise of the reader. All readers are students whose major is either physics or biology. 0 nan demographic questionnaire
level_of_studies graduate: 264, other: 156 Categorical Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners. 0 nan demographic questionnaire

Response data online survey

The original response data from the online survey. The coding fo the values contained in here is found in the answer_coding_online_survey.csv file which is why the table below is empty. Please note that there are many value isn this file which are not relevant for this corpus. E.g., all columns starting with RA specify the randomization and all values starting with TIME contain response time information.

Please find the file at this link: Response data online survey

Column name Possible values Value type Description Num missing values Missing value description Source

Merged: scanpaths, participant info, reading measures and word features

Contains the scanpaths for each trial merged with infomration on the reader, texts, etc.

Please find the files at this link: Scanpaths merged

Column name Possible values Value type Description Num missing values Missing value description Source
fixation_index 1-1469 Integer The index of the fixation in temporal order. 0 nan SR Research data viewer
text_domain bio: 4682, biology: 200017, physics: 199721 Categorical The domain of the stimulus text. 0 nan Manually tagged
trial 1-12 Integer Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. 0 nan nan
acc_bq_1 min: 0.0, max: 1.0, mean: 0.3869, std: 0.487 Float The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_2 min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789 Float The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_3 min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938 Float The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_1 min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729 Float The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_2 min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821 Float The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_3 min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749 Float The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. 5785 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
fixation_duration 2-4474 Integer The duration of the fixation in milliseconds. 0 nan SR Research data viewer
next_saccade_duration 1.0-9491.0 Integer The duration of the saccade that follows a fixation in milliseconds. 46 nan SR Research data viewer
previous_saccade_duration 1.0-9491.0 Integer The duration of a saccade that preceeds a fixation in milliseconds. 515 nan SR Research data viewer
version 0-105 Integer Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv. 0 nan nan
line 1-12 Integer The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1. 0 nan nan
aoi 1-1121 Integer The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated. 0 nan SR Research experiment builder
char_index_in_line 1-100 Integer Index of a character in the line. Indexing starts at 1. 0 nan nan
original_fixation_index 1-1478 Integer The index of the uncorrected fixation. 0 nan SR Research data viewer
is_fixation_adjusted False: 382202, True: 22218 Categorical Whether or not the fixation has been adjusted manually. 0 nan Manually tagged.
reader_id 0-105 Integer The unique identifier given to each reader. Reader IDs start at 0. 0 nan Manually created
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
word_index_in_text 1-180 Integer The index of the word in the text. Indexing starts at 1. 0 nan nan
sent_index_in_text 1-12 Integer The index of a sentence in the respective text. Indexing starts at 1. 0 nan nan
char_index_in_text 1-1121 Integer Index of a character in the text. Indexing starts at 1. 0 nan nan
word string Words as they appear in the stimuli texts. Words are split at white-space. 0 nan nan
character string Character as text. 0 nan nan
text_id_numeric 0-11 Integer Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 0 nan Manually created
text_domain_numeric 0: 204699, 1: 199721 Categorical Numerical value of text_domain; 0=biology, 1=physics. 0 nan Manually created
reader_discipline_numeric 0: 223158, 1: 181262 Categorical Numerical encoding of the reader discipline; 0=biology, 1=physics. 0 nan Manually created
level_of_studies_numeric 0: 154333, 1: 250087 Categorical Numerical value of level_of_studies; 0=beginner, 1=expert. 0 nan demographic questionnaire
expert_reading_label_numeric 0: 290883, 1: 113537 Categorical Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading 0 nan Manually tagged
expert_reading_label expert_reading: 113537, non-expert_reading: 290883 Categorical Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert) 0 nan Manually tagged
word_with_punct string The word as it appears in the text, including punctuation. 96 nan nan
word_index_in_sent 1-51 Integer The index of the word in the sentence. Indexing starts at 1. 0 nan nan
word_length 2-33 Integer Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters). 0 nan nan
STTS_punctuation_before 0.0: 211108, 0: 189407, $(: 3905 Categorical If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. 0 nan Manually tagged
STTS_punctuation_after $(: 3260, $($,: 573, $,: 22559, $.: 25794, 0: 352234 Categorical If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. 0 nan Manually tagged
is_in_quote 0: 399715, 1: 4705 Categorical Whether or not the word is part of an expression in quotes. 0 nan Manually tagged
is_in_parentheses 0: 403155, 1: 1265 Categorical Whether or not the word is part of a phrase in parentheses. 0 nan Manually tagged
is_clause_beginning 0: 388232, 1: 16188 Categorical Whether or not the word is the beginning of a clause. 0 nan Manually tagged
is_sent_beginning 0: 386681, 1: 17739 Categorical Whether or not the word is the beginning of a new sentence. 0 nan Manually tagged
is_clause_end 0: 381545, 1: 22875 Categorical Whether or not the word is the end of a clause. 0 nan Manually tagged
is_sent_end 0: 380027, 1: 24393 Categorical Whether or not the word is the end of a sentence. 0 nan Manually tagged
is_abbreviation 0: 403478, 1: 942 Categorical Whether or not the entire word is an abbreviation. 0 nan Manually tagged
is_expert_technical_term 0: 332354, 1: 72066 Categorical 1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"". 0 nan Manually tagged
is_general_technical_term 0: 325333, 1: 79087 Categorical 1 if the word is a technical term that is generally understandable. E.g.: "elektrisch" 0 nan Manually tagged
contains_symbol 0: 400458, 1: 3962 Categorical Whether or not the word contains a symbol. E.g.: β-D-Glucose 0 nan Manually tagged
contains_hyphen 0: 388149, 1: 16271 Categorical Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)). 0 nan Manually tagged
contains_abbreviation 0: 399423, 1: 4997 Categorical Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA. 0 nan Manually tagged
STTS_PoS_tag ADJA: 51041, ADJD: 12714, ADV: 12236, APPR: 22470, APPRART: 5566, APZR: 91, ART: 37340, CARD: 1594, KOKOM: 2428, KON: 5798, KOUI: 654, KOUS: 2521, NE: 955, NN: 162980, PAV: 3444, PDAT: 3292, PDS: 1374, PIAT: 791, PIDAT: 1653, PIS: 1322, PPER: 2511, PPOSAT: 1360, PRELAT: 1302, PRELS: 4193, PRF: 3606, PTKA: 97, PTKNEG: 687, PTKVZ: 1490, PTKZU: 583, PWAV: 76, TRUNC: 1137, VAFIN: 10340, VAINF: 1206, VMFIN: 3953, VMINF: 153, VVFIN: 23854, VVINF: 7713, VVIZU: 578, VVPP: 9317 Categorical Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information. 0 nan Manually tagged
type string The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name. 0 nan dlexDB
type_length_chars 0.0-33.0 Integer The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted. 0 nan nan
PoS_tag adja: 53330, adjd: 12226, adv: 15728, appr: 22193, apprart: 5566, art: 37918, card: 1594, kokom: 2428, kon: 5405, koui: 559, kous: 2521, ne: 1386, nn: 160585, pdat: 3292, pds: 1374, piat: 791, pidat: 352, pis: 2063, pper: 2434, pposat: 1360, prelat: 1302, prels: 4076, prf: 3606, ptka: 97, ptkneg: 687, ptkvz: 1891, ptkzu: 583, pwav: 76, trunc: 1137, vafin: 10340, vainf: 1206, vmfin: 3829, vminf: 153, vvfin: 23978, vvinf: 7713, vvizu: 578, vvpp: 9317, xy: 746 Categorical Part-of-speech tag as defined by the dlexDB query. 0 nan dlexDB
lemma string nan 0 nan dlexDB
lemma_length_chars 0.0-32.0 Integer nan 0 nan dlexDB
syllables string nan 0 nan dlexDB
type_length_syllables 0.0-14.0 Integer nan 0 nan dlexDB
annotated_type_frequency_normalized min: 0.0, max: 24738.5901996, mean: 1950.9055, std: 5185.3006 Float The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma. 0 nan dlexDB
type_frequency_normalized min: 0.0, max: 26530.3631386, mean: 2247.4523, std: 5847.2187 Float nan 0 nan dlexDB
lemma_frequency_normalized min: 0.0, max: 80100.3069113, mean: 7203.2409, std: 19769.4428 Float nan 0 nan dlexDB
familiarity_normalized min: 0.0, max: 26530.3631386, mean: 2191.7786, std: 5759.2592 Float nan 0 nan dlexDB
regularity_normalized min: 0.0, max: 2123.30585022, mean: 46.8657, std: 137.5046 Float nan 0 nan dlexDB
document_frequency_normalized min: 0.0, max: 9372.80956103, mean: 1684.1043, std: 2829.0626 Float nan 0 nan dlexDB
sentence_frequency_normalized min: 0.0, max: 30912.3596552, mean: 3137.4539, std: 7374.8037 Float nan 0 nan dlexDB
cumulative_syllable_corpus_frequency_normalized min: 0.0, max: 125126.524676, mean: 15768.7784, std: 17529.5528 Float nan 0 nan dlexDB
cumulative_syllable_lexicon_frequency_normalized min: 0.0, max: 218985.607753, mean: 27232.3183, std: 36883.9628 Float nan 0 nan dlexDB
cumulative_character_corpus_frequency_normalized min: 0.0, max: 7810554.20193, mean: 2053804.334, std: 1596380.3916 Float nan 0 nan dlexDB
cumulative_character_lexicon_frequency_normalized min: 0.0, max: 18380479.713, mean: 4612580.9638, std: 3597155.0404 Float nan 0 nan dlexDB
cumulative_character_bigram_corpus_frequency_normalized min: 0.0, max: 1322150.62097, mean: 356831.454, std: 269772.388 Float nan 0 nan dlexDB
cumulative_character_bigram_lexicon_frequency_normalized min: 0.0, max: 2788357.77704, mean: 629626.1651, std: 539088.9742 Float nan 0 nan dlexDB
cumulative_character_trigram_corpus_frequency_normalized min: 0.0, max: 603427.130456, mean: 200341.8076, std: 144122.7012 Float nan 0 nan dlexDB
cumulative_character_trigram_lexicon_frequency_normalized min: 0.0, max: 899592.89035, mean: 236423.2776, std: 199573.1416 Float nan 0 nan dlexDB
initial_letter_frequency_normalized min: 0.0, max: 110461.430317, mean: 28045.0077, std: 30618.9167 Float nan 0 nan dlexDB
initial_bigram_frequency_normalized min: 0.0, max: 53801.2331077, mean: 8706.0335, std: 12743.2638 Float nan 0 nan dlexDB
initial_trigram_frequency_normalized min: -0.00817507899599, max: 29048.3692201, mean: 3754.6304, std: 7393.1224 Float nan 0 nan dlexDB
avg_cond_prob_in_bigrams min: 0.0, max: 0.5006180465, mean: 0.0313, std: 0.0466 Float The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information. 0 nan dlexDB
avg_cond_prob_in_trigrams min: 0.0, max: 25.0, mean: 0.2251, std: 0.8814 Float The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information. 0 nan dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized min: 0.0, max: 44055.247282, mean: 1276.643, std: 5775.4034 Float nan 0 nan dlexDB
neighbors_coltheart_higher_freq_count_normalized min: 0.0, max: 8.13363128109, mean: 0.1556, std: 0.4321 Float nan 0 nan dlexDB
neighbors_coltheart_all_cum_freq_normalized min: 0.0, max: 49782.1108458, mean: 2794.1781, std: 7982.6321 Float nan 0 nan dlexDB
neighbors_coltheart_all_count_normalized min: 0.0, max: 47.5175301158, mean: 9.0448, std: 12.679 Float nan 0 nan dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized min: 0.0, max: 44055.247282, mean: 1683.6273, std: 6153.8504 Float nan 0 nan dlexDB
neighbors_levenshtein_higher_freq_count_normalized min: 0.0, max: 11.9864039932, mean: 0.2681, std: 0.5814 Float nan 0 nan dlexDB
neighbors_levenshtein_all_cum_freq_normalized min: 0.0, max: 54875.2749862, mean: 3761.4734, std: 9299.5647 Float nan 0 nan dlexDB
neighbors_levenshtein_all_count_normalized min: 0.0, max: 75.7711966712, mean: 14.1417, std: 19.6383 Float nan 0 nan dlexDB
sent_surprisal_gpt2-base min: 0.0005104430601932, max: 56.804420471191406, mean: 10.0061, std: 9.1114 Float Surprisal value extracted from a language model (GerPT2-base) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_gpt2-base min: 0.0002225389762315, max: 53.041446685791016, mean: 8.0061, std: 8.0873 Float Surprisal value extracted from a language model (GerPT2-base) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_gpt2-large min: 0.0002048997703241, max: 42.28059005737305, mean: 8.76, std: 8.0159 Float Surprisal value extracted from a language model (GerPT2-large) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_gpt2-large min: 0.0001027531252475, max: 35.38883209228516, mean: 6.6792, std: 6.6522 Float Surprisal value extracted from a language model (GerPT2-large) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_llama-7b min: 0.0001720042055239, max: 42.96158599853516, mean: 8.0373, std: 7.0611 Float Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_llama-7b min: 1.990775308513548e-05, max: 35.62324142456055, mean: 4.7991, std: 4.9022 Float Surprisal value extracted from a language model (LeoLM-7b) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_llama-13b min: 8.702239938429557e-06, max: 46.25139999389648, mean: 7.7768, std: 7.1775 Float Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_llama-13b min: 9.298280929215252e-06, max: 36.29869842529297, mean: 4.5172, std: 4.9048 Float Surprisal value extracted from a language model (LeoLM-13b) with the text as context. 0 nan See script get_surprisal.py
sent_surprisal_bert-base min: 1.1920928244535389e-07, max: 101.79562616348268, mean: 8.1926, std: 13.1873 Float Surprisal value extracted from a language model (BERT-base) with the sentence as context. 0 nan See script get_surprisal.py
text_surprisal_bert-base min: -0.0, max: 88.84420316047726, mean: 7.487, std: 12.7275 Float Surprisal value extracted from a language model (BERT-base) with the text as context. 0 nan See script get_surprisal.py
FFD min: 0, max: 2144, mean: 195.9741, std: 124.5597 Float First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0. 0 nan compute_reading_measures.py
SFD min: 0, max: 2144, mean: 107.9483, std: 134.474 Float Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation). 0 nan compute_reading_measures.py
FD min: 0, max: 2144, mean: 226.9857, std: 103.7904 Float First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass). 0 nan compute_reading_measures.py
FPRT min: 0, max: 9649, mean: 408.9247, std: 526.0428 Float First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass). 0 nan compute_reading_measures.py
FRT min: 0, max: 9649, mean: 456.8788, std: 518.1388 Float First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass). 0 nan compute_reading_measures.py
TFT min: 0, max: 25314, mean: 1333.0163, std: 1428.494 Float Total-fixation time: sum of all fixations on a word (FPRT+RRT). 0 nan compute_reading_measures.py
TFC min: 0, max: 87, mean: 5.8238, std: 5.5152 Float The total fixation count on the word. 0 nan compute_reading_measures.py
RRT min: 0, max: 23902, mean: 924.0916, std: 1240.0587 Float Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT). 0 nan compute_reading_measures.py
RPD_inc min: 0, max: 318898, mean: 1076.7946, std: 5339.73 Float Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT). 0 nan compute_reading_measures.py
RPD_exc min: 0, max: 315640, mean: 557.5849, std: 5209.143 Float Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT). 0 nan compute_reading_measures.py
RBRT min: 0, max: 10675, mean: 519.2098, std: 638.9024 Float Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc). 0 nan compute_reading_measures.py
Fix 0: 110, 1: 404310 Categorical Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR). 0 nan compute_reading_measures.py
FPF 0: 56838, 1: 347582 Categorical First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0. 0 nan compute_reading_measures.py
RR 0: 48241, 1: 356179 Categorical Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)). 0 nan compute_reading_measures.py
FPReg 0: 308156, 1: 96264 Categorical First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)). 0 nan compute_reading_measures.py
TRC_out min: 0, max: 15, mean: 0.8249, std: 1.193 Float Total count of outgoing regressions: total number of regressive saccades initiated from this word. 0 nan compute_reading_measures.py
TRC_in min: 0, max: 12, mean: 0.7776, std: 1.1734 Float Total count of incoming regressions: total number of regressive saccades landing on this word. 0 nan compute_reading_measures.py
LP min: 1, max: 28, mean: 3.3887, std: 2.3225 Float Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character. 0 nan compute_reading_measures.py
SL_in min: -162, max: 156, mean: 1.3449, std: 2.928 Float Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression. 0 nan compute_reading_measures.py
SL_out min: -179, max: 63, mean: -0.0835, std: 7.9375 Float Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated. 0 nan compute_reading_measures.py
mean_acc_tq min: 0.0, max: 0.9991603694374476, mean: 0.3819, std: 0.3148 Float The mean accuracy of all background questions for one text read by one reader. 0 nan nan
mean_acc_bq min: 0.0, max: 0.999250936329588, mean: 0.6398, std: 0.312 Float The mean accuracy of all text questions for one text read by one reader. 0 nan nan
gender_numeric 0.0: 187536, 1.0: 212874, nan: 4010 Categorical Numerical value of gender; 0=male, 1=female. 4010 nan nan
age min: 18.0, max: 41.0, mean: 24.0283, std: 4.1436 Float Reader's age. 8459 nan demographic questionnaire
discipline_level_of_studies_numeric 0: 89325, 1: 133833, 2: 65008, 3: 116254 Categorical Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert. 0 nan demographic questionnaire

AOI to word mapping

Contains the mapping of each aoi to the respective word in each of the texts.

Please find the file at this link: aoi to word mapping

Column name Possible values Value type Description Num missing values Missing value description Source
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
word_index_in_text 1-180 Integer The index of the word in the text. Indexing starts at 1. 0 nan nan
char_index_in_text 1-1121 Integer Index of a character in the text. Indexing starts at 1. 0 nan nan

Participants

In the participants' data file, all demographic information is stored.

Please find the file at this link: Participant information

Column name Possible values Value type Description Num missing values Missing value description Source
reader_id 0-105 Integer The unique identifier given to each reader. Reader IDs start at 0. 0 nan Manually created
reader_discipline biology: 43, physics: 32 Categorical The area of expertise of the reader. All readers are students whose major is either physics or biology. 0 nan demographic questionnaire
reader_discipline_numeric 0: 43, 1: 32 Categorical Numerical encoding of the reader discipline; 0=biology, 1=physics. 0 nan Manually created
level_of_studies graduate: 47, undergraduate: 28 Categorical Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners. 0 nan demographic questionnaire
level_of_studies_numeric 0: 28, 1: 47 Categorical Numerical value of level_of_studies; 0=beginner, 1=expert. 0 nan demographic questionnaire
discipline_level_of_studies biology-graduate: 27, biology-undergraduate: 16, physics-graduate: 20, physics-undergraduate: 12 Categorical The combination of the readers' major (reader_discipline) and their expertise (level_of_studies). 0 nan demographic questionnaire
discipline_level_of_studies_numeric 0: 16, 1: 27, 2: 12, 3: 20 Categorical Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert. 0 nan demographic questionnaire
glasses no: 54, yes: 20, nan: 1 Categorical Whether or not reader is wearing glasses. 1 nan demographic questionnaire
age min: 18.0, max: 41.0, mean: 24.1644, std: 4.2098 Float Reader's age. 2 nan demographic questionnaire
handedness right: 68, left: 6, nan: 1 Categorical Reader's handedness. 1 nan demographic questionnaire
hours_sleep min: 0.0, max: 11.0, mean: 7.2095, std: 1.3138 Float The hours of sleep of the participant before the experiment. 1 nan demographic questionnaire
alcohol no: 71, yes: 3, nan: 1 Categorical Whether or not a participant consumed alcohol within 24 hours before the experiment start. 1 nan demographic questionnaire
gender female: 39, male: 35, nan: 1 Categorical Reader's gender. 1 nan demographic questionnaire
gender_numeric 0.0: 35, 1.0: 39, nan: 1 Categorical Numerical value of gender; 0=male, 1=female. 1 nan nan
semester string The semester the reader is currently enrolled in. 1 nan demographic questionnaire
bilingual n: 73, j: 1, nan: 1 Categorical Whether the reader is bilingual. 1 nan demographic questionnaire
state string The German state the reader is from. 1 nan demographic questionnaire
grade string The grade of the reader in their university entrance diploma. 4 nan demographic questionnaire
subject_detailed The detailed subject of the reader's major. 1 nan demographic questionnaire

Participants' response accuracy

The response accuracy for each participant for each question.

Please find the file at this link: Participant response accuracy

Column name Possible values Value type Description Num missing values Missing value description Source
reader_id 0-105 Integer The unique identifier given to each reader. Reader IDs start at 0. 0 nan Manually created
reader_discipline biology: 516, physics: 384 Categorical The area of expertise of the reader. All readers are students whose major is either physics or biology. 0 nan demographic questionnaire
reader_discipline_numeric 0: 516, 1: 384 Categorical Numerical encoding of the reader discipline; 0=biology, 1=physics. 0 nan Manually created
level_of_studies graduate: 564, undergraduate: 336 Categorical Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners. 0 nan demographic questionnaire
level_of_studies_numeric 0: 336, 1: 564 Categorical Numerical value of level_of_studies; 0=beginner, 1=expert. 0 nan demographic questionnaire
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
text_domain biology: 450, physics: 450 Categorical The domain of the stimulus text. 0 nan Manually tagged
expert_reading_label expert-reading: 282, non-expert-reading: 618 Categorical Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert) 0 nan Manually tagged
expert_reading_label_numeric 0: 618, 1: 282 Categorical Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading 0 nan Manually tagged
acc_tq_1 min: 0.0, max: 1.0, mean: 0.6475, std: 0.478 Float The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_2 min: 0.0, max: 1.0, mean: 0.6441, std: 0.479 Float The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_tq_3 min: 0.0, max: 1.0, mean: 0.6509, std: 0.477 Float The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_1 min: 0.0, max: 1.0, mean: 0.393, std: 0.4887 Float The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_2 min: 0.0, max: 1.0, mean: 0.366, std: 0.482 Float The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
acc_bq_3 min: 0.0, max: 1.0, mean: 0.4234, std: 0.4944 Float The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. 12 For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). nan
mean_acc_tq min: 0.0, max: 1.0, mean: 0.6475, std: 0.3082 Float The mean accuracy of all background questions for one text read by one reader. 12 nan nan
mean_acc_bq min: 0.0, max: 1.0, mean: 0.3941, std: 0.3163 Float The mean accuracy of all text questions for one text read by one reader. 12 nan nan

Coding of the answers of the online survey

This file is an explanation of the values used in the online survey answer file (response_data_online_survey.csv). Each variable has four different options which are expressed as a numerical value and each of the option is mapped to the text option the participant saw.

Please find the file at this link: Answer coding online survey

Column name Possible values Value type Description Num missing values Missing value description Source
VAR string Variable name of the fields in the participant online survey. These are explanations of the names of the columns in the file: response_data_online_survey.csv 0 nan online survey tool
RESPONSE -9: 46, 0: 2, 1: 95, 2: 92, 3: 93, 4: 54, 5: 13, 6: 13, 7: 12, 8: 12, 9: 12, 10: 12, 11: 12, 12: 12 Categorical The response code given by the online survey tool. In the answer file these codes are used. 0 nan online survey tool
MEANING The literal meaning of the response. What the participant could see in the online survey. 0 nan online survey tool
CORRECT_ANSWER nan: 312, False: 126, True: 42 Categorical Whether or not this answer was a correct answer or not. 312 The value is missing if this is not applicable. If the answer means that the participant did not even answer. online survey tool

Response accuracy online survey

This file contains the response accuracy for the participants from the online survey.

Please find the file at this link: Response accuracy

Column name Possible values Value type Description Num missing values Missing value description Source
text_id b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 Unique identifier given to each stimulus text. 0 nan nan
text_domain biology: 210, physics: 210 Categorical The domain of the stimulus text. 0 nan Manually tagged
mean_acc_tq min: 0.0, max: 1.0, mean: 0.2619, std: 0.2495 Float The mean accuracy of all background questions for one text read by one reader. 0 nan nan
reader_discipline biology: 108, other: 156, physics: 156 Categorical The area of expertise of the reader. All readers are students whose major is either physics or biology. 0 nan demographic questionnaire
level_of_studies graduate: 264, other: 156 Categorical Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners. 0 nan demographic questionnaire

Response data online survey

The original response data from the online survey. The coding fo the values contained in here is found in the answer_coding_online_survey.csv file which is why the table below is empty. Please note that there are many value isn this file which are not relevant for this corpus. E.g., all columns starting with RA specify the randomization and all values starting with TIME contain response time information.

Please find the file at this link: Response data online survey

Column name Possible values Value type Description Num missing values Missing value description Source