Codebook

The codebook specifies the data types, possible values, and other information for each column in the data files.

Word features
Stimuli and comprehension questions
Items
Areas of interest (AOI)
Dependency trees
Fixations
Scanpaths
Reading measures
Reading measures merged
Scanpaths merged
AOI to word mapping
Participants
Participant's response accuracy
Coding online survey
Participant's response accuracy online survey
Response data online survey

Word features

Contains the word features for each of the stimulus texts.

Please find the files at this link: Word features

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
word		string	Words as they appear in the stimuli texts. Words are split at white-space.	0	nan	nan
word_with_punct		string	The word as it appears in the text, including punctuation.	0	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	0	nan	nan
word_index_in_sent	1-51	Integer	The index of the word in the sentence. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
word_limit_char_indices	no stats?		Specifies the limits of each word in character indices. Format: [word_start],[word_end]. For example: 3,7 means a word starts at character index 3 in the text and ends at character index 7. The properties of the character indices are specified in char_index_in_text.	0	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	Manually created
text_domain	biology: 954, physics: 941	Categorical	The domain of the stimulus text.	0	nan	Manually tagged
text_domain_numeric	0: 954, 1: 941	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	Manually created
word_length	2-33	Integer	Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters).	0	nan	nan
STTS_punctuation_before	nan: 1883, $(: 12	Categorical	If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	1883	nan	Manually tagged
STTS_punctuation_after	nan: 1689, $.: 101, $,: 93, $(: 10, $($,: 2	Categorical	If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	1689	nan	Manually tagged
is_in_quote	0: 1881, 1: 14	Categorical	Whether or not the word is part of an expression in quotes.	0	nan	Manually tagged
is_in_parentheses	0: 1890, 1: 5	Categorical	Whether or not the word is part of a phrase in parentheses.	0	nan	Manually tagged
is_clause_beginning	0: 1796, 1: 99	Categorical	Whether or not the word is the beginning of a clause.	0	nan	Manually tagged
is_sent_beginning	0: 1798, 1: 97	Categorical	Whether or not the word is the beginning of a new sentence.	0	nan	Manually tagged
is_clause_end	0: 1797, 1: 98	Categorical	Whether or not the word is the end of a clause.	0	nan	Manually tagged
is_sent_end	0: 1798, 1: 97	Categorical	Whether or not the word is the end of a sentence.	0	nan	Manually tagged
is_abbreviation	0: 1890, 1: 5	Categorical	Whether or not the entire word is an abbreviation.	0	nan	Manually tagged
is_expert_technical_term	0: 1740, 1: 155	Categorical	1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"".	0	nan	Manually tagged
is_general_technical_term	0: 1646, 1: 249	Categorical	1 if the word is a technical term that is generally understandable. E.g.: "elektrisch"	0	nan	Manually tagged
contains_symbol	0: 1887, 1: 8	Categorical	Whether or not the word contains a symbol. E.g.: β-D-Glucose	0	nan	Manually tagged
contains_hyphen	0: 1866, 1: 29	Categorical	Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)).	0	nan	Manually tagged
contains_abbreviation	0: 1883, 1: 12	Categorical	Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA.	0	nan	Manually tagged
STTS_PoS_tag	ADJA: 154, ADJD: 53, ADV: 73, APPR: 184, APPRART: 48, APZR: 1, ART: 276, CARD: 9, KOKOM: 17, KON: 66, KOUI: 6, KOUS: 16, NE: 4, NN: 515, PAV: 18, PDAT: 16, PDS: 7, PIAT: 5, PIDAT: 9, PIS: 10, PPER: 25, PPOSAT: 7, PRELAT: 6, PRELS: 29, PRF: 25, PTKA: 1, PTKNEG: 4, PTKVZ: 13, PTKZU: 10, PWAV: 1, TRUNC: 5, VAFIN: 73, VAINF: 8, VMFIN: 25, VMINF: 1, VVFIN: 102, VVINF: 33, VVIZU: 2, VVPP: 38	Categorical	Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information.	0	nan	Manually tagged
type		string	The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name.	4	nan	dlexDB
type_length_chars	2.0-33.0	Integer	The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted.	1	nan	nan
PoS_tag	adja: 162, adjd: 54, adv: 91, appr: 182, apprart: 48, art: 280, card: 9, kokom: 17, kon: 63, koui: 5, kous: 16, ne: 7, nn: 508, pdat: 16, pds: 7, piat: 5, pidat: 2, pis: 14, pper: 24, pposat: 7, prelat: 6, prels: 24, prf: 25, ptka: 1, ptkneg: 4, ptkvz: 15, ptkzu: 10, pwav: 1, trunc: 5, vafin: 73, vainf: 8, vmfin: 24, vminf: 1, vvfin: 103, vvinf: 33, vvizu: 2, vvpp: 38, xy: 5	Categorical	Part-of-speech tag as defined by the dlexDB query.	0	nan	dlexDB
lemma		string	nan	4	nan	dlexDB
lemma_length_chars	1.0-32.0	Integer	nan	3	nan	dlexDB
syllables		string	nan	25	nan	dlexDB
type_length_syllables	1.0-14.0	Integer	nan	24	nan	dlexDB
annotated_type_frequency_normalized	min: 0.00817507899599, max: 24738.5901996, mean: 3889.8532, std: 6967.089	Float	The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma.	127	nan	dlexDB
type_frequency_normalized	min: 0.00817507899599, max: 26530.3631386, mean: 4409.2283, std: 7712.5287	Float	nan	115	nan	dlexDB
lemma_frequency_normalized	min: 0.00817507899599, max: 80100.3069113, mean: 13063.8057, std: 25247.1898	Float	nan	115	nan	dlexDB
familiarity_normalized	min: 0.0, max: 26530.3631386, mean: 4074.0362, std: 7634.0602	Float	nan	117	nan	dlexDB
regularity_normalized	min: 0.0, max: 2123.30585022, mean: 37.6119, std: 123.3575	Float	nan	116	nan	dlexDB
document_frequency_normalized	min: 0.126068429944, max: 9372.80956103, mean: 3073.6225, std: 3377.4549	Float	nan	116	nan	dlexDB
sentence_frequency_normalized	min: 0.0155184320176, max: 30912.3596552, mean: 6119.8019, std: 9642.457	Float	nan	116	nan	dlexDB
cumulative_syllable_corpus_frequency_normalized	min: 1.40611358731, max: 125126.524676, mean: 16825.508, std: 15793.39	Float	nan	116	nan	dlexDB
cumulative_syllable_lexicon_frequency_normalized	min: 0.428085856899, max: 218985.607753, mean: 23221.2613, std: 31879.0143	Float	nan	119	nan	dlexDB
cumulative_character_corpus_frequency_normalized	min: 15533.2550482, max: 7810554.20193, mean: 1917789.2641, std: 1253328.3202	Float	nan	116	nan	dlexDB
cumulative_character_lexicon_frequency_normalized	min: 47003.8270876, max: 18380479.713, mean: 4265792.357, std: 2812004.0938	Float	nan	116	nan	dlexDB
cumulative_character_bigram_corpus_frequency_normalized	min: 5138.64210483, max: 1322150.62097, mean: 363265.3368, std: 217175.5613	Float	nan	116	nan	dlexDB
cumulative_character_bigram_lexicon_frequency_normalized	min: 12677.7626521, max: 2788357.77704, mean: 590209.5889, std: 442407.5129	Float	nan	116	nan	dlexDB
cumulative_character_trigram_corpus_frequency_normalized	min: 4358.04468689, max: 603427.130456, mean: 227949.9158, std: 122856.9432	Float	nan	116	nan	dlexDB
cumulative_character_trigram_lexicon_frequency_normalized	min: 11942.3111499, max: 899592.89035, mean: 237804.6839, std: 171696.6712	Float	nan	116	nan	dlexDB
initial_letter_frequency_normalized	min: 199.202149895, max: 110461.430317, mean: 38381.0963, std: 33346.9984	Float	nan	116	nan	dlexDB
initial_bigram_frequency_normalized	min: 1.57779024623, max: 53801.2331077, mean: 12768.0203, std: 14670.9631	Float	nan	116	nan	dlexDB
initial_trigram_frequency_normalized	min: -0.00817507899599, max: 29048.3692201, mean: 5888.4981, std: 8949.4325	Float	nan	116	nan	dlexDB
avg_cond_prob_in_bigrams	min: 1.2e-07, max: 0.5006180465, mean: 0.0451, std: 0.0448	Float	The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information.	116	nan	dlexDB
avg_cond_prob_in_trigrams	min: 3.153e-06, max: 25.0, mean: 0.2526, std: 0.6009	Float	The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information.	116	nan	dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 2248.7136, std: 7540.5582	Float	nan	116	nan	dlexDB
neighbors_coltheart_higher_freq_count_normalized	min: 0.0, max: 8.13363128109, mean: 0.2077, std: 0.5007	Float	nan	116	nan	dlexDB
neighbors_coltheart_all_cum_freq_normalized	min: 0.0, max: 49782.1108458, mean: 5076.6032, std: 10127.1033	Float	nan	116	nan	dlexDB
neighbors_coltheart_all_count_normalized	min: 0.0, max: 47.5175301158, mean: 15.7971, std: 14.4153	Float	nan	116	nan	dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 2879.4346, std: 7921.0448	Float	nan	116	nan	dlexDB
neighbors_levenshtein_higher_freq_count_normalized	min: 0.0, max: 11.9864039932, mean: 0.3277, std: 0.6576	Float	nan	116	nan	dlexDB
neighbors_levenshtein_all_cum_freq_normalized	min: 0.0, max: 54875.2749862, mean: 6722.366, std: 11598.2601	Float	nan	116	nan	dlexDB
neighbors_levenshtein_all_count_normalized	min: 0.0, max: 75.7711966712, mean: 24.6418, std: 22.5295	Float	nan	116	nan	dlexDB
sent_surprisal_gpt2-base	min: 0.0005104430601932, max: 56.804420471191406, mean: 6.9134, std: 6.601	Float	Surprisal value extracted from a language model (GerPT2-base) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_gpt2-base	min: 0.0002225389762315, max: 53.041446685791016, mean: 5.5822, std: 5.709	Float	Surprisal value extracted from a language model (GerPT2-base) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_gpt2-large	min: 0.0002048997703241, max: 42.28059005737305, mean: 6.1407, std: 5.8854	Float	Surprisal value extracted from a language model (GerPT2-large) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_gpt2-large	min: 0.0001027531252475, max: 35.38883209228516, mean: 4.735, std: 4.8645	Float	Surprisal value extracted from a language model (GerPT2-large) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_llama-7b	min: 0.0001720042055239, max: 42.96158599853516, mean: 6.1564, std: 5.7273	Float	Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_llama-7b	min: 1.990775308513548e-05, max: 35.62324142456055, mean: 3.4794, std: 3.8552	Float	Surprisal value extracted from a language model (LeoLM-7b) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_llama-13b	min: 8.702239938429557e-06, max: 46.25139999389648, mean: 6.0065, std: 5.8588	Float	Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_llama-13b	min: 9.298280929215252e-06, max: 36.29869842529297, mean: 3.2454, std: 3.8091	Float	Surprisal value extracted from a language model (LeoLM-13b) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_bert-base	min: 1.1920928244535389e-07, max: 101.79562616348268, mean: 6.4507, std: 11.6184	Float	Surprisal value extracted from a language model (BERT-base) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_bert-base	min: -0.0, max: 88.84420316047726, mean: 6.2599, std: 11.5846	Float	Surprisal value extracted from a language model (BERT-base) with the text as context.	0	nan	See script get_surprisal.py

Stimuli and comprehension questions

Contains the stimulus information including the questions for each text.

Please find the file at this link: Stimuli including comprehension questions

Column name	Possible values	Value type	Description	Missing value description	Source
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	nan	Manually created
text_domain	biology: 6, physics: 6	Categorical	The domain of the stimulus text.	nan	Manually tagged
text_domain_numeric	0: 6, 1: 6	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	nan	Manually created
source	no stats?		The source of the stimulus text.	nan	nan
headline		string	The header of the respective stimulus text.	nan	nan
tq_1		string	Text question 1.	nan	Manually created
tq_1_option1		string	Option 1 for text question 1.	nan	Manually created
tq_1_option2		string	Option 2 for text question 1.	nan	Manually created
tq_1_option3		string	Option 3 for text question 1.	nan	Manually created
tq_1_option4		string	Option 4 for text question 1.	nan	Manually created
tq_2		string	Text question 2.	nan	Manually created
tq_2_option1		string	Option 1 for text question 2.	nan	Manually created
tq_2_option2		string	Option 2 for text question 2.	nan	Manually created
tq_2_option3		string	Option 3 for text question 2.	nan	Manually created
tq_2_option4		string	Option 4 for text question 2.	nan	Manually created
tq_3		string	Text question 3.	nan	Manually created
tq_3_option1		string	Option 1 for text question 3.	nan	Manually created
tq_3_option2		string	Option 2 for text question 3.	nan	Manually created
tq_3_option3		string	Option 3 for text question 3.	nan	Manually created
tq_3_option4		string	Option 4 for text question 3.	nan	Manually created
bq_1		string	Background question 1.	nan	Manually created
bq_1_option1		string	Option 1 for background question 1.	nan	Manually created
bq_1_option2		string	Option 2 for background question 1.	nan	Manually created
bq_1_option3		string	Option 3 for background question 1.	nan	Manually created
bq_1_option4		string	Option 4 for background question 1.	nan	Manually created
bq_2		string	Background question 2.	nan	Manually created
bq_2_option1		string	Option 1 for background question 2.	nan	Manually created
bq_2_option2		string	Option 2 for background question 2.	nan	Manually created
bq_2_option3		string	Option 3 for background question 2.	nan	Manually created
bq_2_option4		string	Option 4 for background question 2.	nan	Manually created
bq_3		string	Background question 3.	nan	Manually created
bq_3_option1		string	Option 1 for background question 3.	nan	Manually created
bq_3_option2		string	Option 2 for background question 3.	nan	Manually created
bq_3_option3		string	Option 3 for background question 3.	nan	Manually created
bq_3_option4		string	Option 4 for background question 3.	nan	Manually created
correct_ans_tq_1	1-4	Integer	The index of the correct answer for text question 1. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan
correct_ans_tq_2	1-4	Integer	The index of the correct answer for text question 2. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan
correct_ans_tq_3	1-4	Integer	The index of the correct answer for text question 3. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan
correct_ans_bq_1	1-4	Integer	The index of the correct answer for background question 1. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan
correct_ans_bq_2	1-4	Integer	The index of the correct answer for background question 2. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan
correct_ans_bq_3	1-4	Integer	The index of the correct answer for background question 3. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan

Items

The file contains the information on the version number of the question answer randomization for each text.

Please find the file at this link: Items

Column name	Possible values	Value type	Description	Missing value description	Source
version	0-119	Integer	Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv.	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan
text_domain	biology: 720, physics: 720	Categorical	The domain of the stimulus text.	nan	Manually tagged
order_bq_1_ans	no stats?		The order in which the answers for background question 1 were presented.	nan	nan
order_bq_2_ans	no stats?		See description of order_bq_1_ans	nan	nan
order_bq_3_ans	no stats?		See description of order_bq_1_ans	nan	nan
order_tq_1_ans	no stats?		See description of order_bq_1_ans	nan	nan
order_tq_2_ans	no stats?		See description of order_bq_1_ans	nan	nan
order_tq_3_ans	no stats?		See description of order_bq_1_ans	nan	nan
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	nan	nan

Areas of interest (AOI)

Contains the aoi files for each of the stimulus texts.

Please find the files at this link: AOI

Column name	Possible values	Value type	Description	Missing value description	Source
aoi_type			The shape of the area of interest. In this corpus, all aois are rectangles around the characters.	nan	SR Research data viewer
aoi	1-1121	Integer	The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated.	nan	SR Research experiment builder
start_x	80-1622	Integer	The x-coordinate in pixels of the top left corner of the aoi rectangle.	nan	nan
start_y	21-920	Integer	The y-coordinate in pixels of the top left corner of the aoi rectangle.	nan	nan
end_x	92-1634	Integer	The x-coordinate in pixels of the bottom right corner of the aoi rectangle.	nan	nan
end_y	99-998	Integer	The y-coordinate in pixels of the bottom right corner of the aoi rectangle.	nan	nan
character		string	Character as text.	nan	nan
line	1-12	Integer	The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1.	nan	nan

Manually corrected constituency trees

The constituency trees that have been corrected manually.

Please find the file at this link:

Column name	Possible values	Value type	Description	Missing value description	Source
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	nan	nan
sentence		string	The sentence in the text.	nan	nan
spacy_constituency_tree	no stats?		The constituency tree of the sentence in the text as constructed by spacy.	nan	Spacy
str_constituents	no stats?		The constituency tree in string format. This way it can be parsed easily and be displayed.	nan	Spacy
spacy_pos	no stats?		The part-of-speech tags of the words in the sentence as tagged by spacy.	nan	Spacy
constituents	no stats?		The constituents of the sentence tree as constructed by spacy.	nan	Spacy
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	nan	Manually created
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan
manually_corrected	False: 19, True: 79	Categorical	Whether the sentence tree was manually corrected.	nan	Manually tagged

Dependency trees

Contains the dependency trees for all stimuli which have been manually corrected.

Please find the file at this link: Dependency trees

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
spacy_word			The words in the sentence as tokenized by spacy.	0	nan	Spacy
spacy_lemma			The lemmas of the words in the sentence as constructed by spacy.	0	nan	Spacy
spacy_pos	no stats?		The part-of-speech tags of the words in the sentence as tagged by spacy.	0	nan	Spacy
spacy_tag			The details part-of-speech tags of the words in the sentence as constructed by spacy (more fine-grained than spacy_pos).	0	nan	Spacy
dependency	no stats?		The dependency relations of the words in the sentence as constructed by spacy.	0	nan	Spacy
dependency_head	no stats?		The head of the dependency relation of the words in the sentence as constructed by spacy.	0	nan	Spacy
dependency_head_pos	no stats?		The part-of-speech tag of the head of the dependency relation of the words in the sentence as constructed by spacy.	0	nan	Spacy
dependency_children	no stats?		The children of the dependency relation of the words in the sentence as constructed by spacy.	0	nan	Spacy
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	Manually created
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
sent_index_in_text	1.0-12.0	Integer	The index of a sentence in the respective text. Indexing starts at 1.	1	nan	nan
manually_corrected	False: 1768, True: 193, nan: 153, Flse: 2	Categorical	Whether the sentence tree was manually corrected.	153	nan	Manually tagged

Raw data files (samples)

The raw eye tracking data (i.e. each line contains a sample) for each trial.

Please find the files at this link: Raw ET data

Column name	Value type	Description	Source
time	Float	The time stamp of the sample.	edf file created by EyeLink
x	Float	The x-coordinate of the sample.	edf file created by EyeLink
y	Float	The y-coordinate of the sample.	edf file created by EyeLink
pupil_diameter	Float	The pupil diameter of the sample.	edf file created by EyeLink

Fixations

Computed gaze events of all trials for each reader.

Please find the files at this link: Fixations

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
fixation_index	1-1469	Integer	The index of the fixation in temporal order.	0	nan	SR Research data viewer
text_domain	bio: 203667, biology: 1032, physics: 199721	Categorical	The domain of the stimulus text.	0	nan	Manually tagged
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.3869, std: 0.487	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
fixation_duration	2-4474	Integer	The duration of the fixation in milliseconds.	0	nan	SR Research data viewer
next_saccade_duration	1.0-9491.0	Integer	The duration of the saccade that follows a fixation in milliseconds.	46	nan	SR Research data viewer
previous_saccade_duration	nan-nan	Integer	The duration of a saccade that preceeds a fixation in milliseconds.	515	nan	SR Research data viewer
version	0-105	Integer	Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv.	0	nan	nan
line	1-12	Integer	The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1.	0	nan	nan
aoi	1-1121	Integer	The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated.	0	nan	SR Research experiment builder
char_index_in_line	1-100	Integer	Index of a character in the line. Indexing starts at 1.	0	nan	nan
original_fixation_index	1-1478	Integer	The index of the uncorrected fixation.	0	nan	SR Research data viewer
is_fixation_adjusted	False: 382202, True: 22218	Categorical	Whether or not the fixation has been adjusted manually.	0	nan	Manually tagged.
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	Manually created
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan

Scanpaths

The scanpaths for each trial (i.e. fixations in fixation order).

Please find the files at this link: Scanpaths

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
fixation_index	1-1469	Integer	The index of the fixation in temporal order.	0	nan	SR Research data viewer
text_domain	bio: 4682, biology: 200017, physics: 199721	Categorical	The domain of the stimulus text.	0	nan	Manually tagged
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.3869, std: 0.487	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
fixation_duration	2-4474	Integer	The duration of the fixation in milliseconds.	0	nan	SR Research data viewer
next_saccade_duration	1.0-9491.0	Integer	The duration of the saccade that follows a fixation in milliseconds.	46	nan	SR Research data viewer
previous_saccade_duration	1.0-9491.0	Integer	The duration of a saccade that preceeds a fixation in milliseconds.	515	nan	SR Research data viewer
version	0-105	Integer	Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv.	0	nan	nan
line	1-12	Integer	The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1.	0	nan	nan
aoi	1-1121	Integer	The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated.	0	nan	SR Research experiment builder
char_index_in_line	1-100	Integer	Index of a character in the line. Indexing starts at 1.	0	nan	nan
original_fixation_index	1-1478	Integer	The index of the uncorrected fixation.	0	nan	SR Research data viewer
is_fixation_adjusted	False: 382202, True: 22218	Categorical	Whether or not the fixation has been adjusted manually.	0	nan	Manually tagged.
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	Manually created
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
char_index_in_text	1-1121	Integer	Index of a character in the text. Indexing starts at 1.	0	nan	nan
word		string	Words as they appear in the stimuli texts. Words are split at white-space.	0	nan	nan
character		string	Character as text.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	Manually created
text_domain_numeric	0: 204699, 1: 199721	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	Manually created
reader_discipline_numeric	0: 223158, 1: 181262	Categorical	Numerical encoding of the reader discipline; 0=biology, 1=physics.	0	nan	Manually created
level_of_studies_numeric	0: 154333, 1: 250087	Categorical	Numerical value of level_of_studies; 0=beginner, 1=expert.	0	nan	demographic questionnaire
expert_reading_label_numeric	0: 290883, 1: 113537	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	Manually tagged
expert_reading_label	expert_reading: 113537, non-expert_reading: 290883	Categorical	Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert)	0	nan	Manually tagged

Reading measures

The word-level reading measures in a short format.

Please find the files at this link: Reading measures

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
word_index_in_sent	1-51	Integer	The index of the word in the sentence. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
line	1-12	Integer	The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1.	0	nan	nan
FFD	min: 0, max: 2144, mean: 166.4158, std: 132.8433	Float	First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0.	0	nan	compute_reading_measures.py
SFD	min: 0, max: 2144, mean: 118.8309, std: 135.573	Float	Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation).	0	nan	compute_reading_measures.py
FD	min: 0, max: 2144, mean: 203.5219, std: 116.9324	Float	First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass).	0	nan	compute_reading_measures.py
FPRT	min: 0, max: 9649, mean: 247.1511, std: 298.6889	Float	First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass).	0	nan	compute_reading_measures.py
FRT	min: 0, max: 9649, mean: 291.8272, std: 288.631	Float	First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass).	0	nan	compute_reading_measures.py
TFT	min: 0, max: 25314, mean: 632.8199, std: 720.3975	Float	Total-fixation time: sum of all fixations on a word (FPRT+RRT).	0	nan	compute_reading_measures.py
RRT	min: 0, max: 23902, mean: 385.6688, std: 597.5206	Float	Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT).	0	nan	compute_reading_measures.py
RPD_inc	min: 0, max: 318898, mean: 632.8199, std: 3881.7376	Float	Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT).	0	nan	compute_reading_measures.py
RPD_exc	min: 0, max: 315640, mean: 342.295, std: 3815.3786	Float	Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT).	0	nan	compute_reading_measures.py
RBRT	min: 0, max: 10675, mean: 290.5249, std: 358.8929	Float	Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc).	0	nan	compute_reading_measures.py
Fix	0: 14182, 1: 127943	Categorical	Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR).	0	nan	compute_reading_measures.py
FPF	0: 38408, 1: 103717	Categorical	First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0.	0	nan	compute_reading_measures.py
RR	0: 48283, 1: 93842	Categorical	Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)).	0	nan	compute_reading_measures.py
FPReg	0: 119060, 1: 23065	Categorical	First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)).	0	nan	compute_reading_measures.py
TRC_out	min: 0, max: 15, mean: 0.4226, std: 0.7828	Float	Total count of outgoing regressions: total number of regressive saccades initiated from this word.	0	nan	compute_reading_measures.py
TRC_in	min: 0, max: 12, mean: 0.4219, std: 0.7892	Float	Total count of incoming regressions: total number of regressive saccades landing on this word.	0	nan	compute_reading_measures.py
LP	min: 0, max: 28, mean: 2.7791, std: 2.0942	Float	Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character.	0	nan	compute_reading_measures.py
SL_in	min: -162, max: 156, mean: 1.077, std: 3.0552	Float	Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression.	0	nan	compute_reading_measures.py
SL_out	min: -179, max: 63, mean: 0.1881, std: 7.0821	Float	Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated.	0	nan	compute_reading_measures.py
TFC	min: 0, max: 87, mean: 2.8392, std: 2.9135	Float	The total fixation count on the word.	0	nan	compute_reading_measures.py
text_domain_numeric	0: 71550, 1: 70575	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	Manually created
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	Manually created
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	Manually created
gender_numeric	0.0: 66325, 1.0: 73905, nan: 1895	Categorical	Numerical value of gender; 0=male, 1=female.	1895	nan	nan
reader_discipline_numeric	0: 81485, 1: 60640	Categorical	Numerical encoding of the reader discipline; 0=biology, 1=physics.	0	nan	Manually created
level_of_studies_numeric	0: 53060, 1: 89065	Categorical	Numerical value of level_of_studies; 0=beginner, 1=expert.	0	nan	demographic questionnaire
discipline_level_of_studies_numeric	0: 30320, 1: 51165, 2: 22740, 3: 37900	Categorical	Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert.	0	nan	demographic questionnaire
expert_reading_label_numeric	0: 97547, 1: 44578	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	Manually tagged
expert_reading_label	expert_reading: 44578, non-expert_reading: 97547	Categorical	Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert)	0	nan	Manually tagged
age	min: 18.0, max: 41.0, mean: 24.1644, std: 4.1809	Float	Reader's age.	3790	nan	demographic questionnaire
mean_acc_bq	min: 0.0, max: 0.999250936329588, mean: 0.6381, std: 0.3139	Float	The mean accuracy of all text questions for one text read by one reader.	0	nan	nan
mean_acc_tq	min: 0.0, max: 0.9991603694374476, mean: 0.3875, std: 0.3161	Float	The mean accuracy of all background questions for one text read by one reader.	0	nan	nan
acc_bq_1	min: 0.0, max: 0.9993197278911564, mean: 0.3858, std: 0.4857	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 0.9993197278911564, mean: 0.3559, std: 0.4778	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 0.9992429977289932, mean: 0.4207, std: 0.4925	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 0.9993197278911564, mean: 0.6364, std: 0.4794	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 0.999250936329588, mean: 0.6322, std: 0.4805	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 0.9993197278911564, mean: 0.6456, std: 0.4766	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan

Merged: fixations, participant info, reading measures and word features

The word-level reading measures merged with trial, session and reader information, as well as more information on the words.

Please find the files at this link: Reading measures merged

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
word		string	Words as they appear in the stimuli texts. Words are split at white-space.	0	nan	nan
word_with_punct		string	The word as it appears in the text, including punctuation.	0	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	0	nan	nan
word_index_in_sent	1-51	Integer	The index of the word in the sentence. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	Manually created
text_domain	biology: 71550, physics: 70575	Categorical	The domain of the stimulus text.	0	nan	Manually tagged
word_length	2-33	Integer	Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters).	0	nan	nan
STTS_punctuation_before	0.0: 70800, 0: 70425, $(: 900	Categorical	If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	0	nan	Manually tagged
STTS_punctuation_after	$(: 750, $($,: 150, $,: 6975, $.: 7575, 0: 126675	Categorical	If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	0	nan	Manually tagged
is_in_quote	0: 141075, 1: 1050	Categorical	Whether or not the word is part of an expression in quotes.	0	nan	Manually tagged
is_in_parentheses	0: 141750, 1: 375	Categorical	Whether or not the word is part of a phrase in parentheses.	0	nan	Manually tagged
is_clause_beginning	0: 134700, 1: 7425	Categorical	Whether or not the word is the beginning of a clause.	0	nan	Manually tagged
is_sent_beginning	0: 134850, 1: 7275	Categorical	Whether or not the word is the beginning of a new sentence.	0	nan	Manually tagged
is_clause_end	0: 134775, 1: 7350	Categorical	Whether or not the word is the end of a clause.	0	nan	Manually tagged
is_sent_end	0: 134850, 1: 7275	Categorical	Whether or not the word is the end of a sentence.	0	nan	Manually tagged
is_abbreviation	0: 141750, 1: 375	Categorical	Whether or not the entire word is an abbreviation.	0	nan	Manually tagged
is_expert_technical_term	0: 130500, 1: 11625	Categorical	1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"".	0	nan	Manually tagged
is_general_technical_term	0: 123450, 1: 18675	Categorical	1 if the word is a technical term that is generally understandable. E.g.: "elektrisch"	0	nan	Manually tagged
contains_symbol	0: 141525, 1: 600	Categorical	Whether or not the word contains a symbol. E.g.: β-D-Glucose	0	nan	Manually tagged
contains_hyphen	0: 139950, 1: 2175	Categorical	Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)).	0	nan	Manually tagged
contains_abbreviation	0: 141225, 1: 900	Categorical	Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA.	0	nan	Manually tagged
STTS_PoS_tag	ADJA: 11550, ADJD: 3975, ADV: 5475, APPR: 13800, APPRART: 3600, APZR: 75, ART: 20700, CARD: 675, KOKOM: 1275, KON: 4950, KOUI: 450, KOUS: 1200, NE: 300, NN: 38625, PAV: 1350, PDAT: 1200, PDS: 525, PIAT: 375, PIDAT: 675, PIS: 750, PPER: 1875, PPOSAT: 525, PRELAT: 450, PRELS: 2175, PRF: 1875, PTKA: 75, PTKNEG: 300, PTKVZ: 975, PTKZU: 750, PWAV: 75, TRUNC: 375, VAFIN: 5475, VAINF: 600, VMFIN: 1875, VMINF: 75, VVFIN: 7650, VVINF: 2475, VVIZU: 150, VVPP: 2850	Categorical	Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information.	0	nan	Manually tagged
type		string	The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name.	0	nan	dlexDB
type_length_chars	0.0-33.0	Integer	The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted.	0	nan	nan
PoS_tag	adja: 12150, adjd: 4050, adv: 6825, appr: 13650, apprart: 3600, art: 21000, card: 675, kokom: 1275, kon: 4725, koui: 375, kous: 1200, ne: 525, nn: 38100, pdat: 1200, pds: 525, piat: 375, pidat: 150, pis: 1050, pper: 1800, pposat: 525, prelat: 450, prels: 1800, prf: 1875, ptka: 75, ptkneg: 300, ptkvz: 1125, ptkzu: 750, pwav: 75, trunc: 375, vafin: 5475, vainf: 600, vmfin: 1800, vminf: 75, vvfin: 7725, vvinf: 2475, vvizu: 150, vvpp: 2850, xy: 375	Categorical	Part-of-speech tag as defined by the dlexDB query.	0	nan	dlexDB
lemma		string	nan	0	nan	dlexDB
lemma_length_chars	0.0-32.0	Integer	nan	0	nan	dlexDB
syllables		string	nan	0	nan	dlexDB
type_length_syllables	0.0-14.0	Integer	nan	0	nan	dlexDB
annotated_type_frequency_normalized	min: 0.0, max: 24738.5901996, mean: 3629.1612, std: 6797.6492	Float	The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma.	0	nan	dlexDB
type_frequency_normalized	min: 0.0, max: 26530.3631386, mean: 4141.6498, std: 7546.5578	Float	nan	0	nan	dlexDB
lemma_frequency_normalized	min: 0.0, max: 80100.3069113, mean: 12271.0154, std: 24660.3797	Float	nan	0	nan	dlexDB
familiarity_normalized	min: 0.0, max: 26530.3631386, mean: 3822.4994, std: 7457.3314	Float	nan	0	nan	dlexDB
regularity_normalized	min: 0.0, max: 2123.30585022, mean: 35.3095, std: 119.8288	Float	nan	0	nan	dlexDB
document_frequency_normalized	min: 0.0, max: 9372.80956103, mean: 2885.4746, std: 3353.4877	Float	nan	0	nan	dlexDB
sentence_frequency_normalized	min: 0.0, max: 30912.3596552, mean: 5745.1861, std: 9454.5921	Float	nan	0	nan	dlexDB
cumulative_syllable_corpus_frequency_normalized	min: 0.0, max: 125126.524676, mean: 15795.556, std: 15820.9152	Float	nan	0	nan	dlexDB
cumulative_syllable_lexicon_frequency_normalized	min: 0.0, max: 218985.607753, mean: 21763.0396, std: 31363.3366	Float	nan	0	nan	dlexDB
cumulative_character_corpus_frequency_normalized	min: 0.0, max: 7810554.20193, mean: 1800394.2485, std: 1298158.5605	Float	nan	0	nan	dlexDB
cumulative_character_lexicon_frequency_normalized	min: 0.0, max: 18380479.713, mean: 4004667.3367, std: 2909455.8454	Float	nan	0	nan	dlexDB
cumulative_character_bigram_corpus_frequency_normalized	min: 0.0, max: 1322150.62097, mean: 341028.5141, std: 227677.2532	Float	nan	0	nan	dlexDB
cumulative_character_bigram_lexicon_frequency_normalized	min: 0.0, max: 2788357.77704, mean: 554080.6642, std: 451286.9101	Float	nan	0	nan	dlexDB
cumulative_character_trigram_corpus_frequency_normalized	min: 0.0, max: 603427.130456, mean: 213996.2534, std: 130950.6249	Float	nan	0	nan	dlexDB
cumulative_character_trigram_lexicon_frequency_normalized	min: 0.0, max: 899592.89035, mean: 223247.7744, std: 175811.3775	Float	nan	0	nan	dlexDB
initial_letter_frequency_normalized	min: 0.0, max: 110461.430317, mean: 36031.6466, std: 33586.1123	Float	nan	0	nan	dlexDB
initial_bigram_frequency_normalized	min: 0.0, max: 53801.2331077, mean: 11986.4422, std: 14536.7787	Float	nan	0	nan	dlexDB
initial_trigram_frequency_normalized	min: -0.00817507899599, max: 29048.3692201, mean: 5528.0412, std: 8782.9659	Float	nan	0	nan	dlexDB
avg_cond_prob_in_bigrams	min: 0.0, max: 0.5006180465, mean: 0.0423, std: 0.0447	Float	The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information.	0	nan	dlexDB
avg_cond_prob_in_trigrams	min: 0.0, max: 25.0, mean: 0.2371, std: 0.5852	Float	The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information.	0	nan	dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 2111.0615, std: 7323.9586	Float	nan	0	nan	dlexDB
neighbors_coltheart_higher_freq_count_normalized	min: 0.0, max: 8.13363128109, mean: 0.195, std: 0.4875	Float	nan	0	nan	dlexDB
neighbors_coltheart_all_cum_freq_normalized	min: 0.0, max: 49782.1108458, mean: 4765.8454, std: 9884.7277	Float	nan	0	nan	dlexDB
neighbors_coltheart_all_count_normalized	min: 0.0, max: 47.5175301158, mean: 14.8301, std: 14.4676	Float	nan	0	nan	dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 2703.1737, std: 7703.635	Float	nan	0	nan	dlexDB
neighbors_levenshtein_higher_freq_count_normalized	min: 0.0, max: 11.9864039932, mean: 0.3077, std: 0.6418	Float	nan	0	nan	dlexDB
neighbors_levenshtein_all_cum_freq_normalized	min: 0.0, max: 54875.2749862, mean: 6310.865, std: 11349.5391	Float	nan	0	nan	dlexDB
neighbors_levenshtein_all_count_normalized	min: 0.0, max: 75.7711966712, mean: 23.1334, std: 22.6083	Float	nan	0	nan	dlexDB
sent_surprisal_gpt2-base	min: 0.0005104430601932, max: 56.804420471191406, mean: 6.9134, std: 6.5992	Float	Surprisal value extracted from a language model (GerPT2-base) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_gpt2-base	min: 0.0002225389762315, max: 53.041446685791016, mean: 5.5822, std: 5.7075	Float	Surprisal value extracted from a language model (GerPT2-base) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_gpt2-large	min: 0.0002048997703241, max: 42.28059005737305, mean: 6.1407, std: 5.8838	Float	Surprisal value extracted from a language model (GerPT2-large) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_gpt2-large	min: 0.0001027531252475, max: 35.38883209228516, mean: 4.735, std: 4.8632	Float	Surprisal value extracted from a language model (GerPT2-large) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_llama-7b	min: 0.0001720042055239, max: 42.96158599853516, mean: 6.1564, std: 5.7258	Float	Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_llama-7b	min: 1.990775308513548e-05, max: 35.62324142456055, mean: 3.4794, std: 3.8542	Float	Surprisal value extracted from a language model (LeoLM-7b) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_llama-13b	min: 8.702239938429557e-06, max: 46.25139999389648, mean: 6.0065, std: 5.8573	Float	Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_llama-13b	min: 9.298280929215252e-06, max: 36.29869842529297, mean: 3.2454, std: 3.8081	Float	Surprisal value extracted from a language model (LeoLM-13b) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_bert-base	min: 1.1920928244535389e-07, max: 101.79562616348268, mean: 6.4507, std: 11.6153	Float	Surprisal value extracted from a language model (BERT-base) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_bert-base	min: -0.0, max: 88.84420316047726, mean: 6.2599, std: 11.5816	Float	Surprisal value extracted from a language model (BERT-base) with the text as context.	0	nan	See script get_surprisal.py
line	1-12	Integer	The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1.	0	nan	nan
FFD	min: 0, max: 2144, mean: 166.4158, std: 132.8433	Float	First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0.	0	nan	compute_reading_measures.py
SFD	min: 0, max: 2144, mean: 118.8309, std: 135.573	Float	Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation).	0	nan	compute_reading_measures.py
FD	min: 0, max: 2144, mean: 203.5219, std: 116.9324	Float	First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass).	0	nan	compute_reading_measures.py
FPRT	min: 0, max: 9649, mean: 247.1511, std: 298.6889	Float	First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass).	0	nan	compute_reading_measures.py
FRT	min: 0, max: 9649, mean: 291.8272, std: 288.631	Float	First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass).	0	nan	compute_reading_measures.py
TFT	min: 0, max: 25314, mean: 632.8199, std: 720.3975	Float	Total-fixation time: sum of all fixations on a word (FPRT+RRT).	0	nan	compute_reading_measures.py
TFC	min: 0, max: 87, mean: 2.8392, std: 2.9135	Float	The total fixation count on the word.	0	nan	compute_reading_measures.py
RRT	min: 0, max: 23902, mean: 385.6688, std: 597.5206	Float	Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT).	0	nan	compute_reading_measures.py
RPD_inc	min: 0, max: 318898, mean: 632.8199, std: 3881.7376	Float	Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT).	0	nan	compute_reading_measures.py
RPD_exc	min: 0, max: 315640, mean: 342.295, std: 3815.3786	Float	Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT).	0	nan	compute_reading_measures.py
RBRT	min: 0, max: 10675, mean: 290.5249, std: 358.8929	Float	Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc).	0	nan	compute_reading_measures.py
Fix	0: 14182, 1: 127943	Categorical	Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR).	0	nan	compute_reading_measures.py
FPF	0: 38408, 1: 103717	Categorical	First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0.	0	nan	compute_reading_measures.py
RR	0: 48283, 1: 93842	Categorical	Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)).	0	nan	compute_reading_measures.py
FPReg	0: 119060, 1: 23065	Categorical	First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)).	0	nan	compute_reading_measures.py
TRC_out	min: 0, max: 15, mean: 0.4226, std: 0.7828	Float	Total count of outgoing regressions: total number of regressive saccades initiated from this word.	0	nan	compute_reading_measures.py
TRC_in	min: 0, max: 12, mean: 0.4219, std: 0.7892	Float	Total count of incoming regressions: total number of regressive saccades landing on this word.	0	nan	compute_reading_measures.py
LP	min: 0, max: 28, mean: 2.7791, std: 2.0942	Float	Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character.	0	nan	compute_reading_measures.py
SL_in	min: -162, max: 156, mean: 1.077, std: 3.0552	Float	Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression.	0	nan	compute_reading_measures.py
SL_out	min: -179, max: 63, mean: 0.1881, std: 7.0821	Float	Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated.	0	nan	compute_reading_measures.py
acc_bq_1	min: 0.0, max: 0.9993197278911564, mean: 0.3858, std: 0.4857	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 0.9993197278911564, mean: 0.3559, std: 0.4778	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 0.9992429977289932, mean: 0.4207, std: 0.4925	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 0.9993197278911564, mean: 0.6364, std: 0.4794	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 0.999250936329588, mean: 0.6322, std: 0.4805	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 0.9993197278911564, mean: 0.6456, std: 0.4766	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	0	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
mean_acc_tq	min: 0.0, max: 0.9991603694374476, mean: 0.3875, std: 0.3161	Float	The mean accuracy of all background questions for one text read by one reader.	0	nan	nan
mean_acc_bq	min: 0.0, max: 0.999250936329588, mean: 0.6381, std: 0.3139	Float	The mean accuracy of all text questions for one text read by one reader.	0	nan	nan
text_domain_numeric	0: 71550, 1: 70575	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	Manually created
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	Manually created
gender_numeric	0.0: 66325, 1.0: 73905, nan: 1895	Categorical	Numerical value of gender; 0=male, 1=female.	1895	nan	nan
reader_discipline_numeric	0: 81485, 1: 60640	Categorical	Numerical encoding of the reader discipline; 0=biology, 1=physics.	0	nan	Manually created
age	min: 18.0, max: 41.0, mean: 24.1644, std: 4.1809	Float	Reader's age.	3790	nan	demographic questionnaire
level_of_studies_numeric	0: 53060, 1: 89065	Categorical	Numerical value of level_of_studies; 0=beginner, 1=expert.	0	nan	demographic questionnaire
discipline_level_of_studies_numeric	0: 30320, 1: 51165, 2: 22740, 3: 37900	Categorical	Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert.	0	nan	demographic questionnaire
expert_reading_label_numeric	0: 97547, 1: 44578	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	Manually tagged

Merged: scanpaths, participant info, reading measures and word features

Contains the scanpaths for each trial merged with infomration on the reader, texts, etc.

Please find the files at this link: Scanpaths merged

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
fixation_index	1-1469	Integer	The index of the fixation in temporal order.	0	nan	SR Research data viewer
text_domain	bio: 4682, biology: 200017, physics: 199721	Categorical	The domain of the stimulus text.	0	nan	Manually tagged
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.3869, std: 0.487	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
fixation_duration	2-4474	Integer	The duration of the fixation in milliseconds.	0	nan	SR Research data viewer
next_saccade_duration	1.0-9491.0	Integer	The duration of the saccade that follows a fixation in milliseconds.	46	nan	SR Research data viewer
previous_saccade_duration	1.0-9491.0	Integer	The duration of a saccade that preceeds a fixation in milliseconds.	515	nan	SR Research data viewer
version	0-105	Integer	Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv.	0	nan	nan
line	1-12	Integer	The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1.	0	nan	nan
aoi	1-1121	Integer	The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated.	0	nan	SR Research experiment builder
char_index_in_line	1-100	Integer	Index of a character in the line. Indexing starts at 1.	0	nan	nan
original_fixation_index	1-1478	Integer	The index of the uncorrected fixation.	0	nan	SR Research data viewer
is_fixation_adjusted	False: 382202, True: 22218	Categorical	Whether or not the fixation has been adjusted manually.	0	nan	Manually tagged.
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	Manually created
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
char_index_in_text	1-1121	Integer	Index of a character in the text. Indexing starts at 1.	0	nan	nan
word		string	Words as they appear in the stimuli texts. Words are split at white-space.	0	nan	nan
character		string	Character as text.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	Manually created
text_domain_numeric	0: 204699, 1: 199721	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	Manually created
reader_discipline_numeric	0: 223158, 1: 181262	Categorical	Numerical encoding of the reader discipline; 0=biology, 1=physics.	0	nan	Manually created
level_of_studies_numeric	0: 154333, 1: 250087	Categorical	Numerical value of level_of_studies; 0=beginner, 1=expert.	0	nan	demographic questionnaire
expert_reading_label_numeric	0: 290883, 1: 113537	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	Manually tagged
expert_reading_label	expert_reading: 113537, non-expert_reading: 290883	Categorical	Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert)	0	nan	Manually tagged
word_with_punct		string	The word as it appears in the text, including punctuation.	96	nan	nan
word_index_in_sent	1-51	Integer	The index of the word in the sentence. Indexing starts at 1.	0	nan	nan
word_length	2-33	Integer	Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters).	0	nan	nan
STTS_punctuation_before	0.0: 211108, 0: 189407, $(: 3905	Categorical	If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	0	nan	Manually tagged
STTS_punctuation_after	$(: 3260, $($,: 573, $,: 22559, $.: 25794, 0: 352234	Categorical	If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	0	nan	Manually tagged
is_in_quote	0: 399715, 1: 4705	Categorical	Whether or not the word is part of an expression in quotes.	0	nan	Manually tagged
is_in_parentheses	0: 403155, 1: 1265	Categorical	Whether or not the word is part of a phrase in parentheses.	0	nan	Manually tagged
is_clause_beginning	0: 388232, 1: 16188	Categorical	Whether or not the word is the beginning of a clause.	0	nan	Manually tagged
is_sent_beginning	0: 386681, 1: 17739	Categorical	Whether or not the word is the beginning of a new sentence.	0	nan	Manually tagged
is_clause_end	0: 381545, 1: 22875	Categorical	Whether or not the word is the end of a clause.	0	nan	Manually tagged
is_sent_end	0: 380027, 1: 24393	Categorical	Whether or not the word is the end of a sentence.	0	nan	Manually tagged
is_abbreviation	0: 403478, 1: 942	Categorical	Whether or not the entire word is an abbreviation.	0	nan	Manually tagged
is_expert_technical_term	0: 332354, 1: 72066	Categorical	1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"".	0	nan	Manually tagged
is_general_technical_term	0: 325333, 1: 79087	Categorical	1 if the word is a technical term that is generally understandable. E.g.: "elektrisch"	0	nan	Manually tagged
contains_symbol	0: 400458, 1: 3962	Categorical	Whether or not the word contains a symbol. E.g.: β-D-Glucose	0	nan	Manually tagged
contains_hyphen	0: 388149, 1: 16271	Categorical	Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)).	0	nan	Manually tagged
contains_abbreviation	0: 399423, 1: 4997	Categorical	Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA.	0	nan	Manually tagged
STTS_PoS_tag	ADJA: 51041, ADJD: 12714, ADV: 12236, APPR: 22470, APPRART: 5566, APZR: 91, ART: 37340, CARD: 1594, KOKOM: 2428, KON: 5798, KOUI: 654, KOUS: 2521, NE: 955, NN: 162980, PAV: 3444, PDAT: 3292, PDS: 1374, PIAT: 791, PIDAT: 1653, PIS: 1322, PPER: 2511, PPOSAT: 1360, PRELAT: 1302, PRELS: 4193, PRF: 3606, PTKA: 97, PTKNEG: 687, PTKVZ: 1490, PTKZU: 583, PWAV: 76, TRUNC: 1137, VAFIN: 10340, VAINF: 1206, VMFIN: 3953, VMINF: 153, VVFIN: 23854, VVINF: 7713, VVIZU: 578, VVPP: 9317	Categorical	Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information.	0	nan	Manually tagged
type		string	The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name.	0	nan	dlexDB
type_length_chars	0.0-33.0	Integer	The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted.	0	nan	nan
PoS_tag	adja: 53330, adjd: 12226, adv: 15728, appr: 22193, apprart: 5566, art: 37918, card: 1594, kokom: 2428, kon: 5405, koui: 559, kous: 2521, ne: 1386, nn: 160585, pdat: 3292, pds: 1374, piat: 791, pidat: 352, pis: 2063, pper: 2434, pposat: 1360, prelat: 1302, prels: 4076, prf: 3606, ptka: 97, ptkneg: 687, ptkvz: 1891, ptkzu: 583, pwav: 76, trunc: 1137, vafin: 10340, vainf: 1206, vmfin: 3829, vminf: 153, vvfin: 23978, vvinf: 7713, vvizu: 578, vvpp: 9317, xy: 746	Categorical	Part-of-speech tag as defined by the dlexDB query.	0	nan	dlexDB
lemma		string	nan	0	nan	dlexDB
lemma_length_chars	0.0-32.0	Integer	nan	0	nan	dlexDB
syllables		string	nan	0	nan	dlexDB
type_length_syllables	0.0-14.0	Integer	nan	0	nan	dlexDB
annotated_type_frequency_normalized	min: 0.0, max: 24738.5901996, mean: 1950.9055, std: 5185.3006	Float	The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma.	0	nan	dlexDB
type_frequency_normalized	min: 0.0, max: 26530.3631386, mean: 2247.4523, std: 5847.2187	Float	nan	0	nan	dlexDB
lemma_frequency_normalized	min: 0.0, max: 80100.3069113, mean: 7203.2409, std: 19769.4428	Float	nan	0	nan	dlexDB
familiarity_normalized	min: 0.0, max: 26530.3631386, mean: 2191.7786, std: 5759.2592	Float	nan	0	nan	dlexDB
regularity_normalized	min: 0.0, max: 2123.30585022, mean: 46.8657, std: 137.5046	Float	nan	0	nan	dlexDB
document_frequency_normalized	min: 0.0, max: 9372.80956103, mean: 1684.1043, std: 2829.0626	Float	nan	0	nan	dlexDB
sentence_frequency_normalized	min: 0.0, max: 30912.3596552, mean: 3137.4539, std: 7374.8037	Float	nan	0	nan	dlexDB
cumulative_syllable_corpus_frequency_normalized	min: 0.0, max: 125126.524676, mean: 15768.7784, std: 17529.5528	Float	nan	0	nan	dlexDB
cumulative_syllable_lexicon_frequency_normalized	min: 0.0, max: 218985.607753, mean: 27232.3183, std: 36883.9628	Float	nan	0	nan	dlexDB
cumulative_character_corpus_frequency_normalized	min: 0.0, max: 7810554.20193, mean: 2053804.334, std: 1596380.3916	Float	nan	0	nan	dlexDB
cumulative_character_lexicon_frequency_normalized	min: 0.0, max: 18380479.713, mean: 4612580.9638, std: 3597155.0404	Float	nan	0	nan	dlexDB
cumulative_character_bigram_corpus_frequency_normalized	min: 0.0, max: 1322150.62097, mean: 356831.454, std: 269772.388	Float	nan	0	nan	dlexDB
cumulative_character_bigram_lexicon_frequency_normalized	min: 0.0, max: 2788357.77704, mean: 629626.1651, std: 539088.9742	Float	nan	0	nan	dlexDB
cumulative_character_trigram_corpus_frequency_normalized	min: 0.0, max: 603427.130456, mean: 200341.8076, std: 144122.7012	Float	nan	0	nan	dlexDB
cumulative_character_trigram_lexicon_frequency_normalized	min: 0.0, max: 899592.89035, mean: 236423.2776, std: 199573.1416	Float	nan	0	nan	dlexDB
initial_letter_frequency_normalized	min: 0.0, max: 110461.430317, mean: 28045.0077, std: 30618.9167	Float	nan	0	nan	dlexDB
initial_bigram_frequency_normalized	min: 0.0, max: 53801.2331077, mean: 8706.0335, std: 12743.2638	Float	nan	0	nan	dlexDB
initial_trigram_frequency_normalized	min: -0.00817507899599, max: 29048.3692201, mean: 3754.6304, std: 7393.1224	Float	nan	0	nan	dlexDB
avg_cond_prob_in_bigrams	min: 0.0, max: 0.5006180465, mean: 0.0313, std: 0.0466	Float	The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information.	0	nan	dlexDB
avg_cond_prob_in_trigrams	min: 0.0, max: 25.0, mean: 0.2251, std: 0.8814	Float	The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information.	0	nan	dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 1276.643, std: 5775.4034	Float	nan	0	nan	dlexDB
neighbors_coltheart_higher_freq_count_normalized	min: 0.0, max: 8.13363128109, mean: 0.1556, std: 0.4321	Float	nan	0	nan	dlexDB
neighbors_coltheart_all_cum_freq_normalized	min: 0.0, max: 49782.1108458, mean: 2794.1781, std: 7982.6321	Float	nan	0	nan	dlexDB
neighbors_coltheart_all_count_normalized	min: 0.0, max: 47.5175301158, mean: 9.0448, std: 12.679	Float	nan	0	nan	dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 1683.6273, std: 6153.8504	Float	nan	0	nan	dlexDB
neighbors_levenshtein_higher_freq_count_normalized	min: 0.0, max: 11.9864039932, mean: 0.2681, std: 0.5814	Float	nan	0	nan	dlexDB
neighbors_levenshtein_all_cum_freq_normalized	min: 0.0, max: 54875.2749862, mean: 3761.4734, std: 9299.5647	Float	nan	0	nan	dlexDB
neighbors_levenshtein_all_count_normalized	min: 0.0, max: 75.7711966712, mean: 14.1417, std: 19.6383	Float	nan	0	nan	dlexDB
sent_surprisal_gpt2-base	min: 0.0005104430601932, max: 56.804420471191406, mean: 10.0061, std: 9.1114	Float	Surprisal value extracted from a language model (GerPT2-base) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_gpt2-base	min: 0.0002225389762315, max: 53.041446685791016, mean: 8.0061, std: 8.0873	Float	Surprisal value extracted from a language model (GerPT2-base) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_gpt2-large	min: 0.0002048997703241, max: 42.28059005737305, mean: 8.76, std: 8.0159	Float	Surprisal value extracted from a language model (GerPT2-large) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_gpt2-large	min: 0.0001027531252475, max: 35.38883209228516, mean: 6.6792, std: 6.6522	Float	Surprisal value extracted from a language model (GerPT2-large) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_llama-7b	min: 0.0001720042055239, max: 42.96158599853516, mean: 8.0373, std: 7.0611	Float	Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_llama-7b	min: 1.990775308513548e-05, max: 35.62324142456055, mean: 4.7991, std: 4.9022	Float	Surprisal value extracted from a language model (LeoLM-7b) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_llama-13b	min: 8.702239938429557e-06, max: 46.25139999389648, mean: 7.7768, std: 7.1775	Float	Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_llama-13b	min: 9.298280929215252e-06, max: 36.29869842529297, mean: 4.5172, std: 4.9048	Float	Surprisal value extracted from a language model (LeoLM-13b) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_bert-base	min: 1.1920928244535389e-07, max: 101.79562616348268, mean: 8.1926, std: 13.1873	Float	Surprisal value extracted from a language model (BERT-base) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_bert-base	min: -0.0, max: 88.84420316047726, mean: 7.487, std: 12.7275	Float	Surprisal value extracted from a language model (BERT-base) with the text as context.	0	nan	See script get_surprisal.py
FFD	min: 0, max: 2144, mean: 195.9741, std: 124.5597	Float	First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0.	0	nan	compute_reading_measures.py
SFD	min: 0, max: 2144, mean: 107.9483, std: 134.474	Float	Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation).	0	nan	compute_reading_measures.py
FD	min: 0, max: 2144, mean: 226.9857, std: 103.7904	Float	First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass).	0	nan	compute_reading_measures.py
FPRT	min: 0, max: 9649, mean: 408.9247, std: 526.0428	Float	First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass).	0	nan	compute_reading_measures.py
FRT	min: 0, max: 9649, mean: 456.8788, std: 518.1388	Float	First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass).	0	nan	compute_reading_measures.py
TFT	min: 0, max: 25314, mean: 1333.0163, std: 1428.494	Float	Total-fixation time: sum of all fixations on a word (FPRT+RRT).	0	nan	compute_reading_measures.py
TFC	min: 0, max: 87, mean: 5.8238, std: 5.5152	Float	The total fixation count on the word.	0	nan	compute_reading_measures.py
RRT	min: 0, max: 23902, mean: 924.0916, std: 1240.0587	Float	Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT).	0	nan	compute_reading_measures.py
RPD_inc	min: 0, max: 318898, mean: 1076.7946, std: 5339.73	Float	Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT).	0	nan	compute_reading_measures.py
RPD_exc	min: 0, max: 315640, mean: 557.5849, std: 5209.143	Float	Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT).	0	nan	compute_reading_measures.py
RBRT	min: 0, max: 10675, mean: 519.2098, std: 638.9024	Float	Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc).	0	nan	compute_reading_measures.py
Fix	0: 110, 1: 404310	Categorical	Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR).	0	nan	compute_reading_measures.py
FPF	0: 56838, 1: 347582	Categorical	First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0.	0	nan	compute_reading_measures.py
RR	0: 48241, 1: 356179	Categorical	Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)).	0	nan	compute_reading_measures.py
FPReg	0: 308156, 1: 96264	Categorical	First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)).	0	nan	compute_reading_measures.py
TRC_out	min: 0, max: 15, mean: 0.8249, std: 1.193	Float	Total count of outgoing regressions: total number of regressive saccades initiated from this word.	0	nan	compute_reading_measures.py
TRC_in	min: 0, max: 12, mean: 0.7776, std: 1.1734	Float	Total count of incoming regressions: total number of regressive saccades landing on this word.	0	nan	compute_reading_measures.py
LP	min: 1, max: 28, mean: 3.3887, std: 2.3225	Float	Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character.	0	nan	compute_reading_measures.py
SL_in	min: -162, max: 156, mean: 1.3449, std: 2.928	Float	Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression.	0	nan	compute_reading_measures.py
SL_out	min: -179, max: 63, mean: -0.0835, std: 7.9375	Float	Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated.	0	nan	compute_reading_measures.py
mean_acc_tq	min: 0.0, max: 0.9991603694374476, mean: 0.3819, std: 0.3148	Float	The mean accuracy of all background questions for one text read by one reader.	0	nan	nan
mean_acc_bq	min: 0.0, max: 0.999250936329588, mean: 0.6398, std: 0.312	Float	The mean accuracy of all text questions for one text read by one reader.	0	nan	nan
gender_numeric	0.0: 187536, 1.0: 212874, nan: 4010	Categorical	Numerical value of gender; 0=male, 1=female.	4010	nan	nan
age	min: 18.0, max: 41.0, mean: 24.0283, std: 4.1436	Float	Reader's age.	8459	nan	demographic questionnaire
discipline_level_of_studies_numeric	0: 89325, 1: 133833, 2: 65008, 3: 116254	Categorical	Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert.	0	nan	demographic questionnaire

AOI to word mapping

Contains the mapping of each aoi to the respective word in each of the texts.

Please find the file at this link: aoi to word mapping

Column name	Possible values	Value type	Description	Missing value description	Source
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	nan	nan
char_index_in_text	1-1121	Integer	Index of a character in the text. Indexing starts at 1.	nan	nan

Participants

In the participants' data file, all demographic information is stored.

Please find the file at this link: Participant information

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	Manually created
reader_discipline	biology: 43, physics: 32	Categorical	The area of expertise of the reader. All readers are students whose major is either physics or biology.	0	nan	demographic questionnaire
reader_discipline_numeric	0: 43, 1: 32	Categorical	Numerical encoding of the reader discipline; 0=biology, 1=physics.	0	nan	Manually created
level_of_studies	graduate: 47, undergraduate: 28	Categorical	Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners.	0	nan	demographic questionnaire
level_of_studies_numeric	0: 28, 1: 47	Categorical	Numerical value of level_of_studies; 0=beginner, 1=expert.	0	nan	demographic questionnaire
discipline_level_of_studies	biology-graduate: 27, biology-undergraduate: 16, physics-graduate: 20, physics-undergraduate: 12	Categorical	The combination of the readers' major (reader_discipline) and their expertise (level_of_studies).	0	nan	demographic questionnaire
discipline_level_of_studies_numeric	0: 16, 1: 27, 2: 12, 3: 20	Categorical	Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert.	0	nan	demographic questionnaire
glasses	no: 54, yes: 20, nan: 1	Categorical	Whether or not reader is wearing glasses.	1	nan	demographic questionnaire
age	min: 18.0, max: 41.0, mean: 24.1644, std: 4.2098	Float	Reader's age.	2	nan	demographic questionnaire
handedness	right: 68, left: 6, nan: 1	Categorical	Reader's handedness.	1	nan	demographic questionnaire
hours_sleep	min: 0.0, max: 11.0, mean: 7.2095, std: 1.3138	Float	The hours of sleep of the participant before the experiment.	1	nan	demographic questionnaire
alcohol	no: 71, yes: 3, nan: 1	Categorical	Whether or not a participant consumed alcohol within 24 hours before the experiment start.	1	nan	demographic questionnaire
gender	female: 39, male: 35, nan: 1	Categorical	Reader's gender.	1	nan	demographic questionnaire
gender_numeric	0.0: 35, 1.0: 39, nan: 1	Categorical	Numerical value of gender; 0=male, 1=female.	1	nan	nan
semester		string	The semester the reader is currently enrolled in.	1	nan	demographic questionnaire
bilingual	n: 73, j: 1, nan: 1	Categorical	Whether the reader is bilingual.	1	nan	demographic questionnaire
state		string	The German state the reader is from.	1	nan	demographic questionnaire
grade		string	The grade of the reader in their university entrance diploma.	4	nan	demographic questionnaire
subject_detailed			The detailed subject of the reader's major.	1	nan	demographic questionnaire

Participants' response accuracy

The response accuracy for each participant for each question.

Please find the file at this link: Participant response accuracy

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	Manually created
reader_discipline	biology: 516, physics: 384	Categorical	The area of expertise of the reader. All readers are students whose major is either physics or biology.	0	nan	demographic questionnaire
reader_discipline_numeric	0: 516, 1: 384	Categorical	Numerical encoding of the reader discipline; 0=biology, 1=physics.	0	nan	Manually created
level_of_studies	graduate: 564, undergraduate: 336	Categorical	Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners.	0	nan	demographic questionnaire
level_of_studies_numeric	0: 336, 1: 564	Categorical	Numerical value of level_of_studies; 0=beginner, 1=expert.	0	nan	demographic questionnaire
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
text_domain	biology: 450, physics: 450	Categorical	The domain of the stimulus text.	0	nan	Manually tagged
expert_reading_label	expert-reading: 282, non-expert-reading: 618	Categorical	Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert)	0	nan	Manually tagged
expert_reading_label_numeric	0: 618, 1: 282	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	Manually tagged
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6475, std: 0.478	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6441, std: 0.479	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6509, std: 0.477	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.393, std: 0.4887	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.366, std: 0.482	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4234, std: 0.4944	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
mean_acc_tq	min: 0.0, max: 1.0, mean: 0.6475, std: 0.3082	Float	The mean accuracy of all background questions for one text read by one reader.	12	nan	nan
mean_acc_bq	min: 0.0, max: 1.0, mean: 0.3941, std: 0.3163	Float	The mean accuracy of all text questions for one text read by one reader.	12	nan	nan

Coding of the answers of the online survey

This file is an explanation of the values used in the online survey answer file (response_data_online_survey.csv). Each variable has four different options which are expressed as a numerical value and each of the option is mapped to the text option the participant saw.

Please find the file at this link: Answer coding online survey

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
VAR		string	Variable name of the fields in the participant online survey. These are explanations of the names of the columns in the file: response_data_online_survey.csv	0	nan	online survey tool
RESPONSE	-9: 46, 0: 2, 1: 95, 2: 92, 3: 93, 4: 54, 5: 13, 6: 13, 7: 12, 8: 12, 9: 12, 10: 12, 11: 12, 12: 12	Categorical	The response code given by the online survey tool. In the answer file these codes are used.	0	nan	online survey tool
MEANING			The literal meaning of the response. What the participant could see in the online survey.	0	nan	online survey tool
CORRECT_ANSWER	nan: 312, False: 126, True: 42	Categorical	Whether or not this answer was a correct answer or not.	312	The value is missing if this is not applicable. If the answer means that the participant did not even answer.	online survey tool

Response accuracy online survey

This file contains the response accuracy for the participants from the online survey.

Please find the file at this link: Response accuracy

Column name	Possible values	Value type	Description	Missing value description	Source
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan
text_domain	biology: 210, physics: 210	Categorical	The domain of the stimulus text.	nan	Manually tagged
mean_acc_tq	min: 0.0, max: 1.0, mean: 0.2619, std: 0.2495	Float	The mean accuracy of all background questions for one text read by one reader.	nan	nan
reader_discipline	biology: 108, other: 156, physics: 156	Categorical	The area of expertise of the reader. All readers are students whose major is either physics or biology.	nan	demographic questionnaire
level_of_studies	graduate: 264, other: 156	Categorical	Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners.	nan	demographic questionnaire

Response data online survey

The original response data from the online survey. The coding fo the values contained in here is found in the answer_coding_online_survey.csv file which is why the table below is empty. Please note that there are many value isn this file which are not relevant for this corpus. E.g., all columns starting with RA specify the randomization and all values starting with TIME contain response time information.

Please find the file at this link: Response data online survey

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source

Merged: scanpaths, participant info, reading measures and word features

Contains the scanpaths for each trial merged with infomration on the reader, texts, etc.

Please find the files at this link: Scanpaths merged

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
fixation_index	1-1469	Integer	The index of the fixation in temporal order.	0	nan	SR Research data viewer
text_domain	bio: 4682, biology: 200017, physics: 199721	Categorical	The domain of the stimulus text.	0	nan	Manually tagged
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.3869, std: 0.487	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
fixation_duration	2-4474	Integer	The duration of the fixation in milliseconds.	0	nan	SR Research data viewer
next_saccade_duration	1.0-9491.0	Integer	The duration of the saccade that follows a fixation in milliseconds.	46	nan	SR Research data viewer
previous_saccade_duration	1.0-9491.0	Integer	The duration of a saccade that preceeds a fixation in milliseconds.	515	nan	SR Research data viewer
version	0-105	Integer	Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv.	0	nan	nan
line	1-12	Integer	The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1.	0	nan	nan
aoi	1-1121	Integer	The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated.	0	nan	SR Research experiment builder
char_index_in_line	1-100	Integer	Index of a character in the line. Indexing starts at 1.	0	nan	nan
original_fixation_index	1-1478	Integer	The index of the uncorrected fixation.	0	nan	SR Research data viewer
is_fixation_adjusted	False: 382202, True: 22218	Categorical	Whether or not the fixation has been adjusted manually.	0	nan	Manually tagged.
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	Manually created
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
char_index_in_text	1-1121	Integer	Index of a character in the text. Indexing starts at 1.	0	nan	nan
word		string	Words as they appear in the stimuli texts. Words are split at white-space.	0	nan	nan
character		string	Character as text.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	Manually created
text_domain_numeric	0: 204699, 1: 199721	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	Manually created
reader_discipline_numeric	0: 223158, 1: 181262	Categorical	Numerical encoding of the reader discipline; 0=biology, 1=physics.	0	nan	Manually created
level_of_studies_numeric	0: 154333, 1: 250087	Categorical	Numerical value of level_of_studies; 0=beginner, 1=expert.	0	nan	demographic questionnaire
expert_reading_label_numeric	0: 290883, 1: 113537	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	Manually tagged
expert_reading_label	expert_reading: 113537, non-expert_reading: 290883	Categorical	Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert)	0	nan	Manually tagged
word_with_punct		string	The word as it appears in the text, including punctuation.	96	nan	nan
word_index_in_sent	1-51	Integer	The index of the word in the sentence. Indexing starts at 1.	0	nan	nan
word_length	2-33	Integer	Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters).	0	nan	nan
STTS_punctuation_before	0.0: 211108, 0: 189407, $(: 3905	Categorical	If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	0	nan	Manually tagged
STTS_punctuation_after	$(: 3260, $($,: 573, $,: 22559, $.: 25794, 0: 352234	Categorical	If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	0	nan	Manually tagged
is_in_quote	0: 399715, 1: 4705	Categorical	Whether or not the word is part of an expression in quotes.	0	nan	Manually tagged
is_in_parentheses	0: 403155, 1: 1265	Categorical	Whether or not the word is part of a phrase in parentheses.	0	nan	Manually tagged
is_clause_beginning	0: 388232, 1: 16188	Categorical	Whether or not the word is the beginning of a clause.	0	nan	Manually tagged
is_sent_beginning	0: 386681, 1: 17739	Categorical	Whether or not the word is the beginning of a new sentence.	0	nan	Manually tagged
is_clause_end	0: 381545, 1: 22875	Categorical	Whether or not the word is the end of a clause.	0	nan	Manually tagged
is_sent_end	0: 380027, 1: 24393	Categorical	Whether or not the word is the end of a sentence.	0	nan	Manually tagged
is_abbreviation	0: 403478, 1: 942	Categorical	Whether or not the entire word is an abbreviation.	0	nan	Manually tagged
is_expert_technical_term	0: 332354, 1: 72066	Categorical	1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"".	0	nan	Manually tagged
is_general_technical_term	0: 325333, 1: 79087	Categorical	1 if the word is a technical term that is generally understandable. E.g.: "elektrisch"	0	nan	Manually tagged
contains_symbol	0: 400458, 1: 3962	Categorical	Whether or not the word contains a symbol. E.g.: β-D-Glucose	0	nan	Manually tagged
contains_hyphen	0: 388149, 1: 16271	Categorical	Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)).	0	nan	Manually tagged
contains_abbreviation	0: 399423, 1: 4997	Categorical	Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA.	0	nan	Manually tagged
STTS_PoS_tag	ADJA: 51041, ADJD: 12714, ADV: 12236, APPR: 22470, APPRART: 5566, APZR: 91, ART: 37340, CARD: 1594, KOKOM: 2428, KON: 5798, KOUI: 654, KOUS: 2521, NE: 955, NN: 162980, PAV: 3444, PDAT: 3292, PDS: 1374, PIAT: 791, PIDAT: 1653, PIS: 1322, PPER: 2511, PPOSAT: 1360, PRELAT: 1302, PRELS: 4193, PRF: 3606, PTKA: 97, PTKNEG: 687, PTKVZ: 1490, PTKZU: 583, PWAV: 76, TRUNC: 1137, VAFIN: 10340, VAINF: 1206, VMFIN: 3953, VMINF: 153, VVFIN: 23854, VVINF: 7713, VVIZU: 578, VVPP: 9317	Categorical	Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information.	0	nan	Manually tagged
type		string	The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name.	0	nan	dlexDB
type_length_chars	0.0-33.0	Integer	The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted.	0	nan	nan
PoS_tag	adja: 53330, adjd: 12226, adv: 15728, appr: 22193, apprart: 5566, art: 37918, card: 1594, kokom: 2428, kon: 5405, koui: 559, kous: 2521, ne: 1386, nn: 160585, pdat: 3292, pds: 1374, piat: 791, pidat: 352, pis: 2063, pper: 2434, pposat: 1360, prelat: 1302, prels: 4076, prf: 3606, ptka: 97, ptkneg: 687, ptkvz: 1891, ptkzu: 583, pwav: 76, trunc: 1137, vafin: 10340, vainf: 1206, vmfin: 3829, vminf: 153, vvfin: 23978, vvinf: 7713, vvizu: 578, vvpp: 9317, xy: 746	Categorical	Part-of-speech tag as defined by the dlexDB query.	0	nan	dlexDB
lemma		string	nan	0	nan	dlexDB
lemma_length_chars	0.0-32.0	Integer	nan	0	nan	dlexDB
syllables		string	nan	0	nan	dlexDB
type_length_syllables	0.0-14.0	Integer	nan	0	nan	dlexDB
annotated_type_frequency_normalized	min: 0.0, max: 24738.5901996, mean: 1950.9055, std: 5185.3006	Float	The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma.	0	nan	dlexDB
type_frequency_normalized	min: 0.0, max: 26530.3631386, mean: 2247.4523, std: 5847.2187	Float	nan	0	nan	dlexDB
lemma_frequency_normalized	min: 0.0, max: 80100.3069113, mean: 7203.2409, std: 19769.4428	Float	nan	0	nan	dlexDB
familiarity_normalized	min: 0.0, max: 26530.3631386, mean: 2191.7786, std: 5759.2592	Float	nan	0	nan	dlexDB
regularity_normalized	min: 0.0, max: 2123.30585022, mean: 46.8657, std: 137.5046	Float	nan	0	nan	dlexDB
document_frequency_normalized	min: 0.0, max: 9372.80956103, mean: 1684.1043, std: 2829.0626	Float	nan	0	nan	dlexDB
sentence_frequency_normalized	min: 0.0, max: 30912.3596552, mean: 3137.4539, std: 7374.8037	Float	nan	0	nan	dlexDB
cumulative_syllable_corpus_frequency_normalized	min: 0.0, max: 125126.524676, mean: 15768.7784, std: 17529.5528	Float	nan	0	nan	dlexDB
cumulative_syllable_lexicon_frequency_normalized	min: 0.0, max: 218985.607753, mean: 27232.3183, std: 36883.9628	Float	nan	0	nan	dlexDB
cumulative_character_corpus_frequency_normalized	min: 0.0, max: 7810554.20193, mean: 2053804.334, std: 1596380.3916	Float	nan	0	nan	dlexDB
cumulative_character_lexicon_frequency_normalized	min: 0.0, max: 18380479.713, mean: 4612580.9638, std: 3597155.0404	Float	nan	0	nan	dlexDB
cumulative_character_bigram_corpus_frequency_normalized	min: 0.0, max: 1322150.62097, mean: 356831.454, std: 269772.388	Float	nan	0	nan	dlexDB
cumulative_character_bigram_lexicon_frequency_normalized	min: 0.0, max: 2788357.77704, mean: 629626.1651, std: 539088.9742	Float	nan	0	nan	dlexDB
cumulative_character_trigram_corpus_frequency_normalized	min: 0.0, max: 603427.130456, mean: 200341.8076, std: 144122.7012	Float	nan	0	nan	dlexDB
cumulative_character_trigram_lexicon_frequency_normalized	min: 0.0, max: 899592.89035, mean: 236423.2776, std: 199573.1416	Float	nan	0	nan	dlexDB
initial_letter_frequency_normalized	min: 0.0, max: 110461.430317, mean: 28045.0077, std: 30618.9167	Float	nan	0	nan	dlexDB
initial_bigram_frequency_normalized	min: 0.0, max: 53801.2331077, mean: 8706.0335, std: 12743.2638	Float	nan	0	nan	dlexDB
initial_trigram_frequency_normalized	min: -0.00817507899599, max: 29048.3692201, mean: 3754.6304, std: 7393.1224	Float	nan	0	nan	dlexDB
avg_cond_prob_in_bigrams	min: 0.0, max: 0.5006180465, mean: 0.0313, std: 0.0466	Float	The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information.	0	nan	dlexDB
avg_cond_prob_in_trigrams	min: 0.0, max: 25.0, mean: 0.2251, std: 0.8814	Float	The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information.	0	nan	dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 1276.643, std: 5775.4034	Float	nan	0	nan	dlexDB
neighbors_coltheart_higher_freq_count_normalized	min: 0.0, max: 8.13363128109, mean: 0.1556, std: 0.4321	Float	nan	0	nan	dlexDB
neighbors_coltheart_all_cum_freq_normalized	min: 0.0, max: 49782.1108458, mean: 2794.1781, std: 7982.6321	Float	nan	0	nan	dlexDB
neighbors_coltheart_all_count_normalized	min: 0.0, max: 47.5175301158, mean: 9.0448, std: 12.679	Float	nan	0	nan	dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 1683.6273, std: 6153.8504	Float	nan	0	nan	dlexDB
neighbors_levenshtein_higher_freq_count_normalized	min: 0.0, max: 11.9864039932, mean: 0.2681, std: 0.5814	Float	nan	0	nan	dlexDB
neighbors_levenshtein_all_cum_freq_normalized	min: 0.0, max: 54875.2749862, mean: 3761.4734, std: 9299.5647	Float	nan	0	nan	dlexDB
neighbors_levenshtein_all_count_normalized	min: 0.0, max: 75.7711966712, mean: 14.1417, std: 19.6383	Float	nan	0	nan	dlexDB
sent_surprisal_gpt2-base	min: 0.0005104430601932, max: 56.804420471191406, mean: 10.0061, std: 9.1114	Float	Surprisal value extracted from a language model (GerPT2-base) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_gpt2-base	min: 0.0002225389762315, max: 53.041446685791016, mean: 8.0061, std: 8.0873	Float	Surprisal value extracted from a language model (GerPT2-base) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_gpt2-large	min: 0.0002048997703241, max: 42.28059005737305, mean: 8.76, std: 8.0159	Float	Surprisal value extracted from a language model (GerPT2-large) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_gpt2-large	min: 0.0001027531252475, max: 35.38883209228516, mean: 6.6792, std: 6.6522	Float	Surprisal value extracted from a language model (GerPT2-large) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_llama-7b	min: 0.0001720042055239, max: 42.96158599853516, mean: 8.0373, std: 7.0611	Float	Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_llama-7b	min: 1.990775308513548e-05, max: 35.62324142456055, mean: 4.7991, std: 4.9022	Float	Surprisal value extracted from a language model (LeoLM-7b) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_llama-13b	min: 8.702239938429557e-06, max: 46.25139999389648, mean: 7.7768, std: 7.1775	Float	Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_llama-13b	min: 9.298280929215252e-06, max: 36.29869842529297, mean: 4.5172, std: 4.9048	Float	Surprisal value extracted from a language model (LeoLM-13b) with the text as context.	0	nan	See script get_surprisal.py
sent_surprisal_bert-base	min: 1.1920928244535389e-07, max: 101.79562616348268, mean: 8.1926, std: 13.1873	Float	Surprisal value extracted from a language model (BERT-base) with the sentence as context.	0	nan	See script get_surprisal.py
text_surprisal_bert-base	min: -0.0, max: 88.84420316047726, mean: 7.487, std: 12.7275	Float	Surprisal value extracted from a language model (BERT-base) with the text as context.	0	nan	See script get_surprisal.py
FFD	min: 0, max: 2144, mean: 195.9741, std: 124.5597	Float	First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0.	0	nan	compute_reading_measures.py
SFD	min: 0, max: 2144, mean: 107.9483, std: 134.474	Float	Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation).	0	nan	compute_reading_measures.py
FD	min: 0, max: 2144, mean: 226.9857, std: 103.7904	Float	First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass).	0	nan	compute_reading_measures.py
FPRT	min: 0, max: 9649, mean: 408.9247, std: 526.0428	Float	First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass).	0	nan	compute_reading_measures.py
FRT	min: 0, max: 9649, mean: 456.8788, std: 518.1388	Float	First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass).	0	nan	compute_reading_measures.py
TFT	min: 0, max: 25314, mean: 1333.0163, std: 1428.494	Float	Total-fixation time: sum of all fixations on a word (FPRT+RRT).	0	nan	compute_reading_measures.py
TFC	min: 0, max: 87, mean: 5.8238, std: 5.5152	Float	The total fixation count on the word.	0	nan	compute_reading_measures.py
RRT	min: 0, max: 23902, mean: 924.0916, std: 1240.0587	Float	Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT).	0	nan	compute_reading_measures.py
RPD_inc	min: 0, max: 318898, mean: 1076.7946, std: 5339.73	Float	Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT).	0	nan	compute_reading_measures.py
RPD_exc	min: 0, max: 315640, mean: 557.5849, std: 5209.143	Float	Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT).	0	nan	compute_reading_measures.py
RBRT	min: 0, max: 10675, mean: 519.2098, std: 638.9024	Float	Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc).	0	nan	compute_reading_measures.py
Fix	0: 110, 1: 404310	Categorical	Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR).	0	nan	compute_reading_measures.py
FPF	0: 56838, 1: 347582	Categorical	First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0.	0	nan	compute_reading_measures.py
RR	0: 48241, 1: 356179	Categorical	Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)).	0	nan	compute_reading_measures.py
FPReg	0: 308156, 1: 96264	Categorical	First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)).	0	nan	compute_reading_measures.py
TRC_out	min: 0, max: 15, mean: 0.8249, std: 1.193	Float	Total count of outgoing regressions: total number of regressive saccades initiated from this word.	0	nan	compute_reading_measures.py
TRC_in	min: 0, max: 12, mean: 0.7776, std: 1.1734	Float	Total count of incoming regressions: total number of regressive saccades landing on this word.	0	nan	compute_reading_measures.py
LP	min: 1, max: 28, mean: 3.3887, std: 2.3225	Float	Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character.	0	nan	compute_reading_measures.py
SL_in	min: -162, max: 156, mean: 1.3449, std: 2.928	Float	Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression.	0	nan	compute_reading_measures.py
SL_out	min: -179, max: 63, mean: -0.0835, std: 7.9375	Float	Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated.	0	nan	compute_reading_measures.py
mean_acc_tq	min: 0.0, max: 0.9991603694374476, mean: 0.3819, std: 0.3148	Float	The mean accuracy of all background questions for one text read by one reader.	0	nan	nan
mean_acc_bq	min: 0.0, max: 0.999250936329588, mean: 0.6398, std: 0.312	Float	The mean accuracy of all text questions for one text read by one reader.	0	nan	nan
gender_numeric	0.0: 187536, 1.0: 212874, nan: 4010	Categorical	Numerical value of gender; 0=male, 1=female.	4010	nan	nan
age	min: 18.0, max: 41.0, mean: 24.0283, std: 4.1436	Float	Reader's age.	8459	nan	demographic questionnaire
discipline_level_of_studies_numeric	0: 89325, 1: 133833, 2: 65008, 3: 116254	Categorical	Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert.	0	nan	demographic questionnaire

AOI to word mapping

Contains the mapping of each aoi to the respective word in each of the texts.

Please find the file at this link: aoi to word mapping

Column name	Possible values	Value type	Description	Missing value description	Source
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	nan	nan
char_index_in_text	1-1121	Integer	Index of a character in the text. Indexing starts at 1.	nan	nan

Participants

In the participants' data file, all demographic information is stored.

Please find the file at this link: Participant information

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	Manually created
reader_discipline	biology: 43, physics: 32	Categorical	The area of expertise of the reader. All readers are students whose major is either physics or biology.	0	nan	demographic questionnaire
reader_discipline_numeric	0: 43, 1: 32	Categorical	Numerical encoding of the reader discipline; 0=biology, 1=physics.	0	nan	Manually created
level_of_studies	graduate: 47, undergraduate: 28	Categorical	Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners.	0	nan	demographic questionnaire
level_of_studies_numeric	0: 28, 1: 47	Categorical	Numerical value of level_of_studies; 0=beginner, 1=expert.	0	nan	demographic questionnaire
discipline_level_of_studies	biology-graduate: 27, biology-undergraduate: 16, physics-graduate: 20, physics-undergraduate: 12	Categorical	The combination of the readers' major (reader_discipline) and their expertise (level_of_studies).	0	nan	demographic questionnaire
discipline_level_of_studies_numeric	0: 16, 1: 27, 2: 12, 3: 20	Categorical	Numerical value of discipline_level_of_studies; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert.	0	nan	demographic questionnaire
glasses	no: 54, yes: 20, nan: 1	Categorical	Whether or not reader is wearing glasses.	1	nan	demographic questionnaire
age	min: 18.0, max: 41.0, mean: 24.1644, std: 4.2098	Float	Reader's age.	2	nan	demographic questionnaire
handedness	right: 68, left: 6, nan: 1	Categorical	Reader's handedness.	1	nan	demographic questionnaire
hours_sleep	min: 0.0, max: 11.0, mean: 7.2095, std: 1.3138	Float	The hours of sleep of the participant before the experiment.	1	nan	demographic questionnaire
alcohol	no: 71, yes: 3, nan: 1	Categorical	Whether or not a participant consumed alcohol within 24 hours before the experiment start.	1	nan	demographic questionnaire
gender	female: 39, male: 35, nan: 1	Categorical	Reader's gender.	1	nan	demographic questionnaire
gender_numeric	0.0: 35, 1.0: 39, nan: 1	Categorical	Numerical value of gender; 0=male, 1=female.	1	nan	nan
semester		string	The semester the reader is currently enrolled in.	1	nan	demographic questionnaire
bilingual	n: 73, j: 1, nan: 1	Categorical	Whether the reader is bilingual.	1	nan	demographic questionnaire
state		string	The German state the reader is from.	1	nan	demographic questionnaire
grade		string	The grade of the reader in their university entrance diploma.	4	nan	demographic questionnaire
subject_detailed			The detailed subject of the reader's major.	1	nan	demographic questionnaire

Participants' response accuracy

The response accuracy for each participant for each question.

Please find the file at this link: Participant response accuracy

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	Manually created
reader_discipline	biology: 516, physics: 384	Categorical	The area of expertise of the reader. All readers are students whose major is either physics or biology.	0	nan	demographic questionnaire
reader_discipline_numeric	0: 516, 1: 384	Categorical	Numerical encoding of the reader discipline; 0=biology, 1=physics.	0	nan	Manually created
level_of_studies	graduate: 564, undergraduate: 336	Categorical	Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners.	0	nan	demographic questionnaire
level_of_studies_numeric	0: 336, 1: 564	Categorical	Numerical value of level_of_studies; 0=beginner, 1=expert.	0	nan	demographic questionnaire
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
text_domain	biology: 450, physics: 450	Categorical	The domain of the stimulus text.	0	nan	Manually tagged
expert_reading_label	expert-reading: 282, non-expert-reading: 618	Categorical	Whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert)	0	nan	Manually tagged
expert_reading_label_numeric	0: 618, 1: 282	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_discipline and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	Manually tagged
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6475, std: 0.478	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6441, std: 0.479	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6509, std: 0.477	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.393, std: 0.4887	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.366, std: 0.482	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4234, std: 0.4944	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	12	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
mean_acc_tq	min: 0.0, max: 1.0, mean: 0.6475, std: 0.3082	Float	The mean accuracy of all background questions for one text read by one reader.	12	nan	nan
mean_acc_bq	min: 0.0, max: 1.0, mean: 0.3941, std: 0.3163	Float	The mean accuracy of all text questions for one text read by one reader.	12	nan	nan

Coding of the answers of the online survey

Please find the file at this link: Answer coding online survey

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
VAR		string	Variable name of the fields in the participant online survey. These are explanations of the names of the columns in the file: response_data_online_survey.csv	0	nan	online survey tool
RESPONSE	-9: 46, 0: 2, 1: 95, 2: 92, 3: 93, 4: 54, 5: 13, 6: 13, 7: 12, 8: 12, 9: 12, 10: 12, 11: 12, 12: 12	Categorical	The response code given by the online survey tool. In the answer file these codes are used.	0	nan	online survey tool
MEANING			The literal meaning of the response. What the participant could see in the online survey.	0	nan	online survey tool
CORRECT_ANSWER	nan: 312, False: 126, True: 42	Categorical	Whether or not this answer was a correct answer or not.	312	The value is missing if this is not applicable. If the answer means that the participant did not even answer.	online survey tool

Response accuracy online survey

This file contains the response accuracy for the participants from the online survey.

Please find the file at this link: Response accuracy

Column name	Possible values	Value type	Description	Missing value description	Source
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan
text_domain	biology: 210, physics: 210	Categorical	The domain of the stimulus text.	nan	Manually tagged
mean_acc_tq	min: 0.0, max: 1.0, mean: 0.2619, std: 0.2495	Float	The mean accuracy of all background questions for one text read by one reader.	nan	nan
reader_discipline	biology: 108, other: 156, physics: 156	Categorical	The area of expertise of the reader. All readers are students whose major is either physics or biology.	nan	demographic questionnaire
level_of_studies	graduate: 264, other: 156	Categorical	Reader's level of studies. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners.	nan	demographic questionnaire

Response data online survey

Please find the file at this link: Response data online survey

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codebook

Table of contents

Word features

Stimuli and comprehension questions

Items

Areas of interest (AOI)

Manually corrected constituency trees

Dependency trees

Raw data files (samples)

Fixations

Scanpaths

Reading measures

Merged: fixations, participant info, reading measures and word features

Merged: scanpaths, participant info, reading measures and word features

AOI to word mapping

Participants

Participants' response accuracy

Coding of the answers of the online survey

Response accuracy online survey

Response data online survey

Merged: scanpaths, participant info, reading measures and word features

AOI to word mapping

Participants

Participants' response accuracy

Coding of the answers of the online survey

Response accuracy online survey

Response data online survey

FilesExpand file tree

CODEBOOK.md

Latest commit

History

CODEBOOK.md

File metadata and controls

Codebook

Table of contents

Word features

Stimuli and comprehension questions

Items

Areas of interest (AOI)

Manually corrected constituency trees

Dependency trees

Raw data files (samples)

Fixations

Scanpaths

Reading measures

Merged: fixations, participant info, reading measures and word features

Merged: scanpaths, participant info, reading measures and word features

AOI to word mapping

Participants

Participants' response accuracy

Coding of the answers of the online survey

Response accuracy online survey

Response data online survey

Merged: scanpaths, participant info, reading measures and word features

AOI to word mapping

Participants

Participants' response accuracy

Coding of the answers of the online survey

Response accuracy online survey

Response data online survey