Keep original order of pronunciation variants (#1)#2
Open
dietmar wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1
Rather than writing to an intermediate file and then sort via
sortin the shell script extract_de_ipa.sh, we collect the result rows in Python and sort there, but only considering the text in the first column (and not the IPA string), thus keeping the original order (on Wiktionary) of pronunciation variants for the same text (because Python's list.sort() is stable).I also switched to writing the result file using Python's built-in
csvlibrary instead of justprinting result lines, because:printIpais now calledbuildRowand returns a list of three strings (text, IPA, comments).csvlibrary takes care of quoting for us, so we don't need something likereturn "\"" + s + "\"" if "," in s else sanymore.csvlibrary makes the result file actually valid CSV. Until now, rows with a comment had three fields (two commas) and rows without a comment had only two fields (one comma). The new result can be read, e.g., by pandas.read_csv, which threw an error before, complaining about rows with incorrect number of fields.printing status and progress information, some of which I added.Finally, I also switched from plain
open()tobz2.open()so we don't need to decompress the file downloaded from Wikimedia before running the script.