-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Problem Statement
The semcor corpus is currently distributed in nltk_data under the Princeton WordNet License. However, a review of its provenance reveals that this license is likely invalid for distributing the underlying text, making semcor a non-free package.
Reasoning for semcor License Invalidity
-
Derivative Work Status:
semcoris not a new work; it is the Brown Corpus with added semantic annotations. Under copyright law, this makes it a derivative work of the original Brown Corpus. -
Restrictive Source License: The Brown Corpus is unequivocally licensed under restrictive terms by the Linguistic Data Consortium (LDC). The LDC is the sole official licensing authority.
-
No Evidence of Sublicensing: There is no public evidence that Princeton University secured a sublicensing agreement from the LDC that would permit them to strip the LDC's restrictions and re-license the underlying Brown text under a permissive license.
-
Princeton is Not an LDC Member: Investigation confirms Princeton University is not a member of the LDC consortium, eliminating the possibility of special institutional rights that could justify this re-licensing.
Conclusion: Therefore, the Princeton WordNet License attached to semcor is an overreach and is almost certainly invalid for its core content. Distributing semcor relies on academic leniency, not a sound legal basis. It must be classified as non-free.
Why This Does NOT Affect wordnet
It is crucial to understand that the invalidity of the semcor license does not "infect" the WordNet database itself. The legal arguments are distinct:
semcoris a Corpus: It contains the full, expressive text of the copyrighted Brown Corpus. Distributing it copies protected expression.- WordNet is a Database of Facts: WordNet used
semcorto derive sense frequencies—statistical facts about language use. Copyright protects expression, not facts. The structure of WordNet is its own creative, copyrightable work, and the facts it contains are unprotectable.
The creation of WordNet's data from semcor is a textbook example of fair use (highly transformative, non-expressive purpose) and falls under the fact/expression dichotomy. The WordNet database remains on solid legal ground under its permissive Princeton WordNet License.
Proposed Action
- Officially reclassify the
semcorpackage from "free" to "non-free" in thenltk_dataindex and documentation. - Ensure
semcoris included in thenltk-edu(restricted) pip package and excluded from thenltk-free(commercial-safe) package. - Update the
semcordocumentation to clearly state:"Distributed under the Princeton WordNet License, but this license is likely invalid as it is a derivative work of the LDC-licensed Brown Corpus. For academic use only."
This action is necessary to maintain the legal integrity of the NLTK project and protect its users.
Seeking Clarification
To resolve this ambiguity, we would welcome clarification from the WordNet team at Princeton University, particularly Dr. Christiane Fellbaum, regarding the rights obtained for creating and distributing semcor as a derivative work of the Brown Corpus. I am also contacting the WordNet team by email, with an invitation to take part in this discussion.