Skip to content

Semcor License is likely invalid #250

@ekaf

Description

@ekaf

Problem Statement

The semcor corpus is currently distributed in nltk_data under the Princeton WordNet License. However, a review of its provenance reveals that this license is likely invalid for distributing the underlying text, making semcor a non-free package.

Reasoning for semcor License Invalidity

  1. Derivative Work Status: semcor is not a new work; it is the Brown Corpus with added semantic annotations. Under copyright law, this makes it a derivative work of the original Brown Corpus.

  2. Restrictive Source License: The Brown Corpus is unequivocally licensed under restrictive terms by the Linguistic Data Consortium (LDC). The LDC is the sole official licensing authority.

  3. No Evidence of Sublicensing: There is no public evidence that Princeton University secured a sublicensing agreement from the LDC that would permit them to strip the LDC's restrictions and re-license the underlying Brown text under a permissive license.

  4. Princeton is Not an LDC Member: Investigation confirms Princeton University is not a member of the LDC consortium, eliminating the possibility of special institutional rights that could justify this re-licensing.

Conclusion: Therefore, the Princeton WordNet License attached to semcor is an overreach and is almost certainly invalid for its core content. Distributing semcor relies on academic leniency, not a sound legal basis. It must be classified as non-free.

Why This Does NOT Affect wordnet

It is crucial to understand that the invalidity of the semcor license does not "infect" the WordNet database itself. The legal arguments are distinct:

  • semcor is a Corpus: It contains the full, expressive text of the copyrighted Brown Corpus. Distributing it copies protected expression.
  • WordNet is a Database of Facts: WordNet used semcor to derive sense frequencies—statistical facts about language use. Copyright protects expression, not facts. The structure of WordNet is its own creative, copyrightable work, and the facts it contains are unprotectable.

The creation of WordNet's data from semcor is a textbook example of fair use (highly transformative, non-expressive purpose) and falls under the fact/expression dichotomy. The WordNet database remains on solid legal ground under its permissive Princeton WordNet License.

Proposed Action

  1. Officially reclassify the semcor package from "free" to "non-free" in the nltk_data index and documentation.
  2. Ensure semcor is included in the nltk-edu (restricted) pip package and excluded from the nltk-free (commercial-safe) package.
  3. Update the semcor documentation to clearly state:

    "Distributed under the Princeton WordNet License, but this license is likely invalid as it is a derivative work of the LDC-licensed Brown Corpus. For academic use only."

This action is necessary to maintain the legal integrity of the NLTK project and protect its users.

Seeking Clarification

To resolve this ambiguity, we would welcome clarification from the WordNet team at Princeton University, particularly Dr. Christiane Fellbaum, regarding the rights obtained for creating and distributing semcor as a derivative work of the Brown Corpus. I am also contacting the WordNet team by email, with an invitation to take part in this discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions