Semcor License is likely invalid


## Problem Statement

The `semcor` corpus is currently distributed in `nltk_data` under the Princeton WordNet License. However, a review of its provenance reveals that this license is likely invalid for distributing the underlying text, making `semcor` a non-free package.

## Reasoning for `semcor` License Invalidity

1.  **Derivative Work Status:** `semcor` is not a new work; it is the Brown Corpus with added semantic annotations. Under copyright law, this makes it a derivative work of the original Brown Corpus.

2.  **Restrictive Source License:** The Brown Corpus is unequivocally licensed under restrictive terms by the Linguistic Data Consortium (LDC). The LDC is the sole official licensing authority.

3.  **No Evidence of Sublicensing:** There is no public evidence that Princeton University secured a sublicensing agreement from the LDC that would permit them to strip the LDC's restrictions and re-license the underlying Brown text under a permissive license.

4.  **Princeton is Not an LDC Member:** Investigation confirms Princeton University is not a member of the LDC consortium, eliminating the possibility of special institutional rights that could justify this re-licensing.

**Conclusion:** Therefore, the Princeton WordNet License attached to `semcor` is an overreach and is almost certainly **invalid** for its core content. Distributing `semcor` relies on academic leniency, not a sound legal basis. It must be classified as **non-free**.

## Why This Does NOT Affect `wordnet`

It is crucial to understand that the invalidity of the `semcor` license does not "infect" the WordNet database itself. The legal arguments are distinct:

*   **`semcor` is a Corpus:** It contains the full, expressive text of the copyrighted Brown Corpus. Distributing it copies protected expression.
*   **WordNet is a Database of Facts:** WordNet used `semcor` to derive **sense frequencies**—statistical facts about language use. Copyright protects *expression*, not *facts*. The structure of WordNet is its own creative, copyrightable work, and the facts it contains are unprotectable.

The creation of WordNet's data from `semcor` is a textbook example of **fair use** (highly transformative, non-expressive purpose) and falls under the **fact/expression dichotomy**. The WordNet database remains on solid legal ground under its permissive Princeton WordNet License.

## Proposed Action

1.  Officially reclassify the `semcor` package from "free" to **"non-free"** in the `nltk_data` index and documentation.
2.  Ensure `semcor` is included in the `nltk-edu` (restricted) pip package and excluded from the `nltk-free` (commercial-safe) package.
3.  Update the `semcor` documentation to clearly state:
    > "Distributed under the Princeton WordNet License, but this license is likely invalid as it is a derivative work of the LDC-licensed Brown Corpus. For academic use only."

This action is necessary to maintain the legal integrity of the NLTK project and protect its users.

## Seeking Clarification

To resolve this ambiguity, we would welcome clarification from the WordNet team at Princeton University, particularly Dr. Christiane Fellbaum, regarding the rights obtained for creating and distributing `semcor` as a derivative work of the Brown Corpus. I am also contacting the WordNet team by email, with an invitation to take part in this discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Semcor License is likely invalid #250

Problem Statement

Reasoning for `semcor` License Invalidity

Why This Does NOT Affect `wordnet`

Proposed Action

Seeking Clarification

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Semcor License is likely invalid #250

Description

Problem Statement

Reasoning for semcor License Invalidity

Why This Does NOT Affect wordnet

Proposed Action

Seeking Clarification

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Reasoning for `semcor` License Invalidity

Why This Does NOT Affect `wordnet`