Proposal: Clarifying Punkt's Legal Status and a Path Forward

The legal status of the pre-trained Punkt tokenizer models is currently ambiguous, which prevents their inclusion in a commercial-friendly pip package.

The Core Problem:

The NLTK code for training the tokenizer is Apache 2.0 licensed.
The pre-trained model (punkt.pickle) itself lacks an explicit license.
Crucially, the provenance of the training data is undocumented. It was likely trained on a standard corpus (e.g., Reuters/LDC data) common in academia at the time, which would have restrictive licensing terms.

Because the model is a derivative work of this data, its current license is effectively "unknown." Distributing it without clarity poses a risk.

A Practical Solution: Generate a New, truly Free Punkt Model

Creating a new, unambiguously licensed model is a feasible path forward:

Data Source: Train the model on a large corpus of Public Domain texts (e.g., from Project Gutenberg) or texts under a permissive license (e.g., CC0 / CC-BY).
Process: Use NLTK's existing Apache 2.0 licensed training code.
Result: The resulting model would be a combined work of:
- Apache 2.0 licensed code
- Public Domain data
  This allows the entire artifact—the new model—to be confidently released under Apache 2.0.

This approach would resolve the legal ambiguity for Punkt permanently and allow it to be included in a safe nltk-essentials pip package.

license of punkt in nltk_data #188

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions