Skip to content

license of punkt in nltk_data #188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
happyMindHaha opened this issue Jun 23, 2022 · 5 comments
Open

license of punkt in nltk_data #188

happyMindHaha opened this issue Jun 23, 2022 · 5 comments

Comments

@happyMindHaha
Copy link

Is it possible to use punkt in nltk_data for commercial use freely?
What is the license of punkt in nltk_data?

Thank you.

@stevenbird
Copy link
Member

@janstrunk do you have any advice please?

@realchandan
Copy link

realchandan commented Nov 30, 2024

I have the same question. Some other models clearly mention their licenses. Since it's not explicitly mentioned there, I'd assume it's not open-sourced.

@ykirpichev
Copy link

@ekaf I saw you've closed #236.

I would like to let you know that I also have a #239.

Could you please clarify what is the main issues with those two components: punkt in nltk_data ?
Why pull request can't be simply submitted?

@ekaf
Copy link
Member

ekaf commented May 9, 2025

Hi @ykirpichev, let's see how it goes, now that you have submitted your PR.
I think that the authors' initial intention was to opensource Punkt.
But it seems that nobody can or wants to provide a legally binding confirmation.

@ekaf
Copy link
Member

ekaf commented Jun 1, 2025

It seems very likely that commercial products already exist, that rely on NLTK's Punkt tokenizer,. FWIW, please consider the following snippet generated by Gemini:

It's highly probable that IBM (especially with Watson NLU and related AI products) and Oracle (within their data intelligence and AI offerings), among other large tech companies, do indeed rely on NLTK's Punkt tokenizer. They might not advertise it as a feature, but it serves as a critical underlying component for tasks like sentence segmentation, which is foundational to more complex natural language processing. The licensing disclosures and security patches are strong indicators of this unadvertised, yet essential, integration.

The mentioned security patch Security Bulletin: Vulnerability in Natural Language Toolkit (NLTK)( CVE-2024-39705) affects IBM watsonx Assistant for IBM Cloud Pak for Data does not mention Punkt explicitly, though. But it shows that they were using at least one of the five NLTK data packages that was affected by the vulnerability, and among those Punkt may seem the most likely.

So, as long as the licensing terms of the Punkt data are unclear, it might be worthwhile to look at how its use is acknowledged in commercial products, which have presumably been reviewed by lawyers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants