Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question on license #1

Open
joprice opened this issue Jan 19, 2025 · 13 comments
Open

question on license #1

joprice opened this issue Jan 19, 2025 · 13 comments

Comments

@joprice
Copy link

joprice commented Jan 19, 2025

Regarding your comment on this thread explosion/spaCy#3052 (comment), I'm curious if the license could be more detailed:

  • Is the code itself licensed the same as the model?
  • Are the outputs of the model considered derived works?
  • Which corpora specifically trigger the non-commercial license? (maybe useful for people building similar projects where they are forced to be careful about licensing)
@thjbdvlt
Copy link
Owner

Thank you for your questions.

Actually, it could be pretty easy to turn off the non-commercial clause, because it only comes from corpora used to train the static word vectors: and static word vectors are trained on unannotated data. You can find the list of these corpora here: https://github.com/thjbdvlt/french-word-vectors

The morphologizer itself is trained using a corpora that I annoted by myself (https://github.com/thjbdvlt/corpus-narraFEATS), using texts in Public Domain from Wikisource and articles from Wikipedia (CC-BY-SA), all heavily modified by me, in order to ensure a linguistic diversity. If CC-BY-SA it's too restrictive, I should be able to replace Wikipedia articles by other stuff so I could put the corpus under MIT license (which would be a good idea anyway).
The output are not derived works, so you will be able to freely use the model anyway.

I'll train another set of static word vectors on a corpus without non-commercial clause, probably not as good as the current word vectors.
If you have such unannotated data, you could train the pipeline by yourself. I've used 8 millions sentences, which is small.

For the code itself: I think I will split this repository and put the pipeline somewhere else, so the code (which is trivial) could be under MIT license. I'll do that these days.

Does it answer your questions?

@joprice
Copy link
Author

joprice commented Jan 22, 2025

That's good to hear about the possibility to have the license be finer-grained.

Regarding the comment

The output are not derived works,

You mean for the existing model, this applies or with the way you're suggesting to modify the licenses? If the current outputs are not considered derived and ok for commercial use, then perhaps the readme could call out that it's only modifications and contributions that need to be contributed back? Or would changing the license have other practical implications for commercial use?

These questions are maybe naive, but I'd like to raise my confidence in both using and contributing back to this and similar projects. Licenses around data products are the most confusing for me and often lead me to believe that they are purely for academic / research purposes. I wish it was a bit more explicit like 'if you change things, contribute them back', 'the outputs are valid for any use', etc (not your fault obviously as there's a lot of material trying to clarify the terms of these licenses already). I try to contribute back to projects I make use of either way, but sometimes I avoid them instead of being encouraged to contribute.)

As for training the model myself, I haven't run through a spacy pipeline yet and I have it on my todo list to do exactly that with your pipeline. I've hit many similar issues to the imperative one that motivated this issue and need to get my head around improving the quality of outputs with my own customizations. Do you have an estimate of how long the training took and what the minimum system requirements might be?

@thjbdvlt
Copy link
Owner

With the current license, you can't use the pipeline for commercial use, unfortunately. That's because of the linguistics corpora used to train the vectors: as you guessed, those are made for academic researches, and are produced in this context (e.g. this one).

But the data used to train the morphologizer itself (which tags words as Mood=Imp and so on -- it's the most important part of the pipeline) are licensed as CC-BY-SA without the non-commercial clause. The share-like clause doesn't imply that you have to share the work, but that if you share it, it has to be under this same license: "You must comply with the conditions [...] if You Share all or a substantial portion of the contents of the database."

To me, the output is not a derived work. The license defines derived material as "Licensed Material [that] is translated, altered, arranged, transformed, or otherwise modified". But you will just use the pipeline. The ShareAlike clause says nothing about using. That's different from the non-commercial clause ("No Commercial Use") that says that you cannot use the licensed work for commercial purpose: you only have "the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database for NonCommercial purposes only".

This is messy, anyway. That's because CC licenses are actually bad for that kind of object (softwares), and not designed fot that. A GPLv3 license would be better. It gives explicit rights for such cases:

the freedom to use the software for any purpose,
the freedom to change the software to suit your needs,
the freedom to share the software with your friends and neighbors, and
the freedom to share the changes you make.

With such a license, if you make changes to the pipeline itself you cannot sell it (the pipeline itself), but you can keep it for you and do whatever you want with it, including commercial things.

As you can read it here, I "have permission to adapt another licensor’s work under CC BY-SA 4.0 and release your contributions to the adaptation under GPLv3 " -- but not from CC BY-SA NC 4.0. Hence, I'll go for a GPLv3 licence once I've replaced the word vectors corpora. As you suggest, I will probably write in the README that "the output are valid for any use", or something like that.

It is clearer?

About training: the training is not long in itself (around 20 minutes on my 10 years old laptop), but as you will repeat it at least a few times to find good hyperparameters, I can take some time. But i'ts nothing compared to the most important part which is annotating a corpus (took me weeks), so I recommend that you do not do that, just use the one I've made or a Universal Dependency corpus. But I found that UD corpora are not very good for french; either too small (like paris french stories) or incomplete (the one used by spaCy doesn't include a single "tu", I think, neither includes inclusive language forms). But if solipcysme achieves good results (at least in my uses), it's not only because of the training corpus, but also because of the custom MultiHash architecture and FeatureExtractor that make possible to use Hunspell outputs as features. Before I did that, imperative verb forms were rarely recognized. If you want to understand how it works, you can read these files (but you should dig into spaCy source code to understand):

Also, I forgot to write in the README that there a dependency to run pyhunspell, it's the C library headers for Hunspell (libhunspell-dev in linux). I will add it to the dependencies list. It's not a big thing, though.


Here is the cited legal code: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.en
A page on Wikimedia about CC licences states similar interpretations (https://wiki.creativecommons.org/wiki/ShareAlike_interpretation).

@joprice
Copy link
Author

joprice commented Jan 22, 2025

Thank you for the in-depth explanation, and the intent to make it friendlier for broader use. But the crux of the question for me is in the ambiguity of my interpretation of two things you said

You can't use the pipeline for commercial use.

And

To me, the output is not a derived work.

Are you implying that if you ran the pipeline as part of a company's infrastructure, that would violate the licensing, but if you used it as a script to prepare data e.g., then plugged that data into a database, that the output is usable? Or were you comparing the current state to the future state of less strict licensing?

On the hunspell topic, I spent my morning figuring out getting it working on my mac, due to a hardcoded path to the headers and lib 😆: pyhunspell/pyhunspell#67 (comment)

@thjbdvlt
Copy link
Owner

Yes, I was comparing the current and future states. It's not a derived work, so CC BY-SA lets you use it commercialy, but CC BY-SA-NC doesn't, because it CC BY-SA-NC restricts every uses, not only derived works. Sorry I am not very clear.

Great that you were able to install Hunspell dev files / PyHunspell. I see that PyHunspell is rarely updated despite the issues, and that there are more recent packages that do the same (e.g. https://github.com/MSeal/cython_hunspell or https://pypi.org/project/pynuspell/), did you try any of these and think it should be used instead?

@joprice
Copy link
Author

joprice commented Jan 22, 2025

I didn't try any others. I can make a quick empty poetry project and try installing each and report back.

@joprice
Copy link
Author

joprice commented Jan 22, 2025

For some reason, I get Unable to find installation candidates for for both of those and haven't been able to figure out why yet.

@joprice
Copy link
Author

joprice commented Jan 23, 2025

@thjbdvlt On the topic of retraining on a different dataset, since I'm actively trying to learn some of these skills, I'd be willing to contribute some time to help you complete that task if you're interested in spread the work over two people and you could fill me in on the approach. It might be a bit slower given your deep familiarity with the tools and data, but if you create an issue and add some instructions, I could try to throw a pr together. I can also get on a call to chat. Feel free to talk it through.

@thjbdvlt
Copy link
Owner

For some reason, I get Unable to find installation candidates for for both of those and haven't been able to figure out why yet.

Thank you that you have tried! Let's stick on PyHunspell, then. Maybe I could put a link to your commit in a Dependencies section of the README.

I'll train train the static word vectors today, on extracted discussion messages from Wikipedia dumps. didn't want to use the articles because personal pronouns (especially "je", "tu") are pretty rare, but very commons in discussions, so it took me some time to find a way to extract text, but now I can train them. If you're curious how it's done: the static word vectors are not trained with spaCy, but with Gensim using the Word2Vec algorithm. (It statically represents all words of a corpus in a (kind of) semantic space, based on their contextual co-occurrences, where closeness between two vectors means closeness between two words meaning -- e.g. lire and relire will be very close, while lire and plâtre won't be. It's actually way more interesting than that; there's a lot of texts on the topic, but this simple thing make them really interesting to use in a spaCy pipeline.) Then, the word vectors are converted into spaCy's format using command line spaCy tool: spacy init vectors /path/to/word/2/vec.txt ./output/vectors. It's very straightforward. The word vectors will be ready soon, so I'll release the pipeline under GPLv3 within a few days.

I'd be willing to contribute some time to help you complete that task if you're interested in spread the work over two people and you could fill me in on the approach.

Nice to hear! Some help would be welcomed :)
Where you could help would be:

  1. Finding good data for a better dependency parsing, because it's train of 4 concatenated UD corpora which actually may be inconsistent -- that could explain the poor results (89%). Then, training a parser on these data (I'll suggest you some config to start with).
  2. Improving the rule-based lemmatizer. For now, it uses viceverser_lemmatizer, a module that I've written and I'm sure it's not very good for some cases (especially compound words).
  3. Test the pipeline: I have idea if it works on anything else that Linux, or if the results are good enough for any other datasets than mines. I've probably made some unhappy choices.

I'll open some issues about these three things. Maybe some of is of some interest to you?

@joprice
Copy link
Author

joprice commented Jan 24, 2025

Those are all tasks I'm interested in, especially the compound nouns. I just wrote some hacks to detect them using a dictionary lookup and a sliding window of a 3-5 width. to handle things like "caserne de pompiers". I'll start digging into the lemmatizer and make sure I can build it, and test the pipeline locally.

@thjbdvlt
Copy link
Owner

Great! By "compound words" i did only mean stuff like auto-construisez, autrices-compositrices-interprète (that is lemmatized as auteur·rice-compositeur·rice-interprète) or peut-être (that my component lemmatized, until recently, as... pouvoir-être) --i.e. things that the tozenizer don't split--, but what you're talking about is way more interesting! Instead of a hack, you could use the spaCy PhraseMatcher. I think that it's exactly the kind of task it's designed for (I haven't used it yet). I did want to do something similar for expressions like par conséquent (locution adverbiale), but I have no time for this.

@joprice
Copy link
Author

joprice commented Jan 24, 2025

Ah right I misinterpreted that. That would be great to have as well.

I didn't know spacy had the phrase matcher feature. I need to sit down and go through the docs more thorougly. So seems like my "hack" is valid, but just in a later stage in my pipeline, since you just pass it as list of words to look for. But it seems you can also look for general patterns, maybe "noun-prep-noun". I'll try using that with a fixed list to validate it. I'm sure the perf will be better than my current unbatched setup.

@thjbdvlt
Copy link
Owner

@joprice I actually sent you the wrong link: it's not the PhraseMatcher, but the Matcher that may be similar as you're hack (and maybe faster, if it accesses the C level token properties -- I haven''t read the code there).

I've updated the word vectors and just released the full pipeline under GPLv3 license. I've also made a few additions to the corpus used to train the morphologizer, and it should have improved it's accuracy.

When I'll have time, I'll upload files and script for the complete training, and open some issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants