-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question on license #1
Comments
Thank you for your questions. Actually, it could be pretty easy to turn off the non-commercial clause, because it only comes from corpora used to train the static word vectors: and static word vectors are trained on unannotated data. You can find the list of these corpora here: https://github.com/thjbdvlt/french-word-vectors The morphologizer itself is trained using a corpora that I annoted by myself (https://github.com/thjbdvlt/corpus-narraFEATS), using texts in Public Domain from Wikisource and articles from Wikipedia (CC-BY-SA), all heavily modified by me, in order to ensure a linguistic diversity. If CC-BY-SA it's too restrictive, I should be able to replace Wikipedia articles by other stuff so I could put the corpus under MIT license (which would be a good idea anyway). I'll train another set of static word vectors on a corpus without non-commercial clause, probably not as good as the current word vectors. For the code itself: I think I will split this repository and put the pipeline somewhere else, so the code (which is trivial) could be under MIT license. I'll do that these days. Does it answer your questions? |
That's good to hear about the possibility to have the license be finer-grained. Regarding the comment
You mean for the existing model, this applies or with the way you're suggesting to modify the licenses? If the current outputs are not considered derived and ok for commercial use, then perhaps the readme could call out that it's only modifications and contributions that need to be contributed back? Or would changing the license have other practical implications for commercial use? These questions are maybe naive, but I'd like to raise my confidence in both using and contributing back to this and similar projects. Licenses around data products are the most confusing for me and often lead me to believe that they are purely for academic / research purposes. I wish it was a bit more explicit like 'if you change things, contribute them back', 'the outputs are valid for any use', etc (not your fault obviously as there's a lot of material trying to clarify the terms of these licenses already). I try to contribute back to projects I make use of either way, but sometimes I avoid them instead of being encouraged to contribute.) As for training the model myself, I haven't run through a spacy pipeline yet and I have it on my todo list to do exactly that with your pipeline. I've hit many similar issues to the imperative one that motivated this issue and need to get my head around improving the quality of outputs with my own customizations. Do you have an estimate of how long the training took and what the minimum system requirements might be? |
With the current license, you can't use the pipeline for commercial use, unfortunately. That's because of the linguistics corpora used to train the vectors: as you guessed, those are made for academic researches, and are produced in this context (e.g. this one). But the data used to train the morphologizer itself (which tags words as To me, the output is not a derived work. The license defines derived material as "Licensed Material [that] is translated, altered, arranged, transformed, or otherwise modified". But you will just use the pipeline. The ShareAlike clause says nothing about using. That's different from the non-commercial clause ("No Commercial Use") that says that you cannot use the licensed work for commercial purpose: you only have "the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database for NonCommercial purposes only". This is messy, anyway. That's because CC licenses are actually bad for that kind of object (softwares), and not designed fot that. A GPLv3 license would be better. It gives explicit rights for such cases:
With such a license, if you make changes to the pipeline itself you cannot sell it (the pipeline itself), but you can keep it for you and do whatever you want with it, including commercial things. As you can read it here, I "have permission to adapt another licensor’s work under CC BY-SA 4.0 and release your contributions to the adaptation under GPLv3 " -- but not from CC BY-SA NC 4.0. Hence, I'll go for a GPLv3 licence once I've replaced the word vectors corpora. As you suggest, I will probably write in the README that "the output are valid for any use", or something like that. It is clearer? About training: the training is not long in itself (around 20 minutes on my 10 years old laptop), but as you will repeat it at least a few times to find good hyperparameters, I can take some time. But i'ts nothing compared to the most important part which is annotating a corpus (took me weeks), so I recommend that you do not do that, just use the one I've made or a Universal Dependency corpus. But I found that UD corpora are not very good for french; either too small (like paris french stories) or incomplete (the one used by spaCy doesn't include a single "tu", I think, neither includes inclusive language forms). But if
Also, I forgot to write in the README that there a dependency to run Here is the cited legal code: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.en |
Thank you for the in-depth explanation, and the intent to make it friendlier for broader use. But the crux of the question for me is in the ambiguity of my interpretation of two things you said
And
Are you implying that if you ran the pipeline as part of a company's infrastructure, that would violate the licensing, but if you used it as a script to prepare data e.g., then plugged that data into a database, that the output is usable? Or were you comparing the current state to the future state of less strict licensing? On the hunspell topic, I spent my morning figuring out getting it working on my mac, due to a hardcoded path to the headers and lib 😆: pyhunspell/pyhunspell#67 (comment) |
Yes, I was comparing the current and future states. It's not a derived work, so CC BY-SA lets you use it commercialy, but CC BY-SA-NC doesn't, because it CC BY-SA-NC restricts every uses, not only derived works. Sorry I am not very clear. Great that you were able to install Hunspell dev files / PyHunspell. I see that PyHunspell is rarely updated despite the issues, and that there are more recent packages that do the same (e.g. https://github.com/MSeal/cython_hunspell or https://pypi.org/project/pynuspell/), did you try any of these and think it should be used instead? |
I didn't try any others. I can make a quick empty poetry project and try installing each and report back. |
For some reason, I get |
@thjbdvlt On the topic of retraining on a different dataset, since I'm actively trying to learn some of these skills, I'd be willing to contribute some time to help you complete that task if you're interested in spread the work over two people and you could fill me in on the approach. It might be a bit slower given your deep familiarity with the tools and data, but if you create an issue and add some instructions, I could try to throw a pr together. I can also get on a call to chat. Feel free to talk it through. |
Thank you that you have tried! Let's stick on PyHunspell, then. Maybe I could put a link to your commit in a Dependencies section of the README. I'll train train the static word vectors today, on extracted discussion messages from Wikipedia dumps. didn't want to use the articles because personal pronouns (especially "je", "tu") are pretty rare, but very commons in discussions, so it took me some time to find a way to extract text, but now I can train them. If you're curious how it's done: the static word vectors are not trained with spaCy, but with Gensim using the Word2Vec algorithm. (It statically represents all words of a corpus in a (kind of) semantic space, based on their contextual co-occurrences, where closeness between two vectors means closeness between two words meaning -- e.g.
Nice to hear! Some help would be welcomed :)
I'll open some issues about these three things. Maybe some of is of some interest to you? |
Those are all tasks I'm interested in, especially the compound nouns. I just wrote some hacks to detect them using a dictionary lookup and a sliding window of a 3-5 width. to handle things like "caserne de pompiers". I'll start digging into the lemmatizer and make sure I can build it, and test the pipeline locally. |
Great! By "compound words" i did only mean stuff like |
Ah right I misinterpreted that. That would be great to have as well. I didn't know spacy had the phrase matcher feature. I need to sit down and go through the docs more thorougly. So seems like my "hack" is valid, but just in a later stage in my pipeline, since you just pass it as list of words to look for. But it seems you can also look for general patterns, maybe "noun-prep-noun". I'll try using that with a fixed list to validate it. I'm sure the perf will be better than my current unbatched setup. |
@joprice I actually sent you the wrong link: it's not the PhraseMatcher, but the Matcher that may be similar as you're hack (and maybe faster, if it accesses the C level token properties -- I haven''t read the code there). I've updated the word vectors and just released the full pipeline under GPLv3 license. I've also made a few additions to the corpus used to train the morphologizer, and it should have improved it's accuracy. When I'll have time, I'll upload files and script for the complete training, and open some issues. |
Regarding your comment on this thread explosion/spaCy#3052 (comment), I'm curious if the license could be more detailed:
The text was updated successfully, but these errors were encountered: