-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda: unary ops - perform the final computation as a float, and de-duplicate the code #1130
Conversation
…plicate the code. Avoids compilation errors with CUDA 11.7 and SM50. Also, the performance is memory-bound, so the simplified code is better on balance. Thanks @slaren for the suggestion.
I tried to test this on the CI runner (by pushing to a branch on the main repo), but it didn't pick up the change - https://github.com/ggml-org/ggml/commits/cmdr2-fp16-unary/ Does |
You also need to add |
Thanks! That worked (and all the runs passed). Test results: https://github.com/ggml-org/ggml/commits/cmdr2-fp16-unary/ |
@slaren Is there any reason not to run the broader set of CI runners (that llama.cpp uses) in the ggml branches as well? ggml is significantly less active (as a repo), and doesn't run for every PR (so compute cycles won't be that much). Thanks! |
The llama.cpp CI runs llama.cpp, so it would be tricky to run from the ggml repository. It would need some script to fetch llama.cpp and build it with a different version of ggml, or something to that effect. A big part of the problem is likely the lack of resources to keep the CI updated. It's not something that receives many contributions, and almost all the work is done by @ggerganov. It may make more sense to restrict all contributions on this repository strictly to changes to the examples only, and redirect everything else to llama.cpp. |
We can do that. Btw I am hoping that we will make |
I am not very familiar with submodules, what would be the workflow to add changes to llama.cpp that require changes in ggml as well? Would we need to open a PR first in ggml, and then after that is merged open the PR in llama.cpp? |
It would require 2 PRs:
The developer can iterate on the 2 branches (i.e. push changes to both and update the submodule always to the latest version of the Definitely more work, which is the main reason I have avoided the submodules (along with the extra steps like If you have some alternatives in mind, let me know. I'm quite used to git submodules, but they will likely add some complexity and make the work of |
Ok, thanks for the explanation. An obvious alternative to keep the workflow simple would be to merge the Maybe a reasonable way to do that would be to move all development to the |
I am becoming more convinced that a monorepo would be the best solution for our case. Git submodules seem to work well for adding 3rd party dependencies that almost never need to be updated, but they add too much overhead when a PR requires making changes in multiple projects, which is the case in many (if not most) of our PRs. In fact, every change to I am afraid that if we go with submodules it will make contributing to |
You mean they could be essentially empty repos (i.e. without the source code), just used to pull the monorepo and make the releases? |
Just my 2-cents: +1 to making Also, a big +1 to keeping Several domains (e.g. Stable Diffusion or lightweight ML inference on edge) never need that code. -- I understand that right now most of the contributions to ggml are from llama.cpp, but I'd urge a longer-term outlook. Even for existing llama.cpp contributors, updating
-- At a broader sense, my experience is that it makes each project stronger. That's been my experience with Easy Diffusion, sdkit and torchruntime being separate repos, after being a mono-repo for a while. Each benefits from focused attention, instead of code being buried inside folders. I fully agree with the approach of staying focused on specific use-cases, doing a few things well, and not spreading too thin. In this scenario though, I'd suggest that's not the case. ggml is a big deal. :) |
I'd basically say, unix philosophy. |
I would say so, but it would also be possible to keep the repository read-only (PRs disabled) and have a github workflow regularly pull the
I understand this is not optimal for other projects that depend on In the end, everybody benefits from a more efficient development process, it means that |
Only if those repos need the new changes. If ggml is considered a project in its own right (as I feel it should), then it's similar to upgrading a dependency only when you need its latest features.
It would involve compiling and linking all the bits that don't need to be in the project. And if ggml continues to be mainly for llama.cpp and whisper.cpp, I'd say it's natural for it to start getting specific shortcuts and hacks that are meant mainly for llama and whisper use-cases, rather than being a general purpose tensor library. I think if ggml is meant to be a standalone tensor library, then the choice of how it is organized will incentivize the correct behavior. -- I think the main point of contention seems to be around the need to update the submodule. And whether it's worth the overhead of updating the submodule, and how much effort that really is. I don't have much else to add here :) Just that ggml has a lot of potential (independent of llama/whisper), and incentivizing the right repo structure matters over time. |
People rarely contribute to
If you are only pulling |
I see your point too, so it's fine if ggml is combined into a single monorepo too. We'll have to make the CMake targets suitable for lean-and-mean compilation. And also ensure that it's just a monorepo in terms of git repo, but have separate project identities, so that people are aware of ggml and its power. They shouldn't think it's just another folder in the repo. |
I only came to know of ggml when I saw the post on huggingface. It blew my mind. With no hard feelings, there isn't a single page of documentation or API reference for ggml :) It's barely known. And I totally understand why, not complaining. But you can see why ggml contributors are likely to be really really motivated, and very few. |
Absolutely, we need to do better in that aspect. But ultimately the problem is that the resources that we have are very limited, and spending time writing documentation means that less time is spent in other things. Increasing the overhead would only make this worse. |
But the people who do know about ggml are really excited. sd.cpp was written. NVIDIA folks knew about ggml (and asked what they could do to help), when they asked me about Easy Diffusion's plans. So I think there's a loyal group of people who love ggml, and find it very refreshing. |
I totally agree :) That's why I said I'm not at all saying this as a complaint. Infact, I'll go and argue against myself. I split Easy Diffusion's rendering code to sdkit, and that was great for many reasons. But I eventually realized that I wasn't really interested in providing people a library. I didn't mind if others used it, but my focus was on Easy Diffusion. So I think first ggml-org needs to figure out whether it wants ggml to be a standalone project. That's an introspection that I can't say much about. My gut says that keeping ggml in the same repo as whisper.cpp and llama.cpp will harm ggml's prospects, but that's just a gut feeling. |
I'll still continue keeping sdkit in a separate repo (or whatever is Easy Diffusion's next rendering backend), but that's mainly for code organization. But sdkit is not likely to be a general-purpose project meant for others, but others are free to use it. Maybe that's a question for ggml as well, to figure out its identity. Anyway, I think this might be getting too meta, so I'll stop :) |
I think it is a legitimate concern that it may diminish the visibility of |
Agreed. All of these are reversible decisions, so any of them could be tried. In any case, if this decision isn't a key pain-point right now, then we can also add the same CI runners to ggml (happy to help here), and continue syncing manually (for a little while more). With the intention of making the sync less work, and trying to make it just a routine chore rather than a big event. This won't scale forever, but might still be okay for a while. (Easy for me to say though, hah). Wouldn't want ggerganov burning out with chores though. |
And yeah, it's tricky because each project is a big deal. llama.cpp is already a big deal, and will continue being one I think. So it would also be a bit unfair to make it just an example project. It's got a huge amount of potential, since the world isn't going back to the pre-LLM world. |
Following up from ggml-org/llama.cpp#12104
Avoids compilation errors with CUDA 11.7 and SM50, by performing the actual operations in float. As @slaren pointed out, these operations are memory-bound.
I tested that the performance of fp16 operations is unaffected by this change.
clamp
continues to take8ms
vs13ms
for fp16-vs-fp32 with a 1 GB tensor on my 3060 12 GB.Sorry for combining two tasks in one PR - this converts the operators to inline float functions, and uses a
binbcast.cu
-inspired pattern for de-duplicating the code. I'd originally planned to de-duplicate in a separate PR, but the code was easier-to-read without a large number of(float)
casts in every function.