cuda: unary ops - perform the final computation as a float, and de-duplicate the code #1130

cmdr2 · 2025-03-03T14:30:06Z

Following up from ggml-org/llama.cpp#12104

Avoids compilation errors with CUDA 11.7 and SM50, by performing the actual operations in float. As @slaren pointed out, these operations are memory-bound.

I tested that the performance of fp16 operations is unaffected by this change. clamp continues to take 8ms vs 13ms for fp16-vs-fp32 with a 1 GB tensor on my 3060 12 GB.

Sorry for combining two tasks in one PR - this converts the operators to inline float functions, and uses a binbcast.cu-inspired pattern for de-duplicating the code. I'd originally planned to de-duplicate in a separate PR, but the code was easier-to-read without a large number of (float) casts in every function.

@slaren

…plicate the code. Avoids compilation errors with CUDA 11.7 and SM50. Also, the performance is memory-bound, so the simplified code is better on balance. Thanks @slaren for the suggestion.

cmdr2 · 2025-03-03T14:41:23Z

I tried to test this on the CI runner (by pushing to a branch on the main repo), but it didn't pick up the change - https://github.com/ggml-org/ggml/commits/cmdr2-fp16-unary/

Does ggml/ci only run on the master branch? Thanks

slaren · 2025-03-03T14:44:41Z

You also need to add ggml-ci somewhere in the commit message for it to be picked by the ggml CI.

cmdr2 · 2025-03-03T15:03:34Z

Thanks! That worked (and all the runs passed). Test results: https://github.com/ggml-org/ggml/commits/cmdr2-fp16-unary/

cmdr2 · 2025-03-03T15:08:06Z

@slaren Is there any reason not to run the broader set of CI runners (that llama.cpp uses) in the ggml branches as well? ggml is significantly less active (as a repo), and doesn't run for every PR (so compute cycles won't be that much).

Thanks!

slaren · 2025-03-03T15:21:34Z

The llama.cpp CI runs llama.cpp, so it would be tricky to run from the ggml repository. It would need some script to fetch llama.cpp and build it with a different version of ggml, or something to that effect.

A big part of the problem is likely the lack of resources to keep the CI updated. It's not something that receives many contributions, and almost all the work is done by @ggerganov. It may make more sense to restrict all contributions on this repository strictly to changes to the examples only, and redirect everything else to llama.cpp.

ggerganov · 2025-03-03T15:28:54Z

It may make more sense to restrict all contributions on this repository strictly to changes to the examples only, and redirect everything else to llama.cpp.

We can do that. Btw I am hoping that we will make ggml a submodule for llama.cpp. This would allow to run llama.cpp CI with a specific ggml branch and would eliminate the manual sync process.

slaren · 2025-03-03T15:34:25Z

I am not very familiar with submodules, what would be the workflow to add changes to llama.cpp that require changes in ggml as well? Would we need to open a PR first in ggml, and then after that is merged open the PR in llama.cpp?

ggerganov · 2025-03-03T15:47:57Z

It would require 2 PRs:

Open a PR in ggml repo - not merged yet
Open a PR in llama.cpp repo - update the submodule to point to the tip of the new ggml branch

The developer can iterate on the 2 branches (i.e. push changes to both and update the submodule always to the latest version of the ggml branch) and when ready, they need to first merge ggml PR and then update the llama.cpp PR to point to the tip of the ggml master branch. And finally merge the llama.cpp PR.

Definitely more work, which is the main reason I have avoided the submodules (along with the extra steps like git submodule init/update), but as it is now, the manual sync is basically the same thing but instead of the developers doing it, I am doing it manually for everyone.

If you have some alternatives in mind, let me know. I'm quite used to git submodules, but they will likely add some complexity and make the work of ggml contributors a bit harder.

slaren · 2025-03-03T15:54:12Z

Ok, thanks for the explanation. An obvious alternative to keep the workflow simple would be to merge the ggml, llama.cpp and whisper.cpp repositories into a single one, but it would be a more disruptive for the users.

Maybe a reasonable way to do that would be to move all development to the ggml repository, and keep the other repositories for releases only.

slaren · 2025-03-03T17:10:35Z

I am becoming more convinced that a monorepo would be the best solution for our case. Git submodules seem to work well for adding 3rd party dependencies that almost never need to be updated, but they add too much overhead when a PR requires making changes in multiple projects, which is the case in many (if not most) of our PRs. In fact, every change to ggml, even if it doesn't require changes to llama.cpp/whisper.cpp, would also require a PR in the other repositories to update the submodule.

I am afraid that if we go with submodules it will make contributing to llama.cpp less attractive due to the friction of dealing with submodules. It would certainly make me think carefully before opening a small PR, I would likely avoid doing that because it is already the case that the overhead of dealing with git and github PRs is higher than the time it takes to make a small code change, and submodules would effectively double that overhead. A monorepo would solve the same problems without adding this friction. If we continue making (ideally automated) releases in the llama.cpp and whisper.cpp repositories, and keep the repositories active for discussion, it would not really change anything for final users, only contributors would need to switch to the ggml repository.

ggerganov · 2025-03-03T17:30:39Z

If we continue making (ideally automated) releases in the llama.cpp and whisper.cpp repositories, and keep the repositories active for discussion, it would not really change anything for final users, only contributors would need to switch to the ggml repository.

You mean they could be essentially empty repos (i.e. without the source code), just used to pull the monorepo and make the releases?

cmdr2 · 2025-03-03T17:30:46Z

Just my 2-cents:

+1 to making ggml a submodule. ggml is a super-exciting development, and IMHO its impact can exceed even llama.cpp and whisper.cpp, with all due respect. :)

Also, a big +1 to keeping ggml light, lean and mean. Which means domain-specific code in their respective repositories. For e.g. llama-related or whisper-related projects have code that don't need to be in ggml.

Several domains (e.g. Stable Diffusion or lightweight ML inference on edge) never need that code.

--

I understand that right now most of the contributions to ggml are from llama.cpp, but I'd urge a longer-term outlook.

Even for existing llama.cpp contributors, updating ggml and then sending a PR to update llama isn't as much work as it sounds IMO. cd ggml; git pull; git checkout master; cd .. - it's a very common practice for projects - to split across repos and use submodules. Even for small projects.

ggml is already used as a submodule, for e.g. in sd.cpp. Even my throwaway test repos use ggml as a submodule.

--

At a broader sense, my experience is that it makes each project stronger. That's been my experience with Easy Diffusion, sdkit and torchruntime being separate repos, after being a mono-repo for a while. Each benefits from focused attention, instead of code being buried inside folders.

I fully agree with the approach of staying focused on specific use-cases, doing a few things well, and not spreading too thin. In this scenario though, I'd suggest that's not the case.

ggml is a big deal. :)

cmdr2 · 2025-03-03T17:32:30Z

I'd basically say, unix philosophy.

slaren · 2025-03-03T17:38:02Z

You mean they could be essentially empty repos (i.e. without the source code), just used to pull the monorepo and make the releases?

I would say so, but it would also be possible to keep the repository read-only (PRs disabled) and have a github workflow regularly pull the llama.cpp code from the ggml repository. I suppose that this could allow other projects that depend on llama.cpp or whisper.cpp to continue using the same repository.

Also, a big +1 to keeping ggml light, lean and mean. Which means domain-specific code in their respective repositories. For e.g. llama-related or whisper-related projects have code that don't need to be in ggml.

Several domains (e.g. Stable Diffusion or lightweight ML inference on edge) never need that code.

I understand this is not optimal for other projects that depend on ggml, but in practice the changes for them would be minimal. They would have to pull more code from ggml, but it doesn't meaningfully change their build process. At some point whisper.cpp was in ggml, and llama.cpp is already in whisper.cpp as part of the examples.

In the end, everybody benefits from a more efficient development process, it means that ggml will continue evolving at a faster pace. It will also mean that issues with outdated examples and out of sync tests will become more likely to be fixed in a timely manner.

cmdr2 · 2025-03-03T17:46:02Z

every change to ggml, even if it doesn't require changes to llama.cpp/whisper.cpp, would also require a PR in the other repositories to update the submodule.

Only if those repos need the new changes. If ggml is considered a project in its own right (as I feel it should), then it's similar to upgrading a dependency only when you need its latest features.

it doesn't meaningfully change their build process.

It would involve compiling and linking all the bits that don't need to be in the project.

And if ggml continues to be mainly for llama.cpp and whisper.cpp, I'd say it's natural for it to start getting specific shortcuts and hacks that are meant mainly for llama and whisper use-cases, rather than being a general purpose tensor library.

I think if ggml is meant to be a standalone tensor library, then the choice of how it is organized will incentivize the correct behavior.

--

I think the main point of contention seems to be around the need to update the submodule. And whether it's worth the overhead of updating the submodule, and how much effort that really is.

I don't have much else to add here :) Just that ggml has a lot of potential (independent of llama/whisper), and incentivizing the right repo structure matters over time.

slaren · 2025-03-03T17:54:07Z

Only if those repos need the new changes. If ggml is considered a project in its own right (as I feel it should), then it's similar to upgrading a dependency only when you need its latest features.

People rarely contribute to ggml directly, they contribute to llama.cpp and to a lesser extent whisper.cpp. Nearly every meaningful change to ggml in the last two years came from work motivated by these projects.

It would involve compiling and linking all the bits that don't need to be in the project.

If you are only pulling ggml for the libggml target as part of a cmake build script, you don't need to build the other targets. If you using some other process where you build ggml separately, you can still do so without building the examples with very minor changes to the build scripts. We could probably make that the default.

cmdr2 · 2025-03-03T17:54:29Z

I see your point too, so it's fine if ggml is combined into a single monorepo too.

We'll have to make the CMake targets suitable for lean-and-mean compilation.

And also ensure that it's just a monorepo in terms of git repo, but have separate project identities, so that people are aware of ggml and its power.

They shouldn't think it's just another folder in the repo.

cmdr2 · 2025-03-03T17:57:11Z

People rarely contribute to ggml directly

I only came to know of ggml when I saw the post on huggingface. It blew my mind.

With no hard feelings, there isn't a single page of documentation or API reference for ggml :) It's barely known. And I totally understand why, not complaining. But you can see why ggml contributors are likely to be really really motivated, and very few.

slaren · 2025-03-03T17:59:53Z

there isn't a single page of documentation or API reference for ggml :) It's barely known.

Absolutely, we need to do better in that aspect. But ultimately the problem is that the resources that we have are very limited, and spending time writing documentation means that less time is spent in other things. Increasing the overhead would only make this worse.

cmdr2 · 2025-03-03T18:01:04Z

But the people who do know about ggml are really excited. sd.cpp was written. NVIDIA folks knew about ggml (and asked what they could do to help), when they asked me about Easy Diffusion's plans.

So I think there's a loyal group of people who love ggml, and find it very refreshing.

cmdr2 · 2025-03-03T18:05:36Z

resources that we have are very limited

I totally agree :) That's why I said I'm not at all saying this as a complaint.

Infact, I'll go and argue against myself.

I split Easy Diffusion's rendering code to sdkit, and that was great for many reasons. But I eventually realized that I wasn't really interested in providing people a library. I didn't mind if others used it, but my focus was on Easy Diffusion.

So I think first ggml-org needs to figure out whether it wants ggml to be a standalone project. That's an introspection that I can't say much about.

My gut says that keeping ggml in the same repo as whisper.cpp and llama.cpp will harm ggml's prospects, but that's just a gut feeling.

cmdr2 · 2025-03-03T18:08:36Z

I'll still continue keeping sdkit in a separate repo (or whatever is Easy Diffusion's next rendering backend), but that's mainly for code organization. But sdkit is not likely to be a general-purpose project meant for others, but others are free to use it. Maybe that's a question for ggml as well, to figure out its identity.

Anyway, I think this might be getting too meta, so I'll stop :)

slaren · 2025-03-03T18:27:38Z

I think it is a legitimate concern that it may diminish the visibility of ggml, but it doesn't need to be the case if we continue making the ggml repository about ggml, in which llama.cpp and whisper.cpp are examples or derived projects. If done properly, we could take advantage of the popularity of llama.cpp to increase the visibility of ggml instead. Many people are not aware that ggml is a separate component even when contributing to llama.cpp, and this could solve that.

cmdr2 · 2025-03-03T18:43:47Z

Agreed. All of these are reversible decisions, so any of them could be tried.

In any case, if this decision isn't a key pain-point right now, then we can also add the same CI runners to ggml (happy to help here), and continue syncing manually (for a little while more). With the intention of making the sync less work, and trying to make it just a routine chore rather than a big event.

This won't scale forever, but might still be okay for a while. (Easy for me to say though, hah). Wouldn't want ggerganov burning out with chores though.

cmdr2 · 2025-03-03T18:46:45Z

And yeah, it's tricky because each project is a big deal. llama.cpp is already a big deal, and will continue being one I think. So it would also be a bit unfair to make it just an example project. It's got a huge amount of potential, since the world isn't going back to the pre-LLM world.

cuda: unary ops - perform the final computation as a float, and de-du…

e1b0bf1

…plicate the code. Avoids compilation errors with CUDA 11.7 and SM50. Also, the performance is memory-bound, so the simplified code is better on balance. Thanks @slaren for the suggestion.

cmdr2 mentioned this pull request Mar 3, 2025

sync : ggml ggml-org/llama.cpp#12104

Merged

slaren approved these changes Mar 3, 2025

View reviewed changes

ggerganov merged commit 58ecf6b into ggml-org:master Mar 3, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: unary ops - perform the final computation as a float, and de-duplicate the code #1130

cuda: unary ops - perform the final computation as a float, and de-duplicate the code #1130

cmdr2 commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

slaren commented Mar 3, 2025

cmdr2 commented Mar 3, 2025 •

edited

Loading

cmdr2 commented Mar 3, 2025

slaren commented Mar 3, 2025

ggerganov commented Mar 3, 2025

slaren commented Mar 3, 2025

ggerganov commented Mar 3, 2025 •

edited

Loading

slaren commented Mar 3, 2025 •

edited

Loading

slaren commented Mar 3, 2025

ggerganov commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

slaren commented Mar 3, 2025 •

edited

Loading

cmdr2 commented Mar 3, 2025

slaren commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

cmdr2 commented Mar 3, 2025 •

edited

Loading

slaren commented Mar 3, 2025

cmdr2 commented Mar 3, 2025 •

edited

Loading

cmdr2 commented Mar 3, 2025 •

edited

Loading

cmdr2 commented Mar 3, 2025 •

edited

Loading

slaren commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

cuda: unary ops - perform the final computation as a float, and de-duplicate the code #1130

cuda: unary ops - perform the final computation as a float, and de-duplicate the code #1130

Conversation

cmdr2 commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

slaren commented Mar 3, 2025

cmdr2 commented Mar 3, 2025 • edited Loading

cmdr2 commented Mar 3, 2025

slaren commented Mar 3, 2025

ggerganov commented Mar 3, 2025

slaren commented Mar 3, 2025

ggerganov commented Mar 3, 2025 • edited Loading

slaren commented Mar 3, 2025 • edited Loading

slaren commented Mar 3, 2025

ggerganov commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

slaren commented Mar 3, 2025 • edited Loading

cmdr2 commented Mar 3, 2025

slaren commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

cmdr2 commented Mar 3, 2025 • edited Loading

slaren commented Mar 3, 2025

cmdr2 commented Mar 3, 2025 • edited Loading

cmdr2 commented Mar 3, 2025 • edited Loading

cmdr2 commented Mar 3, 2025 • edited Loading

slaren commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

cmdr2 commented Mar 3, 2025

cmdr2 commented Mar 3, 2025 •

edited

Loading

ggerganov commented Mar 3, 2025 •

edited

Loading

slaren commented Mar 3, 2025 •

edited

Loading

slaren commented Mar 3, 2025 •

edited

Loading

cmdr2 commented Mar 3, 2025 •

edited

Loading

cmdr2 commented Mar 3, 2025 •

edited

Loading

cmdr2 commented Mar 3, 2025 •

edited

Loading

cmdr2 commented Mar 3, 2025 •

edited

Loading