Skip to content

Commit 413bb91

Browse files
committed
feat: add GPU buffer loader for IndexProvider integration
Add GpuBufferLoader that bridges zvec's segment-based storage with GPU compute pipelines (Metal, CUDA/cuVS). Streams vectors through the existing IndexProvider::Iterator into contiguous float32 buffers ready for direct GPU transfer. GpuBufferLoader (gpu_buffer_loader.h): - load(): stream all vectors from any IndexProvider into GpuBuffer - load_chunk(): chunked loading for datasets larger than GPU memory - Automatic FP16/INT8 → FP32 conversion - Works with Flat, HNSW, and IVF index providers Replaces the previous standalone RocksDB VectorStorage approach (PR alibaba#174, now closed) with proper integration into zvec's existing storage architecture. Also adds Metal C++ backend documentation (docs/METAL_CPP.md) with updated architecture diagram showing the IndexProvider → GpuBuffer → Metal/CUDA pipeline. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
1 parent 48083ab commit 413bb91

2 files changed

Lines changed: 380 additions & 0 deletions

File tree

docs/METAL_CPP.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Metal C++ Backend
2+
3+
GPU-accelerated vector operations for Apple Silicon using Metal shaders.
4+
5+
## Architecture
6+
7+
```
8+
IndexProvider (Flat/HNSW/IVF)
9+
10+
├── Iterator ──→ GpuBufferLoader::load()
11+
│ │
12+
│ GpuBuffer (contiguous float32)
13+
│ │
14+
│ ┌─────┴──────┐
15+
│ │ │
16+
│ Metal device cudaMemcpy
17+
│ buffer (CUDA/cuVS)
18+
│ │
19+
│ Metal Kernels
20+
│ (L2, IP, Cosine, TopK)
21+
│ │
22+
│ Results Buffer
23+
24+
└── get_vector(key) ──→ single vector lookup
25+
```
26+
27+
## GPU Buffer Loading
28+
29+
The `GpuBufferLoader` bridges zvec's segment-based storage with GPU compute
30+
pipelines. It streams vectors through `IndexProvider::Iterator` into a
31+
contiguous float32 buffer ready for GPU transfer.
32+
33+
```cpp
34+
#include <ailego/gpu/gpu_buffer_loader.h>
35+
36+
// Load all vectors from any index type
37+
auto provider = index->create_provider();
38+
auto buffer = zvec::GpuBufferLoader::load(provider);
39+
40+
// buffer.vectors is contiguous (N x dim) float32
41+
// buffer.keys[i] corresponds to buffer.vector_at(i)
42+
43+
// Metal: create device buffer
44+
id<MTLBuffer> mtl_buf = [device newBufferWithBytes:buffer.vectors.data()
45+
length:buffer.byte_size()
46+
options:MTLResourceStorageModeShared];
47+
48+
// CUDA: copy to device
49+
cudaMemcpy(d_vectors, buffer.vectors.data(),
50+
buffer.byte_size(), cudaMemcpyHostToDevice);
51+
```
52+
53+
### Chunked Loading
54+
55+
For datasets larger than GPU memory:
56+
57+
```cpp
58+
auto iter = provider->create_iterator();
59+
size_t chunk_size = 100000; // vectors per chunk
60+
61+
while (iter->is_valid()) {
62+
auto chunk = zvec::GpuBufferLoader::load_chunk(
63+
iter.get(), provider->dimension(),
64+
provider->data_type(), chunk_size);
65+
66+
// Process chunk on GPU...
67+
}
68+
```
69+
70+
## Metal Kernels
71+
72+
### Distance Kernels
73+
74+
| Kernel | Description |
75+
|--------|-------------|
76+
| `metal_l2_distance` | Basic L2 distance (1 thread per pair) |
77+
| `metal_l2_distance_simd` | float4 vectorized L2 |
78+
| `metal_l2_distance_fp16` | Half-precision L2 |
79+
| `metal_l2_distance_batch` | One query vs all database |
80+
| `metal_l2_distance_simdgroup` | Simdgroup cooperative L2 (32 threads per pair) |
81+
| `metal_inner_product` | Basic inner product |
82+
| `metal_inner_product_simdgroup` | Simdgroup cooperative inner product |
83+
| `metal_cosine_similarity_simdgroup` | Simdgroup cosine similarity |
84+
85+
### Utility Kernels
86+
87+
| Kernel | Description |
88+
|--------|-------------|
89+
| `metal_matmul_batch` | Basic matrix multiplication (C = A * B^T) |
90+
| `metal_matmul_tiled` | Tiled matmul with shared memory |
91+
| `metal_normalize_simdgroup` | In-place L2 normalization |
92+
| `metal_topk_simdgroup` | Per-query top-k selection |
93+
94+
## Simdgroup Optimization
95+
96+
The `*_simdgroup` kernels use Metal's cooperative SIMD intrinsics (`simd_sum`, `simd_min`, `simd_shuffle`) to perform reductions across 32 SIMD lanes without shared memory barriers. Each simdgroup of 32 threads collaborates on a single (query, database) distance computation, splitting the dimension across lanes and reducing with hardware-accelerated cross-lane operations.
97+
98+
Dispatch model:
99+
- Threadgroup size: 32 (one simdgroup)
100+
- Grid: `(n_database, n_queries)` threadgroups
101+
102+
## C++ Quantization
103+
104+
### Product Quantizer (`product_quantizer.h`)
105+
106+
Splits D-dimensional vectors into M sub-vectors and quantizes each with k-means.
107+
108+
```cpp
109+
#include <ailego/algorithm/product_quantizer.h>
110+
111+
zvec::ailego::ProductQuantizer pq(/*m=*/8, /*k=*/256);
112+
pq.train(data, n_vectors, dim);
113+
114+
std::vector<uint8_t> codes(n * 8);
115+
pq.encode(data, n, codes.data());
116+
```
117+
118+
### Optimized PQ (`opq.h`)
119+
120+
Learns an orthogonal rotation matrix R via SVD-based Procrustes before PQ, minimizing quantization distortion.
121+
122+
```cpp
123+
#include <ailego/algorithm/opq.h>
124+
125+
zvec::ailego::OptimizedProductQuantizer opq(/*m=*/8, /*k=*/256, /*n_iter=*/20);
126+
opq.train(data, n_vectors, dim);
127+
128+
std::vector<uint8_t> codes(n * 8);
129+
opq.encode(data, n, codes.data());
130+
```
131+
132+
## Build
133+
134+
```bash
135+
mkdir build && cd build
136+
cmake .. -DCMAKE_BUILD_TYPE=Release
137+
make -j$(nproc)
138+
```
139+
140+
Metal shaders are compiled automatically on macOS via CMake.
141+
142+
## Future Work
143+
144+
- CUDA backend for NVIDIA GPUs (cuVS integration)
145+
- ANE (Apple Neural Engine) backend via Core ML
146+
- Distributed vector search across multiple nodes

src/ailego/gpu/gpu_buffer_loader.h

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
// Copyright 2025-present the zvec project
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
15+
#pragma once
16+
17+
#include <cstdint>
18+
#include <cstring>
19+
#include <vector>
20+
#include <zvec/core/framework/index_meta.h>
21+
#include <zvec/core/framework/index_provider.h>
22+
23+
namespace zvec {
24+
25+
/**
26+
* @brief Result of loading vectors into a contiguous GPU-ready buffer.
27+
*
28+
* Contains parallel arrays: keys[i] corresponds to vectors[i * dim .. (i+1) *
29+
* dim). The vectors buffer is contiguous row-major float32, suitable for direct
30+
* transfer to Metal (device buffer) or CUDA (cudaMemcpy).
31+
*/
32+
struct GpuBuffer {
33+
std::vector<uint64_t> keys; ///< Document keys, one per vector
34+
std::vector<float> vectors; ///< Contiguous (n × dim) float32 buffer
35+
size_t n_vectors = 0; ///< Number of vectors loaded
36+
size_t dim = 0; ///< Dimensionality of each vector
37+
38+
/// @brief Get a pointer to the i-th vector
39+
const float *vector_at(size_t i) const { return vectors.data() + i * dim; }
40+
41+
/// @brief Total bytes in the vector buffer
42+
size_t byte_size() const { return vectors.size() * sizeof(float); }
43+
};
44+
45+
/**
46+
* @brief Loads vectors from an IndexProvider into a contiguous GPU-ready buffer.
47+
*
48+
* This bridges zvec's segment-based storage with GPU compute pipelines
49+
* (Metal, CUDA/cuVS). It streams vectors through the IndexProvider::Iterator
50+
* into a single contiguous float32 buffer that can be directly mapped or
51+
* copied to GPU memory.
52+
*
53+
* Architecture:
54+
* IndexProvider (Flat/HNSW/IVF) → Iterator → GpuBufferLoader → GpuBuffer
55+
* │
56+
* Metal device buffer
57+
* or cudaMemcpy
58+
*
59+
* Usage:
60+
* @code
61+
* auto provider = index->create_provider();
62+
* auto buffer = GpuBufferLoader::load(provider);
63+
*
64+
* // Metal: create device buffer from contiguous data
65+
* id<MTLBuffer> mtl_buf = [device newBufferWithBytes:buffer.vectors.data()
66+
* length:buffer.byte_size()
67+
* options:MTLResourceStorageModeShared];
68+
*
69+
* // CUDA: copy to device
70+
* cudaMemcpy(d_vectors, buffer.vectors.data(), buffer.byte_size(),
71+
* cudaMemcpyHostToDevice);
72+
*
73+
* // cuVS: build index directly
74+
* cagra::build(params, buffer.vectors.data(), buffer.n_vectors, buffer.dim);
75+
* @endcode
76+
*/
77+
class GpuBufferLoader {
78+
public:
79+
/**
80+
* @brief Load all vectors from a provider into a contiguous GPU buffer.
81+
*
82+
* Iterates through the provider's vectors and packs them into a single
83+
* contiguous float32 array. Handles FP32, FP16, INT8 source types with
84+
* automatic conversion to float32.
85+
*
86+
* @param provider The index provider to stream vectors from.
87+
* @return GpuBuffer with contiguous vectors and associated keys.
88+
*
89+
* @note For large datasets, consider load_range() to load in chunks
90+
* that fit in GPU memory.
91+
*/
92+
static GpuBuffer load(const core::IndexProvider::Pointer &provider) {
93+
GpuBuffer buf;
94+
buf.dim = provider->dimension();
95+
buf.n_vectors = provider->count();
96+
97+
// Pre-allocate for the known count
98+
buf.keys.reserve(buf.n_vectors);
99+
buf.vectors.reserve(buf.n_vectors * buf.dim);
100+
101+
auto data_type = provider->data_type();
102+
auto elem_size = provider->element_size();
103+
auto iter = provider->create_iterator();
104+
105+
while (iter->is_valid()) {
106+
buf.keys.push_back(iter->key());
107+
append_as_float32(buf.vectors, iter->data(), buf.dim, data_type);
108+
iter->next();
109+
}
110+
111+
// Update actual count (may differ if iterator had fewer than count())
112+
buf.n_vectors = buf.keys.size();
113+
return buf;
114+
}
115+
116+
/**
117+
* @brief Load a range of vectors (for chunked GPU transfer).
118+
*
119+
* Useful when the full dataset doesn't fit in GPU memory. The caller
120+
* manages the iterator lifetime across multiple calls.
121+
*
122+
* @param iter Iterator (caller manages; position is advanced).
123+
* @param dim Vector dimensionality.
124+
* @param data_type Source data type for conversion.
125+
* @param max_count Maximum number of vectors to load in this chunk.
126+
* @return GpuBuffer with up to max_count vectors.
127+
*/
128+
static GpuBuffer load_chunk(core::IndexHolder::Iterator *iter, size_t dim,
129+
core::IndexMeta::DataType data_type,
130+
size_t max_count) {
131+
GpuBuffer buf;
132+
buf.dim = dim;
133+
134+
buf.keys.reserve(max_count);
135+
buf.vectors.reserve(max_count * dim);
136+
137+
size_t loaded = 0;
138+
while (iter->is_valid() && loaded < max_count) {
139+
buf.keys.push_back(iter->key());
140+
append_as_float32(buf.vectors, iter->data(), dim, data_type);
141+
iter->next();
142+
++loaded;
143+
}
144+
145+
buf.n_vectors = buf.keys.size();
146+
return buf;
147+
}
148+
149+
private:
150+
/**
151+
* @brief Append a single vector to the float32 buffer, converting if needed.
152+
*/
153+
static void append_as_float32(std::vector<float> &dst, const void *src,
154+
size_t dim,
155+
core::IndexMeta::DataType data_type) {
156+
size_t offset = dst.size();
157+
dst.resize(offset + dim);
158+
159+
switch (data_type) {
160+
case core::IndexMeta::DT_FP32: {
161+
std::memcpy(dst.data() + offset, src, dim * sizeof(float));
162+
break;
163+
}
164+
case core::IndexMeta::DT_FP16: {
165+
// Convert half → float. Metal and CUDA both use IEEE 754 half.
166+
const uint16_t *half_ptr = static_cast<const uint16_t *>(src);
167+
for (size_t i = 0; i < dim; ++i) {
168+
dst[offset + i] = half_to_float(half_ptr[i]);
169+
}
170+
break;
171+
}
172+
case core::IndexMeta::DT_INT8: {
173+
const int8_t *int8_ptr = static_cast<const int8_t *>(src);
174+
for (size_t i = 0; i < dim; ++i) {
175+
dst[offset + i] = static_cast<float>(int8_ptr[i]);
176+
}
177+
break;
178+
}
179+
default: {
180+
// Fallback: assume float-sized elements, memcpy
181+
std::memcpy(dst.data() + offset, src, dim * sizeof(float));
182+
break;
183+
}
184+
}
185+
}
186+
187+
/**
188+
* @brief Convert IEEE 754 half-precision to single-precision.
189+
*
190+
* Handles normals, denormals, inf, and NaN.
191+
*/
192+
static float half_to_float(uint16_t h) {
193+
uint32_t sign = (h & 0x8000u) << 16;
194+
uint32_t exponent = (h >> 10) & 0x1Fu;
195+
uint32_t mantissa = h & 0x03FFu;
196+
197+
if (exponent == 0) {
198+
if (mantissa == 0) {
199+
// Zero
200+
uint32_t bits = sign;
201+
float f;
202+
std::memcpy(&f, &bits, sizeof(f));
203+
return f;
204+
}
205+
// Denormalized: convert to normalized float
206+
while (!(mantissa & 0x0400u)) {
207+
mantissa <<= 1;
208+
exponent--;
209+
}
210+
exponent++;
211+
mantissa &= ~0x0400u;
212+
exponent += (127 - 15);
213+
uint32_t bits = sign | (exponent << 23) | (mantissa << 13);
214+
float f;
215+
std::memcpy(&f, &bits, sizeof(f));
216+
return f;
217+
} else if (exponent == 31) {
218+
// Inf or NaN
219+
uint32_t bits = sign | 0x7F800000u | (mantissa << 13);
220+
float f;
221+
std::memcpy(&f, &bits, sizeof(f));
222+
return f;
223+
}
224+
225+
// Normalized
226+
exponent += (127 - 15);
227+
uint32_t bits = sign | (exponent << 23) | (mantissa << 13);
228+
float f;
229+
std::memcpy(&f, &bits, sizeof(f));
230+
return f;
231+
}
232+
};
233+
234+
} // namespace zvec

0 commit comments

Comments
 (0)