Skip to content
This repository was archived by the owner on Jan 13, 2025. It is now read-only.

Commit 89bc647

Browse files
Updated portBLAS documentation (#487)
* Fixed outdated text in README files * Updated the Dockerfile --------- Co-authored-by: Ouadie EL FAROUKI <[email protected]>
1 parent 81bf6b7 commit 89bc647

15 files changed

+236
-162
lines changed

Dockerfile

+2-2
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ CMD cd /portBLAS && \
4848
if [ "${COMMAND}" = 'build-test' ]; then \
4949
if [ "${SYCL_IMPL}" = 'DPCPP' ]; then \
5050
export LD_LIBRARY_PATH="/tmp/dpcpp/lib" && mkdir -p build && cd build && \
51-
cmake .. -DBLAS_ENABLE_STATIC_LIBRARY=ON -DGEMM_TALL_SKINNY_SUPPORT=OFF \
51+
cmake .. -DGEMM_TALL_SKINNY_SUPPORT=OFF \
5252
-DSYCL_COMPILER=dpcpp -DCMAKE_PREFIX_PATH=/tmp/OpenBLAS/build \
5353
-DBLAS_ENABLE_CONST_INPUT=OFF -DCMAKE_BUILD_TYPE=Release && \
5454
make -j$(nproc) && cd test && ctest -VV --timeout 1200; \
@@ -58,7 +58,7 @@ CMD cd /portBLAS && \
5858
elif [ "${COMMAND}" = 'auto-tuner' ]; then \
5959
export LD_LIBRARY_PATH="/tmp/dpcpp/lib" && mkdir -p tools/auto_tuner/build \
6060
&& cd tools/auto_tuner/build && \
61-
cmake .. -DBLAS_ENABLE_STATIC_LIBRARY=ON -DGEMM_TALL_SKINNY_SUPPORT=OFF \
61+
cmake .. -DGEMM_TALL_SKINNY_SUPPORT=OFF \
6262
-DSYCL_COMPILER=dpcpp -DCMAKE_PREFIX_PATH=/tmp/OpenBLAS/build \
6363
-DBLAS_ENABLE_CONST_INPUT=OFF -DCMAKE_BUILD_TYPE=Release && \
6464
make -j$(nproc); \

README.md

+84-65
Large diffs are not rendered by default.

Roadmap.md

-30
This file was deleted.

benchmark/README.md

+98-32
Original file line numberDiff line numberDiff line change
@@ -5,53 +5,82 @@ Benchmarks
55

66
The portBLAS benchmarks are intended to measure the evolution of the
77
performance of this BLAS implementation and how it compares with other tuned
8-
implementations, such as [CLBLAST](https://github.com/CNugteren/CLBlast)
9-
(a very performant OpenCL BLAS library).
8+
implementations, such as [CUBLAS](https://docs.nvidia.com/cuda/cublas/index.html) native and optimized CUDA BLAS library for Nvidia GPUs,
9+
[ROCBLAS](https://github.com/ROCm/rocBLAS) which is the native HIP/Rocm BLAS
10+
library optimized for AMD GPUs, and [CLBLAST](https://github.com/CNugteren/CLBlast),
11+
a very performant OpenCL BLAS library for CPU targets.
1012

1113
The benchmarks use Google's [benchmark](https://github.com/google/benchmark)
12-
library and generate a report with indicative metrics (see instructions below).
14+
library and generate a report with indicative metrics *(see instructions below)*.
1315

1416
## How to compile the benchmarks
17+
### portBLAS Benchmarks
18+
The portBLAS default benchmarks are compiled with the project if the
19+
`BLAS_ENABLE_BENCHMARK` CMake option is activated *(which is the case by
20+
default)*. For the results to be relevant, portBLAS needs to be built in Release
21+
mode *(Passing `-DCMAKE_BUILD_TYPE=Release` to `cmake` command)*.
1522

16-
The benchmarks are compiled with the project if the `BLAS_ENABLE_BENCHMARK`
17-
CMake option is activated (which is the case by default).
18-
23+
### clBLAST Benchmarks
1924
The CLBLAST benchmarks are compiled only if the `BUILD_CLBLAST_BENCHMARKS` CMake
2025
option is activated. If so, if CLBlast cannot be found the build will fail. The
2126
location of CLBlast can be given with `CLBLAST_ROOT`.
2227
To install CLBlast, see:
2328
[CLBlast: Building and installing](
24-
https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md))
29+
https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md)
30+
31+
### cuBLAS Benchmarks
32+
cuBLAS benchmarks can be built by enabling the `BUILD_CUBLAS_BENCHMARKS` option and
33+
require an installation of CUDA *(>=11.x)* and cuBLAS library *(Both can be obtained by
34+
installing the cuda Toolkit :
35+
[installation guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html))*.
2536

37+
### rocBLAS Benchmarks
38+
cuBLAS benchmarks can be built by enabling the `BUILD_ROCBLAS_BENCHMARKS` option
39+
and require an installation of Rocm *(>=4.5.x)* and rocBLAS library *(Please refer
40+
to this [installation guide](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Linux_Install_Guide.html)
41+
for the setup)*
42+
43+
## General Notes
2644
After the compilation, the binaries will be available:
27-
* in the build folder, in `benchmark/portblas/` and `benchmark/clblast/`
45+
* in the build folder, in `benchmark/[Library Name]/`
2846
* if you provide an installation directory with the CMake variable
2947
`CMAKE_INSTALL_PREFIX`, and run the installation command, e.g
3048
`ninja install`, in your installation folder, in `portblas/bin/`
3149

3250
A verification of the results is enabled by default and can be disabled with the
3351
CMake option `BLAS_VERIFY_BENCHMARK` set to `OFF` or `0`. The verification will
34-
be run a small number of times (more than once because of the way the benchmark
52+
be run a small number of times *(more than once because of the way the benchmark
3553
library works, but much less than the usual number of iterations of the
36-
benchmarks). The verification requires that a reference implementation of BLAS
54+
benchmarks)*. The verification requires that a reference implementation of BLAS
3755
like OpenBLAS is installed, which path can be given with the `CMAKE_PREFIX_PATH`
3856
CMake parameter.
3957

4058
## How to run the benchmarks
4159

4260
The benchmarks take two kinds of command-line options: those for the benchmark
43-
library and those specific to the portBLAS projects.
44-
45-
Essentially, the benchmarks can take a CSV configuration file (or will use
46-
defaults), and if your machine has more than one OpenCL device, you can specify
47-
which one to use. The other options specify how to output the results.
61+
library and those specific to portBLAS:
62+
- Benchmark library options : these options help specify for instance the output
63+
format, the verbosity level and help filter specific benchmarks based on regex
64+
arguments.
65+
- portBLAS options : these options are portBLAS specific and they help specify
66+
the target SYCL device *(portBLAS benchmarks only)* as well as pass a custom set
67+
of operator-specfic configurations *(applies to rocBLAS and cuBLAS benchmarks as
68+
well)*. If no csv file is specified, the benchmark will use defaults values
69+
*(found in `include/common/common_utils.hpp`)*.
70+
71+
For portBLAS and clBLASt benchmarks, and if the target machine has more than one
72+
OpenCL device, you can specify which one to use at runtime(\*).
73+
74+
(*): Preferably, for portBLAS benchmarks, this device has to match the
75+
`TUNING_TARGET` as some operators are configured differently depending on the
76+
SYCL target for optimal performance.
4877

4978
The most useful options for us are:
5079

5180
|option|parameter|description|
5281
|------|:-------:|-----------|
5382
| `--help` | | Show help message |
54-
| `--device` | device name | Select a device to run on (e.g `intel:gpu`) |
83+
| `--device` | device name | Select a device to run on (e.g `intel:gpu`), useful for portBLAS and clBLASt benchmarks|
5584
| `--csv-param` | file path | Path to a CSV file with the benchmark parameters |
5685
| `--benchmark_format` | `console` / `json` / `csv` | Specify the format of the standard output |
5786
| `--benchmark_out` | file path | Specify a file where to write the report |
@@ -64,21 +93,27 @@ The most useful options for us are:
6493
You can check the [GitHub repository](https://github.com/google/benchmark) of
6594
the library for information about the other supported command-line arguments.
6695

67-
Here is an example of an invocation of the GEMM benchmark running on Intel GPU,
68-
displaying the results in the console and saving a json report:
96+
Here is an example of an invocation of the portBLAS GEMM benchmark running on
97+
Intel GPU, displaying the results in the console and saving a json report:
6998

7099
```bash
71-
./bench_gemm --device=intel:gpu --csv-param=parameters.csv \
100+
./benchmark/portblas/bench_gemm --device=intel:gpu --csv-param=parameters.csv \
72101
--benchmark_out=../results.json --benchmark_out_format=json \
73102
--benchmark_format=console
74103
```
75104

105+
Here is the same benchmark through cuBLAS:
106+
```bash
107+
./benchmark/cublas/bench_cublas_gemm --csv-param=parameters.csv --benchmark_out=../results.json \
108+
--benchmark_out_format=json --benchmark_format=console
109+
```
110+
76111
### CSV format
77112

78113
The benchmarks can be given a CSV file containing the parameters to run with
79-
(matrix/vector dimensions, transpose or not, etc), in the following format: one
80-
line corresponds to one set of parameters, i.e. one name for the library (though
81-
it will be iterated many times for statistical accuracy).
114+
*(matrix/vector dimensions, transpose or not, etc)*, in the following format: one
115+
line corresponds to one set of parameters, i.e. one name for the library *(though
116+
it will be iterated many times for statistical accuracy)*.
82117

83118
The formats for the different BLAS levels are:
84119

@@ -88,24 +123,51 @@ The formats for the different BLAS levels are:
88123
| blas 2 | *transpose_A,m,n,alpha,beta* | Action on the matrix (`n`, `t`, `c`), dimensions, and scalars alpha and beta |
89124
| blas 3 | | |
90125
| gemm | *transpose_A,transpose_B,m,k,n,alpha,beta* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), and scalars alpha and beta |
91-
| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size |
126+
| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size,batch_type* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size, batch_type |
92127
| trsm | *side,triangle,transpose,diagonal,m,n,alpha* | Position of A (`l`, `r`), A is upper or lower triangular (`u`, `l`), transposition of A (`n`, `t`), A is unit or non-unit diagonal(`u`,`n`),dimensions, scalar alpha |
93128

94-
Note: for operations that support a stride, the benchmarks will use a stride of
95-
1 (contiguous values), except for the GEMM batched operation where valid default stride values are used depending on batch type *(strided or interleaved)*. For operations that support a leading dimension, the
96-
benchmarks use the minimum possible value (the actual leading dimension of the
97-
matrix).
129+
### Notes:
130+
131+
For operations that support an increment, the benchmarks will use an increment of
132+
1 *(contiguous values)*, except for the GEMM batched operation where valid
133+
default increment values are used depending on `batch_type` *(strided or
134+
interleaved)*. For operations that support a leading dimension, the
135+
benchmarks use the minimum possible value *(the actual leading dimension
136+
of the matrix)*.
98137

99-
Here is an example of a valid CSV file for the GEMM benchmark:
138+
For batched-strided operations that expect a stride(s), the benchmarking suite
139+
expects **stride multipliers** instead of explicit stride values. For example, in
140+
`gemm_batched_strided` operator, a `stride_a_mul=3` for matrix `A` of `size_a=m*k`
141+
is equivalent to an actual stride of `stride_a = stride_a_mul * size_a`.
142+
143+
For operations that support Complex data type *(and when cmake option
144+
BLAS_ENABLE_COMPLEX is enabled)*, scalars such as `alpha` and `beta` are
145+
expected to have real and imaginary parts as separate values, if only single
146+
scalar values are passed in the csv configuration files, complex values
147+
for these arguments will be constructed by duplicating the same real value
148+
for both imaginary and real parts. (e.g. alpha_cplx={alpha_real, alpha_real}).
149+
150+
151+
Here are two examples of a valid CSV files for the GEMM benchmark:
152+
153+
- Scalar only alpha & beta values
100154

101155
```
102156
n,n,42,42,42,1,0
103157
n,t,64,128,64,0.5,0.5
104158
t,n,13,3,7,0,0.7
105159
```
106160

161+
- Complex alpha & beta values *(two scalars each)*
162+
```
163+
n,n,42,42,42,1,2,3,0
164+
n,t,64,128,64,0.5,1,0.5,1
165+
t,n,13,3,7,0,1,0.7,3
166+
```
167+
168+
107169
The folder `config_csv` provides a few files corresponding to sizes that are
108-
relevant for neural networks, but you can use your own files, see the next
170+
relevant for neural networks, but you can provide your own, see the next
109171
section for more info on how to generate them.
110172

111173
### Python tool to generate a CSV file
@@ -262,6 +324,9 @@ following keys:
262324
* `cpu_time`: actual CPU time spent running the benchmark
263325
* `time_unit`: unit used for these times. Should be `ns`, if not please file an
264326
issue.
327+
* `label`: Contains the name of the benchmark backend *(`@backend` : "portblas",
328+
"cublas" or "rocblas")*, and target device related informations *(`device_name`,
329+
`device_version`, `driver_version` etc..)*
265330
* `avg_event_time`: the average of the CL/SYCL event times in nanoseconds. This
266331
time depends on the events returned by the BLAS functions used and might not
267332
be accurate in some cases
@@ -280,9 +345,10 @@ following keys:
280345
* `bytes_processed`: total number of bytes read and written in memory. It is
281346
calculated theoretically based on the operations that we think the benchmark
282347
is doing.
283-
* a few benchmark parameters (e.g `m`, `n`, `k` for GEMM)
348+
* other operator-specific parameters that affect the computations *(e.g `m`, `n`,
349+
`k` for GEMM, but `alpha` is skipped as it doens't affect the operation directly.)*
284350
* some other keys from the benchmark library
285351

286352
**Note:** to calculate the performance in Gflops, you can divide `n_fl_ops` by one
287-
of the best or average time metrics, e.g `avg_overall_time` (the event and wall
288-
time usually converge for large dimensions).
353+
of the best or average time metrics, e.g `avg_overall_time` *(the event and wall
354+
time usually converge for large dimensions)*.

benchmark/gen_param.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/usr/bin/env python3
2-
#/***************************************************************************
2+
# /***************************************************************************
33
# *
44
# * @license
55
# * Copyright (C) Codeplay Software Limited
@@ -32,12 +32,14 @@
3232
import itertools
3333
import argparse
3434

35+
3536
def main(args):
3637
"""Generate the csv file according to the given arguments
3738
"""
3839
# Match DSL to Python names
3940
nd_range = itertools.product
4041
value_range = lambda *v: list(v)
42+
4143
def size_range(low, high, mult):
4244
val = low
4345
while val <= high:

doc/AddingBlas3Op.md

+31-10
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ typename sb_handle_t::event_t _trsm(sb_handle_t& sb_handle, char Side,
3535
container_0_t A, index_t lda,
3636
container_1_t B, index_t ldb);
3737

38-
}
38+
} // namespace internal
3939

4040
// User-facing function to call the TRSM operation
4141
template <typename sb_handle_t, typename container_0_t, typename container_1_t,
@@ -47,7 +47,6 @@ typename sb_handle_t::event_t inline _trsm(
4747
return internal::_trsm(sb_handle, Side, Triangle, Transpose, Diagonal, M, N, alpha, A, lda,
4848
B, ldb);
4949
}
50-
} // namespace internal
5150
} // namespace blas
5251
```
5352
@@ -98,14 +97,19 @@ void run_test(const combination_t<scalar_t> combi) {
9897
auto q = make_queue();
9998
SB_Handle sb_handle(q);
10099
100+
//
101+
// Perform any host-to-device copies here
102+
//
103+
101104
// Invoke the newly added operation
102-
_trsm(sb_handle, side, triangle, transA, diag, m, n, alpha, a_gpu, lda, b_gpu, ldb);
105+
_trsm(sb_handle, side, triangle, transA, diag, m, n, alpha, a_gpu, lda, b_gpu, ldb, dependencies);
103106
104107
// Verify the results
105108
}
106109
107110
// Create the combinations of parameters to invoke the test
108-
const auto combi = ::testing::Combine(::testing::Values(7, 513, 1027), // m
111+
const auto combi = ::testing::Combine(::testing::Values("usm", "buf"), // allocation type
112+
::testing::Values(7, 513, 1027), // m
109113
::testing::Values(7, 513, 1027), // n
110114
::testing::Values('n', 't'), // transA
111115
::testing::Values('l', 'r'), // side
@@ -143,7 +147,8 @@ template <typename sb_handle_t, typename container_0_t, typename container_1_t,
143147
typename sb_handle_t::event_t _trsm(
144148
sb_handle_t& sb_handle, char Side, char Triangle, char Transpose, char Diagonal,
145149
index_t M, index_t N, element_t alpha, container_0_t A, index_t lda,
146-
container_1_t B, index_t ldb) {
150+
container_1_t B, index_t ldb,
151+
const typename sb_handle_t::event_t& _dependencies) {
147152
// Implementation of the new operation
148153
// This will probably invoke a kernel which we don't yet have defined.
149154

@@ -153,7 +158,7 @@ typename sb_handle_t::event_t _trsm(
153158
auto gemmEvent = internal::_gemm(
154159
sb_handle, 'n', isTranspose ? 't' : 'n', M, currentBlockSize,
155160
currentBlockSize, (i == 0) ? alpha : element_t{1}, B + i * ldb, ldb,
156-
invA + i * blockSize, blockSize, element_t{0}, X + i * ldx, ldx);
161+
invA + i * blockSize, blockSize, element_t{0}, X + i * ldx, ldx, intermediate_events);
157162
trsmEvents = concatenate_vectors(trsmEvents, gemmEvent);
158163

159164
// Ultimately, a list of all events created in this function is returned
@@ -173,8 +178,9 @@ binary is linked against portBLAS, the linker will find the definition of the mi
173178
174179
To do this, we create the source file that will contain instantiations of the new `_trsm` operation.
175180
The file is located at `src/interface/blas3/trsm.cpp.in`. This is not the file that will be
176-
compiled, but a template file that the python script `python_generator/py_gen_blas_binary.py`
177-
will use to generate the actual source file where the instantiation of `_trsm` will happen.
181+
compiled, but a template file that the python script `python_generator/py_gen_blas_ops.py`
182+
will use to generate the actual source file where the instantiation of `_trsm` will happen. The call
183+
to this generator is wrapped within the cmake function `generate_blas_objects`.
178184
179185
The file `src/interface/blas3/trsm.cpp.in` must include all files that are necessary to successfully
180186
compile `blas::internal::_trsm`, for this particular example, this file looks like the following:
@@ -193,13 +199,28 @@ compile `blas::internal::_trsm`, for this particular example, this file looks li
193199
namespace blas {
194200
namespace internal {
195201
196-
202+
// Buffer Declaration
197203
template typename SB_Handle::event_t _trsm(
198204
SB_Handle sb_handle, char Side, char Triangle, char Transpose, char Diagonal,
199205
${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha,
200206
BufferIterator<${DATA_TYPE}> A, ${INDEX_TYPE} lda,
201-
BufferIterator<${DATA_TYPE}> B, ${INDEX_TYPE} ldb);
207+
BufferIterator<${DATA_TYPE}> B, ${INDEX_TYPE} ldb,
208+
const typename SB_Handle::event_t& _dependencies);
202209
210+
// USM Declarations
211+
#ifdef SB_ENABLE_USM
212+
template typename SB_Handle::event_t _trsm(
213+
SB_Handle& sb_handle, char side, char uplo, char trans, char diag,
214+
${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha, ${DATA_TYPE} * A,
215+
${INDEX_TYPE} lda, ${DATA_TYPE} * B, ${INDEX_TYPE} ldb,
216+
const typename SB_Handle::event_t& _dependencies);
217+
218+
template typename SB_Handle::event_t _trsm(
219+
SB_Handle& sb_handle, char side, char uplo, char trans, char diag,
220+
${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha,
221+
const ${DATA_TYPE} * A, ${INDEX_TYPE} lda, ${DATA_TYPE} * B,
222+
${INDEX_TYPE} ldb, const typename SB_Handle::event_t& _dependencies);
223+
#endif
203224
204225
} // namespace internal
205226
} // namespace blas

python_generator/py_gen_blas_binary.py

Whitespace-only changes.

python_generator/py_gen_blas_binary_special.py

Whitespace-only changes.

python_generator/py_gen_blas_gemm_launcher.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#/***************************************************************************
1+
# /***************************************************************************
22
# *
33
# * @license
44
# * Copyright (C) Codeplay Software Limited

python_generator/py_gen_blas_ops.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#/***************************************************************************
1+
# /***************************************************************************
22
# *
33
# * @license
44
# * Copyright (C) Codeplay Software Limited

0 commit comments

Comments
 (0)