JACC v1.0 API Design Documentation #283
Replies: 6 comments 16 replies
-
|
Questions:
|
Beta Was this translation helpful? Give feedback.
-
|
Do you have examples or other documentation on how to combine JACC.jl with other parts of the JuliaGPU ecosystem? |
Beta Was this translation helpful? Give feedback.
-
|
I think adding apple metal support would be good since iTensor.jl support it. |
Beta Was this translation helpful? Give feedback.
-
Is it possible to pass GPU parameters for more advanced users? |
Beta Was this translation helpful? Give feedback.
-
I would say that having the Kokkos signature might make it easier for Kokkos folks to use JACC. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you all for your input and @PhilipFackler for the great work. We have a release candidate: https://github.com/JuliaORNL/JACC.jl/releases/tag/v1.0.0-rc1 Original APIs are kept.
To do: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Goal: discuss the JACC APIs with the broader community towards a stable public facing APIs for JACC v1.0.
This is the closest to a JACC governance process in which community members can provide their inputs for decision making.
JACC, Julia for Accelerators, API is designed for user to write parallel code once, in the form of “kernels” represented by independent parallel for loops (parallel_for) and reduction operations (parallel_reduce) on CPU and GPUs leveraging the JuliaGPU ecosystem.
Users interact with four major components:
set_backendandunset_backend, inside code for autodetection:@init_backendzeros,onesandarrayparallel_forandparallel_reduce, advanced:launch_specatomic,sharedBackend configuration APIs
JACC.set_backend(:backend)orJACC.set_backend("backend"): JACC uses [Preferences.jl](https://github.com/JuliaPackaging/Preferences.jl) to modify an entry inLocalPreference.toml.:backendcan be Supported backends:{:threads", :cuda", :amdgpu", :oneAPI"}` (current plan is to extend to Apple). Threads is the assumed backend if this function is never called. Example in CI.LocalPreferences.toml example:
JACC.remove_backend("backend")whenJACC.set_backend("backend")the backend package (e.g. CUDA.jl, AMDGPU.jl) is downloaded and added polluting theProject.tomlfile. Yhis function is aimed for developers to clean up this file and prevents from committing JACC and backend dependencies in the sameProject.tomlNote:
set_backendandunset_backendare not meant to be used in code, only as preliminary steps to configure JACC.JACC.@init_backendto be of mandatory use right afterimport JACCorusing JACC, it is the autodetection mechanism for portable user code that is meant to not be backend-dependent. Not using it would result in backend-related errors. ExampleArray APIs for memory management
dimsis a tuple of integers of the form(Nx, Ny, Nz,...)JACC.zeros(T, Nx, Ny, Nz, …)orJACC.zeros(T, dims)return an nd-array of typeT(e.g.Int,Float32,Float64) filled with zeros. ExampleJACC.ones(T, Nx, Ny, Nz, …)orJACC.zeros(T, dims)return an nd-array of typeT(e.g.Int,Float32,Float64) filled with ones. ExampleJACC.array(Core.AbstractArray)return an nd-array that is a subtype ofCore.AbstractArraywith uninitialized memory allocated on device. ExampleJACC.fill(value, Nx, Ny, Nz, ...) orJACC.fill( value, dims )return an nd-array of typeT(e.g.Int,Float32,Float64`) filled with a particular value. The array takes the value type. ExampleJACC.to_host(x_d): returns a copy of an device-allocated array (e.g. GPU) in host (e.g. CPU), if the array is allocated in CPU it returns the same array. ExampleKernel Launching APIs:
parallel_for,parallel_reduceandsynchronizeNote: functions can also be passed using the
doandendsyntax construct in our examples after the parallel_for/parallel_reduce functions.parallel_forAPIsExecute the following for loops with the assumption that each (i,j) operation is independent,
fis the workload andx...are the arguments inside the for loops. Note: on CPUs only the top level is parallelized.basic "high-level"
parallel_forAPIsRequires no familiarity with GPUs. Users don't need to pass any configuration, e.g. threads, blocks. JACC will make a "best guess". CPU will use the threads passed to Julia, for GPUs it will estimate blocks and threads.
JACC.parallel_for( dims, f, x…)Follows Kokkos syntax Should we remove this pattern?. ExampleJACC.parallel_for( f, dims, x...)Follows Julia's map "function first" syntax. Should we prefer this signature?JACC.parallel_for(dims = N, args = (arg1, arg2, arg3, ...), f = function)keyword arguments based, order doesn't matter. Performance is not studied. Example Note: remove launch spec entries for a basic API.advanced
parallel_forAPIsIntroduces a struct to add launch configurations for more advanced users. Most key/value entries target tuning capabilities for GPUs requiring some familiarity.
JACC.launch_spec(; kwargs)creates aJACC.LaunchSpecstruct to tune parallel_for configurationsspec = JACC.launch_spec(; blocks = (2,2), threads = (16,16) )ExampleJACC.parallel_for( spec, dims, f, x…)lower-level, users need to pass configuration, e.g. threads/blocks. Follows Kokkos syntax. Example.JACC.parallel_for( f, spec, dims, x...)lower-level, users to pass configuration, e.g. threads/blocks. Follows Julia's map "function first" syntax. Should we favor this pattern? and Deprecated the Kokkos-order in the previous pattern?JACC.parallel_for(dims = N, args = (arg1, arg2, arg3, ...), f = function, blocks = blocks, threads = threads, sync = false)add LaunchSpec entries to dims, args, and function. Example.parallel_reduceAPIsbasic "high-level"
parallel_reduceAPIsRequires no familiarity with GPUs. Users don't need to pass any configuration, e.g. threads, blocks. JACC will make a "best guess". CPU will use the threads passed to Julia, for GPUs it will estimate blocks and threads.
red = JACC.parallel_reduce(a)high-level, minimal function returning the sum of allaelements, a[1] + .... + a[N]. Example.red = JACC.parallel_reduce(op, a), high-level, let user choose a supported operator ={ +, *, min, max}. e.g.JACC.parallel_reduce(min, a). Example.red = JACC.parallel_reduce(dims, f, x...)requires a size, function to declare the per-element reduced quantity, and arguments, default op is sum (+). Examplered = JACC.parallel_reduce(dims, op, f, x...; init)requires size, operator, function to declare the per-element reduced quantity, arguments, and initial value.red = JACC.parallel_reduce(f, dims, x...)requires a size, function to declare the per-element reduced quantity, and arguments, default op is sum (+). Follows Julia'smapfunction first syntax. Should we favor this pattern? Deprecate "dims" first?red = JACC.parallel_reduce(f, dims, op, x...; init)requires size, operator, function to declare the per-element reduced quantity, arguments, and initial value. Follow Julia'smapfunction first syntax. Should we favor this pattern? Deprecate "dims" first?red = JACC.parallel_reduce(dims, dot, x1, x2)Special op=dot product reduction, requires two arraysx1,x2. Examplered = JACC.parallel_reduce(; dims, f, args, [type, op, init,])Keyword argument based functionadvanced
parallel_reduceAPIsIntroduces a struct to add launch configurations for more advanced users. Most key/value entries target tuning capabilities for GPUs requiring some familiarity.
red = JACC.parallel_reduce(spec, a)high-level, minimal function returning the sum of allaelements, a[1] + .... + a[N] . Examplered = JACC.parallel_reduce(spec, op, a), high-level, let user choose a supported operator ={ +, *, min, max}. e.g.JACC.parallel_reduce(min, a)Examplered = JACC.parallel_reduce(spec, dims, f, x...)requires a size, function to declare the per-element reduced quantity, and arguments, default op is sum (+)red = JACC.parallel_reduce(spec, dims, op, f, x...; init)requires size, operator, function to declare the per-element reduced quantity, arguments, and initial value.red = JACC.parallel_reduce(f, spec, dims, x...)requires a size, function to declare the per-element reduced quantity, and arguments, default op is sum (+). Follows Julia'smapfunction first syntax. Should we favor this pattern? Deprecate "dims" first?red = JACC.parallel_reduce(f, spec, dims, op, x...; init)requires size, operator, function to declare the per-element reduced quantity, arguments, and initial value. Follow Julia'smapfunction first syntax. Should we favor this pattern? Deprecate "dims" first?red = JACC.parallel_reduce(spec, dims, dot, x1, x2)Special op=dot product reduction, requires two arraysx1,x2red = JACC.parallel_reduce(; dims, f, args, [type, op, init,] kw...)Keyword argument based function, kw... can be any of the same keyword arguments used in launch_spec(kw...) Examplesynchronize
JACC.synchronize(): synchronizes the default stream (if launched with sync=false). ExampleJACC.synchronize(stream): synchronize the input stream (not portable to CPUs, if launched with sync=false)Inside kernel capabilities APIs
These functions are provided by JACC and can used inside kernel functions passed to
parallel_forandparallel_reduceJACC.shared(x)allows to bump an array into "shared" memory, e.g. block memory on GPUs for fast access. Note: only 1D arrays are returned, if passing a ND array it will be flattened to 1D . ExampleJACC.@atomicallows to declare an operation "atomic" (one at a time) leveraging the Atomix.jl support across backends. Careful must be taken due to parallelization, thus, performance losses. ExampleBeta Was this translation helpful? Give feedback.
All reactions