You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
os : Centos 7.6
kernel version : 3.10.0-957.el7.x86_64
Python version : 3.7.7
devtools version : 7.3.1
hpcx version : 2.4.1
rocm version : dtk22.10.1
2 basic hardware info
CPU : Hygon C86 7185 32-core Processor
DRAM : 128GB
DCU : 4 * Z100 (1319MHz, 16GB)
Server Type : X785
3 environment preparation
DTB1.2 adopts a server-client architecture, where messages are defined using proto files. The server responds to service requests sent by the client, encapsulated in the protobuf protocol, via gRPC, as illustrated in Figure 1. The server runs on nodes via dist_simulator.sh, with the process communicating with the client running on rank 0, while the remaining processes run on other ranks. Communication and data transfer between rank 0 and other ranks occur via MPI. The main server process executes the dist_simulator executable file. Rank 0 exclusively occupies one node, while the remaining ranks run on nodes in groups of four, each binding to one DCU per node and sharing 200Gbps of IB bandwidth equally. Thus, each process has an IB bandwidth of 50Gbps. Communication and data transfer between rank 0 and other ranks are facilitated via MPI. The script is provided below:
In DTB2.0, the client-side Python interface is implemented through the BlockWrapper class in cuda/python/dist_blockwrapper_pytorch.py. This interface can be categorized into two types based on their functionality: operation interfaces and data retrieval interfaces.
4.1 operation interface
__init__
The constructor interface for the BlockWrapper class is defined as follows:
Initializes a new instance of the BlockWrapper class.
address: The IP address and port number of the server, e.g., "10.11.2.10:50051".
path: The absolute path where the address table is stored, such as the address table block_0.npz.
delta_t: The biological time interval between two consecutive simulations, defaulting to 1ms.
route_path: The absolute path where the routing table is stored, defaulting to None, indicating peer-to-peer transmission.
print_stat: Option to output statistical items for each iteration, defaulting to False.
force_rebase: In the assimilation program, whether to
forcefully perform non-accumulative sorting on population IDs from
multiple address tables (re-sorting from 0), defaulting to False.
allow_rebase: In the assimilation program, whether to allow non-accumulative sorting, defaulting to True.
overlap: In the assimilation program, the number of
populations into which a single voxel in the address table can be split,
defaulting to 2.
run
Network simulation execution interface, defined as follows:
bid: The ID of the population to be updated, defaults to None, meaning all populations are updated
mul_property_by_subblk
In the DA process, we use it to update parameters which are sampled from a distribution.
Multiply the attribute parameters in the population by a constant, defined as follows:
property_hyper_parameter: The parameter to be updated
accumulate: In the assimilation program, divide the previous
parameter by the parameter passed before multiplying it, making the
current passed parameter the parameter to be set for this property,
defaults to True
assign_property_by_subblk
Assign property parameters in the population, defined as follows:
g_t: Specific constant for each population, for some populations
(such as STN and GPe) it takes a constant of 0.06nS, while for other
populations it takes 0
To verify whether neuron spikes are correctly emitted in the network, i.e., whether the membrane potentials computed by the GPU are accurate, the process can be reproduced in the CPU and compared against the GPU results. However, this method is not feasible for large-scale scenarios due to the long computational time of CPU serial calculations. In DTB1.2, a sampling network composed of several sampling neurons is taken from the network, and this sampling network also participates in the simulation of the entire network. At the same time, a sampling network of the same scale and parameters is simulated on the CPU for comparison. If the comparison is correct, the membrane potential calculation can be considered accurate.
Network Simulation
The computational time required to simulate 1 ms of biological time (i.e., one iteration) is referred to as the slowdown ratio, which is an important performance indicator for DTB. The test results are as follows,
Scale
Firing Rate
Communication Method
Slowdown Ratio
7.5 Billion
14Hz
p2p
50
7.5 Billion
14Hz
p2p
65
15 Billion
14Hz
p2p
63
15 Billion
14Hz
p2p
80
86 Billion
7Hz
p2p
65
86 Billion
15Hz
p2p
79
86 Billion
30Hz
p2p
119
100 Billion
15Hz
p2p
63
The slower speed of the 860-billion-scale compared to the 100-billion-scale is mainly due to the poor state of the IB network in the cluster during testing.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
DTB2.0 Python Interface Documentation
1 basic software info
os : Centos 7.6
kernel version : 3.10.0-957.el7.x86_64
Python version : 3.7.7
devtools version : 7.3.1
hpcx version : 2.4.1
rocm version : dtk22.10.1
2 basic hardware info
CPU : Hygon C86 7185 32-core Processor
DRAM : 128GB
DCU : 4 * Z100 (1319MHz, 16GB)
Server Type : X785
3 environment preparation
DTB1.2 adopts a server-client architecture, where messages are defined using proto files. The server responds to service requests sent by the client, encapsulated in the protobuf protocol, via gRPC, as illustrated in Figure 1. The server runs on nodes via dist_simulator.sh, with the process communicating with the client running on rank 0, while the remaining processes run on other ranks. Communication and data transfer between rank 0 and other ranks occur via MPI. The main server process executes the dist_simulator executable file. Rank 0 exclusively occupies one node, while the remaining ranks run on nodes in groups of four, each binding to one DCU per node and sharing 200Gbps of IB bandwidth equally. Thus, each process has an IB bandwidth of 50Gbps. Communication and data transfer between rank 0 and other ranks are facilitated via MPI. The script is provided below:
3.1 launch server
tbd
3.2 launch client
tbd
4 Client Python Interface
In DTB2.0, the client-side Python interface is implemented through the BlockWrapper class in cuda/python/dist_blockwrapper_pytorch.py. This interface can be categorized into two types based on their functionality: operation interfaces and data retrieval interfaces.
4.1 operation interface
__init__
The constructor interface for the
BlockWrapper
class is defined as follows:Initializes a new instance of the BlockWrapper class.
address
: The IP address and port number of the server, e.g., "10.11.2.10:50051".path
: The absolute path where the address table is stored, such as the address tableblock_0.npz
.delta_t
: The biological time interval between two consecutive simulations, defaulting to 1ms.route_path
: The absolute path where the routing table is stored, defaulting to None, indicating peer-to-peer transmission.print_stat
: Option to output statistical items for each iteration, defaulting to False.force_rebase
: In the assimilation program, whether toforcefully perform non-accumulative sorting on population IDs from
multiple address tables (re-sorting from 0), defaulting to False.
allow_rebase
: In the assimilation program, whether to allow non-accumulative sorting, defaulting to True.overlap
: In the assimilation program, the number ofpopulations into which a single voxel in the address table can be split,
defaulting to 2.
run
Network simulation execution interface, defined as follows:
STRATEGY_SEND_PAIRWISE, can also be set to STRATEGY_SEND_SEQUENTIAL or
STRATEGY_SEND_RANDOM
update_property
Update property interface, defined as follows:
mul_property_by_subblk
In the DA process, we use it to update parameters which are sampled from a distribution.
Multiply the attribute parameters in the population by a constant, defined as follows:
parameter by the parameter passed before multiplying it, making the
current passed parameter the parameter to be set for this property,
defaults to True
assign_property_by_subblk
Assign property parameters in the population, defined as follows:
gamma_property_by_subblk
Select a property in a certain population and generate gamma distribution parameters based on alpha/beta to update this property, defined as follows:
set_samples
Set sampling neurons, defined as follows:
set_state_rule
load_state_from_file
Load checkpoint state from file:
update_ou_background_stimuli
update_ttype_ca_stimuli
(such as STN and GPe) it takes a constant of 0.06nS, while for other
populations it takes 0
check_sample_conn_weight
set_samples_by_specifying_popu_idx
4.2 data retrieval interface
total_neurons
Returns the total number of neurons, such as:
tensor(6826335882, device='cuda:0')
.block_id
Returns the sequence of DCU card numbers, such as
[1,2,3…2000]
.subblk_id
Gets the sequence of population numbers, such as:
tensor([2, 3, 4, ..., 227017, 227026, 227027], device='cuda:0')
.total_subblks
Gets the total number of populations.
neurons_per_subblk
Sequence of the number of neurons for each population, such as
tensor([90906, 25639, 96272, ..., 17933, 62897, 15725], device='cuda:0')
.neurons_per_block
Number of neurons per DCU card (sequence).
last_time_stat
Statistics of each card in the previous iteration, including 34 indicators as shown in the table below:
5 test
To verify whether neuron spikes are correctly emitted in the network, i.e., whether the membrane potentials computed by the GPU are accurate, the process can be reproduced in the CPU and compared against the GPU results. However, this method is not feasible for large-scale scenarios due to the long computational time of CPU serial calculations. In DTB1.2, a sampling network composed of several sampling neurons is taken from the network, and this sampling network also participates in the simulation of the entire network. At the same time, a sampling network of the same scale and parameters is simulated on the CPU for comparison. If the comparison is correct, the membrane potential calculation can be considered accurate.
Network Simulation
The computational time required to simulate 1 ms of biological time (i.e., one iteration) is referred to as the slowdown ratio, which is an important performance indicator for DTB. The test results are as follows,
The slower speed of the 860-billion-scale compared to the 100-billion-scale is mainly due to the poor state of the IB network in the cluster during testing.
Beta Was this translation helpful? Give feedback.
All reactions