CAUTION: This is an beta / non-production software, do not use on production clusters.
Network Operator allows automatic configuring and easier use of RDMA NICs with Intel AI accelerators.
Network operator currently supports Gaudi and its integrated scale-out network interfaces.
Intel Gaudi and its integrated NICs are supported in two modes: L2 and L3.
Once configuration is done, the ready nodes will be labeled (via NFD) with intel.feature.node.kubernetes.io/gaudi-scale-out=true
The L2 mode is where the scale-out interfaces are only brought up without IP addresses. The Gaudi FW will leverage the interfaces for scale-out operations without IPs. The scale-out network topology can be simple without L3 switching or routing protocols.
The L3 mode refers to a scale-out network that has L3 switching enabled. The supported provisioning method for Intel Gaudi is a custom LLDP aided provisioning. It expects the LLDP to be configured on the switches with specific settings. For the IP provisioning, LLDP's Port Description
field has to have the switch port's IP and netmask at the end of it. e.g. no-alert 10.200.10.2/30
. The information is used to calculate the Gaudi NIC IP.
The operator will deploy configuration Pods to the worker nodes which will listen to the LLDP packets and then configure the node's network interfaces. In addition to the IP addresses for the Gaudi NICs, the configurator will also setup routes and create configuration files for the Gaudi SW to use. The configurator creates two routes for each NIC: 1) a route to /30
point to point network, and 2) a route to /16
larger network.
More info on the switch topology and configurations is available here.
- Enable Host-NIC use in cluster
- Support to install Host-NIC KMD
- Configure RDMA NICs to be used with Intel AI accelerators
The operator depends on following Kubernetes components:
- Intel Gaudi base operator
- Node Feature Discovery
- Cert-manager
- go version v1.23+
- docker version 17.03+.
- kubectl version v1.31+.
- Access to a Kubernetes v1.31+ cluster.
Images are available at dockerhub.io.
Install NFD Gaudi device rules into the cluster:
kubectl apply -f config/nfd/gaudi-device-rule.yaml
Install operator into the cluster:
kubectl apply -k config/operator/default/
Create instances of your solution
Ensure that the samples have desired operator configuration values from the configuration options below. After that apply for example a Gaudi L3 sample with:
kubectl apply -f config/operator/samples/gaudi-l3.yaml
Delete the instances (CRs) from the cluster:
kubectl delete -f config/operator/samples/gaudi-l3.yaml
Uninstall the controller from the cluster:
kubectl delete -k config/operator/default/
Remove NFD Gaudi device rules from the cluster:
kubectl delete -f config/nfd/gaudi-device-rule.yaml
See the README for Helm installation.
The most important Network Operator CRD properties are:
-
disableNetworkManager
booleanDisable Gaudi scale-out interfaces in NetworkManager. For nodes where NetworkManager tries to configure the Gaudi interfaces, prevent it from doing so.
-
enableLLDPAD
booleanEnable LLDP for Priority Flow Control in a dedicated container. Keep this value as
false
if lldpad LLDP daemon is already present and running on the host. -
layer
enumLink layer where the scale-out communication should occur. Possible options are
L2
andL3
. -
mtu
integerdescription: MTU for the scale-out interfaces. Maximum
9000
, minimum1500
. -
pfcPriorities
stringBitmask of Priority Flow Control priorities to enable. Requires 'lldpad' on the host or enabled in a container with the above
enableLLDPAD
boolean. Currently the only two accepted values are00000000
and11110000
.
The full set of properties is available in the NetworkClusterPolicy CRD definition. Examples of Network Operator CRDs are found in the samples directory.
Contributions to this project are welcome as issues (bugs, enhancement requests) or via pull requests. Please review our Code of Conduct and our note on security policy.
Copyright 2024 Intel Corporation. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Intel, the Intel logo and Gaudi are trademarks of Intel Corporation or its subsidiaries.