HPC-certification-forum · kevin27luedemann · Jun 4, 2025 · Aug 6, 2025
diff --git a/ai/1/1/b.txt b/ai/1/1/b.txt
@@ -0,0 +1,18 @@
+# AI1.1 HPC AI Architectures
+
+This skill introduces the foundational architectural components that support AI workloads in high-performance computing environments. It covers the integration of accelerators, memory hierarchies, interconnects, and the software stacks that enable scalable training and inference.
+
+## Requirements
+
+* External: Basic understanding of parallel computing and AI model training
+* Internal: None
+
+## Learning Outcomes
+
+* Identify key components of HPC architectures relevant to AI, including GPUs, TPUs, and memory systems.
+* Compare different node-level and system-level configurations for AI workloads.
+* Explain the role of interconnects (e.g., NVLink, InfiniBand) in distributed AI performance.
+* Recognize the impact of hardware-software co-design in AI system performance.
+* Describe how different architectural features affect scalability and throughput of AI training and inference tasks.
+
+** Caution: All text is AI generated **
diff --git a/ai/1/2/b.txt b/ai/1/2/b.txt
@@ -0,0 +1,18 @@
+# AI1.2 AI Workflow Management
+
+This skill covers tools, strategies, and principles for designing and managing AI workflows on HPC systems. It addresses orchestration, scheduling, automation, and reproducibility of AI pipelines across distributed infrastructure.
+
+## Requirements
+
+* External: Familiarity with AI training/inference steps and command-line environments
+* Internal: None
+
+## Learning Outcomes
+
+* Define the components of a typical AI workflow (data preprocessing, training, evaluation, deployment).
+* Describe the role of workflow engines (e.g., Snakemake, Nextflow, Airflow) in managing AI pipelines.
+* Demonstrate how to schedule and monitor multi-stage AI tasks on HPC resources.
+* Apply versioning and reproducibility best practices in AI workflow design.
+* Understand error handling, checkpointing, and dependency resolution in distributed AI pipelines.
+
+** Caution: All text is AI generated **
diff --git a/ai/1/3/b.txt b/ai/1/3/b.txt
@@ -0,0 +1,18 @@
+# AI1.3 Agents
+
+This skill introduces the concept of software agents in AI systems, including autonomous and semi-autonomous agents used for decision-making, task execution, or coordination. It focuses on their design, deployment, and integration in HPC-based AI workflows.
+
+## Requirements
+
+* External: Basic knowledge of reinforcement learning or AI model behavior
+* Internal: None
+
+## Learning Outcomes
+
+* Define what constitutes an AI agent in the context of HPC workloads.
+* Differentiate between reactive, deliberative, and hybrid agent architectures.
+* Explain use cases for agents in model orchestration, data interaction, and simulation-based learning.
+* Describe the lifecycle of an agent from initialization to termination in an HPC pipeline.
+* Evaluate agent behavior in terms of autonomy, adaptability, and communication.
+
+** Caution: All text is AI generated **
diff --git a/ai/1/4/b.txt b/ai/1/4/b.txt
@@ -0,0 +1,18 @@
+# AI1.4 Fine Tuning
+
+This skill focuses on techniques and strategies for fine-tuning pre-trained AI models in HPC environments. It includes adapting models to domain-specific data, optimizing resource usage, and applying transfer learning for efficient training.
+
+## Requirements
+
+* External: Understanding of basic deep learning and model training processes
+* Internal: None
+
+## Learning Outcomes
+
+* Explain the purpose and benefits of fine-tuning pre-trained models.
+* Identify key hyperparameters and architectural considerations during fine-tuning.
+* Apply methods for efficient fine-tuning, including layer freezing and learning rate scheduling.
+* Describe how fine-tuning strategies differ for large-scale models on HPC infrastructure.
+* Recognize potential pitfalls such as overfitting, catastrophic forgetting, and data leakage.
+
+** Caution: All text is AI generated **
diff --git a/ai/1/b.txt b/ai/1/b.txt
@@ -0,0 +1,19 @@
+# AI1 AI System Design and Deployment
+
+This node encompasses the key concepts and practical knowledge required to design, organize, and operate AI systems on HPC platforms. It brings together architectural considerations, workflow coordination, fine-tuning strategies, and agent-based designs relevant to scalable AI deployment.
+
+## Learning Outcomes
+
+* Identify major architectural design choices and their impact on AI scalability in HPC environments.
+* Describe how AI workflows are coordinated and executed across distributed computing systems.
+* Explain the function and design of intelligent agents in HPC-based AI systems.
+* Summarize techniques for fine-tuning AI models, considering performance and resource constraints.
+
+## Subskills
+
+* [[skill-tree:ai:1:1:b]]
+* [[skill-tree:ai:1:2:b]]
+* [[skill-tree:ai:1:3:b]]
+* [[skill-tree:ai:1:5:b]]
+
+** Caution: All text is AI generated **
diff --git a/ai/2/1/b.txt b/ai/2/1/b.txt
@@ -0,0 +1,18 @@
+# AI2.1 Hosting AI Models
+
+This skill focuses on strategies and technologies for deploying and serving AI models in high-performance and hybrid computing environments. It includes model packaging, service interfaces, containerization, and compatibility with HPC infrastructure.
+
+## Requirements
+
+* External: Familiarity with model training and inference concepts
+* Internal: None
+
+## Learning Outcomes
+
+* Compare different approaches for deploying AI models in HPC, cloud, and hybrid environments.
+* Describe the role of containers (e.g., Docker, Singularity) in hosting AI models.
+* Identify tools and frameworks used for serving models (e.g., TorchServe, Triton Inference Server).
+* Explain compatibility issues and solutions for model execution on HPC systems.
+* Demonstrate how to expose AI models as services or endpoints for internal or external access.
+
+** Caution: All text is AI generated **
diff --git a/ai/2/2/b.txt b/ai/2/2/b.txt
@@ -0,0 +1,18 @@
+# AI2.2 Scaling and Inference Optimization
+
+This skill explores techniques for scaling AI inference across HPC systems and optimizing throughput and latency. It includes performance tuning, parallelism, model simplification, and the use of hardware accelerators.
+
+## Requirements
+
+* External: Basic understanding of AI inference and performance bottlenecks
+* Internal: None
+
+## Learning Outcomes
+
+* Describe methods to parallelize inference across multiple compute nodes or GPUs.
+* Identify and apply model optimization techniques such as pruning, quantization, and distillation.
+* Explain how to balance latency, throughput, and resource usage in production environments.
+* Evaluate the impact of batch size, I/O overhead, and memory footprint on inference performance.
+* Use profiling tools to locate bottlenecks and improve inference efficiency in HPC workflows.
+
+** Caution: All text is AI generated **
diff --git a/ai/2/3/b.txt b/ai/2/3/b.txt
@@ -0,0 +1,18 @@
+# AI2.3 Resource-Aware Deployment
+
+This skill focuses on deploying AI workloads in a manner that accounts for the limitations and availability of HPC resources such as compute, memory, storage, and energy. It teaches techniques to maximize efficiency and sustainability.
+
+## Requirements
+
+* External: Familiarity with AI workload characteristics and HPC job environments
+* Internal: None
+
+## Learning Outcomes
+
+* Define what makes a deployment “resource-aware” in the context of HPC.
+* Select appropriate compute and memory configurations based on model size and workload type.
+* Use resource profiling tools to guide allocation decisions.
+* Apply strategies to reduce energy consumption during training and inference.
+* Explain the trade-offs between performance, resource use, and scheduling constraints.
+
+** Caution: All text is AI generated **
diff --git a/ai/2/b.txt b/ai/2/b.txt
@@ -0,0 +1,17 @@
+# AI2 Engineering and Infrastructure
+
+This node encompasses the core infrastructure and engineering principles required to support scalable and efficient deployment of AI systems in HPC environments. It covers physical and virtual hosting strategies, performance optimization, and resource-aware deployment.
+
+## Learning Outcomes
+
+* Describe the requirements and challenges of hosting AI models in HPC or hybrid infrastructure.
+* Identify methods to optimize inference workloads through model and system-level engineering.
+* Explain the principles of resource-aware AI deployment, including compute, memory, and energy considerations.
+
+## Subskills
+
+* [[skill-tree:ai:2:1:b]]
+* [[skill-tree:ai:2:2:b]]
+* [[skill-tree:ai:2:3:b]]
+
+** Caution: All text is AI generated **
diff --git a/ai/3/1/b.txt b/ai/3/1/b.txt
@@ -0,0 +1,18 @@
+# AI3.1 Language Models (LLMs)
+
+This skill covers the structure, training, and deployment of large language models (LLMs) in HPC environments. It includes tokenization, transformer architectures, training dynamics, and considerations for inference scalability and memory usage.
+
+## Requirements
+
+* External: Familiarity with basic deep learning concepts and NLP tasks
+* Internal: None
+
+## Learning Outcomes
+
+* Describe the transformer architecture and how it underpins most LLMs.
+* Explain tokenization strategies and their impact on model efficiency.
+* Identify memory and compute bottlenecks in LLM training and inference.
+* Compare distributed training strategies used for scaling LLMs across HPC resources.
+* Evaluate the performance and limitations of LLMs on different HPC configurations.
+
+** Caution: All text is AI generated **
diff --git a/ai/3/2/b.txt b/ai/3/2/b.txt
@@ -0,0 +1,18 @@
+# AI3.2 Image Models
+
+This skill introduces key concepts and architectures for image-based AI models, such as CNNs and vision transformers, along with their training and deployment in HPC environments. It emphasizes dataset handling, parallel processing, and acceleration techniques.
+
+## Requirements
+
+* External: Basic understanding of computer vision and convolutional neural networks
+* Internal: None
+
+## Learning Outcomes
+
+* Identify common architectures used in image modeling (e.g., ResNet, EfficientNet, Vision Transformers).
+* Describe the data pipeline requirements for large-scale image datasets.
+* Apply techniques for distributed training of image models in HPC environments.
+* Explain GPU/TPU acceleration strategies for image model training and inference.
+* Evaluate performance trade-offs between model size, accuracy, and runtime efficiency.
+
+** Caution: All text is AI generated **
diff --git a/ai/3/3/b.txt b/ai/3/3/b.txt
@@ -0,0 +1,18 @@
+# AI3.3 Audio and Voice Models
+
+This skill introduces models designed to process audio signals and voice data, including spectrogram-based models, recurrent architectures, and transformers. It emphasizes preprocessing, training techniques, and deployment challenges in HPC environments.
+
+## Requirements
+
+* External: Familiarity with signal processing concepts and neural networks
+* Internal: None
+
+## Learning Outcomes
+
+* Describe preprocessing techniques used to transform raw audio into model-ready formats (e.g., spectrograms, MFCCs).
+* Compare model architectures suited for audio and voice tasks (e.g., RNNs, CNNs, transformers).
+* Explain challenges in training audio models at scale, such as input length variability and I/O throughput.
+* Evaluate model performance across metrics like accuracy, latency, and noise robustness.
+* Demonstrate how to optimize audio inference on HPC systems using batching and compression.
+
+** Caution: All text is AI generated **
diff --git a/ai/3/4/b.txt b/ai/3/4/b.txt
@@ -0,0 +1,18 @@
+# AI3.4 Video Generation
+
+This skill focuses on AI models that generate or synthesize video content, addressing temporal consistency, multi-frame modeling, and large-scale training challenges. It covers architectures, datasets, and HPC-specific resource demands.
+
+## Requirements
+
+* External: Knowledge of deep learning and image/video data structures
+* Internal: None
+
+## Learning Outcomes
+
+* Explain the structure of video generation models and how they differ from static image models.
+* Identify key challenges in modeling temporal dynamics and visual coherence.
+* Describe the resource demands of video generation, including GPU memory and disk I/O.
+* Apply strategies to manage large-scale video datasets in distributed environments.
+* Evaluate performance and quality trade-offs in video synthesis models.
+
+** Caution: All text is AI generated **
diff --git a/ai/3/5/b.txt b/ai/3/5/b.txt
@@ -0,0 +1,18 @@
+# AI3.5 Multimodal Models
+
+This skill introduces AI models that process and integrate data from multiple modalities such as text, images, audio, and video. It covers model architectures, training strategies, and synchronization techniques in distributed HPC environments.
+
+## Requirements
+
+* External: Basic understanding of multiple data types (e.g., text, image, audio) and neural networks
+* Internal: None
+
+## Learning Outcomes
+
+* Define what constitutes a multimodal model and describe its typical input/output structures.
+* Compare fusion strategies (early, late, and hybrid) used to combine modalities in model architectures.
+* Explain challenges in synchronizing and batching multimodal inputs during training.
+* Identify common datasets and benchmarks used for evaluating multimodal models.
+* Describe how HPC systems handle distributed training and scaling of multimodal networks.
+
+** Caution: All text is AI generated **
diff --git a/ai/3/6/b.txt b/ai/3/6/b.txt
@@ -0,0 +1,18 @@
+# AI3.6 Graph Neural Networks
+
+This skill introduces graph neural networks (GNNs), which operate on structured data represented as graphs. It focuses on graph-based learning, message passing, and scaling GNNs on HPC platforms.
+
+## Requirements
+
+* External: Understanding of basic machine learning and graph theory concepts
+* Internal: None
+
+## Learning Outcomes
+
+* Explain how graph neural networks represent and process relational data.
+* Describe core GNN operations such as message passing and aggregation.
+* Identify use cases for GNNs in scientific computing, recommendation systems, and bioinformatics.
+* Apply techniques for batching and sampling large graphs in distributed training.
+* Evaluate performance and scalability of GNNs in multi-node HPC environments.
+
+** Caution: All text is AI generated **
diff --git a/ai/3/7/b.txt b/ai/3/7/b.txt
@@ -0,0 +1,18 @@
+# AI3.7 Scientific Models
+
+This skill focuses on AI models tailored for scientific domains, such as physics-informed neural networks (PINNs), surrogate models, and models used in simulations. It emphasizes integration with traditional HPC workflows and scientific data.
+
+## Requirements
+
+* External: Familiarity with scientific computing or domain-specific simulation tasks
+* Internal: None
+
+## Learning Outcomes
+
+* Describe how AI models can accelerate or complement scientific simulations.
+* Explain the concept and application of physics-informed neural networks (PINNs).
+* Identify the challenges of incorporating domain knowledge into AI model design.
+* Discuss the integration of scientific models with HPC job scheduling and simulation pipelines.
+* Evaluate the accuracy, efficiency, and generalizability of scientific AI models.
+
+** Caution: All text is AI generated **
diff --git a/ai/3/8/b.txt b/ai/3/8/b.txt
@@ -0,0 +1,18 @@
+# AI3.8 Explainable AI (XAI)
+
+This skill introduces methods and frameworks that enable interpretability of AI models, especially in high-stakes or scientific contexts. It emphasizes tools, techniques, and best practices for understanding, auditing, and communicating model behavior.
+
+## Requirements
+
+* External: Understanding of basic AI model structure and outputs
+* Internal: None
+
+## Learning Outcomes
+
+* Define explainability and distinguish it from transparency and interpretability.
+* Identify common XAI methods (e.g., SHAP, LIME, saliency maps) and their applications.
+* Apply interpretability techniques to evaluate model decisions in classification or regression tasks.
+* Describe use cases for explainability in scientific research, safety-critical systems, and compliance.
+* Evaluate trade-offs between model complexity and interpretability.
+
+** Caution: All text is AI generated **
diff --git a/ai/3/b.txt b/ai/3/b.txt
@@ -0,0 +1,29 @@
+# AI3 AI Modalities
+
+This node introduces various types of AI model modalities and their applications in HPC environments. It groups foundational knowledge and deployment considerations for language, image, audio, video, multimodal, graph, and scientific models, along with principles of interpretability.
+
+## Learning Outcomes
+
+* Explain the structure and application of large language models and how they scale in HPC environments.
+* Describe the training and inference workflows of image-based models and their resource requirements.
+* Summarize techniques used in audio and voice model processing and deployment.
+* Identify the challenges of video generation models and their compute/memory implications.
+* Understand how multimodal models combine inputs from various domains and the synchronization strategies involved.
+* Describe graph neural network architectures and their relevance in scientific and relational data modeling.
+* Explain the role of domain-specific scientific models in physics-informed AI and simulation-enhanced learning.
+* Apply principles of explainable AI (XAI) to interpret predictions and assess model reliability across modalities.
+
+
+
+## Subskills
+
+* [[skill-tree:ai:3:1:b]]
+* [[skill-tree:ai:3:2:b]]
+* [[skill-tree:ai:3:3:b]]
+* [[skill-tree:ai:3:4:b]]
+* [[skill-tree:ai:3:5:b]]
+* [[skill-tree:ai:3:6:b]]
+* [[skill-tree:ai:3:7:b]]
+* [[skill-tree:ai:3:8:b]]
+
+** Caution: All text is AI generated **
diff --git a/ai/4/1/b.txt b/ai/4/1/b.txt
@@ -0,0 +1,18 @@
+# AI4.1 Responsible AI Use
+
+This skill explores the ethical and societal implications of AI systems, focusing on fairness, transparency, safety, and accountability. It introduces frameworks and best practices for responsible AI design and deployment in HPC contexts.
+
+## Requirements
+
+* External: General understanding of AI system design and deployment
+* Internal: None
+
+## Learning Outcomes
+
+* Define key principles of responsible AI, including fairness, non-discrimination, and inclusiveness.
+* Identify potential sources of bias in AI data and models and methods to mitigate them.
+* Describe strategies for ensuring transparency and explainability in model outputs.
+* Recognize ethical risks in deploying large-scale AI systems and how to address them.
+* Apply responsible AI guidelines or frameworks (e.g., EU AI Act, OECD Principles) to HPC workflows.
+
+** Caution: All text is AI generated **
diff --git a/ai/4/2/b.txt b/ai/4/2/b.txt
@@ -0,0 +1,18 @@
+# AI4.2 Data Privacy and Compliance
+
+This skill focuses on the legal and technical aspects of handling sensitive data in AI systems, particularly in regulated environments. It covers compliance requirements, anonymization techniques, and policy enforcement in HPC settings.
+
+## Requirements
+
+* External: Familiarity with AI workflows and basic data management concepts
+* Internal: None
+
+## Learning Outcomes
+
+* Identify major privacy regulations relevant to AI (e.g., GDPR, HIPAA) and their implications.
+* Describe data anonymization and pseudonymization techniques.
+* Apply access controls and audit mechanisms for secure data handling in HPC workflows.
+* Explain how compliance requirements influence dataset selection, storage, and retention policies.
+* Evaluate tools and frameworks that support compliance monitoring and reporting.
+
+** Caution: All text is AI generated **