From 635bc1794f7ce8f99ae648885aa3b8387fda1a69 Mon Sep 17 00:00:00 2001 From: Kevin Luedemann Date: Wed, 4 Jun 2025 21:00:59 +0200 Subject: [PATCH 1/2] Add the first batch of skills and the new tree --- ai/1/1/b.txt | 18 ++++++++ ai/1/2/b.txt | 18 ++++++++ ai/1/3/b.txt | 18 ++++++++ ai/1/4/b.txt | 18 ++++++++ ai/1/b.txt | 19 ++++++++ ai/2/1/b.txt | 18 ++++++++ ai/2/2/b.txt | 18 ++++++++ ai/2/3/b.txt | 18 ++++++++ ai/2/b.txt | 17 +++++++ ai/3/1/b.txt | 18 ++++++++ ai/3/2/b.txt | 18 ++++++++ ai/3/3/b.txt | 18 ++++++++ ai/3/4/b.txt | 18 ++++++++ ai/3/5/b.txt | 18 ++++++++ ai/3/6/b.txt | 18 ++++++++ ai/3/7/b.txt | 18 ++++++++ ai/3/8/b.txt | 18 ++++++++ ai/3/b.txt | 29 ++++++++++++ ai/4/1/b.txt | 18 ++++++++ ai/4/2/b.txt | 18 ++++++++ ai/4/3/b.txt | 14 ++++++ ai/4/b.txt | 17 +++++++ ai/5/1/b.txt | 18 ++++++++ ai/5/2/b.txt | 18 ++++++++ ai/5/3/b.txt | 18 ++++++++ ai/5/b.txt | 18 ++++++++ ai/6/1/b.txt | 18 ++++++++ ai/6/2/b.txt | 18 ++++++++ ai/6/3/b.txt | 18 ++++++++ ai/6/b.txt | 17 +++++++ ai/b.txt | 26 +++++++++++ b.txt | 14 +++--- bda/3/1/b.txt | 18 ++++++++ bda/3/2/b.txt | 18 ++++++++ bda/3/b.txt | 33 ++++---------- bda/5/1/1/b.txt | 18 ++++++++ bda/5/1/2/b.txt | 18 ++++++++ bda/5/1/b.txt | 16 +++++++ bda/5/2/1/b.txt | 18 ++++++++ bda/5/2/2/b.txt | 18 ++++++++ bda/5/2/b.txt | 15 ++++++ bda/5/3/1/b.txt | 18 ++++++++ bda/5/3/2/b.txt | 18 ++++++++ bda/5/3/3/b.txt | 18 ++++++++ bda/5/3/b.txt | 16 +++++++ bda/5/4/1/b.txt | 18 ++++++++ bda/5/4/2/b.txt | 18 ++++++++ bda/5/4/3/b.txt | 18 ++++++++ bda/5/4/b.txt | 16 +++++++ bda/5/5/1/b.txt | 18 ++++++++ bda/5/5/2/b.txt | 18 ++++++++ bda/5/5/3/b.txt | 18 ++++++++ bda/5/5/4/b.txt | 18 ++++++++ bda/5/5/b.txt | 17 +++++++ bda/5/b.txt | 41 ++++++++--------- bda/7/1/b.txt | 19 ++++++++ bda/7/2/b.txt | 19 ++++++++ bda/7/b.txt | 16 +++++++ bda/b.txt | 4 +- skill-tree.mm | 118 ++++++++++++++++++++++++++++++++++++++++-------- 60 files changed, 1131 insertions(+), 72 deletions(-) create mode 100644 ai/1/1/b.txt create mode 100644 ai/1/2/b.txt create mode 100644 ai/1/3/b.txt create mode 100644 ai/1/4/b.txt create mode 100644 ai/1/b.txt create mode 100644 ai/2/1/b.txt create mode 100644 ai/2/2/b.txt create mode 100644 ai/2/3/b.txt create mode 100644 ai/2/b.txt create mode 100644 ai/3/1/b.txt create mode 100644 ai/3/2/b.txt create mode 100644 ai/3/3/b.txt create mode 100644 ai/3/4/b.txt create mode 100644 ai/3/5/b.txt create mode 100644 ai/3/6/b.txt create mode 100644 ai/3/7/b.txt create mode 100644 ai/3/8/b.txt create mode 100644 ai/3/b.txt create mode 100644 ai/4/1/b.txt create mode 100644 ai/4/2/b.txt create mode 100644 ai/4/3/b.txt create mode 100644 ai/4/b.txt create mode 100644 ai/5/1/b.txt create mode 100644 ai/5/2/b.txt create mode 100644 ai/5/3/b.txt create mode 100644 ai/5/b.txt create mode 100644 ai/6/1/b.txt create mode 100644 ai/6/2/b.txt create mode 100644 ai/6/3/b.txt create mode 100644 ai/6/b.txt create mode 100644 ai/b.txt create mode 100644 bda/3/1/b.txt create mode 100644 bda/3/2/b.txt create mode 100644 bda/5/1/1/b.txt create mode 100644 bda/5/1/2/b.txt create mode 100644 bda/5/1/b.txt create mode 100644 bda/5/2/1/b.txt create mode 100644 bda/5/2/2/b.txt create mode 100644 bda/5/2/b.txt create mode 100644 bda/5/3/1/b.txt create mode 100644 bda/5/3/2/b.txt create mode 100644 bda/5/3/3/b.txt create mode 100644 bda/5/3/b.txt create mode 100644 bda/5/4/1/b.txt create mode 100644 bda/5/4/2/b.txt create mode 100644 bda/5/4/3/b.txt create mode 100644 bda/5/4/b.txt create mode 100644 bda/5/5/1/b.txt create mode 100644 bda/5/5/2/b.txt create mode 100644 bda/5/5/3/b.txt create mode 100644 bda/5/5/4/b.txt create mode 100644 bda/5/5/b.txt create mode 100644 bda/7/1/b.txt create mode 100644 bda/7/2/b.txt create mode 100644 bda/7/b.txt diff --git a/ai/1/1/b.txt b/ai/1/1/b.txt new file mode 100644 index 0000000..0385e10 --- /dev/null +++ b/ai/1/1/b.txt @@ -0,0 +1,18 @@ +# AI1.1 HPC AI Architectures + +This skill introduces the foundational architectural components that support AI workloads in high-performance computing environments. It covers the integration of accelerators, memory hierarchies, interconnects, and the software stacks that enable scalable training and inference. + +## Requirements + +* External: Basic understanding of parallel computing and AI model training +* Internal: None + +## Learning Outcomes + +* Identify key components of HPC architectures relevant to AI, including GPUs, TPUs, and memory systems. +* Compare different node-level and system-level configurations for AI workloads. +* Explain the role of interconnects (e.g., NVLink, InfiniBand) in distributed AI performance. +* Recognize the impact of hardware-software co-design in AI system performance. +* Describe how different architectural features affect scalability and throughput of AI training and inference tasks. + +** Caution: All text is AI generated ** diff --git a/ai/1/2/b.txt b/ai/1/2/b.txt new file mode 100644 index 0000000..18e4784 --- /dev/null +++ b/ai/1/2/b.txt @@ -0,0 +1,18 @@ +# AI1.2 AI Workflow Management + +This skill covers tools, strategies, and principles for designing and managing AI workflows on HPC systems. It addresses orchestration, scheduling, automation, and reproducibility of AI pipelines across distributed infrastructure. + +## Requirements + +* External: Familiarity with AI training/inference steps and command-line environments +* Internal: None + +## Learning Outcomes + +* Define the components of a typical AI workflow (data preprocessing, training, evaluation, deployment). +* Describe the role of workflow engines (e.g., Snakemake, Nextflow, Airflow) in managing AI pipelines. +* Demonstrate how to schedule and monitor multi-stage AI tasks on HPC resources. +* Apply versioning and reproducibility best practices in AI workflow design. +* Understand error handling, checkpointing, and dependency resolution in distributed AI pipelines. + +** Caution: All text is AI generated ** diff --git a/ai/1/3/b.txt b/ai/1/3/b.txt new file mode 100644 index 0000000..6ac6a82 --- /dev/null +++ b/ai/1/3/b.txt @@ -0,0 +1,18 @@ +# AI1.3 Agents + +This skill introduces the concept of software agents in AI systems, including autonomous and semi-autonomous agents used for decision-making, task execution, or coordination. It focuses on their design, deployment, and integration in HPC-based AI workflows. + +## Requirements + +* External: Basic knowledge of reinforcement learning or AI model behavior +* Internal: None + +## Learning Outcomes + +* Define what constitutes an AI agent in the context of HPC workloads. +* Differentiate between reactive, deliberative, and hybrid agent architectures. +* Explain use cases for agents in model orchestration, data interaction, and simulation-based learning. +* Describe the lifecycle of an agent from initialization to termination in an HPC pipeline. +* Evaluate agent behavior in terms of autonomy, adaptability, and communication. + +** Caution: All text is AI generated ** diff --git a/ai/1/4/b.txt b/ai/1/4/b.txt new file mode 100644 index 0000000..07ec7e2 --- /dev/null +++ b/ai/1/4/b.txt @@ -0,0 +1,18 @@ +# AI1.4 Fine Tuning + +This skill focuses on techniques and strategies for fine-tuning pre-trained AI models in HPC environments. It includes adapting models to domain-specific data, optimizing resource usage, and applying transfer learning for efficient training. + +## Requirements + +* External: Understanding of basic deep learning and model training processes +* Internal: None + +## Learning Outcomes + +* Explain the purpose and benefits of fine-tuning pre-trained models. +* Identify key hyperparameters and architectural considerations during fine-tuning. +* Apply methods for efficient fine-tuning, including layer freezing and learning rate scheduling. +* Describe how fine-tuning strategies differ for large-scale models on HPC infrastructure. +* Recognize potential pitfalls such as overfitting, catastrophic forgetting, and data leakage. + +** Caution: All text is AI generated ** diff --git a/ai/1/b.txt b/ai/1/b.txt new file mode 100644 index 0000000..c0e26f8 --- /dev/null +++ b/ai/1/b.txt @@ -0,0 +1,19 @@ +# AI1 AI System Design and Deployment + +This node encompasses the key concepts and practical knowledge required to design, organize, and operate AI systems on HPC platforms. It brings together architectural considerations, workflow coordination, fine-tuning strategies, and agent-based designs relevant to scalable AI deployment. + +## Learning Outcomes + +* Identify major architectural design choices and their impact on AI scalability in HPC environments. +* Describe how AI workflows are coordinated and executed across distributed computing systems. +* Explain the function and design of intelligent agents in HPC-based AI systems. +* Summarize techniques for fine-tuning AI models, considering performance and resource constraints. + +## Subskills + +* [[skill-tree:ai:1:1:b]] +* [[skill-tree:ai:1:2:b]] +* [[skill-tree:ai:1:3:b]] +* [[skill-tree:ai:1:5:b]] + +** Caution: All text is AI generated ** diff --git a/ai/2/1/b.txt b/ai/2/1/b.txt new file mode 100644 index 0000000..7c4249e --- /dev/null +++ b/ai/2/1/b.txt @@ -0,0 +1,18 @@ +# AI2.1 Hosting AI Models + +This skill focuses on strategies and technologies for deploying and serving AI models in high-performance and hybrid computing environments. It includes model packaging, service interfaces, containerization, and compatibility with HPC infrastructure. + +## Requirements + +* External: Familiarity with model training and inference concepts +* Internal: None + +## Learning Outcomes + +* Compare different approaches for deploying AI models in HPC, cloud, and hybrid environments. +* Describe the role of containers (e.g., Docker, Singularity) in hosting AI models. +* Identify tools and frameworks used for serving models (e.g., TorchServe, Triton Inference Server). +* Explain compatibility issues and solutions for model execution on HPC systems. +* Demonstrate how to expose AI models as services or endpoints for internal or external access. + +** Caution: All text is AI generated ** diff --git a/ai/2/2/b.txt b/ai/2/2/b.txt new file mode 100644 index 0000000..cace19f --- /dev/null +++ b/ai/2/2/b.txt @@ -0,0 +1,18 @@ +# AI2.2 Scaling and Inference Optimization + +This skill explores techniques for scaling AI inference across HPC systems and optimizing throughput and latency. It includes performance tuning, parallelism, model simplification, and the use of hardware accelerators. + +## Requirements + +* External: Basic understanding of AI inference and performance bottlenecks +* Internal: None + +## Learning Outcomes + +* Describe methods to parallelize inference across multiple compute nodes or GPUs. +* Identify and apply model optimization techniques such as pruning, quantization, and distillation. +* Explain how to balance latency, throughput, and resource usage in production environments. +* Evaluate the impact of batch size, I/O overhead, and memory footprint on inference performance. +* Use profiling tools to locate bottlenecks and improve inference efficiency in HPC workflows. + +** Caution: All text is AI generated ** diff --git a/ai/2/3/b.txt b/ai/2/3/b.txt new file mode 100644 index 0000000..0dbf1dd --- /dev/null +++ b/ai/2/3/b.txt @@ -0,0 +1,18 @@ +# AI2.3 Resource-Aware Deployment + +This skill focuses on deploying AI workloads in a manner that accounts for the limitations and availability of HPC resources such as compute, memory, storage, and energy. It teaches techniques to maximize efficiency and sustainability. + +## Requirements + +* External: Familiarity with AI workload characteristics and HPC job environments +* Internal: None + +## Learning Outcomes + +* Define what makes a deployment “resource-aware” in the context of HPC. +* Select appropriate compute and memory configurations based on model size and workload type. +* Use resource profiling tools to guide allocation decisions. +* Apply strategies to reduce energy consumption during training and inference. +* Explain the trade-offs between performance, resource use, and scheduling constraints. + +** Caution: All text is AI generated ** diff --git a/ai/2/b.txt b/ai/2/b.txt new file mode 100644 index 0000000..a6b4525 --- /dev/null +++ b/ai/2/b.txt @@ -0,0 +1,17 @@ +# AI2 Engineering and Infrastructure + +This node encompasses the core infrastructure and engineering principles required to support scalable and efficient deployment of AI systems in HPC environments. It covers physical and virtual hosting strategies, performance optimization, and resource-aware deployment. + +## Learning Outcomes + +* Describe the requirements and challenges of hosting AI models in HPC or hybrid infrastructure. +* Identify methods to optimize inference workloads through model and system-level engineering. +* Explain the principles of resource-aware AI deployment, including compute, memory, and energy considerations. + +## Subskills + +* [[skill-tree:ai:2:1:b]] +* [[skill-tree:ai:2:2:b]] +* [[skill-tree:ai:2:3:b]] + +** Caution: All text is AI generated ** diff --git a/ai/3/1/b.txt b/ai/3/1/b.txt new file mode 100644 index 0000000..d43974e --- /dev/null +++ b/ai/3/1/b.txt @@ -0,0 +1,18 @@ +# AI3.1 Language Models (LLMs) + +This skill covers the structure, training, and deployment of large language models (LLMs) in HPC environments. It includes tokenization, transformer architectures, training dynamics, and considerations for inference scalability and memory usage. + +## Requirements + +* External: Familiarity with basic deep learning concepts and NLP tasks +* Internal: None + +## Learning Outcomes + +* Describe the transformer architecture and how it underpins most LLMs. +* Explain tokenization strategies and their impact on model efficiency. +* Identify memory and compute bottlenecks in LLM training and inference. +* Compare distributed training strategies used for scaling LLMs across HPC resources. +* Evaluate the performance and limitations of LLMs on different HPC configurations. + +** Caution: All text is AI generated ** diff --git a/ai/3/2/b.txt b/ai/3/2/b.txt new file mode 100644 index 0000000..71f646e --- /dev/null +++ b/ai/3/2/b.txt @@ -0,0 +1,18 @@ +# AI3.2 Image Models + +This skill introduces key concepts and architectures for image-based AI models, such as CNNs and vision transformers, along with their training and deployment in HPC environments. It emphasizes dataset handling, parallel processing, and acceleration techniques. + +## Requirements + +* External: Basic understanding of computer vision and convolutional neural networks +* Internal: None + +## Learning Outcomes + +* Identify common architectures used in image modeling (e.g., ResNet, EfficientNet, Vision Transformers). +* Describe the data pipeline requirements for large-scale image datasets. +* Apply techniques for distributed training of image models in HPC environments. +* Explain GPU/TPU acceleration strategies for image model training and inference. +* Evaluate performance trade-offs between model size, accuracy, and runtime efficiency. + +** Caution: All text is AI generated ** diff --git a/ai/3/3/b.txt b/ai/3/3/b.txt new file mode 100644 index 0000000..6db96cf --- /dev/null +++ b/ai/3/3/b.txt @@ -0,0 +1,18 @@ +# AI3.3 Audio and Voice Models + +This skill introduces models designed to process audio signals and voice data, including spectrogram-based models, recurrent architectures, and transformers. It emphasizes preprocessing, training techniques, and deployment challenges in HPC environments. + +## Requirements + +* External: Familiarity with signal processing concepts and neural networks +* Internal: None + +## Learning Outcomes + +* Describe preprocessing techniques used to transform raw audio into model-ready formats (e.g., spectrograms, MFCCs). +* Compare model architectures suited for audio and voice tasks (e.g., RNNs, CNNs, transformers). +* Explain challenges in training audio models at scale, such as input length variability and I/O throughput. +* Evaluate model performance across metrics like accuracy, latency, and noise robustness. +* Demonstrate how to optimize audio inference on HPC systems using batching and compression. + +** Caution: All text is AI generated ** diff --git a/ai/3/4/b.txt b/ai/3/4/b.txt new file mode 100644 index 0000000..865190a --- /dev/null +++ b/ai/3/4/b.txt @@ -0,0 +1,18 @@ +# AI3.4 Video Generation + +This skill focuses on AI models that generate or synthesize video content, addressing temporal consistency, multi-frame modeling, and large-scale training challenges. It covers architectures, datasets, and HPC-specific resource demands. + +## Requirements + +* External: Knowledge of deep learning and image/video data structures +* Internal: None + +## Learning Outcomes + +* Explain the structure of video generation models and how they differ from static image models. +* Identify key challenges in modeling temporal dynamics and visual coherence. +* Describe the resource demands of video generation, including GPU memory and disk I/O. +* Apply strategies to manage large-scale video datasets in distributed environments. +* Evaluate performance and quality trade-offs in video synthesis models. + +** Caution: All text is AI generated ** diff --git a/ai/3/5/b.txt b/ai/3/5/b.txt new file mode 100644 index 0000000..6259736 --- /dev/null +++ b/ai/3/5/b.txt @@ -0,0 +1,18 @@ +# AI3.5 Multimodal Models + +This skill introduces AI models that process and integrate data from multiple modalities such as text, images, audio, and video. It covers model architectures, training strategies, and synchronization techniques in distributed HPC environments. + +## Requirements + +* External: Basic understanding of multiple data types (e.g., text, image, audio) and neural networks +* Internal: None + +## Learning Outcomes + +* Define what constitutes a multimodal model and describe its typical input/output structures. +* Compare fusion strategies (early, late, and hybrid) used to combine modalities in model architectures. +* Explain challenges in synchronizing and batching multimodal inputs during training. +* Identify common datasets and benchmarks used for evaluating multimodal models. +* Describe how HPC systems handle distributed training and scaling of multimodal networks. + +** Caution: All text is AI generated ** diff --git a/ai/3/6/b.txt b/ai/3/6/b.txt new file mode 100644 index 0000000..63517da --- /dev/null +++ b/ai/3/6/b.txt @@ -0,0 +1,18 @@ +# AI3.6 Graph Neural Networks + +This skill introduces graph neural networks (GNNs), which operate on structured data represented as graphs. It focuses on graph-based learning, message passing, and scaling GNNs on HPC platforms. + +## Requirements + +* External: Understanding of basic machine learning and graph theory concepts +* Internal: None + +## Learning Outcomes + +* Explain how graph neural networks represent and process relational data. +* Describe core GNN operations such as message passing and aggregation. +* Identify use cases for GNNs in scientific computing, recommendation systems, and bioinformatics. +* Apply techniques for batching and sampling large graphs in distributed training. +* Evaluate performance and scalability of GNNs in multi-node HPC environments. + +** Caution: All text is AI generated ** diff --git a/ai/3/7/b.txt b/ai/3/7/b.txt new file mode 100644 index 0000000..c72edb3 --- /dev/null +++ b/ai/3/7/b.txt @@ -0,0 +1,18 @@ +# AI3.7 Scientific Models + +This skill focuses on AI models tailored for scientific domains, such as physics-informed neural networks (PINNs), surrogate models, and models used in simulations. It emphasizes integration with traditional HPC workflows and scientific data. + +## Requirements + +* External: Familiarity with scientific computing or domain-specific simulation tasks +* Internal: None + +## Learning Outcomes + +* Describe how AI models can accelerate or complement scientific simulations. +* Explain the concept and application of physics-informed neural networks (PINNs). +* Identify the challenges of incorporating domain knowledge into AI model design. +* Discuss the integration of scientific models with HPC job scheduling and simulation pipelines. +* Evaluate the accuracy, efficiency, and generalizability of scientific AI models. + +** Caution: All text is AI generated ** diff --git a/ai/3/8/b.txt b/ai/3/8/b.txt new file mode 100644 index 0000000..cde4d19 --- /dev/null +++ b/ai/3/8/b.txt @@ -0,0 +1,18 @@ +# AI3.8 Explainable AI (XAI) + +This skill introduces methods and frameworks that enable interpretability of AI models, especially in high-stakes or scientific contexts. It emphasizes tools, techniques, and best practices for understanding, auditing, and communicating model behavior. + +## Requirements + +* External: Understanding of basic AI model structure and outputs +* Internal: None + +## Learning Outcomes + +* Define explainability and distinguish it from transparency and interpretability. +* Identify common XAI methods (e.g., SHAP, LIME, saliency maps) and their applications. +* Apply interpretability techniques to evaluate model decisions in classification or regression tasks. +* Describe use cases for explainability in scientific research, safety-critical systems, and compliance. +* Evaluate trade-offs between model complexity and interpretability. + +** Caution: All text is AI generated ** diff --git a/ai/3/b.txt b/ai/3/b.txt new file mode 100644 index 0000000..df65309 --- /dev/null +++ b/ai/3/b.txt @@ -0,0 +1,29 @@ +# AI3 AI Modalities + +This node introduces various types of AI model modalities and their applications in HPC environments. It groups foundational knowledge and deployment considerations for language, image, audio, video, multimodal, graph, and scientific models, along with principles of interpretability. + +## Learning Outcomes + +* Explain the structure and application of large language models and how they scale in HPC environments. +* Describe the training and inference workflows of image-based models and their resource requirements. +* Summarize techniques used in audio and voice model processing and deployment. +* Identify the challenges of video generation models and their compute/memory implications. +* Understand how multimodal models combine inputs from various domains and the synchronization strategies involved. +* Describe graph neural network architectures and their relevance in scientific and relational data modeling. +* Explain the role of domain-specific scientific models in physics-informed AI and simulation-enhanced learning. +* Apply principles of explainable AI (XAI) to interpret predictions and assess model reliability across modalities. + + + +## Subskills + +* [[skill-tree:ai:3:1:b]] +* [[skill-tree:ai:3:2:b]] +* [[skill-tree:ai:3:3:b]] +* [[skill-tree:ai:3:4:b]] +* [[skill-tree:ai:3:5:b]] +* [[skill-tree:ai:3:6:b]] +* [[skill-tree:ai:3:7:b]] +* [[skill-tree:ai:3:8:b]] + +** Caution: All text is AI generated ** diff --git a/ai/4/1/b.txt b/ai/4/1/b.txt new file mode 100644 index 0000000..c6ba2a1 --- /dev/null +++ b/ai/4/1/b.txt @@ -0,0 +1,18 @@ +# AI4.1 Responsible AI Use + +This skill explores the ethical and societal implications of AI systems, focusing on fairness, transparency, safety, and accountability. It introduces frameworks and best practices for responsible AI design and deployment in HPC contexts. + +## Requirements + +* External: General understanding of AI system design and deployment +* Internal: None + +## Learning Outcomes + +* Define key principles of responsible AI, including fairness, non-discrimination, and inclusiveness. +* Identify potential sources of bias in AI data and models and methods to mitigate them. +* Describe strategies for ensuring transparency and explainability in model outputs. +* Recognize ethical risks in deploying large-scale AI systems and how to address them. +* Apply responsible AI guidelines or frameworks (e.g., EU AI Act, OECD Principles) to HPC workflows. + +** Caution: All text is AI generated ** diff --git a/ai/4/2/b.txt b/ai/4/2/b.txt new file mode 100644 index 0000000..e1dcadb --- /dev/null +++ b/ai/4/2/b.txt @@ -0,0 +1,18 @@ +# AI4.2 Data Privacy and Compliance + +This skill focuses on the legal and technical aspects of handling sensitive data in AI systems, particularly in regulated environments. It covers compliance requirements, anonymization techniques, and policy enforcement in HPC settings. + +## Requirements + +* External: Familiarity with AI workflows and basic data management concepts +* Internal: None + +## Learning Outcomes + +* Identify major privacy regulations relevant to AI (e.g., GDPR, HIPAA) and their implications. +* Describe data anonymization and pseudonymization techniques. +* Apply access controls and audit mechanisms for secure data handling in HPC workflows. +* Explain how compliance requirements influence dataset selection, storage, and retention policies. +* Evaluate tools and frameworks that support compliance monitoring and reporting. + +** Caution: All text is AI generated ** diff --git a/ai/4/3/b.txt b/ai/4/3/b.txt new file mode 100644 index 0000000..8c920be --- /dev/null +++ b/ai/4/3/b.txt @@ -0,0 +1,14 @@ +# AI4.3 Data Provenance and Auditability + +This skill addresses the tracking and documentation of data origins, transformations, and usage throughout the AI lifecycle. It focuses on enabling reproducibility, transparency, and auditability of AI workflows in HPC environments. + +## Learning Outcomes + +* Define data provenance and explain its importance in scientific and regulated AI use cases. +* Identify tools and metadata standards used for tracking data lineage. +* Describe how audit trails can be maintained across distributed HPC workflows. +* Implement strategies to ensure reproducibility of AI experiments, including versioning of data and models. +* Evaluate systems that integrate provenance tracking with workflow engines or data lakes. + +** Caution: All text is AI generateIntelligent Interactions and Retrieval Systems + diff --git a/ai/4/b.txt b/ai/4/b.txt new file mode 100644 index 0000000..dbcfa97 --- /dev/null +++ b/ai/4/b.txt @@ -0,0 +1,17 @@ +# AI4 Governance and Compliance + +This node covers the policies, frameworks, and practices needed to ensure responsible, lawful, and auditable use of AI technologies in HPC environments. It includes responsible AI principles, legal compliance, and data governance techniques. + +## Learning Outcomes + +* Explain the principles of responsible AI development and deployment. +* Identify key regulatory and legal frameworks relevant to AI usage and data handling. +* Describe how data provenance, auditability, and reproducibility are maintained in HPC AI workflows. + +## Subskills + +* [[skill-tree:ai:4:1:b]] +* [[skill-tree:ai:4:2:b]] +* [[skill-tree:ai:4:3:b]] + +** Caution: All text is AI generated ** diff --git a/ai/5/1/b.txt b/ai/5/1/b.txt new file mode 100644 index 0000000..a3bf0ec --- /dev/null +++ b/ai/5/1/b.txt @@ -0,0 +1,18 @@ +# AI5.1 Prompt Engineering + +This skill focuses on designing and optimizing inputs (prompts) to guide the behavior of large language models (LLMs) and other generative AI systems. It includes patterns, templates, and evaluation methods tailored to HPC-scale applications. + +## Requirements + +* External: Basic understanding of language models and inference processes +* Internal: None + +## Learning Outcomes + +* Define prompt engineering and describe its role in controlling generative model output. +* Identify common prompt types (e.g., zero-shot, few-shot, chain-of-thought) and their use cases. +* Design structured prompts for tasks such as summarization, code generation, and question answering. +* Evaluate prompt effectiveness using criteria such as output quality, consistency, and efficiency. +* Apply prompt optimization techniques for performance on large-scale inference systems. + +** Caution: All text is AI generated ** diff --git a/ai/5/2/b.txt b/ai/5/2/b.txt new file mode 100644 index 0000000..f2bcded --- /dev/null +++ b/ai/5/2/b.txt @@ -0,0 +1,18 @@ +# AI5.2 Retrieval Augmented Generation + +This skill introduces the concept of Retrieval-Augmented Generation (RAG), where external knowledge sources are queried and integrated into the generation process. It covers retrieval pipelines, indexing strategies, and deployment in HPC environments. + +## Requirements + +* External: Familiarity with LLMs and vector search concepts +* Internal: None + +## Learning Outcomes + +* Define the RAG architecture and explain how it improves generative model performance. +* Describe the components of a retrieval pipeline, including query formulation, embedding, and indexing. +* Identify vector databases and similarity metrics used in AI retrieval tasks. +* Integrate retrieval results into prompt templates or model input streams. +* Evaluate RAG systems based on latency, accuracy, and grounding quality. + +** Caution: All text is AI generated ** diff --git a/ai/5/3/b.txt b/ai/5/3/b.txt new file mode 100644 index 0000000..1adfe23 --- /dev/null +++ b/ai/5/3/b.txt @@ -0,0 +1,18 @@ +# AI5.3 Agentic Interfaces and Collaboration + +This skill explores AI systems designed to act as agents—capable of multi-step reasoning, memory, and collaboration with users or other agents. It includes the architecture, behavior models, and interface strategies used to implement intelligent and adaptive workflows. + +## Requirements + +* External: Understanding of basic prompting and LLM capabilities +* Internal: AI5.1 Prompt Engineering (recommended) + +## Learning Outcomes + +* Define agentic interfaces and describe their role in structured decision-making and automation. +* Differentiate between tool-using, memory-enabled, and collaborative agent types. +* Describe the looped reasoning and execution cycles used by AI agents. +* Identify strategies for agent evaluation, feedback integration, and task adaptation. +* Apply best practices for integrating agents into interactive or multi-agent HPC workflows. + +** Caution: All text is AI generated ** diff --git a/ai/5/b.txt b/ai/5/b.txt new file mode 100644 index 0000000..765a7ae --- /dev/null +++ b/ai/5/b.txt @@ -0,0 +1,18 @@ +# AI5 Intelligent Interactions and Retrieval Systems + +This node explores techniques for enhancing AI model interactions with users and data sources. It includes prompting strategies, retrieval-augmented generation (RAG), and agentic interfaces designed for complex task execution and collaboration. + +## Learning Outcomes + +* Describe prompting techniques and their influence on generative model behavior. +* Explain how retrieval-augmented generation improves accuracy and grounding in generative systems. +* Understand the architecture and behavior of agentic interfaces designed for iterative or autonomous workflows. + +## Subskills + +* [[skill-tree:ai:5:1:b]] +* [[skill-tree:ai:5:2:b]] +* [[skill-tree:ai:5:3:b]] + +** Caution: All text is AI generated ** + diff --git a/ai/6/1/b.txt b/ai/6/1/b.txt new file mode 100644 index 0000000..942f169 --- /dev/null +++ b/ai/6/1/b.txt @@ -0,0 +1,18 @@ +# AI6.1 Integrating External AI APIs + +This skill covers how to incorporate third-party AI services—such as foundation models, transcription engines, and vision APIs—into HPC workflows. It includes authentication, latency handling, and resource-aware integration techniques. + +## Requirements + +* External: Familiarity with web APIs and basic programming +* Internal: None + +## Learning Outcomes + +* Identify common external AI APIs and services (e.g., OpenAI, Hugging Face, Google Vision). +* Demonstrate secure API integration using authentication methods like OAuth or API keys. +* Handle latency, rate limits, and failures in long-running HPC jobs using retry/backoff strategies. +* Apply methods for data formatting and streaming between HPC systems and external APIs. +* Evaluate trade-offs between in-house models and external service integration. + +** Caution: All text is AI generated ** diff --git a/ai/6/2/b.txt b/ai/6/2/b.txt new file mode 100644 index 0000000..f30ad3e --- /dev/null +++ b/ai/6/2/b.txt @@ -0,0 +1,18 @@ +# AI6.2 Building AI APIs + +This skill focuses on exposing AI models as APIs to support integration, automation, and user interaction. It covers API design, containerization, deployment, and scalability in HPC or hybrid cloud environments. + +## Requirements + +* External: Experience with Python and web frameworks (e.g., Flask, FastAPI) +* Internal: None + +## Learning Outcomes + +* Design RESTful APIs for AI model inference, including input/output schema definition. +* Implement API endpoints to expose model functionality securely and efficiently. +* Containerize API services using tools like Docker or Singularity for deployment. +* Deploy APIs in scalable environments using orchestration tools (e.g., Kubernetes, Slurm). +* Monitor, log, and benchmark API performance in production. + +** Caution: All text is AI generated ** diff --git a/ai/6/3/b.txt b/ai/6/3/b.txt new file mode 100644 index 0000000..dec7376 --- /dev/null +++ b/ai/6/3/b.txt @@ -0,0 +1,18 @@ +# AI6.3 AI Frameworks + +This skill introduces widely used AI frameworks that support model development, training, and deployment. It emphasizes selecting the right framework based on use case, hardware compatibility, and scalability in HPC environments. + +## Requirements + +* External: Understanding of deep learning workflows +* Internal: None + +## Learning Outcomes + +* Compare major AI frameworks such as TensorFlow, PyTorch, JAX, and ONNX. +* Describe the strengths and limitations of each framework in HPC use cases. +* Identify tools for mixed precision training, distributed computing, and hardware acceleration. +* Demonstrate how to port models between frameworks for deployment or optimization. +* Select appropriate frameworks based on model architecture, team skills, and resource constraints. + +** Caution: All text is AI generated ** diff --git a/ai/6/b.txt b/ai/6/b.txt new file mode 100644 index 0000000..4b60c12 --- /dev/null +++ b/ai/6/b.txt @@ -0,0 +1,17 @@ +# AI6 AI Services + +This node focuses on the development, integration, and delivery of AI capabilities as services. It includes building and deploying APIs, integrating external AI providers, and managing frameworks for scalable service-based AI in HPC environments. + +## Learning Outcomes + +* Describe how AI services are developed and delivered via APIs. +* Explain how external AI models and APIs are integrated into HPC workflows. +* Identify common AI frameworks and how they support scalable, service-oriented deployment. + +## Subskills + +* [[skill-tree:ai:6:1:b]] +* [[skill-tree:ai:6:2:b]] +* [[skill-tree:ai:6:3:b]] + +** Caution: All text is AI generated ** diff --git a/ai/b.txt b/ai/b.txt new file mode 100644 index 0000000..a5ee8a8 --- /dev/null +++ b/ai/b.txt @@ -0,0 +1,26 @@ +# AI Artificial intelligence + +Using Artificial intelligence software and also deploying it is becoming more and more relevant. +This is also true for deploying models in an HPC system and offering it as a service. + +## Learning Outcomes + +* Explain the function and design of intelligent agents in HPC-based AI systems and investigate scalability. +* Summarize techniques for fine-tuning AI models, considering performance and resource constraints. +* Describe the requirements and challenges of hosting AI models in HPC or hybrid infrastructure. +* Explain the principles of resource-aware AI deployment, including compute, memory, and energy considerations. +* Summarize techniques used in different models for processing and deployment. +* Identify key regulatory and legal frameworks relevant to AI usage and data handling and also identify principles of responsible AI use. +* Explain how retrieval-augmented generation improves accuracy and grounding in generative systems. +* Describe how AI services are developed and delivered via APIs. + +## Subskills + +* [[skill-tree:ai:1:b]] +* [[skill-tree:ai:2:b]] +* [[skill-tree:ai:3:b]] +* [[skill-tree:ai:4:b]] +* [[skill-tree:ai:5:b]] +* [[skill-tree:ai:6:b]] + +** Caution: All text is AI generated ** diff --git a/b.txt b/b.txt index 4d4e796..5f5c5b7 100644 --- a/b.txt +++ b/b.txt @@ -1,9 +1,11 @@ # SkillTree ## Subskills - * [[skill-tree:adm:b]] - * [[skill-tree:bda:b]] - * [[skill-tree:k:b]] - * [[skill-tree:pe:b]] - * [[skill-tree:sd:b]] - * [[skill-tree:use:b]] + +* [[skill-tree:adm:b]] +* [[skill-tree:bda:b]] +* [[skill-tree:k:b]] +* [[skill-tree:pe:b]] +* [[skill-tree:sd:b]] +* [[skill-tree:use:b]] +* [[skill-tree:ai:b]] diff --git a/bda/3/1/b.txt b/bda/3/1/b.txt new file mode 100644 index 0000000..cc8a2b3 --- /dev/null +++ b/bda/3/1/b.txt @@ -0,0 +1,18 @@ +# BDA3.1 Parallel File Systems + +This skill focuses on file systems optimized for concurrent access by distributed processes in HPC environments. It covers architecture, performance tuning, and usage best practices for systems like Lustre, GPFS, and BeeGFS. + +## Requirements + +* External: Basic knowledge of file systems and parallel computing +* Internal: None + +## Learning Outcomes + +* Describe the architecture and operational principles of parallel file systems. +* Compare commonly used systems (e.g., Lustre, GPFS, BeeGFS) in terms of scalability and performance. +* Identify key configuration parameters that influence I/O throughput. +* Apply best practices for managing I/O contention and metadata performance. +* Monitor and optimize parallel file system performance for big data workflows. + +** Caution: All text is AI generated ** diff --git a/bda/3/2/b.txt b/bda/3/2/b.txt new file mode 100644 index 0000000..35925fc --- /dev/null +++ b/bda/3/2/b.txt @@ -0,0 +1,18 @@ +# BDA3.2 Object Storage for Big Data + +This skill introduces object storage systems designed for scalability, durability, and ease of access in big data environments. It covers access patterns, API usage, and integration with HPC and cloud-based workflows. + +## Requirements + +* External: Familiarity with data formats and file access in distributed systems +* Internal: None + +## Learning Outcomes + +* Explain the concept of object storage and how it differs from file/block storage. +* Identify popular object storage platforms (e.g., S3, Ceph, MinIO) and their use cases in big data. +* Describe data access mechanisms including REST APIs, SDKs, and CLI tools. +* Evaluate performance considerations such as consistency, latency, and throughput. +* Integrate object storage into HPC data pipelines using connectors or abstraction layers. + +** Caution: All text is AI generated ** diff --git a/bda/3/b.txt b/bda/3/b.txt index c9f3458..52d1745 100644 --- a/bda/3/b.txt +++ b/bda/3/b.txt @@ -1,30 +1,15 @@ # BDA3 Technology -Understanding the underlying technologies and infrastructure is crucial for effectively managing and analyzing large volumes of data in big data analytics. This module covers the foundational technologies, tools, and platforms used in big data processing, storage, and analysis. - -## Requirements +This node introduces technologies used to support large-scale data workflows in HPC environments. It focuses on file systems and object storage solutions that ensure high throughput, scalability, and reliability for big data applications. ## Learning Outcomes -* **Understand the principles** of distributed computing and parallel processing in big data analytics. -* **Explore the architecture** of distributed file systems like Hadoop Distributed File System (HDFS) and its role in storing and managing large datasets. -* **Analyze the components** of the Hadoop ecosystem, including Hadoop MapReduce, YARN, and Hadoop Common, and their contributions to big data processing. -* **Examine the role** of NoSQL databases such as Apache Cassandra, MongoDB, and Apache HBase in handling unstructured and semi-structured data in distributed environments. -* **Understand the principles** of data replication, fault tolerance, and high availability in distributed storage systems for ensuring data reliability and resilience. -* **Explore the concepts** of stream processing frameworks such as Apache Kafka, Apache Storm, and Apache Flink for real-time data ingestion, processing, and analysis. -* **Analyze the architecture** of distributed batch processing frameworks such as Apache Spark, Apache Flink, and Apache Beam for processing large volumes of data in parallel. -* **Understand the principles** of resource management and workload scheduling in distributed computing environments for optimizing resource utilization and performance. -* **Explore the role** of containerization technologies such as Docker and Kubernetes in deploying and managing distributed big data applications at scale. -* **Analyze the features** of cloud-based big data platforms such as Amazon EMR, Google Cloud Dataproc, and Microsoft Azure HDInsight, and their advantages for scalable data processing and analytics. -* **Understand the principles** of data compression and serialization techniques for optimizing storage efficiency and reducing data transfer overhead in distributed systems. -* **Explore the concepts** of data lakes, data warehouses, and data marts in organizing and structuring data for analytics and business intelligence purposes. -* **Analyze the architecture** of distributed stream processing systems such as Apache Beam, Apache Samza, and Apache Apex for processing continuous streams of data with low latency and high throughput. -* **Understand the principles** of graph processing and graph databases such as Neo4j, Amazon Neptune, and Apache Giraph for analyzing and querying interconnected data. -* **Explore the role** of indexing and search technologies such as Apache Solr, Elasticsearch, and Apache Lucene in enabling fast and efficient retrieval of information from large datasets. -* **Analyze the challenges** of data integration, data quality, and data governance in big data environments and strategies for overcoming these challenges. -* **Understand the principles** of data encryption, access control, and data masking techniques for securing sensitive data in distributed storage and processing systems. -* **Explore the concepts** of data preprocessing, feature engineering, and data transformation techniques for preparing raw data for machine learning and predictive analytics. -* **Analyze the features** of data governance tools and metadata management solutions for tracking data lineage, ensuring data quality, and enforcing regulatory compliance. -* **Understand the principles** of data virtualization and federated query processing in integrating heterogeneous data sources and enabling cross-platform data analytics. +* Describe the characteristics and trade-offs of file system technologies used in big data HPC environments. +* Explain how object storage solutions are structured and integrated with HPC and analytics workflows. + +## Subskills + +* [[skill-tree:bda:3:1:b]] +* [[skill-tree:bda:3:2:b]] -AI generated content +** Caution: All text is AI generated ** diff --git a/bda/5/1/1/b.txt b/bda/5/1/1/b.txt new file mode 100644 index 0000000..8cbdd20 --- /dev/null +++ b/bda/5/1/1/b.txt @@ -0,0 +1,18 @@ +# BDA5.1.1 Supervised and Unsupervised Learning + +This skill introduces the fundamental principles of supervised and unsupervised learning, focusing on algorithm types, data requirements, and typical use cases in HPC-scale machine learning. + +## Requirements + +* External: Basic statistics and data analysis knowledge +* Internal: None + +## Learning Outcomes + +* Define and distinguish between supervised and unsupervised learning paradigms. +* Identify common algorithms such as decision trees, SVMs, k-means, and PCA. +* Explain how labeled and unlabeled datasets are used in each approach. +* Describe typical workflows and applications of each paradigm in scientific and industrial settings. +* Understand data partitioning, overfitting, and model generalization in supervised learning. + +** Caution: All text is AI generated ** diff --git a/bda/5/1/2/b.txt b/bda/5/1/2/b.txt new file mode 100644 index 0000000..ac4c523 --- /dev/null +++ b/bda/5/1/2/b.txt @@ -0,0 +1,18 @@ +# BDA5.1.2 Evaluation Metrics + +This skill focuses on the metrics and methods used to assess the performance of machine learning models. It includes evaluation for classification, regression, and unsupervised learning, with emphasis on interpretability and reliability in HPC-scale workflows. + +## Requirements + +* External: Familiarity with supervised and unsupervised learning concepts +* Internal: BDA5.1.1 Supervised and Unsupervised Learning (recommended) + +## Learning Outcomes + +* Identify common metrics for classification (e.g., accuracy, precision, recall, F1-score) and regression (e.g., RMSE, MAE). +* Explain confusion matrices, ROC curves, and AUC for classification tasks. +* Describe cluster evaluation metrics like silhouette score and Davies-Bouldin index. +* Interpret metric values to assess underfitting, overfitting, and model calibration. +* Apply cross-validation to estimate model performance and generalizability. + +** Caution: All text is AI generated ** diff --git a/bda/5/1/b.txt b/bda/5/1/b.txt new file mode 100644 index 0000000..b2ef0e4 --- /dev/null +++ b/bda/5/1/b.txt @@ -0,0 +1,16 @@ +# BDA5.1 Machine Learning Fundamentals + +This node introduces core machine learning concepts and techniques used in HPC environments. It includes supervised and unsupervised learning paradigms as well as the evaluation metrics necessary to assess model performance. + +## Learning Outcomes + +* Explain the principles of supervised and unsupervised learning in data-driven tasks. +* Identify key metrics used to evaluate classification and regression models. + +## Subskills + +* [[skill-tree:bda:5:1:1:b]] +* [[skill-tree:bda:5:1:2:b]] + +** Caution: All text is AI generated ** + diff --git a/bda/5/2/1/b.txt b/bda/5/2/1/b.txt new file mode 100644 index 0000000..d0e764a --- /dev/null +++ b/bda/5/2/1/b.txt @@ -0,0 +1,18 @@ +# BDA5.2.1 Basics of Neural Networks + +This skill introduces the foundational building blocks of neural networks, including how they are structured, trained, and optimized. It covers feedforward networks, loss functions, activation functions, and the backpropagation algorithm. + +## Requirements + +* External: Knowledge of basic linear algebra and calculus +* Internal: None + +## Learning Outcomes + +* Describe the structure of a feedforward neural network. +* Explain the role of activation functions and compare common types (ReLU, sigmoid, tanh). +* Understand how neural networks learn through backpropagation and gradient descent. +* Identify common loss functions for classification and regression tasks. +* Outline the training loop and how it is implemented in modern frameworks. + +** Caution: All text is AI generated ** diff --git a/bda/5/2/2/b.txt b/bda/5/2/2/b.txt new file mode 100644 index 0000000..5ed30c0 --- /dev/null +++ b/bda/5/2/2/b.txt @@ -0,0 +1,18 @@ +# BDA5.2.2 Neural Network Architectures + +This skill introduces major neural network architectures used across domains such as image processing, sequence modeling, and large-scale representation learning. It emphasizes the use cases, design patterns, and performance considerations of each. + +## Requirements + +* External: Understanding of neural network basics +* Internal: BDA5.2.1 Basics of Neural Networks (recommended) + +## Learning Outcomes + +* Describe the architecture and applications of convolutional neural networks (CNNs). +* Explain how recurrent neural networks (RNNs), LSTMs, and GRUs process sequential data. +* Understand the transformer architecture and its impact on modern AI models. +* Compare trade-offs among architectures in terms of complexity, performance, and parallelism. +* Select appropriate architectures based on task type and data modality. + +** Caution: All text is AI generated ** diff --git a/bda/5/2/b.txt b/bda/5/2/b.txt new file mode 100644 index 0000000..2ada564 --- /dev/null +++ b/bda/5/2/b.txt @@ -0,0 +1,15 @@ +# BDA5.2 Deep Learning Fundamentals + +This node covers the foundational elements of deep learning, including the structure and training of neural networks. It introduces basic neural network components and various deep learning architectures relevant to HPC workflows. + +## Learning Outcomes + +* Describe how neural networks are structured, trained, and applied in data-driven tasks. +* Differentiate between core deep learning architectures such as CNNs, RNNs, and transformers. + +## Subskills + +* [[skill-tree:bda:5:2:1:b]] +* [[skill-tree:bda:5:2:2:b]] + +** Caution: All text is AI generated ** diff --git a/bda/5/3/1/b.txt b/bda/5/3/1/b.txt new file mode 100644 index 0000000..973511b --- /dev/null +++ b/bda/5/3/1/b.txt @@ -0,0 +1,18 @@ +# BDA5.3.1 Pytorch + +This skill introduces the PyTorch framework for developing and training machine learning and deep learning models. It emphasizes model definition, training loops, and debugging in a dynamic computation graph environment. + +## Requirements + +* External: Programming experience with Python +* Internal: BDA5.2.1 Basics of Neural Networks (recommended) + +## Learning Outcomes + +* Define and train neural networks using the PyTorch nn.Module interface. +* Implement forward passes, loss computation, and backward propagation. +* Use DataLoader and Dataset classes to efficiently handle input data. +* Apply GPU acceleration and mixed precision with PyTorch utilities. +* Debug models interactively using PyTorch’s dynamic computation graph. + +** Caution: All text is AI generated ** diff --git a/bda/5/3/2/b.txt b/bda/5/3/2/b.txt new file mode 100644 index 0000000..f9772e8 --- /dev/null +++ b/bda/5/3/2/b.txt @@ -0,0 +1,18 @@ +# BDA5.3.2 Tensorflow + +This skill introduces TensorFlow for building and training machine learning models. It covers the Keras API, static vs dynamic graph execution, and strategies for using TensorFlow effectively in research and production. + +## Requirements + +* External: Experience with Python and basic deep learning workflows +* Internal: BDA5.2.1 Basics of Neural Networks (recommended) + +## Learning Outcomes + +* Build and train models using TensorFlow's Keras API. +* Explain the difference between eager execution and static graph mode. +* Use tf.data pipelines for scalable and efficient data input. +* Implement model checkpoints and callbacks for training control. +* Apply GPU acceleration, distribution strategies, and TensorBoard for visualization. + +** Caution: All text is AI generated ** diff --git a/bda/5/3/3/b.txt b/bda/5/3/3/b.txt new file mode 100644 index 0000000..4a3f1b6 --- /dev/null +++ b/bda/5/3/3/b.txt @@ -0,0 +1,18 @@ +# BDA5.3.3 Distributed Training + +This skill introduces techniques for training machine learning and deep learning models across multiple GPUs or nodes. It covers data and model parallelism, communication frameworks, and integration with PyTorch and TensorFlow. + +## Requirements + +* External: Understanding of model training workflows +* Internal: BDA5.3.1 Pytorch or BDA5.3.2 Tensorflow (recommended) + +## Learning Outcomes + +* Distinguish between data parallelism and model parallelism strategies. +* Use PyTorch’s DistributedDataParallel and TensorFlow’s tf.distribute APIs. +* Configure multi-GPU and multi-node training in HPC environments. +* Monitor and debug performance bottlenecks in distributed training setups. +* Apply best practices for checkpointing, fault tolerance, and reproducibility at scale. + +** Caution: All text is AI generated ** diff --git a/bda/5/3/b.txt b/bda/5/3/b.txt new file mode 100644 index 0000000..eb0423f --- /dev/null +++ b/bda/5/3/b.txt @@ -0,0 +1,16 @@ +# BDA5.3 ML and DL Frameworks + +This node focuses on the use of modern machine learning and deep learning frameworks in HPC contexts. It includes practical usage of PyTorch and TensorFlow, as well as strategies for scaling training across distributed systems. + +## Learning Outcomes + +* Use leading ML/DL frameworks to define, train, and evaluate models. +* Apply distributed training strategies using framework-native tools for scalability and performance. + +## Subskills + +* [[skill-tree:bda:5:3:1:b]] +* [[skill-tree:bda:5:3:2:b]] +* [[skill-tree:bda:5:3:3:b]] + +** Caution: All text is AI generated ** diff --git a/bda/5/4/1/b.txt b/bda/5/4/1/b.txt new file mode 100644 index 0000000..153e9fa --- /dev/null +++ b/bda/5/4/1/b.txt @@ -0,0 +1,18 @@ +# BDA5.4.1 Batch Size and Data Parallelism + +This skill focuses on tuning batch sizes and applying data parallelism to accelerate training across multiple compute units. It covers trade-offs in memory usage, convergence behavior, and hardware utilization. + +## Requirements + +* External: Familiarity with model training and GPU compute +* Internal: BDA5.3.3 Distributed Training (recommended) + +## Learning Outcomes + +* Explain how batch size affects training stability, convergence, and throughput. +* Identify the relationship between batch size and memory usage on accelerators. +* Apply data parallelism techniques across GPUs or nodes for scalable training. +* Use gradient accumulation to simulate large batch sizes under memory constraints. +* Evaluate performance trade-offs using throughput and loss convergence metrics. + +** Caution: All text is AI generated ** diff --git a/bda/5/4/2/b.txt b/bda/5/4/2/b.txt new file mode 100644 index 0000000..f7d0996 --- /dev/null +++ b/bda/5/4/2/b.txt @@ -0,0 +1,18 @@ +# BDA5.4.2 Mixed Precision Training + +This skill introduces the use of mixed precision (FP16 + FP32) to accelerate training while maintaining model accuracy. It focuses on numerical stability, hardware support, and integration with modern ML frameworks. + +## Requirements + +* External: Basic understanding of floating point computation +* Internal: BDA5.3.1 Pytorch or BDA5.3.2 Tensorflow (recommended) + +## Learning Outcomes + +* Define mixed precision training and describe its benefits in performance and memory usage. +* Identify hardware and software prerequisites for mixed precision support (e.g., NVIDIA Tensor Cores, AMP). +* Apply automatic mixed precision (AMP) in PyTorch and TensorFlow workflows. +* Monitor for numerical instability and apply scaling techniques as needed. +* Benchmark training speed and accuracy trade-offs using mixed precision. + +** Caution: All text is AI generated ** diff --git a/bda/5/4/3/b.txt b/bda/5/4/3/b.txt new file mode 100644 index 0000000..e70e5be --- /dev/null +++ b/bda/5/4/3/b.txt @@ -0,0 +1,18 @@ +# BDA5.4.3 Checkpointing and Recovery + +This skill covers techniques for saving and restoring training state to ensure fault tolerance and efficient recovery in long-running ML jobs. It includes strategies for storage management, frequency control, and integration with batch schedulers. + +## Requirements + +* External: Familiarity with training loops and storage systems +* Internal: None + +## Learning Outcomes + +* Explain the importance of checkpointing for resiliency in HPC training workflows. +* Implement model, optimizer, and scheduler state saving in popular ML frameworks. +* Choose checkpointing frequency based on job length, stability, and system load. +* Manage checkpoint file size, compression, and storage placement. +* Integrate checkpointing with job resubmission and monitoring tools in HPC environments. + +** Caution: All text is AI generated ** diff --git a/bda/5/4/b.txt b/bda/5/4/b.txt new file mode 100644 index 0000000..0e5ce21 --- /dev/null +++ b/bda/5/4/b.txt @@ -0,0 +1,16 @@ +# BDA5.4 HPC Optimization for ML + +This node covers performance tuning strategies that enhance machine learning training efficiency on HPC systems. It includes batch size tuning, mixed precision training, and mechanisms for recovery and checkpointing. + +## Learning Outcomes + +* Optimize batch sizes and parallelism settings to improve training scalability. +* Apply mixed precision techniques and implement robust checkpointing strategies for long-running jobs. + +## Subskills + +* [[skill-tree:bda:5:4:1:b]] +* [[skill-tree:bda:5:4:2:b]] +* [[skill-tree:bda:5:4:3:b]] + +** Caution: All text is AI generated ** diff --git a/bda/5/5/1/b.txt b/bda/5/5/1/b.txt new file mode 100644 index 0000000..e7d51d0 --- /dev/null +++ b/bda/5/5/1/b.txt @@ -0,0 +1,18 @@ +# BDA5.5.1 Hyperparameter Search Methods + +This skill introduces core approaches to searching hyperparameter spaces, including manual, grid, random, and probabilistic methods. It focuses on balancing coverage, efficiency, and reproducibility in ML experiments. + +## Requirements + +* External: Familiarity with model training and tuning concepts +* Internal: None + +## Learning Outcomes + +* Describe the role of hyperparameters in machine learning model performance. +* Compare manual tuning, grid search, and random search techniques. +* Explain how Bayesian optimization and other probabilistic methods guide search using prior results. +* Evaluate search efficiency and effectiveness across multiple performance metrics. +* Apply reproducible tuning experiments using configuration management tools or frameworks. + +** Caution: All text is AI generated ** diff --git a/bda/5/5/2/b.txt b/bda/5/5/2/b.txt new file mode 100644 index 0000000..8b0aaad --- /dev/null +++ b/bda/5/5/2/b.txt @@ -0,0 +1,18 @@ +# BDA5.5.2 Distributed Hyperparameter Search + +This skill focuses on running hyperparameter tuning jobs in parallel across multiple compute nodes or GPUs. It includes orchestration tools, job scheduling integration, and best practices for large-scale experimentation in HPC settings. + +## Requirements + +* External: Understanding of hyperparameter tuning strategies and batch job submission +* Internal: BDA5.5.1 Hyperparameter Search Methods (recommended) + +## Learning Outcomes + +* Explain the benefits and challenges of distributing hyperparameter search in HPC environments. +* Use orchestration tools such as Ray Tune, Optuna, or Hyperopt for parallel execution. +* Integrate distributed tuning with HPC schedulers like Slurm or PBS. +* Monitor, resume, and manage many concurrent tuning jobs. +* Optimize compute resource usage to balance speed, cost, and coverage. + +** Caution: All text is AI generated ** diff --git a/bda/5/5/3/b.txt b/bda/5/5/3/b.txt new file mode 100644 index 0000000..204032a --- /dev/null +++ b/bda/5/5/3/b.txt @@ -0,0 +1,18 @@ +# BDA5.5.3 Early Stopping and Pruning Strategies + +This skill introduces methods to halt underperforming model runs early during training or tuning. It focuses on decision rules, integration with tuning frameworks, and efficiency improvements in HPC resource usage. + +## Requirements + +* External: Experience with model evaluation and validation techniques +* Internal: BDA5.5.1 Hyperparameter Search Methods (recommended) + +## Learning Outcomes + +* Define early stopping and its role in improving training efficiency. +* Describe criteria such as validation loss plateaus or metric stagnation. +* Apply pruning strategies to terminate weak candidates during hyperparameter search. +* Integrate early stopping with tuning frameworks and logging systems. +* Evaluate trade-offs between training completeness and resource savings. + +** Caution: All text is AI generated ** diff --git a/bda/5/5/4/b.txt b/bda/5/5/4/b.txt new file mode 100644 index 0000000..9d72678 --- /dev/null +++ b/bda/5/5/4/b.txt @@ -0,0 +1,18 @@ +# BDA5.5.4 Resource Aware Tuning + +This skill focuses on hyperparameter tuning strategies that account for resource constraints in HPC environments. It includes adaptive scheduling, prioritization, and energy-aware optimization techniques. + +## Requirements + +* External: Basic understanding of HPC scheduling and tuning processes +* Internal: BDA5.5.1 Hyperparameter Search Methods (recommended) + +## Learning Outcomes + +* Explain how resource availability and cost influence tuning strategy selection. +* Use adaptive or multi-fidelity tuning approaches to allocate compute efficiently. +* Apply constraints on runtime, memory, or energy usage in tuning campaigns. +* Integrate resource metrics into scheduling or pruning decisions. +* Evaluate tuning success with respect to both model performance and resource efficiency. + +** Caution: All text is AI generated ** diff --git a/bda/5/5/b.txt b/bda/5/5/b.txt new file mode 100644 index 0000000..0e901f8 --- /dev/null +++ b/bda/5/5/b.txt @@ -0,0 +1,17 @@ +# BDA5.5 Hyperparameter Tuning + +This node focuses on methods and tools for optimizing hyperparameters to improve model accuracy, robustness, and efficiency. It includes search strategies, distributed tuning, and resource-aware techniques suited for HPC environments. + +## Learning Outcomes + +* Apply search algorithms to explore hyperparameter spaces effectively. +* Scale tuning processes across HPC resources while minimizing overhead and waste. + +## Subskills + +* [[skill-tree:bda:5:5:1:b]] +* [[skill-tree:bda:5:5:2:b]] +* [[skill-tree:bda:5:5:3:b]] +* [[skill-tree:bda:5:5:4:b]] + +** Caution: All text is AI generated ** diff --git a/bda/5/b.txt b/bda/5/b.txt index ea5a9bd..1b74d79 100644 --- a/bda/5/b.txt +++ b/bda/5/b.txt @@ -1,28 +1,25 @@ # BDA5 Machine Learning -Machine learning techniques are essential for extracting valuable insights and predictions from big data. This module covers the principles, algorithms, and applications of machine learning in the context of big data analytics. - -## Requirements +This node encompasses foundational and advanced skills for building, training, evaluating, and optimizing machine learning models in HPC environments. It brings together concepts from classical ML, deep learning, software frameworks, system optimization, and hyperparameter tuning. ## Learning Outcomes -* **Understand the fundamentals** of machine learning, including supervised learning, unsupervised learning, and reinforcement learning. -* **Explore the principles** of feature engineering and feature selection for preparing input data for machine learning models. -* **Analyze the role** of data preprocessing techniques such as normalization, standardization, and missing value imputation in improving model performance. -* **Understand the concepts** of model evaluation, including performance metrics such as accuracy, precision, recall, and F1-score. -* **Explore the concepts** of cross-validation and hyperparameter tuning for optimizing model performance and generalization. -* **Analyze the architecture** of popular machine learning libraries and frameworks such as scikit-learn, TensorFlow, and PyTorch. -* **Understand the principles** of linear regression and logistic regression for solving regression and classification problems, respectively. -* **Explore the concepts** of decision trees, random forests, and gradient boosting for building ensemble learning models. -* **Analyze the principles** of support vector machines (SVMs) for binary classification and kernel methods for nonlinear decision boundaries. -* **Understand the concepts** of clustering algorithms such as k-means, hierarchical clustering, and density-based clustering for unsupervised learning tasks. -* **Explore the role** of dimensionality reduction techniques such as principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) for visualizing and compressing high-dimensional data. -* **Analyze the principles** of neural networks and deep learning architectures for solving complex machine learning problems. -* **Understand the concepts** of convolutional neural networks (CNNs) for image classification and object detection tasks. -* **Explore the role** of recurrent neural networks (RNNs) and long short-term memory (LSTM) networks for sequence modeling and natural language processing tasks. -* **Analyze the principles** of generative adversarial networks (GANs) for generating synthetic data and images. -* **Understand the concepts** of transfer learning and fine-tuning pre-trained models for domain adaptation and task-specific learning. -* **Explore the role** of autoencoders and variational autoencoders (VAEs) for unsupervised feature learning and data generation. -* **Analyze the principles** of reinforcement learning algorithms such as Q-learning and deep Q-networks (DQN) for learning optimal decision-making policies. +* Explain the fundamental differences between supervised, unsupervised, and deep learning approaches. +* Interpret evaluation metrics for different model types and assess performance quality. +* Describe how neural networks are structured and how architecture choice affects modeling tasks. +* Use major ML/DL frameworks such as PyTorch and TensorFlow to implement models. +* Apply distributed training techniques to scale model development across HPC resources. +* Tune performance using batch sizing, mixed precision, and checkpointing for long-running jobs. +* Design and execute hyperparameter search experiments efficiently at scale. +* Implement resource-aware tuning strategies that consider runtime, memory, and energy trade-offs. +* Integrate optimization, monitoring, and reproducibility practices for ML workflows on HPC platforms. + +## Subskills + +* [[skill-tree:bda:5:1:b]] +* [[skill-tree:bda:5:2:b]] +* [[skill-tree:bda:5:3:b]] +* [[skill-tree:bda:5:4:b]] +* [[skill-tree:bda:5:5:b]] -AI generated content +** Caution: All text is AI generated ** diff --git a/bda/7/1/b.txt b/bda/7/1/b.txt new file mode 100644 index 0000000..7463189 --- /dev/null +++ b/bda/7/1/b.txt @@ -0,0 +1,19 @@ +# BDA7.1 Cross-Validation + +This skill introduces cross-validation techniques used to assess model generalization and prevent overfitting. It focuses on different strategies, statistical robustness, and integration with HPC workflows. + +## Requirements + +* External: Familiarity with model training and evaluation +* Internal: BDA5.1.2 Evaluation Metrics (recommended) + +## Learning Outcomes + +* Define cross-validation and explain its role in evaluating model generalization. +* Compare k-fold, stratified, and leave-one-out cross-validation strategies. +* Implement cross-validation efficiently on large datasets in parallel environments. +* Use cross-validation results to guide model selection and hyperparameter tuning. +* Evaluate statistical significance and variance across validation folds. + + +** Caution: All text is AI generated ** diff --git a/bda/7/2/b.txt b/bda/7/2/b.txt new file mode 100644 index 0000000..02e0aa0 --- /dev/null +++ b/bda/7/2/b.txt @@ -0,0 +1,19 @@ +# BDA7.2 Scalable Model Evaluation + +This skill focuses on evaluating AI and ML models at scale in HPC environments. It covers techniques for benchmarking accuracy, performance, and resource efficiency across large datasets and distributed systems. + +## Requirements + +* External: Understanding of evaluation metrics and HPC resource usage +* Internal: BDA7.1 Cross-Validation (recommended) + +## Learning Outcomes + +* Design scalable evaluation pipelines for large datasets and models. +* Benchmark model performance across multiple hardware configurations or datasets. +* Monitor compute, memory, and I/O usage during model evaluation. +* Compare models using normalized, reproducible metrics and visualizations. +* Ensure fair evaluation through consistent preprocessing, baselining, and version control. + + +** Caution: All text is AI generated ** diff --git a/bda/7/b.txt b/bda/7/b.txt new file mode 100644 index 0000000..6123380 --- /dev/null +++ b/bda/7/b.txt @@ -0,0 +1,16 @@ +# BDA7 Evaluation and Benchmarking + +This node introduces the principles and tools for evaluating machine learning models in large-scale HPC environments. It covers statistical evaluation methods, reproducibility techniques, and scalable benchmarking strategies for AI workloads. + +## Learning Outcomes + +* Apply statistical evaluation techniques to assess the generalization and reliability of ML models. +* Use cross-validation and related methods to quantify model performance. +* Benchmark AI models at scale with a focus on consistency, fairness, and comparability. + +## Subskills + +* [[skill-tree:bda:7:1:b]] +* [[skill-tree:bda:7:2:b]] + +** Caution: All text is AI generated ** diff --git a/bda/b.txt b/bda/b.txt index c6da705..a0aef42 100644 --- a/bda/b.txt +++ b/bda/b.txt @@ -14,6 +14,7 @@ AI can be used inside simulations or to steer workflows, while data science can * Contrast the different steps in data processing from preparation, to post-processing, and visualization and analysis. * Describe and apply the concepts of artificial intelligence and data science. * Compose a workflow consisting of HPC and big data tools to analyze the data. +* Apply statistical evaluation techniques to assess the generalization and reliability of ML models and benchmark AI models at scale with a focus on consistency, fairness, and comparability. # Subskills @@ -21,6 +22,7 @@ AI can be used inside simulations or to steer workflows, while data science can * [[skill-tree:bda:2:b]] * [[skill-tree:bda:3:b]] * [[skill-tree:bda:4:b]] -* [[skill-tree:bda:6:b]] * [[skill-tree:bda:5:b]] * [[skill-tree:bda:6:b]] +* [[skill-tree:bda:6:b]] +* [[skill-tree:bda:7:b]] diff --git a/skill-tree.mm b/skill-tree.mm index b62c693..3100e8d 100644 --- a/skill-tree.mm +++ b/skill-tree.mm @@ -1,8 +1,8 @@ - + - + @@ -65,7 +65,7 @@ - + @@ -142,7 +142,8 @@ - + + @@ -184,15 +185,15 @@ - + - - - + + + - + @@ -202,21 +203,53 @@ - - + + + + + - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - + + + + + @@ -247,7 +280,8 @@ - + + @@ -292,7 +326,8 @@ - + + @@ -328,6 +363,10 @@ + + + + @@ -376,5 +415,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + From 5aa8d44207cc8633d12f6b2f9440769b2d463c16 Mon Sep 17 00:00:00 2001 From: Kevin Luedemann Date: Wed, 6 Aug 2025 08:49:02 +0200 Subject: [PATCH 2/2] Update AI skill tree part --- skill-tree.mm | 117 ++++++++++++++++++++++++++------------------------ use/8/1/b.txt | 17 ++++++++ use/8/2/b.txt | 16 +++++++ use/8/b.txt | 20 +++++++++ 4 files changed, 115 insertions(+), 55 deletions(-) create mode 100644 use/8/1/b.txt create mode 100644 use/8/2/b.txt create mode 100644 use/8/b.txt diff --git a/skill-tree.mm b/skill-tree.mm index 3100e8d..a9d3752 100644 --- a/skill-tree.mm +++ b/skill-tree.mm @@ -1,6 +1,6 @@ - + @@ -185,7 +185,7 @@ - + @@ -193,7 +193,7 @@ - + @@ -203,50 +203,50 @@ - + - + - - + + - + - + - + - + - + - + @@ -336,7 +336,7 @@ - + @@ -363,9 +363,10 @@ - - - + + + + @@ -407,53 +408,59 @@ - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/use/8/1/b.txt b/use/8/1/b.txt new file mode 100644 index 0000000..0832692 --- /dev/null +++ b/use/8/1/b.txt @@ -0,0 +1,17 @@ +# USE8.1 Prompt Engineering + +Prompt engineering focuses on how users interact with AI models by crafting input (prompts) that guide the output behavior. +At its core, this skill explores how the structure, wording, and formatting of prompts can significantly affect the results returned by large language models. +It provides users with practical techniques to phrase questions, give context, and evaluate output — all without needing to understand the internal mechanics of the models. +The skill includes key strategies to refine outputs and avoid common mistakes in prompt usage. + +## Requirements + +## Learning Outcomes + +* Understand how the structure and phrasing of a prompt can influence the output of a language model. +* Identify common prompt formats used for AI tasks such as summarization, generation, classification, and translation. +* Recognize and avoid common prompt-related errors (e.g., ambiguous wording, lack of context). +* Refine prompts iteratively to improve output quality. +* Apply techniques such as role prompting, few-shot prompting, and instruction-based prompting. + diff --git a/use/8/2/b.txt b/use/8/2/b.txt new file mode 100644 index 0000000..a94a72b --- /dev/null +++ b/use/8/2/b.txt @@ -0,0 +1,16 @@ +# USE8.2 AI Infrastructure + +This skill introduces users to AI services hosted on HPC systems via web and API interfaces. +Rather than deploying or managing models, users focus on accessing and using them through friendly platforms like web dashboards or through APIs. +The emphasis is on understanding how to send queries to models, interpret responses, and integrate them into tasks. +Part of this skill is becoming familiar with tokens, endpoints, and response handling — essential for using AI tools programmatically. + +## Requirements + +## Learning Outcomes + +* Access AI tools via a web interface and understand basic interaction workflows. +* Send requests to AI models using API interfaces and interpret structured responses. +* Understand how to use API keys, endpoints, and basic request formatting (e.g., JSON). +* Distinguish between synchronous and asynchronous inference requests. +* Integrate simple AI responses into command-line or scripting workflows for automation. diff --git a/use/8/b.txt b/use/8/b.txt new file mode 100644 index 0000000..6eed9fa --- /dev/null +++ b/use/8/b.txt @@ -0,0 +1,20 @@ +# USE8 AI Services + +AI services on HPC systems offer users access to large language models and other AI tools through interactive and programmatic interfaces. +These services are designed to be simple to use, requiring no deep background in AI or infrastructure. +Whether through web applications or basic scripting, users can explore, prompt, and benefit from modern AI capabilities. +This skill introduces the general principles of interacting with AI systems using prompts, interfaces, and hosted models. + +## Requirements + +## Learning Outcomes + +* Understand the purpose and availability of AI services within an HPC environment. +* Identify common methods of interacting with AI models, including prompt-based and API-based usage. +* Describe the difference between web interface access and programmatic access to AI tools. +* Recognize the scope and limitations of using hosted AI tools without managing the underlying infrastructure. + +## Subskills + +* [[skill-tree:use:8:1:b]] +* [[skill-tree:use:8:2:b]]