Skip to content

Conversation

@zhencliu
Copy link
Contributor

@zhencliu zhencliu commented Oct 5, 2025

Depends on #4242

YongxueHong and others added 11 commits September 24, 2025 16:57
Introduces VT Agent, a new XML-RPC based agent for Avocado VT.

This agent allows for remote execution of commands and services on a
target system. It features a core RPC server, dynamic service loading,
and basic API functions for agent control and logging.

Key components include:
- Main application entry point and argument parsing (`__main__.py`, `app/`).
- Core server logic, service discovery, data/logging utilities (`core/`).
- Example services (`services/examples/`) to demonstrate extensibility.

This provides a foundation for more advanced remote interactions
and test automation capabilities within Avocado VT.

Signed-off-by: Yongxue Hong <[email protected]>
Co-authored-by: Xu Han <[email protected]>
Co-authored-by: Zhenchao Liu <[email protected]>
This commit introduces a new `vt_cluster` module to enable distributed,
multi-host testing within the `virttest` framework. It provides the
foundation for orchestrating complex test scenarios that span across
multiple machines (nodes).

The new module implements a controller-agent architecture:
- A central controller orchestrates tests and manages the state of the cluster.
- Remote agents run on test nodes, executing commands received via XML-RPC.

Key components of this new feature include:

- **`Cluster` (`__init__.py`):** A singleton object that manages the
  state of all nodes and partitions in the cluster. It persists the
  cluster state to a file.

- **`Node` (`node.py`):** Represents a single machine (remote or controller).
  It handles agent setup, lifecycle management (start/stop), and log
  collection on remote nodes using SSH and SCP.

- **`Partition` (`__init__.py`):** A logical grouping of nodes allocated
  for a specific test run, allowing for resource isolation.

- **`_ClientProxy` (`proxy.py`):** An XML-RPC client proxy for
  communication between the controller and agents.

This framework allows tests to request a partition of nodes, execute
commands on them, and collect logs centrally, which is essential for
multi-host migration tests and other distributed scenarios.

Signed-off-by: Yongxue Hong <[email protected]>
Co-authored-by: Xu Han <[email protected]>
Co-authored-by: Zhenchao Liu <[email protected]>
This commit introduces a centralized logging server for the `vt_cluster`
framework. The new `logger.py` module provides a `LoggerServer` that
listens for log records from remote agents, enabling a unified view of
events across the entire distributed environment.

Key features:
- A `LoggerServer` that runs on the controller node and collects logs from
  all registered nodes.
- Log records are serialized to JSON for secure transmission over the
  network.
- Each log message is tagged with its originating node name and IP address,
  providing a clear, chronological stream of logs from the entire cluster.

This new logging mechanism simplifies debugging and monitoring of distributed
tests by consolidating all log output into a single location.

Signed-off-by: Yongxue Hong <[email protected]>
Co-authored-by: Xu Han <[email protected]>
Co-authored-by: Zhenchao Liu <[email protected]>
This commit introduces a framework for collecting and managing properties
from all nodes within a virt test cluster. This provides a centralized
and persistent way for tests to access node-specific hardware and software
configurations.

The new module handles the collection, caching, and retrieval of this data.
On initialization, it gathers information from each node and stores it in
a JSON file.

As an initial implementation, the following metadata is collected:
- Hostname
- CPU vendor and model name

To support this, a new agent service, has been added to expose CPU details
from the worker nodes. This framework is designed to be extensible for
gathering more properties in the future.

Signed-off-by: Yongxue Hong <[email protected]>
Co-authored-by: Xu Han <[email protected]>
Co-authored-by: Zhenchao Liu <[email protected]>
This commit introduces a new node selection mechanism to the vt_cluster
framework.

The selector allows tests to dynamically request nodes based on specific
attributes, rather than relying on hardcoded node names. This enables more
flexible and resource-aware test scheduling.

Key components:
 - selector.py: A new module containing the core selection logic.
 - select_node(): The main function that filters a list of candidate nodes
                  based on a set of selector rules.
 - _NodeSelector: A class that matches rules against node metadata.
 - _MatchExpression & _Operator: Helper classes for parsing and executing
                                 selection rules.

The selector works by querying properties from active agents on remote nodes.
It uses a simple expression format (key, operator, values) to define
requirements such as CPU vendor, memory size, or other custom metadata.

This change empowers tests to specify their hardware or software needs,
for example: "select a node where memory_gb >= 32"

Signed-off-by: Yongxue Hong <[email protected]>
Co-authored-by: Xu Han <[email protected]>
Co-authored-by: Zhenchao Liu <[email protected]>
This change introduces support for setting up a multi-host testing
environment directly through the vt-bootstrap command.

A new command-line option, --vt-cluster-config, has been added. This
option accepts a path to a JSON file that defines the cluster topology,
including the hosts and a central controller.

The bootstrap process now includes steps to:
 - Parse the provided cluster configuration file.
 - Register each host as a node in the cluster, preparing its agent
   environment.

Signed-off-by: Yongxue Hong <[email protected]>
Co-authored-by: Xu Han <[email protected]>
Co-authored-by: Zhenchao Liu <[email protected]>
This commit introduces the `VTCluster` plugin, designed to manage the
lifecycle of a multi-node test environment within Avocado-VT.

The plugin hooks into the job's pre-test and post-test phases to
automate cluster management:

- In `pre_tests`, it initializes the test environment by starting agent
  servers on all configured cluster nodes and loading necessary metadata.
  Setup failures are treated as fatal, terminating the job to prevent
  tests from running in a misconfigured environment.

- In `post_tests`, it handles the teardown process. This includes
  stopping the agent servers, collecting their logs into a dedicated
  `cluster/` directory within the job results, and unloading metadata.
  The cleanup process is designed to be robust, logging failures without
  halting execution to ensure as much cleanup and log collection as
  possible is performed.

Custom exceptions are included for clear error reporting, and the
structure provides placeholders for future cluster manager logic.

Signed-off-by: Yongxue Hong <[email protected]>
Co-authored-by: Xu Han <[email protected]>
Co-authored-by: Zhenchao Liu <[email protected]>
This commit introduces the core capability to run tests across multiple hosts.
It extends the testing framework to manage a cluster of nodes, allowing tests
to orchestrate and validate complex scenarios that involve distributed
systems.

Key features include:
- A cluster management system for allocating and releasing nodes.
- A mechanism for tests to request and interact with multiple remote machines.
- Centralized log collection from all participating nodes to simplify debugging.
- Robust setup and teardown logic to ensure a clean test environment.

Signed-off-by: Yongxue Hong <[email protected]>
Co-authored-by: Xu Han <[email protected]>
Co-authored-by: Zhenchao Liu <[email protected]>
@coderabbitai
Copy link

coderabbitai bot commented Oct 5, 2025

Warning

Rate limit exceeded

@zhencliu has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 21 minutes and 58 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 72943c1 and 58b6738.

📒 Files selected for processing (82)
  • avocado_vt/plugins/vt_bootstrap.py (1 hunks)
  • avocado_vt/plugins/vt_cluster.py (1 hunks)
  • avocado_vt/test.py (6 hunks)
  • avocado_vt/vt_agent/pyproject.toml (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/README.md (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/__main__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/app/__init__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/app/args.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/app/cmd.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/core/__init__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/core/data_dir.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/core/logger.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/core/rpc/__init__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/core/rpc/server.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/core/rpc/service.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/__init__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/connect.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/console.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/image_manager.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/images/__init__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/images/qemu/__init__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/images/qemu/qemu_image_handlers.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backing_manager.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/__init__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/backing.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/pool_connection.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/storage/__init__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/storage/dir/__init__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/storage/dir/dir_pool_connection.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/storage/dir/dir_volume_backing.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/storage/file_volume_backing.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/storage/nfs/__init__.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/storage/nfs/nfs_pool_connection.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/storage/nfs/nfs_volume_backing.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/managers/resource_backings/storage/volume_backing.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/services/core.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/services/examples/hello.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/services/host/cpu.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/services/host/platform.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/services/image.py (1 hunks)
  • avocado_vt/vt_agent/src/avocado_vt/agent/services/resource.py (1 hunks)
  • setup.py (1 hunks)
  • spell.ignore (8 hunks)
  • virttest/bootstrap.py (5 hunks)
  • virttest/env_process.py (15 hunks)
  • virttest/vt_cluster/README.md (1 hunks)
  • virttest/vt_cluster/__init__.py (1 hunks)
  • virttest/vt_cluster/logger.py (1 hunks)
  • virttest/vt_cluster/node.py (1 hunks)
  • virttest/vt_cluster/node_properties.py (1 hunks)
  • virttest/vt_cluster/proxy.py (1 hunks)
  • virttest/vt_cluster/selector.py (1 hunks)
  • virttest/vt_imgr/__init__.py (1 hunks)
  • virttest/vt_imgr/logical_image_manager.py (1 hunks)
  • virttest/vt_imgr/logical_images/__init__.py (1 hunks)
  • virttest/vt_imgr/logical_images/layer_image.py (1 hunks)
  • virttest/vt_imgr/logical_images/logical_image.py (1 hunks)
  • virttest/vt_imgr/logical_images/qemu/__init__.py (1 hunks)
  • virttest/vt_imgr/logical_images/qemu/images/__init__.py (1 hunks)
  • virttest/vt_imgr/logical_images/qemu/images/luks_qemu_layer_image.py (1 hunks)
  • virttest/vt_imgr/logical_images/qemu/images/qcow2_qemu_layer_image.py (1 hunks)
  • virttest/vt_imgr/logical_images/qemu/images/raw_qemu_layer_image.py (1 hunks)
  • virttest/vt_imgr/logical_images/qemu/qemu_layer_image.py (1 hunks)
  • virttest/vt_imgr/logical_images/qemu/qemu_logical_image.py (1 hunks)
  • virttest/vt_resmgr/__init__.py (1 hunks)
  • virttest/vt_resmgr/resource_manager.py (1 hunks)
  • virttest/vt_resmgr/resources/__init__.py (1 hunks)
  • virttest/vt_resmgr/resources/pool.py (1 hunks)
  • virttest/vt_resmgr/resources/pool_selector.py (1 hunks)
  • virttest/vt_resmgr/resources/resource.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/__init__.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/block_volume.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/dir/__init__.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/dir/dir_pool.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/dir/dir_volume.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/file_volume.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/net_volume.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/nfs/__init__.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/nfs/nfs_pool.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/nfs/nfs_volume.py (1 hunks)
  • virttest/vt_resmgr/resources/storage/volume.py (1 hunks)
  • virttest/vt_utils/image/qemu.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zhencliu zhencliu force-pushed the env_setup branch 13 times, most recently from 234ffa5 to 9c7352a Compare October 6, 2025 16:12
zhencliu and others added 5 commits October 7, 2025 17:15
Introduced a comprehensive unified resource management system,
establishing centralized coordination of test resources across
cluster worker nodes.

Master Node Resource Management:
  - PoolSelector with configurable criteria-based matching
  - ResourceManager: Central coordinator managing all pools and resources
  - ResourcePool: Collections of resources accessible by specific nodes
  - Resource: Individual assets (volumes, ports, etc.) within pools

Worker Node Resource Backing management:
  - Resource service API: exposing operations via cluster node proxy
  - ResourceBackingManager: handling node-local resource implementations
  - ResourcePoolConnection: handling pool connectivity on worker nodes
  - ResourceBacking: Node-specific implementations of resources

Signed-off-by: Zhenchao Liu <[email protected]>
Co-authored-by: Xu Han <[email protected]>
Co-authored-by: Yongxue Hong <[email protected]>
bootstrap: Setup/cleanup resource manager. Register all resource pools
configured by the user at boot strap. As avocado-vt doesn't support a
an environment level cleanup, we need to do cleanup before the setup.
vt_cluster: Startup/teardown the resource manager. Attach/detach all
resource pools configured to their accessing nodes.

Signed-off-by: Zhenchao Liu <[email protected]>
Signed-off-by: Zhenchao Liu <[email protected]>
A local filesystem storage pool can supply file-based volumes.
A filesystem pool can be attached(connected) from only one worker
node, users can configure a filesystem pool in the cluster.json:
      "dir_pool1": {
        "type": "filesystem",
        "path": "/home/dirpool1",
        "access": {
          "nodes": ["host1"]
        }
      },

Required:
  - type: filesystem
Optional:
  - path: Use get_data_dir()/root_dir by default
  - access: Use all worker nodes of the cluster by default, note
            access.nodes must be set if there are more than one
            nodes in the cluster.

If the filesystem pool is not configured in the cluster.json, create
it for each worker node by default.

Signed-off-by: Zhenchao Liu <[email protected]>
A nfs storage pool can supply a file-based volume resource.
A nfs pool can be attached(connected) from more then one worker
nodes, users can configure a nfs pool in the cluster.json:
      "nfs_pool": {
        "type": "nfs",
	"server": "nfs-server-host",
        "export": "/nfs/exported/dir",
	"mount_point": {
		"host1": "/var/tmp/mnt",
		"host2": "/tmp/mnt"
	},
	"mount_options": {"*": "rw"},
        "access": {
          "nodes": ["host1", "host2"]
        }
      }
Required:
  - type: nfs
  - server: The nfs server hostname or ip
  - export: The exported directory
Optional:
  - mount_point: Use get_data_dir()/nfs_mnt/{server} by default
  - mount_options: Use nfs's default options by default
  - access: Use all worker nodes of the cluster by default

Signed-off-by: Zhenchao Liu <[email protected]>
zhencliu and others added 4 commits October 7, 2025 17:20
Introduced the foundational image management infrastructure that
enables hierarchical image handling with distributed storage coordination.

Master Node Components:
- LogicalImageManager: Coordinates complex image topologies and delegates
  storage operations to the unified resource management system
- LogicalImage/Image abstractions: Define hierarchical image structures
  where each layer can be stored across different resource pools

Worker Node Components:
- ImageHandlerManager: Executes image operations (clone, update) on worker nodes
- Image service interface: Provides RPC endpoints for distributed image operations

Key Features:
- Hierarchical Image Support: Enables complex image topologies, e.g. with
  backing chains for snapshots for the qemu image
- Resource Integration: Images backed by volume resources managed through
  the unified resource system
- Distributed Operations: Clone and update operations coordinated across
  cluster nodes
- Extensible Architecture: Plugin-based design for different image types
  and formats

Architecture Flow:
  LogicalImageManager → LogicalImage → Image → Volume Resource → Storage Backing

The infrastructure provides the foundation for advanced image operations like
snapshot management, image cloning, and distributed image access while
maintaining tight integration with the cluster resource management system.

Signed-off-by: Zhenchao Liu <[email protected]>
Co-authored-by: Xu Han <[email protected]>
Co-authored-by: Yongxue Hong <[email protected]>
When the cluster feature is enabled(checking if nodes is set), then use
the image management system to handle images defined in 'images' param
in the pre-/post- processes.

Signed-off-by: Zhenchao Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants