Feature: dry run support for deploy command#106
Conversation
…e checkers - Add DryRunEngine orchestrating phase-specific validation checkers - Implement 11 specialized checkers (bootstrap, bundle, catalog, connectivity, dependency, git, manifest, permission, project, quicksight, storage, workflow) - Add PermissionChecker using iam:SimulatePrincipalPolicy for IAM validation - Add DependencyChecker validating pre-existing AWS resources and DataZone types - Add DryRunReport model collecting findings classified as OK/WARNING/ERROR - Integrate dry-run as pre-deployment validation step in deploy command - Add --dry-run flag for standalone validation mode - Add --skip-validation flag to bypass pre-deployment checks - Add --output option for JSON report generation - Include comprehensive unit and integration tests for all checkers - Add design, requirements, and testing documentation - Update CLI and project dependencies for dry-run support
|
|
||
| ## Introduction | ||
|
|
||
| The Deploy Dry Run feature adds a `--dry-run` option to the existing `smus-cicd deploy` command. When enabled, the CLI walks through every phase of the deployment pipeline — manifest loading, bundle exploration, project initialization, storage deployment, git deployment, catalog import, QuickSight dashboard deployment, workflow creation, and bootstrap actions — without creating, modifying, or deleting any actual resources. It also proactively verifies IAM permissions, S3 bucket accessibility, DataZone domain/project reachability, and catalog asset availability, producing a structured report of what would happen and any issues detected. The goal is to let operators confirm a deployment will succeed before committing to it, avoiding partial deployment failures. |
There was a problem hiding this comment.
Hmmm, are we saying
If the
--dry-rundeploy does not fail, the actual deploy will not fail
?
We might want to stay away from such claims because a lot of things (like notebooks for example) are in customer control and we cannot predict whether these will execute okay.
There was a problem hiding this comment.
Good catch, though i'm not able to see it anymore. Looks like its updated. Will do a search for it on other docs. Do keep an eye when you review the updated revision.
There was a problem hiding this comment.
Sorry, my comment was around this phrasing:
The goal is to let operators confirm a deployment will succeed before committing to it,...
The deployment will not 100% succeed even after successful dry-run. We need to make it clear
| 1. WHEN a bundle archive path is provided, THE Dry_Run_Engine SHALL open the bundle archive and enumerate all files contained within it. | ||
| 2. WHEN a bundle archive path is not provided, THE Dry_Run_Engine SHALL attempt to locate the bundle in the `./artifacts` directory using the same resolution logic as the deploy command. | ||
| 3. THE Dry_Run_Engine SHALL verify that each storage item referenced in the Deployment_Configuration has corresponding files in the bundle or on the local filesystem. | ||
| 4. THE Dry_Run_Engine SHALL verify that each git item referenced in the Deployment_Configuration has corresponding content in the bundle or is accessible via the configured repository URL. |
There was a problem hiding this comment.
I think we still do not have a github workflow for git stuff. We should add it to verify this.
| 2. WHEN dry-run mode is active, THE Permission_Checker SHALL verify that the current IAM identity has DataZone permissions (`datazone:GetDomain`, `datazone:GetProject`, `datazone:SearchListings`) required for the target domain and project. | ||
| 3. WHEN the Deployment_Configuration includes catalog assets, THE Permission_Checker SHALL verify that the current IAM identity has catalog import permissions (`datazone:CreateAsset`, `datazone:CreateGlossary`, `datazone:CreateGlossaryTerm`, `datazone:CreateFormType`). | ||
| 4. WHEN the manifest configures IAM role creation or update, THE Permission_Checker SHALL verify that the current IAM identity has `iam:CreateRole`, `iam:AttachRolePolicy`, and `iam:PutRolePolicy` permissions. | ||
| 5. WHEN the manifest configures QuickSight dashboard deployment, THE Permission_Checker SHALL verify that the current IAM identity has QuickSight permissions (`quicksight:DescribeDashboard`, `quicksight:CreateDashboard`, `quicksight:UpdateDashboard`). |
There was a problem hiding this comment.
There is actually another QuickSightServiceRole which is used by default to perform dashboard refresh where I faced issues last time: https://github.com/aws/CICD-for-SageMakerUnifiedStudio/tree/main/examples/analytic-workflow/dashboard-glue-quick#quicksight-dataset-refresh-fails
We might want to figure out, if we can set a different role for it in the examples
| 2. WHEN dry-run mode is active, THE Dry_Run_Engine SHALL simulate storage deployment and report the target S3 bucket, prefix, and file count for each storage item. | ||
| 3. WHEN dry-run mode is active, THE Dry_Run_Engine SHALL simulate git deployment and report the target connection, repository, and file count for each git item. | ||
| 4. WHEN dry-run mode is active AND the bundle contains catalog export data, THE Dry_Run_Engine SHALL simulate catalog import and report the count and types of catalog resources that would be created, updated, or deleted. | ||
| 5. WHEN dry-run mode is active AND the manifest configures QuickSight dashboards, THE Dry_Run_Engine SHALL simulate QuickSight deployment and report which dashboards would be exported and imported. |
There was a problem hiding this comment.
Wondering: do we have any more "special" resources like QuickSight? Does it make sense to create a unique logic for handling it?
There was a problem hiding this comment.
Technically, the CLI directly interacts only with Quicksight and MWAA serverless services. I can explore what's needed in addition for MWAA-S
|
|
||
| **User Story:** As a DevOps engineer, I want the dry run to verify that target AWS resources are reachable, so that I can detect network or configuration issues before deployment. | ||
|
|
||
| #### Acceptance Criteria |
There was a problem hiding this comment.
Should we add necessary connection checks to this as well?
- Add form type status validation in target domain to detect DISABLED form types - Implement _check_disabled_form_types method to query DataZone API for form type status - Resolve target_domain_id and target_region in dry-run engine after manifest validation - Add target_domain_id and target_region fields to DryRunContext for downstream checkers - Update catalog_import to re-enable DISABLED form types during import via create_form_type upsert - Add add_policy_grants integration test utility for catalog import workflows - Enhance cleanup_catalog_resources with improved resource deletion handling - Emit WARNING findings when form types exist but are DISABLED in target environment
| 2. WHEN the `--dry-run` flag is not provided, THE CLI SHALL execute the deploy command with its current behavior unchanged, preceded by an automatic dry-run validation step (see Requirement 14). | ||
| 3. THE CLI SHALL accept the `--dry-run` flag in combination with all existing deploy options (`--manifest`, `--targets`, `--bundle-archive-path`, `--emit-events/--no-events`, `--event-bus-name`). | ||
| 4. WHEN the `--dry-run` flag is provided, THE CLI SHALL suppress EventBridge event emission regardless of the `--emit-events` setting. | ||
| 5. THE CLI SHALL accept a `--skip-validation` flag that, when provided, skips the automatic pre-deployment dry-run validation step and proceeds directly to deployment. |
There was a problem hiding this comment.
Is this like --force? Maybe we should use that since we already use it for many other commands, though it is less descriptive
|
|
||
| 1. WHEN the deploy command is invoked without `--dry-run` and without `--skip-validation`, THE CLI SHALL automatically execute the Dry_Run_Engine as a pre-deployment validation step before calling the actual deploy logic. | ||
| 2. IF the pre-deployment dry-run validation produces one or more ERROR findings, THEN THE CLI SHALL abort the deployment, display the Dry_Run_Report, and exit with a non-zero exit code without executing any deployment actions. | ||
| 3. IF the pre-deployment dry-run validation produces only OK and WARNING findings (zero ERROR findings), THEN THE CLI SHALL log a summary of the validation results and proceed with the actual deployment. |
There was a problem hiding this comment.
Shall we request a go-ahead from the customer to allow them time to review the summary and the warnings?
There was a problem hiding this comment.
Not sure if its practical to expect customers to enter input as part of CI/CD deployment. Instead, the option is setup to return error code, if there is a failure expected failing the step.
|
|
||
| ## Overview | ||
|
|
||
| The deploy dry-run feature adds a `--dry-run` flag to the existing `smus-cicd deploy` command. When active, the CLI executes every deployment phase in read-only mode: it loads and validates the manifest, explores the bundle archive, verifies IAM permissions, checks AWS resource reachability, simulates each deployment phase (project init, storage, git, catalog import, QuickSight, workflows, bootstrap actions), validates pre-existing resource dependencies (Glue Data Catalog resources including tables, views, and databases; data sources; form types; asset types), and produces a structured report — all without creating, modifying, or deleting any resources. |
There was a problem hiding this comment.
Should be aws-smus-cicd-cli throughout the PR
There was a problem hiding this comment.
yeah. Mainline merge hadn't happened. It should now all be 'aws-smus-cicd-cli'
There was a problem hiding this comment.
Can you rebase the branch against main, so that we get rid of them all?
There was a problem hiding this comment.
i had already rebased it and but still saw a few of them. But they should all be addressed now.
| # Resolve project_name — may come from config or target_config. | ||
| project_name = config.get("project_name") | ||
| if not project_name: | ||
| project_cfg = getattr(context.target_config, "project", None) |
There was a problem hiding this comment.
target config is the config of the target in the manifest.
| Finding( | ||
| severity=Severity.WARNING, | ||
| message=( | ||
| f"DataZone project '{project_name}' not found " |
There was a problem hiding this comment.
We can check the manifest and see if the project is getting created through the CLI. If it is, then we are fine. If it is not, we should throw an ERROR. I do not think we need this warning at all
| if not self._db_cache[db_name]: | ||
| findings.append( | ||
| Finding( | ||
| severity=Severity.ERROR, |
There was a problem hiding this comment.
Why are we marking this one ERROR and the above ones WARNING? Are we advising the customer to have the data prior to deployment or are we blocking the deployment?
There was a problem hiding this comment.
we are expecting customers to have the assets with the same names.
There was a problem hiding this comment.
Then I think previous ones should be marked ERRORs as well, no?
There was a problem hiding this comment.
yup. already fixed.
| permissions: Dict[str, List[str]], | ||
| dz_arn: str, | ||
| ) -> None: | ||
| """Add catalog import and grant permissions when catalog assets exist.""" |
There was a problem hiding this comment.
But it can also contain catalog assets with disabled: true. We should probably include this edge-case
There was a problem hiding this comment.
yup. updated this already in the code to make sure we do this, only if disable is not true.
Shnekit
left a comment
There was a problem hiding this comment.
Done, reviewing. Please take a look at the comments :)
…ramework - Add preflight_checker.py to consolidate pre-deployment validation logic - Implement comprehensive dry-run validation step in GitHub Actions workflow before test deployment - Update all dry-run checker modules with improved error handling and validation logic - Enhance dry-run engine and models to support preflight validation patterns - Update catalog import/export helpers and integration tests with refined validation - Correct CLI command references from smus-cli to aws-smus-cicd-cli across documentation and specs - Add unit tests for new preflight_checker module - Update requirements and testing guide documentation for catalog import/export feature - Improve manifest and quickstart documentation with updated command references
- Move DryRunEngine import after describe_command import for logical grouping - Reformat deploy_command function calls to split arguments across multiple lines - Improve code readability and maintain consistent import organization
| bundle_path = find_bundle_file( | ||
| "./artifacts", | ||
| context.manifest.application_name, | ||
| str(context.config.get("region", "")) if context.config else "", |
There was a problem hiding this comment.
Can we remove defaults altogether and fail when the region is not present? With this, it is going to use empty string as the region, right?
There was a problem hiding this comment.
yup. updated the code to remove any region fallbacks and fail if region is not available.
Shnekit
left a comment
There was a problem hiding this comment.
I see there are several comments which are not addressed, but I think they are all minor. Feel free to address them later, if needed
Will catchup offline to understand which comments were not addressed. Will post a separate PR for those. |
feat(dry-run): add deploy dry-run validation engine with comprehensive checkers
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.