Skip to content

Conversation

demikl
Copy link

@demikl demikl commented Sep 4, 2025

Description

Fixes nil pointer panic in updateComputeConfig and prevents spurious Auto Mode updates for non-Auto Mode clusters.

Related Issue

Fixes aws-controllers-k8s/community#2619

Changes

  • Add isAutoModeCluster() function to properly detect Auto Mode clusters
  • Add validateAutoModeConfig() to validate AWS requirement that compute, storage, and load balancing must all be enabled/disabled together
  • Only invoke updateComputeConfig() for actual Auto Mode clusters
  • Ignore elasticLoadBalancing absent vs false differences for non-Auto Mode clusters

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ack-prow ack-prow bot requested review from a-hilaly and michaelhtm September 4, 2025 14:24
Copy link

ack-prow bot commented Sep 4, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: demikl
Once this PR has been reviewed and has the lgtm label, please assign jlbutler for approval by writing /assign @jlbutler in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

ack-prow bot commented Sep 4, 2025

Hi @demikl. Thanks for your PR.

I'm waiting for a aws-controllers-k8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@demikl
Copy link
Author

demikl commented Sep 4, 2025

⚠️ Breaking Change Notice

This PR introduces a breaking change in validation behavior for Auto Mode configurations. Users who previously had partial Auto Mode configurations in their Custom Resources will now encounter validation errors.

What Changed

Previously, the controller allowed partial Auto Mode configurations where only one or two of the three required capabilities were specified. This led to:

  • Nil pointer panics when ComputeConfig was present but other configs were missing
  • Confusing behavior where missing fields were implicitly treated as false
  • Spurious reconciliation loops due to AWS API inconsistencies

New Validation Rules

With this change, if any Auto Mode configuration is specified, ALL three capabilities must be explicitly declared:

  1. spec.computeConfig.enabled
  2. spec.storageConfig.blockStorage.enabled
  3. spec.kubernetesNetworkConfig.elasticLoadBalancing.enabled

Example of Breaking Change

This configuration will now fail validation:

apiVersion: eks.services.k8s.aws/v1alpha1
kind: Cluster
metadata:
  name: self
spec:
  computeConfig:
    enabled: false  # Only one capability specified
  kubernetesNetworkConfig:
    ipFamily: ipv4
    serviceIPv4CIDR: 172.20.0.0/16

Error result:

status:
  conditions:
  - lastTransitionTime: "2025-09-04T15:03:00Z"
    status: "True"
    type: ACK.ResourceSynced
  - message: 'invalid Auto Mode configuration: when configuring Auto Mode, all three
      capabilities must be specified (compute=true, storage=false, elb=false)'
    status: "True"
    type: ACK.Terminal

How to Fix Your Custom Resources

Users have three options:

Option 1: Enable Auto Mode completely

spec:
  computeConfig:
    enabled: true
  storageConfig:
    blockStorage:
      enabled: true
  kubernetesNetworkConfig:
    elasticLoadBalancing:
      enabled: true

Option 2: Disable Auto Mode completely

spec:
  computeConfig:
    enabled: false
  storageConfig:
    blockStorage:
      enabled: false
  kubernetesNetworkConfig:
    elasticLoadBalancing:
      enabled: false

Option 3: Don't use Auto Mode

Remove all Auto Mode configurations from your CR entirely.

Why This Change is Necessary

  1. Prevents crashes: Eliminates nil pointer panics when partial configurations are present
  2. Follows AWS requirements: EKS Auto Mode requires all three capabilities to be configured together
  3. Improves user experience: Clear validation errors instead of silent failures or crashes
  4. Reduces confusion: No more implicit false treatment of missing fields

This aligns the controller behavior with AWS EKS Auto Mode documentation and prevents the undefined behavior that was causing production issues.

@demikl
Copy link
Author

demikl commented Sep 4, 2025

Auto Mode Configuration Logic

This PR adds proper validation and handling for EKS Auto Mode, which has specific requirements from AWS:

Auto Mode Requirements:

  • Activate: All three capabilities (compute, storage, elasticLoadBalancing) must be true
  • Deactivate: All three capabilities must be false
  • Invalid: Any partial or inconsistent configuration (e.g., compute=true, storage=false)

Changes Made:

  1. Strict Validation: The controller now validates that Auto Mode configurations are complete and consistent. If a Custom Resource specifies only some capabilities or has inconsistent true/false values, the resource status is set to failed with a clear error message.
  2. Auto Mode Detection: Only clusters with valid Auto Mode configurations (all true or all false) will trigger Auto Mode-specific logic. Other clusters skip Auto Mode processing entirely.
  3. AWS API Resilience: If the EKS API returns inconsistent responses (which we've observed with the elasticLoadBalancing field), the controller ignores these inconsistencies rather than crashing, preventing API quirks from affecting cluster management.

Breaking Change: Custom Resources with partial Auto Mode configurations that previously succeeded will now fail validation with clear error messages, guiding users toward correct configurations.

This ensures the controller behavior matches AWS Auto Mode requirements and prevents the nil pointer crashes reported in issue aws-controllers-k8s/community#2619.

@demikl demikl force-pushed the fix-automode-nil-pointer-issue-2619 branch from 8290936 to 85a97c1 Compare September 4, 2025 15:42
@rushmash91
Copy link
Member

/ok-to-test

@ack-prow ack-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 5, 2025
@rushmash91
Copy link
Member

Hey @demikl , Thank you for this!!

Can you add a test for the auto-mode behavior? Creating a non-auto cluster and making it auto mode and one trying to turn it off with incorrect parameterS..

@demikl
Copy link
Author

demikl commented Sep 8, 2025

/retest

@demikl
Copy link
Author

demikl commented Sep 9, 2025

Hi @rushmash91,

I’m seeing behavior in the Auto Mode activation test that I can’t explain:

Test flow (test_enable_auto_mode_on_standard_cluster in test_cluster_automode_updates.py):

  1. Create a standard (non–Auto Mode) cluster using the existing simple-cluster template.
  2. Wait until the cluster is ACTIVE (both CR status and eks:DescribeCluster agree).
  3. Patch the Cluster CR to add:
  • spec.computeConfig.enabled = true
  • spec.storageConfig.blockStorage.enabled = true
  • spec.kubernetesNetworkConfig.elasticLoadBalancing.enabled = true
  1. Cluster transitions to UPDATING (observed in CR status and eks:DescribeCluster).
  2. Cluster returns to ACTIVE.
  3. The CR now contains the three Auto Mode sections with enabled=true and the controller reports no further drift.
  4. However, eks:DescribeCluster still does not include computeConfig, storageConfig, or kubernetesNetworkConfig.elasticLoadBalancing (the fields are absent, not just disabled).

Additional debugging I added:

  • Logged intermediate CR + DescribeCluster snapshots (pre‑patch, during UPDATING, after ACTIVE).
  • Called list_updates + describe_update for each updateId: there is exactly one update of type AutoModeUpdate with status Successful.
  • This suggests the service accepted and applied the transition.
  • When I manually reproduce the same workflow in my company’s AWS account (create standard cluster → patch to enable Auto Mode → describe), the DescribeCluster response DOES include the three sections as expected.

Questions / hypotheses:

  • Do Prow tests run against the real EKS API or some proxy/mock that may not yet surface these new fields?
  • Is there longer propagation latency in this environment? (I currently poll up to 180s.)
  • Could the region under test have partial rollout of the Auto Mode describe fields?
  • Are additional fields (e.g. nodeRoleARN, nodePools) required in this environment for the sections to appear in DescribeCluster even though just enabled=true works elsewhere?
  • Is the API silently accepting the update but withholding the fields until another condition is met?

Next steps I can take if helpful:

  • Enrich the patch with nodeRoleARN (from bootstrap) and optional nodePools.
  • Extend polling window and dump the raw JSON describe payload.
  • Add a temporary xfail if this is a known service-side lag or rollout gap.

Could you please confirm:

  • That the test environment is using the live EKS control plane.
  • Whether any extra prerequisites are required for DescribeCluster to return the Auto Mode sections after an update.

Happy to adjust the test once I understand the expected behavior here.

Thanks!

@rushmash91
Copy link
Member

Hi @demikl ,

Questions:
That the test environment is using the live EKS control plane - Yes, the environment is an EKS cluster
Whether any extra prerequisites are required for DescribeCluster to return the Auto Mode sections after an update - No

I usually test the api describe and update behavior directly via the CLI to see if there are any caveats being missed.

@demikl
Copy link
Author

demikl commented Sep 11, 2025

Thanks for the clarification, @rushmash91.

I can reliably reproduce the expected behavior (DescribeCluster showing the three Auto Mode sections) when running the same workflow locally/manually.

Because I still can’t determine why the Prow run’s DescribeCluster response omits those sections after a Successful AutoModeUpdate, I’ve updated the test to assert the transition using list-updates + describe-update only (type=AutoModeUpdate, status=Successful, three params all enabled). This has been consistent across runs.

Let me know if you’d prefer that I keep a (soft) DescribeCluster check as a best-effort, or leave it as-is with the update-based validation.

@demikl
Copy link
Author

demikl commented Sep 11, 2025

/retest

1 similar comment
@demikl
Copy link
Author

demikl commented Sep 12, 2025

/retest

@demikl
Copy link
Author

demikl commented Sep 18, 2025

Hi, do you need anything more from me for this PR to be accepted?

@rushmash91
Copy link
Member

Hey @demikl , thank you! this is great!
One comment i had is can we have the auto-mode test be part of the cluster auto mode test file already there..

- Add isAutoModeCluster() function to detect valid Auto Mode configurations
- Add validateAutoModeConfig() to enforce AWS requirement that compute,
  storage, and load balancing must all be enabled/disabled together
- Only call updateComputeConfig() for actual Auto Mode clusters
- Ignore elasticLoadBalancing absent vs false diffs for non-Auto Mode clusters

Fixes aws-controllers-k8s/community#2619
@demikl demikl force-pushed the fix-automode-nil-pointer-issue-2619 branch from a00dd91 to d5a11c1 Compare September 22, 2025 06:29
@demikl
Copy link
Author

demikl commented Sep 22, 2025

I've merged the tests as requested. It looks like there is a flaky test that impacts my PR checks 😞 : test_cluster_adopt_update

Copy link

ack-prow bot commented Sep 22, 2025

@demikl: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
eks-verify-attribution 086fe41 link false /test eks-verify-attribution

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@demikl
Copy link
Author

demikl commented Sep 22, 2025

/test eks-kind-e2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

eks-controller: Nil pointer panic in updateComputeConfig + false drift on elasticLoadBalancing for non–Auto Mode clusters
2 participants