Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SimulatedTemperatureSensor Module Status Issues on Stop command #7420

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sush-101
Copy link

@sush-101 sush-101 commented Feb 3, 2025

Issue description:
Expected Behavior
When changing the module status from Running to Stopped via the Azure portal, the module on the device should stop successfully.

Actual Behavior
After setting the module to Stopped, running sudo iotedge list shows the SimulatedTemperatureSensor module in a Failed (139) state instead of Stopped.

Fix:
This pull request addresses the issues causing the SimulatedTemperatureSensor module to fail with exit code 139 when it receives a stop command from the Azure portal. The fixes involve handling task cancellation exceptions and adding a delay to ensure a graceful shutdown.

Initially, the SimulatedTemperatureSensor module encountered exit code 139 due to improper handling of task cancellations, leading to a segmentation fault. To address this, handling for TaskCanceledException was added in the SendEvents method to log a message when the task is canceled.

Additionally, a 2-second delay was added upon receiving the SIGTERM signal from Docker to ensure proper time is given for a graceful shutdown. This delay allows the module to complete any ongoing operations and release resources properly before shutting down.

Changes made:

  • Added handling for TaskCanceledException in the SendEvents method to log a message when the task is canceled.
  • Added a 2-second delay upon receiving the SIGTERM signal from Docker to ensure proper time for graceful shutdown.
  • Updated ShutdownHandler to include the delay for ensuring all operations complete gracefully before shutdown.

These changes address the issues causing the SimulatedTemperatureSensor module to fail with exit codes 139 and 143, ensuring the module transitions to the stopped state correctly.
image

Azure IoT Edge PR checklist:

This checklist is used to make sure that common guidelines for a pull request are followed.

General Guidelines and Best Practices

  • I have read the contribution guidelines.
  • Title of the pull request is clear and informative.
  • Description of the pull request includes a concise summary of the enhancement or bug fix.

Testing Guidelines

  • Pull request includes test coverage for the included changes.
  • Description of the pull request includes
    • concise summary of tests added/modified
    • local testing done.

Draft PRs

  • Open the PR in Draft mode if it is:
    • Work in progress or not intended to be merged.
    • Encountering multiple pipeline failures and working on fixes.

Note: We use the kodiakhq bot to merge PRs once the necessary checks and approvals are in place. When it merges a PR, kodiakhq converts the PR title to the commit title, PR description to the commit description, and squashes all the commits in the PR to a single commit. The net effect is that entire PR becomes a single commit. Please follow the best practices mentioned here for the PR title and description

@sush-101
Copy link
Author

sush-101 commented Feb 3, 2025

@microsoft-github-policy-service agree company="Microsoft"

@sush-101
Copy link
Author

sush-101 commented Feb 3, 2025 via email

@@ -47,6 +48,7 @@ void CancelProgram()
// Wait for shutdown operations to complete.
if (completed.Wait(shutdownWaitPeriod))
{
Task.Delay(TimeSpan.FromSeconds(2)).Wait();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a simple comment, explaining - why wait

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a comment.

@@ -7,5 +7,6 @@
"machinePressureMin": 1,
"machinePressureMax": 10,
"ambientTemp": 21,
"ambientHumidity": 25
"ambientHumidity": 25,
"RuntimeLogLevel": "Debug"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this to be debug level. If added for debugging purpose, then we should remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the level to info.

@@ -49,67 +50,96 @@ public enum ControlCommandEnum

static async Task<int> MainAsync()
{
Console.WriteLine("SimulatedTemperatureSensor Main() started.");
ManualResetEventSlim completed = null;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this change is a code re-arrangement, is it intentional? If not, would suggest to only keep the actual changes in PR for batter change review.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…rom Azure Portal and Ensure Graceful Shutdown

This pull request addresses the issues causing the SimulatedTemperatureSensor module to fail with exit code 139 when it receives a stop command from the Azure portal. The fixes involve handling task cancellation exceptions and adding a delay to ensure a graceful shutdown.

Initially, the SimulatedTemperatureSensor module encountered exit code 139 due to improper handling of task cancellations, leading to a segmentation fault. To address this, handling for TaskCanceledException was added in the SendEvents method to log a message when the task is canceled.

Additionally, a 2-second delay was added upon receiving the SIGTERM signal from Docker to ensure proper time is given for a graceful shutdown. This delay allows the module to complete any ongoing operations and release resources properly before shutting down.

Changes made:
- Added handling for TaskCanceledException in the SendEvents method to log a message when the task is canceled.
- Added a 2-second delay upon receiving the SIGTERM signal from Docker to ensure proper time for graceful shutdown.
- Updated ShutdownHandler to include the delay for ensuring all operations complete gracefully before shutdown.

These changes address the issues causing the SimulatedTemperatureSensor module to fail with exit codes 139 and 143, ensuring the module transitions to the stopped state correctly.
- Updated the log message in the TaskCanceledException catch block to correctly display the number of sent messages by replacing 'count' with 'count-1'
}
catch (TaskCanceledException)
{
Console.WriteLine($"SendEvents has been canceled, sent {count-1} messages.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Console.Writeline calls be converted to logger.LogInformation/LogWarning/LogError/etc. calls since you added the logger?

@@ -47,6 +48,8 @@ void CancelProgram()
// Wait for shutdown operations to complete.
if (completed.Wait(shutdownWaitPeriod))
{
// Delay to allow the module to complete any ongoing operations and release resources properly before shutting down
Task.Delay(TimeSpan.FromSeconds(2)).Wait();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shutdown handler is used by all our modules, not just tempSensor--we should carefully consider whether we want to add this behavior to all our modules. Also, a two-second fixed delay isn't guaranteed to solve the problem; on slower devices like Raspberry Pi, for example, it might not be enough time.

It should be the responsibility of completed.Wait(shutdownWaitPeriod) to provide enough time for operations to complete and resources to shut down. Have you looked into why this call isn't sufficient for tempSensor? Maybe tempSensor-specific shutdown code is doing something wrong? Or did you add this delay "just in case"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants