Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Non-Deterministic" error which I have high confidence to believe cause by host rather than my code. #2563

Open
jjshao-ms opened this issue Aug 31, 2023 · 5 comments
Labels
P2 Priority 2 Reliability Durable functions get stuck or don’t run as expected.

Comments

@jjshao-ms
Copy link

jjshao-ms commented Aug 31, 2023

Description

A clear and concise description of what the bug is. Please make an effort to fill in all the sections below; the information will help us investigate your issue.

(dotnet-isolated)
Hi all, my orchestrator is getting "Non-Deterministic" errors which I believe isn't cause by my code.
The "Non-Deterministic" error message I got looks like this: "Non-Deterministic workflow detected: A previous execution of this orchestration scheduled an activity task with sequence ID 0 and name 'xxxxx' (version ''), but the current replay execution hasn't (yet?) scheduled this task. Was a change made to the orchestrator code after this instance had already started running?"
As you can see the sequence ID is "0", which means it is the first activity that got orchestrated, but my orchestrator looks like following:

[Function(OrchName)]
public async Task Run([OrchestrationTrigger] TaskOrchestrationContext context, FunctionContext functionContext)
{
    var data = await context.CallActivityAsync<List<string>>(nameof(this.MyActivity), null);

    if (!context.IsReplaying)
    {
        var log = functionContext.GetLogger(OrchName);
        log.LogInformation($"{data.Count} sources need to be updated.");
    }

    context.SetCustomStatus(new { count = data.Count });

    // more codes to process data
    foreach (var item in data) /* ... */
}

The first line of code my orchestrator was call the first activity, so there isn't any change my orchestrator didn't schedule a task with sequence ID "0". And I can guarantee that I didn't do a deploy, so the code didn't change.

And if you wonder, we didn't have any custom worker middleware for now, so no await operation inside user code middleware.

Additional information

Before, our orchestration never got "Non-Deterministic" error when we use in proc dotnet function, but just after we move to dotnet-isolated worker, it starts reporting this error.
And some nature of our durable functions: the main activity of our function uses a lot of named pipes, therefore the function host will cancel the JobHost a lot due to health monitor alert on named pipe number. And according to the log from our side we see the JobHost was just under cancellation when this orchestration was running, and we got "Unable to cancel invocation xxx" log showing something wrong with the cancellation.

Expected behavior

A clear and concise description of what you expected to happen.

My orchestrator shouldn't report "Non-Deterministic" error. At least, not on ID with "0".

Actual behavior

A clear and concise description of what actually happened.

"Non-Deterministic"

Relevant source code snippets

see above.

Known workarounds

Provide a description of any known workarounds you used.

Hope you can provide one for me.

App Details

  • Durable Functions extension version (e.g. v1.8.3):
<PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.17.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.0.2" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Storage" Version="5.1.2" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.2.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.11.0" />
  • Azure Functions runtime version (1.0 or 2.0): 4.0
  • Programming language used: C# (isolated)

Screenshots

If applicable, add screenshots to help explain your problem.

If you need any, tell me.

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

The latest one:

  • Timeframe issue observed: "createdTime": "2023-08-25T10:59:22Z", "lastUpdatedTime": "2023-08-26T15:43:24Z"
  • Function App name: reach me through Teams if you need, search my GitHub full name
  • Function name(s): same as above
  • Azure region: wus
  • Orchestration instance ID(s): "abaa0e2b34d15370a66898d6d3f625e9"
  • Azure storage account name: same as above

Other old ones:
"createdTime": "2023-08-19T10:17:17Z"
"lastUpdatedTime": "2023-08-19T23:52:17Z"
"instanceId": "0079d620afca5ef3a531007facad5c12"

"createdTime": "2023-08-13T10:20:31Z"
"lastUpdatedTime": "2023-08-15T11:13:28Z"
"instanceId": "5e0251be87eb597c84a6d957f49c2e28"

If you don't want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it's extremely difficult to look up information.

@davidmrdavid
Copy link
Contributor

@jviau: are you aware of any non-determinism errors that may be caused by a failed cancellation flow?

@davidmrdavid davidmrdavid added the P1 Priority 1 label Sep 5, 2023
@jviau
Copy link
Contributor

jviau commented Sep 6, 2023

Could it be related to this? microsoft/durabletask-dotnet#158

@YanumVlad
Copy link

YanumVlad commented Sep 6, 2023

I am experiencing the similar issue with a Python based durable function (containerized)
It started on Fri (Sept 1, 2023). the next day after the v2.11.1 release (suspecting - it might be corelating)
The same code base had been working without issues for months

@jjshao-ms
Copy link
Author

Could it be related to this? microsoft/durabletask-dotnet#158

@jviau I don't think so, because our function does not have any custom middleware when this problem happens.
Though we do add a custom middleware this week, but that is just a log to help triage this problem, and it looks like following, I don't think it will cause Non-Deterministic, right?

public Task Invoke(FunctionContext context, FunctionExecutionDelegate next)
{
    if (!IsOrchestrationTrigger(context))
    {
        return next(context);
    }

    this.logger.LogInformation("Orchestration start executing function: {functionId}, invocation: {invocationId}", context.FunctionId, context.InvocationId);
    return next(context);
}

private static bool IsOrchestrationTrigger(FunctionContext context)
{
    return context.FunctionDefinition.InputBindings.Values.Any(x => x.Type == "orchestrationTrigger");
}

@lilyjma lilyjma added the Reliability Durable functions get stuck or don’t run as expected. label Jan 3, 2024
@lilyjma lilyjma added P2 Priority 2 and removed P1 Priority 1 labels Jan 23, 2024
@well0549
Copy link

I have exactly the same error message:

31ee961e38234446930f35bec246e125: Orchestration completed with a 'Failed' status and 324 bytes of output. Details: Unhandled exception while executing orchestration: DurableTask.Core.Exceptions.NonDeterministicOrchestrationException: Non-Deterministic workflow detected: A previous execution of this orchestration scheduled an activity task with sequence ID 0 and name 'GetServiceUrl' (version ''), but the current replay execution hasn't (yet?) scheduled this task. Was a change made to the orchestrator code after this instance had already started running?
at DurableTask.Core.TaskOrchestrationContext.HandleTaskScheduledEvent(TaskScheduledEvent scheduledEvent) in /_/src/DurableTask.Core/TaskOrchestrationContext.cs:line 275
at DurableTask.Core.TaskOrchestrationExecutor.<ExecuteCore>g__ProcessEvents|11_0(IEnumerable`1 events) in /_/src/DurableTask.Core/TaskOrchestrationExecutor.cs:line 115
at DurableTask.Core.TaskOrchestrationExecutor.ExecuteCore(IEnumerable`1 pastEvents, IEnumerable`1 newEvents) in /_/src/DurableTask.Core/TaskOrchestrationExecutor.cs:line 123
at DurableTask.Core.TaskOrchestrationContext.HandleTaskScheduledEvent(TaskScheduledEvent scheduledEvent) in /_/src/DurableTask.Core/TaskOrchestrationContext.cs:line 275
at DurableTask.Core.TaskOrchestrationExecutor.<ExecuteCore>g__ProcessEvents|11_0(IEnumerable`1 events) in /_/src/DurableTask.Core/TaskOrchestrationExecutor.cs:line 115
at DurableTask.Core.TaskOrchestrationExecutor.ExecuteCore(IEnumerable`1 pastEvents, IEnumerable`1 newEvents) in /_/src/DurableTask.Core/TaskOrchestrationExecutor.cs:line 123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Priority 2 Reliability Durable functions get stuck or don’t run as expected.
Projects
None yet
Development

No branches or pull requests

7 participants