Tags: temporalio/temporal
Tags
Fix license spelling (#6019) Fixed the inconsistent 🪪 spelling in the repo. Co-authored-by: Alex Shtin <alex@temporal.io>
Fix missing Workflow Execution on completed Update (#6322) ## What changed? <!-- Describe what has changed in this PR --> Make sure that `WorkflowExecution` is set. ## Why? <!-- Tell your future self why have you made these changes --> If the user sends an Update request to a closed Workflow and that Update was already complete earlier, the response is missing the `WorkflowExecution`. ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> Updated existing test. ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> Maybe.
Fix: RespondWorkflowTaskCompleted should the workflow termination eve… …nt into the same event batch as the workflowtask failure event (#6304) ## What changed? We [previously updated RespondWorkflowTaskCompleted](#6180) to terminate on certain non-retryable errors (payload size exceeded) instead of fail. Termination indicates that the failure was sourced from the service, rather than the customer code/SDK. The change inadvertedly resulted in the wrong `CompletionEventBatchID` being written into mutable state, since both `failWorkflowTask` and `TerminateWorkflow` allocate a new event batch. `CompletionEventBatchID` should be set to the batch started in `failWorkflowTask`, but it was overridden from `TerminateWorkflow`. When replication to another cluster was involved, or the events cache is otherwise evicted, `MutableState.GetCompletionEvent()` was no longer able to load the failed event, as the `CompletionEventBatchID` used to load from persistence is past the WFT failure event. This caused active transfer tasks involving the terminated workflow to be sent to the DLQ on secondary/standby clusters. This fix: - correctly writes the right event batch ID into the workflow termination event by using `AddWorkflowExecutionTerminatedEvent` over `TerminateWorkflow` - updates `GetCompletionEvent()` to handle internal errors for terminated workflows by iterating through the last batch of events via persistence ## How did you test it? Set up a local temporal server, and a simple test worker/workflow. The test workflow simply launches a child workflow and awaits the result; the child workflow returns an output with a large (2MB) payload, to trigger termination. ✅ Validating test case: - start with unpatched temporal - run test workflow, observe termination - run `tdbg w d` and observe an incorrect `completionEventBatchId` - restart server - run `tdbg w rt` to refresh tasks - observe `"error":"unable to get workflow completion event"` message in logs ✅ Validating backfill for corrupted `completionEventBatchId` workflows: - (continue from "Validated test case" steps) - stop unpatched temporal server - start patched temporal server - run `tdbg w rt` to refresh tasks - observe server no longer shows error message/makes progress ✅ Validating newly-terminated workflows: - start with patched temporal server - run test workflow, observe termination - run `tdbg w d` and observe the correct `completionEventBatchId`, ensures it matches the `WorkflowTaskFailed` event ID ## Potential risks - introduces new reverse-history search logic into `GetCompletionEvent()` for edge cases which previously wasn't there; tried to mitigate impact by only requesting a single page + batch of results - once again changes the `CompletionEventBatchID` for workflow termination events, which may cause transfer task errors if incorrect ## Documentation ## Is hotfix candidate? Yes
Fix: RespondWorkflowTaskCompleted should the workflow termination eve… …nt into the same event batch as the workflowtask failure event (#6304) ## What changed? We [previously updated RespondWorkflowTaskCompleted](#6180) to terminate on certain non-retryable errors (payload size exceeded) instead of fail. Termination indicates that the failure was sourced from the service, rather than the customer code/SDK. The change inadvertedly resulted in the wrong `CompletionEventBatchID` being written into mutable state, since both `failWorkflowTask` and `TerminateWorkflow` allocate a new event batch. `CompletionEventBatchID` should be set to the batch started in `failWorkflowTask`, but it was overridden from `TerminateWorkflow`. When replication to another cluster was involved, or the events cache is otherwise evicted, `MutableState.GetCompletionEvent()` was no longer able to load the failed event, as the `CompletionEventBatchID` used to load from persistence is past the WFT failure event. This caused active transfer tasks involving the terminated workflow to be sent to the DLQ on secondary/standby clusters. This fix: - correctly writes the right event batch ID into the workflow termination event by using `AddWorkflowExecutionTerminatedEvent` over `TerminateWorkflow` - updates `GetCompletionEvent()` to handle internal errors for terminated workflows by iterating through the last batch of events via persistence ## How did you test it? Set up a local temporal server, and a simple test worker/workflow. The test workflow simply launches a child workflow and awaits the result; the child workflow returns an output with a large (2MB) payload, to trigger termination. ✅ Validating test case: - start with unpatched temporal - run test workflow, observe termination - run `tdbg w d` and observe an incorrect `completionEventBatchId` - restart server - run `tdbg w rt` to refresh tasks - observe `"error":"unable to get workflow completion event"` message in logs ✅ Validating backfill for corrupted `completionEventBatchId` workflows: - (continue from "Validated test case" steps) - stop unpatched temporal server - start patched temporal server - run `tdbg w rt` to refresh tasks - observe server no longer shows error message/makes progress ✅ Validating newly-terminated workflows: - start with patched temporal server - run test workflow, observe termination - run `tdbg w d` and observe the correct `completionEventBatchId`, ensures it matches the `WorkflowTaskFailed` event ID ## Potential risks - introduces new reverse-history search logic into `GetCompletionEvent()` for edge cases which previously wasn't there; tried to mitigate impact by only requesting a single page + batch of results - once again changes the `CompletionEventBatchID` for workflow termination events, which may cause transfer task errors if incorrect ## Documentation ## Is hotfix candidate? Yes
Fix state machine timer task metrics (#6286) ## What changed? - Add `state_machine_timer_skips` metrics - Fix `state_machine_timer_processing_failures` to be emitted on failures instead of skips
Add feature flag for workflowIDReuse start time validation (#6231) (#… …6235) <!-- Describe what has changed in this PR --> - Add a feature flag for workflowIDReuse start time validation, default to **_disabled_** <!-- Tell your future self why have you made these changes --> - The start time validation introduces one more db read for start workflow operation. This can be avoided by adding the start time to workflow execution state. - Also we should only get workflow start time when TerminateIfRunning policy is used. <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - Existing functional test `TestStartWorkflowExecution_Terminate` <!-- Assuming the worst case, what can be broken when deploying this change to production? --> <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - Yes ## What changed? <!-- Describe what has changed in this PR --> ## Why? <!-- Tell your future self why have you made these changes --> ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> Co-authored-by: Yichao Yang <yichao@temporal.io>
Cherry-pick Nexus fixes into v1.25.0-116 (#6233) ## What changed? <!-- Describe what has changed in this PR --> Cherry-pick Nexus fixes into v1.25.0-116 [Deep copy attributes in SyncHSM task](bf089c9) [Record SM task transition count on transition](46bd485) [Translate NamespaceNotActive error in nexus completion API to retryab…](a9840ee) [Always register outbound category, make outbound processor registrati…](f8743f7) ## Why? <!-- Tell your future self why have you made these changes --> ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> --------- Co-authored-by: Yichao Yang <yichao@temporal.io> Co-authored-by: Roey Berman <roey@temporal.io>
Add feature flag for workflowIDReuse start time validation (#6231) <!-- Describe what has changed in this PR --> - Add a feature flag for workflowIDReuse start time validation, default to **_disabled_** <!-- Tell your future self why have you made these changes --> - The start time validation introduces one more db read for start workflow operation. This can be avoided by adding the start time to workflow execution state. - Also we should only get workflow start time when TerminateIfRunning policy is used. <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - Existing functional test `TestStartWorkflowExecution_Terminate` <!-- Assuming the worst case, what can be broken when deploying this change to production? --> <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - Yes
Add feature flag for workflowIDReuse start time validation (#6231) <!-- Describe what has changed in this PR --> - Add a feature flag for workflowIDReuse start time validation, default to **_disabled_** <!-- Tell your future self why have you made these changes --> - The start time validation introduces one more db read for start workflow operation. This can be avoided by adding the start time to workflow execution state. - Also we should only get workflow start time when TerminateIfRunning policy is used. <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - Existing functional test `TestStartWorkflowExecution_Terminate` <!-- Assuming the worst case, what can be broken when deploying this change to production? --> <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - Yes
PreviousNext