Skip to content

Tags: temporalio/temporal

Tags

v1.25.0-118.0

Toggle v1.25.0-118.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix license spelling (#6019)

Fixed the inconsistent 🪪  spelling in the repo.

Co-authored-by: Alex Shtin <alex@temporal.io>

v1.25.0-117.2

Toggle v1.25.0-117.2's commit message
Fix missing Workflow Execution on completed Update (#6322)

## What changed?
<!-- Describe what has changed in this PR -->

Make sure that `WorkflowExecution` is set.

## Why?
<!-- Tell your future self why have you made these changes -->

If the user sends an Update request to a closed Workflow and that Update
was already complete earlier, the response is missing the
`WorkflowExecution`.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

Updated existing test.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

Maybe.

v1.25.0-rc.0

Toggle v1.25.0-rc.0's commit message
Fix: RespondWorkflowTaskCompleted should the workflow termination eve…

…nt into the same event batch as the workflowtask failure event (#6304)

## What changed?
We [previously updated
RespondWorkflowTaskCompleted](#6180)
to terminate on certain non-retryable errors (payload size exceeded)
instead of fail. Termination indicates that the failure was sourced from
the service, rather than the customer code/SDK. The change inadvertedly
resulted in the wrong `CompletionEventBatchID` being written into
mutable state, since both `failWorkflowTask` and `TerminateWorkflow`
allocate a new event batch. `CompletionEventBatchID` should be set to
the batch started in `failWorkflowTask`, but it was overridden from
`TerminateWorkflow`.

When replication to another cluster was involved, or the events cache is
otherwise evicted, `MutableState.GetCompletionEvent()` was no longer
able to load the failed event, as the `CompletionEventBatchID` used to
load from persistence is past the WFT failure event. This caused active
transfer tasks involving the terminated workflow to be sent to the DLQ
on secondary/standby clusters.

This fix:
- correctly writes the right event batch ID into the workflow
termination event by using `AddWorkflowExecutionTerminatedEvent` over
`TerminateWorkflow`
- updates `GetCompletionEvent()` to handle internal errors for
terminated workflows by iterating through the last batch of events via
persistence

## How did you test it?
Set up a local temporal server, and a simple test worker/workflow. The
test workflow simply launches a child workflow and awaits the result;
the child workflow returns an output with a large (2MB) payload, to
trigger termination.

✅ Validating test case:
- start with unpatched temporal
- run test workflow, observe termination
- run `tdbg w d` and observe an incorrect `completionEventBatchId`
- restart server
- run `tdbg w rt` to refresh tasks
- observe `"error":"unable to get workflow completion event"` message in
logs

✅  Validating backfill for corrupted `completionEventBatchId` workflows:
- (continue from "Validated test case" steps)
- stop unpatched temporal server
- start patched temporal server
- run `tdbg w rt` to refresh tasks
- observe server no longer shows error message/makes progress

✅  Validating newly-terminated workflows:
- start with patched temporal server
- run test workflow, observe termination
- run `tdbg w d` and observe the correct `completionEventBatchId`,
ensures it matches the `WorkflowTaskFailed` event ID

## Potential risks
- introduces new reverse-history search logic into
`GetCompletionEvent()` for edge cases which previously wasn't there;
tried to mitigate impact by only requesting a single page + batch of
results
- once again changes the `CompletionEventBatchID` for workflow
termination events, which may cause transfer task errors if incorrect

## Documentation

## Is hotfix candidate?
Yes

v1.25.0-117.1

Toggle v1.25.0-117.1's commit message
Fix: RespondWorkflowTaskCompleted should the workflow termination eve…

…nt into the same event batch as the workflowtask failure event (#6304)

## What changed?
We [previously updated
RespondWorkflowTaskCompleted](#6180)
to terminate on certain non-retryable errors (payload size exceeded)
instead of fail. Termination indicates that the failure was sourced from
the service, rather than the customer code/SDK. The change inadvertedly
resulted in the wrong `CompletionEventBatchID` being written into
mutable state, since both `failWorkflowTask` and `TerminateWorkflow`
allocate a new event batch. `CompletionEventBatchID` should be set to
the batch started in `failWorkflowTask`, but it was overridden from
`TerminateWorkflow`.

When replication to another cluster was involved, or the events cache is
otherwise evicted, `MutableState.GetCompletionEvent()` was no longer
able to load the failed event, as the `CompletionEventBatchID` used to
load from persistence is past the WFT failure event. This caused active
transfer tasks involving the terminated workflow to be sent to the DLQ
on secondary/standby clusters.

This fix:
- correctly writes the right event batch ID into the workflow
termination event by using `AddWorkflowExecutionTerminatedEvent` over
`TerminateWorkflow`
- updates `GetCompletionEvent()` to handle internal errors for
terminated workflows by iterating through the last batch of events via
persistence

## How did you test it?
Set up a local temporal server, and a simple test worker/workflow. The
test workflow simply launches a child workflow and awaits the result;
the child workflow returns an output with a large (2MB) payload, to
trigger termination.

✅ Validating test case:
- start with unpatched temporal
- run test workflow, observe termination
- run `tdbg w d` and observe an incorrect `completionEventBatchId`
- restart server
- run `tdbg w rt` to refresh tasks
- observe `"error":"unable to get workflow completion event"` message in
logs

✅  Validating backfill for corrupted `completionEventBatchId` workflows:
- (continue from "Validated test case" steps)
- stop unpatched temporal server
- start patched temporal server
- run `tdbg w rt` to refresh tasks
- observe server no longer shows error message/makes progress

✅  Validating newly-terminated workflows:
- start with patched temporal server
- run test workflow, observe termination
- run `tdbg w d` and observe the correct `completionEventBatchId`,
ensures it matches the `WorkflowTaskFailed` event ID

## Potential risks
- introduces new reverse-history search logic into
`GetCompletionEvent()` for edge cases which previously wasn't there;
tried to mitigate impact by only requesting a single page + batch of
results
- once again changes the `CompletionEventBatchID` for workflow
termination events, which may cause transfer task errors if incorrect

## Documentation

## Is hotfix candidate?
Yes

v1.25.0-117.0

Toggle v1.25.0-117.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix state machine timer task metrics (#6286)

## What changed?

- Add `state_machine_timer_skips` metrics
- Fix `state_machine_timer_processing_failures` to be emitted on
failures instead of skips

v1.25.0-116.4

Toggle v1.25.0-116.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add feature flag for workflowIDReuse start time validation (#6231) (#…

…6235)

<!-- Describe what has changed in this PR -->
- Add a feature flag for workflowIDReuse start time validation, default
to **_disabled_**

<!-- Tell your future self why have you made these changes -->
- The start time validation introduces one more db read for start
workflow operation. This can be avoided by adding the start time to
workflow execution state.
- Also we should only get workflow start time when TerminateIfRunning
policy is used.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Existing functional test `TestStartWorkflowExecution_Terminate`

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- Yes

## What changed?
<!-- Describe what has changed in this PR -->

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

Co-authored-by: Yichao Yang <yichao@temporal.io>

v1.25.0-116.3

Toggle v1.25.0-116.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Cherry-pick Nexus fixes into v1.25.0-116 (#6233)

## What changed?
<!-- Describe what has changed in this PR -->
Cherry-pick Nexus fixes into v1.25.0-116

[Deep copy attributes in SyncHSM
task](bf089c9)
[Record SM task transition count on
transition](46bd485)
[Translate NamespaceNotActive error in nexus completion API to
retryab…](a9840ee)
[Always register outbound category, make outbound processor
registrati…](f8743f7)

## Why?
<!-- Tell your future self why have you made these changes -->

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

---------

Co-authored-by: Yichao Yang <yichao@temporal.io>
Co-authored-by: Roey Berman <roey@temporal.io>

v1.25.0-115.5

Toggle v1.25.0-115.5's commit message
Fix WorkflowIdReuseMinimalInterval config

v1.25.0-115.4

Toggle v1.25.0-115.4's commit message
Add feature flag for workflowIDReuse start time validation (#6231)

<!-- Describe what has changed in this PR -->
- Add a feature flag for workflowIDReuse start time validation, default
to **_disabled_**

<!-- Tell your future self why have you made these changes -->
- The start time validation introduces one more db read for start
workflow operation. This can be avoided by adding the start time to
workflow execution state.
- Also we should only get workflow start time when TerminateIfRunning
policy is used.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Existing functional test `TestStartWorkflowExecution_Terminate`

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- Yes

v1.25.0-115.3

Toggle v1.25.0-115.3's commit message
Add feature flag for workflowIDReuse start time validation (#6231)

<!-- Describe what has changed in this PR -->
- Add a feature flag for workflowIDReuse start time validation, default
to **_disabled_**

<!-- Tell your future self why have you made these changes -->
- The start time validation introduces one more db read for start
workflow operation. This can be avoided by adding the start time to
workflow execution state.
- Also we should only get workflow start time when TerminateIfRunning
policy is used.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Existing functional test `TestStartWorkflowExecution_Terminate`

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- Yes