SQS Messages not deleted after DLQ Implementation

Question

I recently implemented DLQ(Dead Letter Queue) for SQS. I have done the following three configurations.

Default visibility timeout in SQS.
In the Dead Letter Queue configuration, enable and Maximum receives are configured as 3.
In Lambda configuration, Report batch item failures are enabled.

But the problem, all success/failure messages are processed three times and moved to DLQ.

For success cases, the correct JSON response is returned.

Once I disable the "Report batch item failures is enabled", message will deleted for both success/failure cases.

I'm guessing you went through those best practices: docs.aws.amazon.com/prescriptive-guidance/latest/… — McKean, Commented Nov 15, 2023 at 11:35
Enabling just the 'Report batch item failures' in lambda configuration will not give any benefit unless the lambda code has been modified to support partial batch failures. To handle partial batch failures properly, the list of failed items need to be returned from the lambda handler method in the SQS Batch Response. You may refer to this AWS documentation for details. docs.aws.amazon.com/lambda/latest/dg/… Without this if your lambda handler method throws exception while processing the batch, the whole batch becomes visible. — MANISH, Commented Nov 17, 2023 at 16:55

bgs · Accepted Answer · 2023-11-20 14:04:23Z

Once we enable the "Report batch item failures is enabled", we should change the response type of the function.

Old Code : public async Task<String> FunctionHandlerAsync(SQSEvent sqsEvent)

New Code : public async Task<SQSBatchResponse> FunctionHandlerAsync(SQSEvent sqsEvent)

Due to response type changes, we should change the code in function implementation.

create the object,

 List<SQSBatchResponse.BatchItemFailure> batchItemFailures = new List<SQSBatchResponse.BatchItemFailure>();

For exception cases,

batchItemFailures.Add(new SQSBatchResponse.BatchItemFailure { ItemIdentifier = record.MessageId });

Finally, return the batch response to function,

return new SQSBatchResponse(batchItemFailures);

After the above changes, success messages are correctly deleted.

Allan Chua · Accepted Answer · 2023-11-16 01:28:57Z

There are two immediate solutions that I could think help you debug your challenge:

The queue's visibility timeout may be shorter compared to the entire batch's processing time in Lambda. This will cause all messages in the batch that wasn't deleted by your function to become visible again until it hits the maximum redrive count and get sent to the dead-letter queue. To solve this, make sure that your SQS queue's visibility timeout is 6X longer compared to your function's maximum execution duration.
If you have ruled out that solution 1 isn't fixing your problem, inspect if any of the 3 messages in the batch is failing and make sure you handle it in such a way that it gets sent to a storage dedicated for observing messages that failed. This will make sure that your message batch gets processed even if some of the messages causes failure.

More Tips:

It is also considered a good practice to implement idempotency in your Lambda function to prevent the re-processing of messages that were successfully processed previously.
It will be cooler if you can hook-up a notification system for monitoring the dedicated storage for messages that causes failure / poisoning of the queue.
The batch failure actually causes Lambda to reduce the number of SQS processors (Defaults to 5) if retries are detected from your queue.

2 Answers 2