Poison queue handling

In Async Caller we use Azure Storage Queues for queueing messages such as events, etc. In some cases a message might end up in a so called poison queue. We will try to explain what that is and how you are supposed to handle messages that end up there.

Queue and Function App

The Async Caller adds all incoming requests as messages on a queue that is unique for every customer and is located in the customer Azure subscription.

The customer Azure subscription also hosts a Function App that should handle the messages that are on the queue. The function app contains a Function that is configured to be called by Azure whenever there are messages available on the queue.

If the function successfully handles a message the message is removed from the queue. If the function throws an exception back to Azure, the message is not removed from the queue. After a number of consecutive failures, the message will be removed and put on a poison queue.

The name of the poison queue is always the same as the name of the original queue, but with -poison appended to it, e.g. messages from the queue request-queue-1 will be put on the poison-queue request-queue-1-poison.

Example: In the old version of the Async Caller, the function app made a REST call to the Async Caller service to hand over the responsibility of sending the request and handling the result. If the Async Caller wasn’t available for some seconds, messages would end up on the poison queue.

Dequeue Count and Poison Queue

Every message on a queue has a dequeue count that initially is zero. For each time the function throws an exception for a specific message, that message remains on the queue, but the dequeue count increases by one.

The function app has a default maxDequeueCount of 5. When the dequeue count reaches maxDequeueCount, the message will be moved to the poison queue.

Function App Configuration

By adding the queue extensions configuration to your Function App Host.json file, you can configure the maxDequeueCount and some other settings.

"extensions": {
  "queues": {                         // Queue settings...
    "maxPollingInterval": "00:00:10", // Default: 00:00:01
    "visibilityTimeout": "00:00:10",  // Default: 00:00:00
    "batchSize": 4,                   // Default: 16
    "maxDequeueCount": 5,             // Default: 5
    "newBatchThreshold": 2            // Default: 8
  }
}

The retries are done in a very rapid succession, unless you set visibilityTimeout, in which case the message is not retried within that time span. So with the settings above, a message will be retried 5 times over a total of 50 seconds before it is put on the poison queue.

Handle the Poison Queue

When messages end up on the poison queue they must be handled manually, these messages could for example be deleted, moved to a temporary queue for later handling, or moved back to the originating queue for immediate handling. In some cases they could need to be changed before putting them back on the queue. This is depending on your business case, so we won’t advice you here.

In Microsoft Azure Storage Explorer it is possible to move 1 to 32 messages by selection. It is also possible to move all events in one queue to another.

poison-queue-storage

You could also create your own custom console application to move messages with more control.

Async Caller and Retries

Async Caller is responsible for making the actual call to the URL in the message. If an error occurs when calling the URL, Async Caller is responsible for handling the error and to add the message to the end of the normal queue again. This is seen as a new message which will have dequeue count of zero.

When Async Caller adds a message to the Queue it calculates a new time for delivery based on a logarithmic algorithm with some random variations, i.e. the message is tried later again after exponentially increasing time periods. These retries do not have anything to do with the dequeue count or the poison queue.

Async Caller and Expiration time

Every customer can set an ExpirationTime for Async Caller. The expiration time is when the Async Caller should give up on a message. The default value is 7 days.

If Async Caller couldn't deliver a message before the expiration time is reached, the message is ignored, and an error will be logged ("Giving up on envelope...").

Function App Timeout (avoiding problems)

For the Function App to be able to pick up a timeout error from Async Caller the request timeout of the HttpClient must be longer than the timeout of the Async Caller service itself. As the timeout of the Async Caller service is 100 seconds we recommend that at least 120 seconds is used for the HttpClient in the Function App.

If the Function App uses a to short timeout when calling Async Caller, the logic for adding the message back to the queue will be executed by both Azure and Async Caller. This could result in huge amounts of duplicates of each message on the queue.