Handling transient exceptions with Elastic Scale

There are quite a lot of differences between having an on-premise data center and using the cloud. One of these differences is the (guaranteed) uptime and the latency between the different servers. When creating your local on-premise datacenter you will have a pretty stable network connection between the different servers and it’s probably really fast. The cloud can be pretty fast also, especially when you are located in the same datacenter/container. However, you don’t have any real guarantees on where your stuff gets deployed, let alone the latency or connection between the servers.

This might lead to some small problems when you have to hit a different server, like when accessing files from storage or querying a database. This is why the Microsoft Patterns & Practices team has published the Transient Fault Handling Application Block. This application block states there are several transient exceptions which can occur when you try to access a service. These exceptions are known to sometimes be automatically be resolved over a small period of time, therefore retrying within a small period of time might fix the problem.

Basically, all this application block does is helping you add a retry-mechanism so you don’t have to worry much about these transient errors yourself. There are a couple of retrying strategies available out of the box when using the Entity Framework. The SQL team has also provided a new retrying mechanism to be used with the Elastic Scale libraries.

When you have downloaded the Elastic Scale sample applications you’ll have access to the SqlDatabaseUtils in which you will find the retry policy defined for the sample application.

/// <summary>
/// Gets the retry policy to use for connections to SQL Server.
/// </summary>
public static RetryPolicy SqlRetryPolicy
{
	get { return new RetryPolicy<SqlDatabaseTransientErrorDetectionStrategy>(10, TimeSpan.FromSeconds(5)); }
}

If an exception occurs within the scope of this policy, it’ll check if the exception is transient. If so, the command is executed again in 5 seconds. To determine if a SqlException is transient the policy checks if the error number matches one of a specified list. Below is an excerpt of the decompiled piece code.

foreach (SqlError error in sqlException.Errors)
{
switch (error.Number)
{
    case 40501:
    ThrottlingCondition throttlingCondition = ThrottlingCondition.FromError(error);
    sqlException.Data[(object) throttlingCondition.ThrottlingMode.GetType().Name] = (object) throttlingCondition.ThrottlingMode.ToString();
    sqlException.Data[(object) throttlingCondition.GetType().Name] = (object) throttlingCondition;
    return true;
    case 40540:
    case 40613:
    case 10928:
    case 10929:
    case 40143:
    case 40197:
    case 233:
    case 10053:
    case 10054:
    case 10060:
    case 20:
    case 64:
    return true;
    default:
    continue;
}
}

If you want to know the details about these errors, just check out your SQL Server.

SELECT *
FROM [master].[dbo].[sysmessages]
WHERE error in (40501, 40540, 40613, 10928, 10929, 40143, 40197, 233, 10053, 10054, 10060, 20, 64)
AND msglangid = 1033

This will give you a list, in English, of all errors known to the SQL Server you are running on. If you need a different language, just change or remove the mslangid constraint.

Using this retry policy is rather straightforward. The basics are shown in the sample applications.

SqlDatabaseUtils.SqlRetryPolicy.ExecuteAction(() =>
{
    using (var db = new ElasticScaleContext<int>(sharding.ShardMap, tenantId2, connStrBldr.ConnectionString))
    {
        var blog = new Blog { Name = name2 };
        db.Blogs.Add(blog);
        db.SaveChanges();
    }
});

When executing this piece of code and you are faced with a transient SqlException, the request is executed again to the same connection/context until the constraints of the policy is met.

The code runs a bit different in my project, since I’m not creating the database connection and the database context in the same place. The reason for this is because sometimes I have to do a query/command on multiple shards and the so I had to resort to a small work around to get the proper connections and database contexts.

One of my connection managers now contains the following piece of code


return shards.Select(s =>
{
	return DatabaseUtility.SqlRetryPolicy.ExecuteAsync(() =>
	{
		var task = s.OpenConnectionAsync(credentialsConnectionString, ConnectionOptions.Validate);
		return task.ContinueWith(t => (IDbConnection) t.Result, TaskContinuationOptions.OnlyOnRanToCompletion | TaskContinuationOptions.ExecuteSynchronously);
	});
});

At this place, I’m creating all of the IDbConnections for the necessary shards. All of this is done within the scope of the retry policy.

Creating the database context and executing the request is done in this piece of code.

public async Task<IList<T>> InvokeQuery<T>(IDbConnection connection, Func<CustomerDatabaseEntities, T> command)
{
	var resultSet = new List<T>();
	Exception exception = null;
	return await TransientDatabase.SqlRetryPolicy.ExecuteAsync(async () =>
	{
		CustomerDatabaseEntities context = null;
		try
		{
			context = await this.SetCurrentContext(connection);
			var result = await Task<T>.Factory.FromAsync(command.BeginInvoke, command.EndInvoke, context, null);
			if (result != null)
			{
				resultSet.Add(result);
			}
		}
		catch (Exception ex)
		{
			exception = ex;
		}
		finally
		{
			if (context != null &amp;&amp; this.TransactionManager == null)
			{
				this.contexts.Remove(context);
				context.Dispose();
			}
		}
	})
	.ContinueWith(t =>
	{
		this.ThrowIfRealException(exception);
		return resultSet;
	});
}
private void ThrowIfRealException(Exception exception)
{
	if (exception != null &amp;&amp; !TransientDatabase.SqlRetryPolicy.ErrorDetectionStrategy.IsTransient(exception))
	{
		throw exception;
	}
}

You might notice a rather strange piece of code over here in the catch block and the ContinueWith method. This has a reason though. When I ran my tests on this piece of code it became clear to me the exceptions thrown within the retry policy scope were swallowed, even when they weren’t transient exceptions.

As you can imagine, this isn’t a desirable scenario so I had to think of a work around for this. Now, when an exception occurs, I’m able to store it locally and throw it later on, out of the scope of the retry policy.

Using the above examples it’s rather doable to create a stable connection to the database. Hope it helps!


Share

comments powered by Disqus