Now You See Me, Now You Don't - the Mystery of the Vanishing S3 Objects
An investigation into why S3 objects were disappearing in an event-driven workflow, how we identified the issues and the cause of the problem.
PartitionedLogs/Year=yy/Month=mm/Day=dd/bucket
, replacing the placeholders with appropriate values. Athena could then use this to partition the data, so we could have more efficient queries for a specific day or a specific bucket.PartitionedLogs
.try ... except
clause to capture any errors. With these in place, a simplified version of our code would look like:1
2
3
4
5
6
7
8
9
10
11
12
13
for object in event['Messages']:
...
log.debug('Notified of key %s', objectName)
try:
s3_client.copy_object(
CopySource=copy_object,
Bucket=bucket_name,
Key=target_key)
s3_client.delete_object(Bucket=s3access_bucket, Key=key)
except botocore.exceptions.ClientError as exception:
log.debug(exception)
Notified of
and 350 errors logged as the copy_object
failed. This meant we were seeing a failure rate of 0.02%, and this was approximately the same on other days.NoSuchKey - The specified key does not exist.
The system was designed to provide a data availability of 99.99%; all data is stored in multiple locations.
NoSuchKey
message.When we launched S3 back in 2006, I discussed its virtually unlimited capacity (“…easily store any number of blocks…”), the fact that it was designed to provide 99.99% availability, and that it offered durable storage, with data transparently stored in multiple locations. Since that launch, our customers have used S3 in an amazing diverse set of ways: backup and restore, data archiving, enterprise applications, web sites, big data, and (at last count) over 10,000 data lakes.One of the more interesting (and sometimes a bit confusing) aspects of S3 and other large-scale distributed systems is commonly known as eventual consistency. In a nutshell, after a call to an S3 API function such as PUT that stores or modifies data, there’s a small time window where the data has been accepted and durably stored, but not yet visible to all GET or LIST requests.
Effective immediately, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent. What you write is what you will read, and the results of a LIST will be an accurate reflection of what’s in the bucket. This applies to all existing and new S3 objects, works in all regions, and is available to you at no extra charge! There’s no impact on performance, you can update an object hundreds of times per second if you’d like, and there are no global dependencies.
copy_object
.NoSuchKey
warning.- Make sure you understand consistency and how it can impact your system. Don't expect that the systems you work with will always be strongly consistent.
- When you read announcements or documentation, don't skim read or just read the headline; always dig into the detail.
- Everything fails eventually (no pun intended) - your code should never assume that it will always work. Ensure that you capture any possible error conditions and work out how to deal with them; at a minimum, flag the issue, but where possible, try to find a solution to resolve the problem.