data:image/s3,"s3://crabby-images/75fa4/75fa4f3f6d53d12fd6b69d75149c628e244e9b80" alt=""
Let’s implement leader election using Amazon S3’s If-Match
condition by building a distributed lock with it.
In August 2024, Gunnar Morling published a blog post that shows how to do it with the If-None-Match
condition. Back then, If-Match
had not yet been released.
This post shows another way to solve the same problem.
The post is intended to stand on its own so you don’t need to read Gunnar’s post first. But do read it as well to see how the solutions compare!
What’s If-Match
PutObject
is the API call that you use to upload data to Amazon S3. By default, the PutObject
calls are upserts: they will replace the object contents or create an object if one does not already exist.
In 2024, Amazon introduced two conditions for the PutObject calls If-Match
(announcement) and If-None-Match
(announcement). They allow you to restrict the behavior in the following ways:
- If you set
If-None-Match: *
, the call will only succeed if the object does not already exist. - If you set
If-Match: <value>
, the call will only succeed if the object exists and its content has the matching entity tag (ETag) value. Entity tag is essentially checksum for the object content.1
DeleteObject
also takes the If-Match
condition, so you can delete an object only if it has matching ETag.
If the call fails, you’ll get a 412 error response (or, in some cases, another 4xx error).
Together with S3’s consistency guarantees these conditions allow you to do compare-and-swap (CAS) operations. They are a key building block for distributed systems.
What’s leader election?
Many distributed systems require designating one of the nodes as the leader. Typically the leader accepts the write requests from the clients and then sends them to the other nodes that process read requests.
How do the nodes choose the leader? Martin Kleppmann in Designing Data-Intensive Applications writes:
One way of electing a leader is to use a lock: every node that starts up tries to acquire the lock, and the one that succeeds becomes the leader.
If we can build a distributed lock, we can perform leader election. Let’s see how to do that on S3.
The locking protocol
We will use a single object in the bucket for locking. Let’s call it lock
. It will be a JSON blob that looks like this:
{
"expires_at": 1740151473.206179
}
Here expires_at
is a timestamp in seconds since the UNIX epoch for when the lock expires.
To acquire the lock, the nodes do the following.
- Read the contents of
lock
. If the object does not exist, there’s no lock and we can jump to step 3. - If
expires_at
is in the past, the lock has expired and we can continue. Otherwise acquiring the lock has failed. - Put a new version of
lock
with the desired expiration time and with one of the conditions:- If
lock
existed in step 1, useIf-Match
with its ETag value. - If
lock
did not exist in step 1, useIf-None-Match
.
- If
If the put in step 3 succeeds, the node has acquired the lock.
S3 has strong read-after-write consistency, so if there is a lock, in step 1 every node is guaranteed to see up-to-date version of the lock data. In step 3, the use of the conditions guarantees that only one node will succeed at acquiring the lock.
If the leader wants to release the lock, it can delete the object using If-Match with the ETag value received in step 3.
Fencing tokens
The elephant in the room is that this relies on the nodes having their clocks in sync, which is a famously difficult problem. Consider what happens if the leader’s clock is behind the others or the clock of one of the secondaries is ahead the others: the leader thinks it still holds the lock while the secondary thinks it has expired. If the secondary now grabs the lock, the former leader can end up issuing zombie requests.
In his post How to Distributed Locking, Martin Kleppman explains that you can use fencing tokens to solve the issue. Fencing token is a number that increases every time a node acquires the lock. The token should then be included in the requests to the system that we hold the lock over, and it should track the highest token it has seen and reject the requests with lower tokens. This filters out the zombie requests.
In our case, even expires_at
could work as a fencing token if the lock duration is always the same.
The protocol guarantees that it will always increase.
However, we do not have to make the lock duration fixed.
We can add another field token
to the JSON object:
{
"expires_at": 1740151473.206179,
"token": 1
}
token
is a number, starting at zero, that should be incremented every time the lock is acquired.
The node acquiring the lock reads it in step 1 and it can increase it in step 3.
Releasing the lock by deleting object does not work anymore as that would reset the token.
You can release the lock by setting expires_at
to zero without incrementing token
.
{
"expires_at": 0,
"token": 1
}
Python implementation
Here’s a basic implementation in Python using boto3. Adding support for the fencing tokens and releasing the lock is left as an exercise for the reader.
import dataclasses
import json
from dataclasses import dataclass
from datetime import UTC, datetime, timedelta
from typing import TYPE_CHECKING, Self
import boto3
import botocore.exceptions
if TYPE_CHECKING:
from mypy_boto3_s3.client import S3Client
s3_client: "S3Client" = boto3.client("s3")
@dataclass(frozen=True)
class LockData:
expires_at: float
def to_json(self) -> str:
return json.dumps(dataclasses.asdict(self))
@classmethod
def from_json(cls, data: str) -> Self:
return cls(**json.loads(data))
def acquire_lock(
s3_client: "S3Client",
bucket: str,
key: str = "lock",
expires_in: timedelta = timedelta(seconds=60),
) -> bool:
"""Try to acquire a lock using S3 as the coördination mechanism.
Args:
s3_client: boto3 S3 client
bucket: S3 bucket name
key: S3 object key
expires_in_seconds: Lock timeout
Returns:
bool: True if the lock was acquired, False otherwise
"""
try:
existing_lock = s3_client.get_object(
Bucket=bucket,
Key=key,
)
except botocore.exceptions.ClientError as e:
if e.response["Error"]["Code"] == "NoSuchKey":
existing_lock = None
else:
raise
if existing_lock is not None:
existing_data = LockData.from_json(existing_lock["Body"].read().decode("utf-8"))
if datetime.now(UTC).timestamp() <= existing_data.expires_at:
return False
condition = {"IfMatch": existing_lock["ETag"]}
else:
condition = {"IfNoneMatch": "*"}
lock_data = LockData(expires_at=(datetime.now(UTC) + expires_in).timestamp())
try:
s3_client.put_object(
Bucket=bucket,
Key=key,
Body=lock_data.to_json(),
**condition, # type: ignore[arg-type]
)
except botocore.exceptions.ClientError as error:
if error.response["Error"]["Code"] in (
"ConditionalRequestConflict",
"PreconditionFailed",
):
# We could alternatively retry on ConditionalRequestConflict (409)
return False
raise
return True
Here’s another exercise for the reader:
The lock
object does not include information about who is holding the lock as it’s not necessary for the protocol.
However, it would be handy in a real-world implementation in case you ever need to debug this.
data:image/s3,"s3://crabby-images/14202/14202bb5f0a2ebc0cb3b6a23146ca8aabac93ae4" alt=""
Does this make sense?
What’s nice about this compared to Gunnar’s version is that there’s no need for a background process to delete the stale lock objects. Gunnar’s design creates a new object every time a lock is acquired but in this version, there’s only a single object that gets modified.
However, with both designs you have to ask whether they make sense in the real world.
As I’ve mentioned before,
while S3 storage is fairly inexpensive, the requests are not cheap: in the standard tier and us-east-1
region, PUTs cost $0.005 per 1000 requests
and GETs cost $0.0004 per 1000 requests. The latencies are in double-digit milliseconds.
S3 Express One Zone makes the requests only 2x cheaper, so it does not materially change the situation.
This means that if you’re looking to build a high-performance, low-cost distributed lock, S3 is not going to be your first choice.
You would probably use it because you’re already using S3 for something else and you want to hold a lock over S3 resources.
Unfortunately S3 does not support fencing tokens for PutObject
calls, which limits the usefulness of this approach.
Meta
This is a companion post for my lightning talk at HYTRADBOI that shows how to use If-None-Match
.
It presents an idea similar to Gunnar’s and to what Delta Lake uses in practice.
I’ll add a link here once the talk has been posted online.
Thanks to Joel Kaasinen, Juuso Takalainen, Iivari Äikäs, and Waltteri for giving feedback on the talk and thanks to Joel Kaasinen for feedback on this post. Any mistakes are my own.
Photos: The first one shows a rock and a tree at the frozen Lake Meiko in Kirkkonummi, Finland on a cloudy winter day. The second one is a cliff at Vepsu, a small island in the sea in front of Hamina, Finland.