Enhancing Running Apps with RIFL: A Retry Safety Approach

When I started building my runs-app, the main intention was simple. I wanted to track my running data in a way that made sense to me. I have Garmin runs, Strava runs, and other running data that I want to bring together and eventually use for my own running analysis.

The great thing for me was not just adding idempotency. I took a serious distributed-systems idea from RIFL and made it work inside my own Spring Boot running application, protecting a real endpoint that can corrupt my running history if retries are handled badly. But once the application started becoming more real, I started running into the kind of problem that does not look big at first, but can become a real production issue. What happens when a client sends a request to create a Garmin run, the backend saves it successfully, but the network fails before the client receives the response?

From the client side, it looks like the request failed.

So the client retries.

Now the backend may create the same run again.

For a running app, duplicate runs are not just a small UI problem. They can affect total mileage, training history, race preparation, weekly summaries, and eventually any AI model or analytics that depends on clean running data. That is where I wanted to try something deeper than just saying, “Let us add a unique key and move on.” I wanted to learn how distributed systems solve this problem.

Why I Looked at RIFL

RIFL stands for Reusable Infrastructure for Linearizability. I came across this idea from Seo Jin Park’s Stanford PhD work. The core idea that interested me was exactly-once execution. In simple terms, if a client sends a request and does not know whether it succeeded or failed, it should be able to retry safely. The server should know whether this request was already completed. If it was completed, the server should replay the earlier result instead of executing the same operation again. That idea felt very relevant to my runs-app.

The endpoint I wanted to protect was:

POST /api/garminRuns

Before the request reaches my Garmin runs controller, the client has to send three RIFL headers:

X-Client-Id: 1
X-Sequence-Number: 42
X-First-Incomplete: 40

These three values are what make the retry safe.

X-Client-Id tells the server which client is making the request.

X-Sequence-Number gives each request from that client a unique number.

X-First-Incomplete acts like a cleanup watermark, telling the server which earlier requests are safe to garbage collect.

This is the kind of endpoint where retry behavior matters. A duplicate GET request is not usually a big problem. But a duplicate POST request can create bad data. So the great thing I ended up doing here was taking an academic distributed systems idea and making it practical inside a Spring Boot running application.

Not as a research paper.

Not as a toy example.

But as part of my actual runs-app.

Core Architecture and Implementation

The implementation relies on a few key components to intercept and track requests:

LeaseManager: Tracks client heartbeats; leases are set to expire after 60 seconds.

ResultTracker: Maintains an in-memory cache of completion records, which are keyed by the combination of clientId and sequenceNumber.

RiflFilter: Intercepts incoming requests, looks up the cache, replays the response on a hit, and caches the result on a miss.

RiflGcScheduler: Handles the periodic cleanup of expired clients’ records every 30 seconds.

The idea is that every client gets a lease and sends request sequence numbers. The server tracks completed requests using the combination of:

clientId + sequenceNumber

When the same request comes again, the server can identify that it has already processed it and return the earlier response instead of creating another Garmin run.

LeaseManager

The LeaseManager is responsible for tracking active clients. A client opens a lease and receives a clientId. After that, the client is expected to renew the lease periodically. In my implementation, the lease expires after 60 seconds if the client does not renew it.

This is important because the server is keeping completion records in memory. Without lease expiration, the memory can keep growing forever. So the lease gives the server a clean way to know: this client is still alive (keep its completion records) or this client is gone (clean up its completion records).

ResultTracker

The ResultTracker is where the completed request result is stored. The key is clientId + sequenceNumber and the value is the completed response. So when a request comes in, the application can ask: have I already completed this request for this client and sequence number? If yes, it returns the earlier result. If no, it allows the request to proceed and then stores the result. That is the heart of the retry safety.

RiflFilter

Instead of putting retry logic inside every controller method, the filter intercepts the request before it reaches the controller. It checks the RIFL headers, looks for a completion record, and decides whether to replay the earlier response or allow the request to execute.

The controller can still focus on Garmin run creation. The RIFL filter handles retry safety. That separation is what makes this feel like real architecture and not just a quick patch.

The most important part of the implementation is in the filter. If the same client retries the same sequence number, I do not let the request hit the controller again. I replay the earlier response:

RpcId id = new RpcId(clientId, seq);
var cached = resultTracker.lookup(id);
if (cached.isPresent()) {
replay(response, cached.get());
return;
}
RiflResponseWrapper wrapper = new RiflResponseWrapper(response);
try {
chain.doFilter(request, wrapper);
} finally {
int status = wrapper.getStatus();
String body = wrapper.capturedBody();
if (status >= 200 && status < 300) {
resultTracker.record(id, status, body);
}
wrapper.copyBodyToResponse();
}

The Garmin runs controller does not need to know about retries. The filter handles the retry boundary. On the first call, the controller runs and the response is captured. On a retry, the cached response is replayed and the duplicate write is avoided.

RiflGcScheduler

Because my current implementation keeps completion records in memory, I needed garbage collection. The RiflGcScheduler runs periodically and removes completion records for expired clients. This is a practical trade-off — a full RIFL implementation can be more involved, but for my runs-app I wanted something that teaches the concept and solves the real problem without over-engineering the project.

@Scheduled(fixedDelayString = "${rifl.gc-interval:30000}")
void reapExpiredClients() {
Set<Long> expired = leaseManager.expiredClients();
for (Long clientId : expired) {
int reaped = resultTracker.reapAll(clientId);
leaseManager.expire(clientId);
log.debug("RIFL GC: reaped {} records for expired client {}", reaped, clientId);
}
if (!expired.isEmpty()) {
log.info("RIFL GC: expired {} clients", expired.size());
}
}

Why This Is a Big Deal for My Project

For me, the important part is not just that I added one more feature. The important part is that the runs-app is slowly becoming a place where I can apply serious software architecture ideas to a personal running problem. Earlier, I worked on syncing runs, multiple databases, Testcontainers, Dockerized services, and running-app data flows. Now I added a distributed systems concept. RIFL is usually discussed in the context of distributed systems, linearizability, and exactly-once execution. I brought that idea into a practical Spring Boot application where duplicate Garmin runs can become a real data quality problem.

This gives me three benefits: safer retries for POST /api/garminRuns, cleaner running data for future analytics, and a deeper learning path into distributed systems.

The Practical Trade-Off

My implementation is intentionally simplified. The completion records are stored in memory. That means if the JVM crashes, the in-memory cache is lost. But I still protect the business data using the database — the Garmin activity id has to remain unique, so even if the cache is lost, the database layer gives me another safety net against duplicate runs.

So I look at this design as practical exactly-once behavior for client retries, with PostgreSQL uniqueness as the final guardrail. That is good enough for this stage of the runs-app, and it keeps the implementation understandable.

What I Learned

The biggest learning for me is this: exactly-once behavior is not magic. It comes from carefully tracking client identity, request sequence, completed results, leases, and cleanup. Before implementing this, retry sounded like a simple client-side concern. After implementing this, I see retry as a full architecture concern.

The server has to participate.

The client has to participate.

The database has to protect the final state.

And the application has to be honest about the trade-offs.

More to Come

This RIFL implementation is currently focused on the Garmin runs endpoint. I want to keep improving this as the runs-app grows.

Originally published on Medium.

Leave a Reply