I’ve never run a big system like this, but like the lead character in the story, I always figured exponential backoff would be enough. Turns out there’s more.
I’ve never run a big system like this, but like the lead character in the story, I always figured exponential backoff would be enough. Turns out there’s more.
All of what you’re saying seems correct. I think this is more of a meta discussion, on how (in this case) retries, even with exponential back off, aren’t a solution by themselves when you look at the system overall. There are interesting hidden caveats to any common solutions, this is one I personally wasn’t aware of.
Practically, adding a timeout budget so that the clients themselves just error out (forcing a manual refresh) sorta accomplishes the same as what you’re positing.