I think there’s a subtle but serious issue in the reports scheduler around how the checkpoint (REPORT_SCHEDULER_LAST_CHECK_CACHE_KEY) is handled.
Right now in check_and_enqueue, the scheduler updates the checkpoint to now before it actually enqueues any report trigger messages. This effectively marks the entire time window as “processed” even though the enqueue step hasn’t completed yet.
That becomes a problem in failure scenarios:
- If the process crashes (or restarts) after the checkpoint is written but before the enqueue loop finishes, the next run will start from the new checkpoint and skip all the pending hour buckets in that window.
- If the message queue is unavailable (e.g. RabbitMQ down),
push_to_reports_queue can fail for all reports, but the function still returns Ok(()) and the checkpoint has already moved forward — so those triggers are never retried.
- Errors from enqueue are only logged, and the result of
cache.insert(...) is ignored (let _ =), so there’s no strong signal that anything went wrong.
In practice this can lead to silently missing scheduled reports (daily/weekly), which is pretty hard to detect unless someone notices the absence of emails.
Repro idea:
- Let the scheduler run with at least one report due
- Kill the process right after the checkpoint is written (before enqueue completes), or simulate MQ failure
- Restart and observe that the missed time buckets are not retried
Root cause (as I understand it):
The checkpoint is being treated as “we attempted this window” instead of “we successfully handed off the work”.
Possible fix directions:
- Move the checkpoint update to after successful enqueue
- Or advance it incrementally (per hour bucket / per successful enqueue)
- Avoid ignoring errors from
cache.insert
- Optionally add retry/backoff for queue failures
Happy to take a shot at a fix if this direction makes sense 👍
I think there’s a subtle but serious issue in the reports scheduler around how the checkpoint (
REPORT_SCHEDULER_LAST_CHECK_CACHE_KEY) is handled.Right now in
check_and_enqueue, the scheduler updates the checkpoint tonowbefore it actually enqueues any report trigger messages. This effectively marks the entire time window as “processed” even though the enqueue step hasn’t completed yet.That becomes a problem in failure scenarios:
push_to_reports_queuecan fail for all reports, but the function still returnsOk(())and the checkpoint has already moved forward — so those triggers are never retried.cache.insert(...)is ignored (let _ =), so there’s no strong signal that anything went wrong.In practice this can lead to silently missing scheduled reports (daily/weekly), which is pretty hard to detect unless someone notices the absence of emails.
Repro idea:
Root cause (as I understand it):
The checkpoint is being treated as “we attempted this window” instead of “we successfully handed off the work”.
Possible fix directions:
cache.insertHappy to take a shot at a fix if this direction makes sense 👍