To actually answer your question, you need some kind of job scheduling service that manages the whole operation. Whether that’s SSM or Ansible or something else. With Ansible, you can set a parallel parameter that will say that you only update 3 or so at a time until they are all done. If one of those upgrades fails, then it will abort the process. There’s a parameter to make it die if any host fails, but I don’t recall it right now.
That would be great. Just a bit that sends an email from a different innocuous sounding Gmail every month with a generic problem like “app crashes on <random device>” to see if there is a response. If you miss 3 in a row, you’re out