Continuity

Continuity in the IT world means that if we lose a teammate (hopefully only temporarily) that work and life in our environment can still go on.

The most important thing to do is to automate everything that you can. Alerts, backups, index investigation, data integrity checks and other maintenance processes must be automated. Make sure that failure notices go to a team, rather than an individual, but make sure an individual is responsible for responding to the messages. On top of that, make sure that whatever is sending the messages is up (SQL Server Agent, for example). The second thing is to document the jobs, location of the backups, the tolerances for performance and capacity issues. In other words, create a run book. If you document what your normal tolerances are and share what your past stress causing experiences are in your runbook you can help the rest of the team handle situations when you have a missing team member.

We create runbooks that share these tolerances, but the runbooks have to be kept up to date. Many times, teams treat runbooks as historical record, but they should be live documents, updated for every change in the environment including what new jobs are added.

Another key point is to ensure that alerts don’t only go to one team member they should go to a local or group email or file or send an email to every team member. When you have secret alerts then you have secret processes that no one knows to handle, or the information can be lost forever if the team member leaves your business.

You should create rules for job creation, job documentation and alert responses. Every so often we see businesses where you have a DBA that wants to be the HERO. They are always first on the trail for a response. It is great if they are that engaged but the single threaded information can be dangerous for your team if that person is on vacation, or out sick. Also, it can be security issue because information could continue to be sent to this person after they have left the business permanently or to a dead email address).

We suggest that you have a meeting soon, today or tomorrow. Review the runbooks, review the jobs that are running. Explore and interview each other about the knowledge each has about your system and record it all. Record the meeting in case it doesn’t make it in the runbook. Version the runbook considers it a live document and create rules for updating the document.

Next have someone test and validate the information in the runbook. Have them provide feedback and create a hard deadline to get the information from each person’s head into the document. (Note this is a great task for the newest team member for lots of reasons)

Then the living final document should be shared in a share file that is monitored and has controlled access.  Quarterly or at least twice a year depending on how volatile your environment is someone should be assigned to again review and validate the information in the document.

Comments