Data Disruption


Usually, we use the term data corruption or attack to talk about data interruption issues. I am using the term Data Disruption to discuss a topic that may surprise you. Most data security issues are self-inflicted by the business’ employees.  At least 40 – 50% of system outages are caused by human error, which includes:  poor software and hardware management, poor maintenance on hardware and software, untested upgrades of software, or application code.

Many data disruptions are preventable by ensuring that your IT staff are following solid maintenance processes, capacity management on disks, network, data pipes. It is easy to set up email alerts and direct the emails to many addresses or better, a group.

Ensuring the alerts are examined, validated, and corrected is another story. Many shops send so many alerts or take so long to make changes that the alerts turn into noise.  key to preventing this is to make sure alerts are actionable.

If management won’t support the expenses or wish to examine and approve the needed changes the team will look the other way too. Most IT teams are overwhelmed and work long days. Sending requests to fix things that are ignored by upper management causes a malaise that can affect the entire team.

My first recommendation if you are a manager or business owner: Listen to your IT team and help them help themselves. Many system outages that cost your company money can be prevented but providing the requested support to your team, and ensuring they have the tools and permissions necessary to monitor and maintain your system.

Poor roll up and poor accessibility plans, and rules also cause other outages or slow systems. Review with your IT staff the rules concerning who can role up to your production systems, who does the testing and who approves the new code or any other changes. Frequently, system slowness occurs after a bad query, report, or new application code is rolled up by the wrong person at the wrong time.

When staff are allowed to do whatever they want in the system, bad things happen to good people and good companies especially at the worst times.  You should also ensure that roll ups are communicated in advance to the team, or at least scheduled at regular intervals. Creating a specific roll up schedule will help identify issues when they occur and will help you team plan to monitor the upgrades, keeping greater vigilance directly after changes are made. This prevents a culture of constantly putting out fires and allowing those fires to delay deadlines.

Here is a checklist to start these discussions with your team:

  1. Who owns the system? Who is the system administrator?
    1. Top roles must be limited to a handful of people; this is not a public role you want everyone to have access to especially in production, but consider limitations in development environments as well
  2. Who is the owner of all truth? (Honestly should be more than one person)
    1. Where is all the information kept?
      1. Runbooks
      1. Disaster Recovery plans
      1. Encryption Key locker
      1. Who has access to what?
      1. User accountability review
      1. Current vs of your application code
      1. Current network map?
      1. Security Rules
  3. Roll ups
    1. Who tests roll ups?
    1. Who is allowed to roll up the code?
    1. Is there always a roll back plan? If not, who allows this step to be skipped?
    1. Who approves the roll up?
    1. When are roll ups?
    1. Who tests production after the roll up?
    1. Who determines when the rollback option must be enacted?
    1. Who is final approver?
  4. What Maintenance is being done in the environment?
    1. Capacity checks
    1. Network checks
    1. Security checks
    1. Backups
      1. What backup process did you select, why?
      1. Where are they?
    1. Performance checks
    1. Data integrity checks
    1. Who is responsible for each, how often, who to report issues to, to request changes to resolve?

In Summary, if you already know all of this and you have a fabulous team that communicates to you celebrate for a moment. If not, talk to your team and get these answers written and stored in your system for the next time you have data disruption. It may save you some money, clients, and sanity.