What is an Incident?
Before we can define our incident response process, we should first define what an incident (and a major incident) is.
What is an incident?#
Any unplanned disruption or degradation of service that is actively affecting customers ability to use Silverline.
What is a major incident?#
Any incident that significantly interrupts service for multiple customers. We usually will have to involve multiple teams.
What triggers our incident response process?#
Our incident response process should be initiated for any major incident. It provides a framework for effectively responding and reaching a fast resolution time. Our incident response process can be triggered one of two ways, either via automated monitoring and alerting, or manually via human action.
Throughout our system, we monitor various metrics to determine if our system is in a state which would require a co-ordinated human response in order to resolve. To determine which metrics we monitor, and what to monitor them for, we ask ourselves these questions. If the answer to any is "No", then we should trigger our incident response process.
- Are all customers able to access and configure settings in the Silverline Portal?
- Are customers reporting loss of traffic or blocked hosts in any of our datacenters?
- Are we able to stop current attacks on our customers within SLA?
- Are several customers customers affected?
We trigger on any unplanned disruption or degradation of service to which any Silverline analyst or manager deems necessary of requiring co-ordinated incident response.
Is a response required?
If you are unsure of whether response is required, trigger our incident response process. All you need to do to start the process is send an email to firstname.lastname@example.org
Our severity definitions determine how severe we think an incident is, based on some pre-defined guidelines. The intent is to guide responders on the type of response they can provide. For example, the higher the severity, the riskier the decisions you can take to return the system to normal.
Severities are useful to quickly determine whether something requires a more complex response, or whether it requires a co-ordinated response at all. However, they are not a black and white definition of what constitutes a major incident. If something is not covered by our severity definitions, but you think it requires incident response, then it requires incident response. We only need to know one thing: "Is this a major incident?". The severity level can be determined later, and isn't a requirement of triggering our response process.