Info Image

Service Assurance - the Knowledge of Power

Service Assurance - the Knowledge of Power Image Credit: Sashkin/www.Bigstockphoto.com

We have all become highly reliant on power and especially power in the form of electricity - both at home and at work. Our heating and lighting, our entertainment devices and our communications devices are all dependent on power from electricity. At work we are also reliant on electricity to power our laptops, our phones, the computer networks, servers, machines and even “the cloud”. We all take this ubiquitous power for granted. We anticipate that with a flick of a switch something happens. Further, once on, we expect it to stay on and always be 100% available.

However, at any one time there are power outages. Some of these power outages are ‘planned’ e.g. for maintenance reasons or power ‘shedding’ while others are ‘unplanned’ due to technical failures or storm damage etc. Power outages may impact a single home or wider communities, and indiscriminately businesses and public services such as hospitals. Unless the home or business has an alternative back-up power source, all mains powered equipment stops working. Mobiles and laptops with internal batteries will still run for a few hours more but may still lose communications.

In the business world, all IT equipment and services require electricity to be powered. Without electricity the business service provided is no longer available to the business organisation and its users.

IT is a complex machine with interconnected systems including laptops, wi-fi access points, data networks and routers, the “internet” and user applications that are delivered from the cloud or enterprise servers. It is a vulnerable chain of systems that cannot operate without power or the services of the upstream systems.

When procuring these critical services, it makes sense to have an appropriate expectation on the service performance and availability of the service. For example, the service may offer 99.9% availability and a 4-hour restoration time which is defined in the service level agreement (SLA).

Figure 1: Example of the relationships between users, IT components and the power network

In contrast, electricity suppliers do not make any such assurances on continuity of their service or restoration times in the event of an outage.

In a well implemented IT Service Management (ITSM) system some of the inter-relationships and dependencies may be mapped as service relationships within the Configuration Management Database (CMDB). A benefit of this mapping is that if there is an IT device failure, then it is possible for Service Assurance to quickly identify the relationships and therefore the root cause of the Incident.

IT failures can be due to many causes including human error, configuration issues, hardware failures as well as power interruptions/outages. Therefore, it is important to quickly discover the underlying root cause of an Incident minimising the disruption to the business and its customers and with a major outage potential reputational/brand damage.

Managed services

Where an IT service is managed by a third-party supplier then the portfolio of services provided will have contractual assurances mentioned above e.g. availability and restoration SLAs amongst many others. However, these SLAs will normally be restricted to the service covered by the contract itself. Any service outages that are not within the scope of the managed services contract will be excluded from the contractual assurances provided.

Unless power is part of the contracted portfolio of services, any outages caused by power would be excluded from the SLA arrangements and financial penalties if there is a service credit scheme as part of the contract.

Note that continuous availability is an approach or design to achieve 100% availability. A continuously available IT service has no ‘planned’ or ‘unplanned’ downtime. However, this is rarely achieved, and allowances must be made.

This raises an important point and something that needs to be understood by stakeholders and users of a business service. Although a third-party company may offer a resilient/high availability service this may not mean that there will not be events that result in the service not being available, for example, planned engineering works, third-party suppliers and of course power outages.

Because the service performance measures will exclude outages which are either ‘planned’ or outside of the scope of the contracted services, this can lead to a perception of poor service performance despite the commercial SLAs being satisfactory.

Service assurance dilemma

There is a further problem. If a business service has an ‘unplanned’ outage caused by a specific device which is unreachable to manage and troubleshoot, then how can an Incident Manager know if the device has a genuine fault or due to a power outage?

Here there is a service management dilemma. Does the Incident Manager wait and see what happens i.e. hope that the device returns to normal working status or should the Incident Manager request that a field engineer be dispatched to investigate the incident?

Waiting to see if the device does come back on-line could delay getting the service restored and potentially jeopardise one or more of the SLAs. Alternatively, dispatching a field engineer is not without cost and could impact another incident that the engineer could have been dealing with.

When the power is restored and the device reboots, then the Incident may be closed without accurately recording the correct cause of the outage. It would not be surprising for a significant proportion of Incident reports to be closed as right when tested, no fault found, false alarm or something equally ambiguous. Unless it can be proven otherwise most of these Incidents records will go against the service provider which again impacts the perceived service performance and SLAs.

Irrespective of the Incident closure code recorded, it is hypothesised that a significant percentage of Incidents may be caused by power outages.

Figure 2: Example breakdown of Incident closure codes

This hypothesis may be disputed; however, an example above shows that in one month 30% of Incidents were found to be power related. In the same month there were a further 24% recorded as False Alarms and an additional 6% No Fault Found. It is possible that a significant proportion of these ‘unknown’ cause Incidents were also power related. Interesting, in this example, only 40% were genuine device or network issues.

Is it possible that 60% of all Incidents could be power related and if so, what solutions are available?

The case for AIOps

AIOps and automation can be deployed alongside existing service management tools to help detect and manage a significant proportion of Incidents reducing the volume of Incidents managed by Incident Managers. In some organisations, targeting power outages would be an ideal first step for AIOps enabling Incident Managers to focus on the genuine IT issues that need human attention.

Using AIOps to target specific Incidents, could lead to a more cost-effective IT service model and at the same time improvements to service performance. With careful analysis this AIOps approach could then be extended to other types of Incidents and even other Service Assurance functions.

NEW REPORT:
Next-Gen DPI for ZTNA: Advanced Traffic Detection for Real-Time Identity and Context Awareness
Author

Andrew Catchpole is an innovator, and the founder of Poweye. With a background in telecommunications and managed network services, Andrew is focused on developing innovative solutions for IT service management problems.

PREVIOUS POST

The Quiet Telco Army

NEXT POST

Getting Through the COVID-19 Global Health Crisis: Better Together