By Andrew Vermes, Kepner-Tregoe 

Cloud services have greatly expanded the IT capabilities available to support companies’ diverse business needs over the past few years.  Home-grown software is being replaced by SaaS and company run data centers are being replaced with IaaS and PaaS offerings.  While the move to the cloud brings significant benefits in business functionality, scalability and reducing capital costs, the management of these environments to provide service assurance to users can be challenging. 

One of the areas that IT Service Management (ITSM) teams often struggle with is how to effectively diagnose issues and their causes when the symptoms, data and impacts extend beyond the boundaries of the company.  Doing root cause analysis (RCA) in the cloud requires looking at the IT environment differently, taking a stronger dependency on data as a tool to help in analysis and knowing when to bring partners into the conversation to help.  Here are 5 tips for performing root cause analysis in the cloud to help you understand better.

1. Embrace Automation

With cloud services, you don’t typically have access to source code for debugging software and you can’t physically touch most of the devices in the environment either.  Monitoring and diagnosis of cloud environments requires becoming a skilled user of automation to serve as your “eyes and ears”.  Most cloud services have their own administrative tools which can help you understand what is going on inside the service itself, however, external monitoring and diagnostic capabilities might be necessary for monitoring the availability and performance of services to end-users.

An example of using monitoring tools for investigation is:

Users are affected by extreme slowness in a core application. Changes were implemented over the weekend, and incident managers are naturally drawn to that as a possible cause.

However, a look at load time in Citrix shows that excessive latency is confined to roaming users. Clearly no need to investigate the changes to the application!

2. Leverage Partners

When you use cloud services, you aren’t just taking a dependency on the technology – you are extending your service operations to include the supplier organizations that provide and manage the services.  When a problem is encountered that requires diagnosis and troubleshooting, the cloud provider should be there to assist.  Taking advantage of these resources requires you to do a couple of things differently than you might have done before.  First (and most challenging for most companies) is acknowledging that troubleshooting isn’t an individual activity anymore but rather a team effort.  You need to understand who is on the team and how to engage with them. The second thing you need to do differently is to understand your Service Level Agreements (the formal contracts with suppliers) to ensure they are prepared to provide the responsiveness and resources your company needs.

Partners have an interest in helping you: a long duration incident is not only a nuisance to your users, it’s also consuming your partner’s time in diagnosing the fault. The more effectively the whole partner ecosystem co-operates, the better for everyone.

Sometimes this needs a little contractual push: instead of focusing solely on availability and numeric measures, require your service providers to furnish a root cause for every significant outage. Knowing that they have to provide a detailed, credible explanation will influence the way they treat incident investigations for the better.

3. Manage Service Interfaces

Cloud services are meant to be treated as “black-boxes” with the details of what goes into providing the service only available to the service providers (obscured from your view).  This can be a good thing as it makes your IT environment much less complex. For some IT staff members, not seeing how things work can be frustrating.  The key is learning to focus on managing the scope and interfaces for the service – understanding what is going in, and what is coming out of the box along with the functions that are expecting to be performed within the service.  Managing service interfaces may require some changes to your company’s notion of what configuration items are in your CMDB, what needs to be monitored, and how SLAs should be structured.

4. Understand what services are made of

Just because you can’t see the detailed interworking of a cloud service, doesn’t eliminate the need for a basic understanding of what the services you use are composed of.  Most cloud services include dependencies on underlying technology, connectivity from external service providers and other cloud services (like hosting or data storage).  It is important to understand (at a high level) what these dependencies are even if you don’t manage them directly.  They still present a potential cause of failure that needs to be considered in the root cause analysis process.

5. Don’t forget about connectivity

When using cloud services, you need to pay special attention to the connectivity components that enable users and administrators to access the services.  Its great if the service is up and running but if you can’t get to it, you still have a problem.  The same tip applies to monitoring and diagnostic tools.  If the only tools you have available are hosted by the service provider, you may not be able to access them in the event of a connectivity issue.

Cloud services are one of the biggest advances in the IT industry over the past 5 years and provide tremendous productivity and cost-saving potential for companies that use them.  They do require your IT service management staff to think differently about how they manage, monitor and repair issues when they occur.

Kepner-Tregoe has been an industry leader in problem-solving and root cause analysis processes and techniques for over 60 years – helping companies achieve Service Excellence.  To learn more, visit www.kepner-tregoe.com

()