Often, the CxO community of large multi-national firms approach THBS seeking help not just in planning for maintenance and life-cycle management of their IT estate, in other words - managed-services, but also in implementing such managed services using scientific techniques & frameworks that guarantee a phased RoI. A major part of such a managed-services initiative is constituted by ‘application support’, which forms the subject of discussion in this post.

Bank Branch
Figure: Bank Branch Counters Framework

Typically, we start with a ‘knowledge sharing’ initiative, to help our clients understand the relationship between support models and service levels. The idea is to clearly model the relationship between required service levels and cost, so the application owners take an intelligent decision on the service levels vs. cost. You will agree that the main cost for IT applications support today is directly proportional to the number of support staff deployed. This is very unlike a cloud hardware model, where resources can be ramped up or down based on real-time demand. In this article, we explore the characteristics of a support model and its similarities with that of a waiting line situation we face in everyday life. We apply queuing theory to help with optimal decision making.

We can visualize a set of issues or incident tickets being raised by application owners or end users to the application support team. The support team consists of a set of “servers” who resolve the issue and change the incident status to “fixed”. However, not every incident can be attended to immediately since all the support staff may be busy with an existing incident. Hence, the new incident gets into a queue. Queues or waiting lines form because people or things arrive at the servicing function, or server, faster than they can be served. However, this does not mean that the service operation is understaffed or does not have the overall capacity to handle the influx of customers. In fact, most businesses and organizations have sufficient serving capacity available to handle their customers in the long run. Waiting lines result because customers do not arrive at a constant, evenly paced rate, nor are they all served in an equal amount of time. Customers arrive at random times, and the time required to serve them individually is not the same. Thus, a waiting line is continually increasing and decreasing in length (and is sometimes empty), and it approaches an average rate of customer arrivals and an average time to serve the customer in the long run. For example, the checkout counters at a grocery store may have enough clerks to serve an average of 100 customers in an hour, and in any particular hour only 60 customers might arrive. However, at specific points in time during the hour, waiting lines may form because more than an average number of customers arrive, and they make more than an average number of purchases.

In this blog, we discuss the metrics that are critical to determine the relationship between service levels and cost to serve (or staffing levels) and describe the “Discrete event simulation” model to model incident management so that appropriate staffing levels can be arrived at depending on the required service levels.

An example:

When a banking customer visits a branch, a dispatcher helps categorize the request and direct him or her to the right counter where the request is serviced.

If we map the customer experience in this scenario, the customer is interested in the below metrics:

MetricDescription
Arrival time What time did I arrive?
Service start time How long did it take before the bank started processing my request?
Waiting time The difference between Service start time and Arrival time
Service time How long did the counter take to process my request?
Completion time At what time was my request processed completely?
Time in system How much time did I spend at the branch?

 

Additionally, while a client is interested in his or her own metrics, a company may process hundreds of request each hour. Hence, companies need to consider a few additional metrics which are summarized metrics for a group of clients.

MetricDescription
Average time in system Across a group of clients who came into the branch, what was the average time they spent in the branch
Maximum time in system What was the longest time spent in the branch among all the clients who came to the branch
Number waiting The number of clients who had to wait before a counter was free to service them
Probability of waiting If a client walked into the branch, what is the probability that the client has to wait before a counter is free to service him or her?
Average waiting time On an average, how long did a customer have to wait before a counter was free to service him or her?
Maximum waiting time What was the longest time a customer had to wait before a counter was free to service him or her?
Average service time On an average, how long did a counter take to service a request?
Maximum service time What was the longest time a counter took to service a request?
Average counter utilization On an average, what percentage of time was a counter occupied with servicing a request?
Probability of time in system < X min (threshold) What is the probability that a client can enter and exit the branch (with his request services) within X min?
Cost of Waiting What does it cost the bank when a customer waits in terms of lost opportunities, lost goodwill, lost brand image, etc.
No. of Servers How many servers are needed per category of service and in total?

 

The bank’s objective is to provide the best customer experience possible. The way to do this is to minimize waiting time and service time. Assuming that the service time at individual counters is fully optimized, the way to reduce waiting time and service times is to increase the number of counters per category of service. However, this increases the bank’s cost to serve. Hence, there is a need to understand customer’s service expectations with cost constraints involved in serving the customer at his or her expectations level. Once these two criteria are understood, trade-offs can be studied and optimal staffing levels (optimized for cost and service levels) can be arrived at.

At this stage, the similarities between IT Application Support and the above bank counter example is clearly noticeable. In IT Application Support, the “request” is in the form of an “incident ticket”, a dispatcher is the first level of support and the incident resolution is provided by support resources that are specialized in specific categories of applications. In IT Application Support, cost of waiting is easier to quantify as compared to the banking example since the revenue generated by the application during a time interval can be quantified using historic data.

Let us assume that the client’s IT estate is grouped into the below layers (or domains):

  • User interaction layer consisting of web, mobile, USSD, SMS, kiosks and other similar interface applications
  • Business process management layer
  • Middleware and services or APIs for integration
  • Data warehouse & business intelligence layer
  • Core-platforms (Core-banking systems for banks, billing and network elements in communication service providers, etc.)

If we have to start designing a support model for these layers, the basic premise is that the support staff skillset needed for each layer is different. The visual model starts to look very similar to other day-to-day queue scenarios like the above bank example as well as hundreds of other examples including hospitals, government citizen services, banks, etc.

Bank Branch

 

Figure: IT Application Support Framework


The same metrics we discussed in the bank counter example are also applicable in the case of IT Application Support.

The cost trade-off relationship is summarized in the below graph. As the level of service increases, the cost of service goes up and the waiting cost goes down. The sum of these costs results in a total cost curve, and the level of service that should be maintained is the level where this total cost curve is at a minimum. (However, this does not mean we can determine an exact optimal minimum cost solution because the service and waiting characteristics we can determine are averages and thus uncertain.)

Bank Branch levels

Figure: The Cost Trade-Off Relationship


In general, in IT Application Support outsourcing scenarios, the below metrics are of primary important in service level agreements (SLAs):

MetricDescription
Response time Time taken to acknowledge an incident ticket
Restoration time (or initial resolution time) Time taken to restore an application to working condition through work-around fixes
Final resolution time Time taken to provide a permanent fix for the issue causing the incident
Mean time between failures Time between 2 incidents


In this blog, we confine ourselves to restoration time since maximum risk for the IT Application Support provider and the client is associated with this metric. In the present context, restoration time is the same as average time in system. Also, daily standard checks, dip-checks are not considered in modelling since the effort involved is generally fixed and the variable effort is based on incidents only.

The best way to arrive at optimal staffing levels (and cost) for expected service levels is to model the application support scenario and to vary the inputs till an optimal result is arrived at. Modelling involves understanding the rate at which incidents arrive, the time taken to service them, number of servers and the costs associated with them.

DISCRETE EVENT SIMULATION MODEL

In this model, we simulate the arrival and servicing of 1000 incidents. We assume a two member support team (for one category of applications), however, the model can be extended into a larger sized support team using macros or any programming logic using the same logic provided for the two member support team.

The inputs to this model are:

InputDescriptionCategory
Minimum inter-arrival time Minimum time between two incidents based on historic data Incident arrival, based on uniform distribution
Maximum inter-arrival time Maximum time between two incidents based on historic data Incident arrival, based on uniform distribution
Mean service time Average time taken to service an incident based on historic data Incident resolution, based on normal distribution
Service time standard deviation Deviation from the mean for service time as per normal distribution, can be assumed to be 30% of the mean service time Incident resolution, based on normal distribution


Arrival times and service times are generated randomly using a random number generator based on the probability distribution and inputs listed in the above table. The clock is a minute clock and assumed to start at 00 minutes and all time measures are in minutes.

Simulation of the first 2 incidents is described below:

  • Incident 1
    • Incident 1 arrives at X minutes (using the random generator for Inter-Arrival Time)
    • Hence, Arrival Time is X minutes
    • Since both support resources are idle, Service Start Time is at the same time as the Arrival Time i.e., X minutes. Else, Service Start Time is equal to lower of the Time Available after Incident 1 for the two resources.
    • Resource 1 is assigned to the incident (on policy basis only, since both resources are idle)
    • Hence, Waiting Time is zero (if not, Waiting Time is equal to lower of the Time Available for the two resources after Incident 1 minus the Arrival Time of Incident 2)
    • Service Time is generated using random generator
    • Completion time is Service Start Time + Service Time
    • Time in System is Completion Time – Arrival Time
    • After this incident is processed, Resource 1 is available to take up the next incident at Completion Time. Hence, Resource 1 - Time Available is equal Completion Time.
    • Resource 2 is idle and is available to take up an incident at any time after start of clock. Hence, Resource 2 - Time Available is equal 0.
  • Incident 2
    • Incident 1 arrives at Y minutes (using the random generator for Inter-Arrival Time)
    • Hence, Arrival Time is X+Y minutes
    • Resource is assigned to this incident based on lower Time Available after previous incident. Since Resource 2 has the lower time available (Zero minutes), Resource 2 is assigned.
    • Since Resource 2 – Time Available is zero; Service Start Time is at the same time as the Arrival Time. Else, Service Start Time is equal to lower of the Time Available after Incident 1 for the two resources.
    • Hence, waiting time is zero (if not, Waiting Time is equal to lower of the Time Available for the two resources after Incident 1 minus the Arrival Time of Incident 2)
    • Service Time is generated using random generator
    • Completion time is Service Start Time + Service Time
    • Time in System is Completion Time – Arrival Time
    • After this incident is processed, Resource 2 is available to take up the next incident at Completion Time. Hence, Resource 2 - Time Available is equal Completion Time.
    • Resource 1 did not take up this incident, but was working on Incident 1. Hence, Resource 1 - Time Available is equal Resource 1 – Time Available at the end of Incident 1.

The full simulation of 1000 incidents could be provided based on request.

 

About the Author

Arnav -Torry-HarrisArun Vasudeva Rao, PMP® works as the Regional Manager - Technical Services at Torry Harris Business Solutions and is responsible for managing IT services engagement with telecom and banking majors in the African region.