Mining Invariant Relationships for Failure Analysis of Batch Software

Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
tesi di laurea magistrale
Mining Invariant Relationships for Failure Analysis of
Batch Software Systems
Anno Accademico 2012/2013
relatori
Ch.mo Prof. Stefano Russo
Ch.mo Prof. Marcello Cinque
correlatori
Ch.mo Ing. Flavio Frattini
Ch.mo Dr. Santonu Sarkar, Infosys Ltd., Bangalore, INDIA
candidato
Agostino Savignano
Matr. M63000176
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica

This thesis is in part the result of a research work done during the
InStep internship experience with Infosys Ltd



Mining Invariant Relationships for Failure Analysis of Batch Software Systems
multinational provider of business consulting, information technology,
software engineering and outsourcing services.
160.000 employees, 8 billion revenues.
Next Gen Computing Labs, within the Infosys
campus of Bangalore, India.

June-September 2013
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
Context

A batch software system is a system where jobs run without end
user interaction, or can be scheduled to run as resources permit.
This class of systems shows a very repetitive behaviour. They are
often used:
 to process large amount of transaction data
 for scientific computation

as there may be failures in the execution of jobs, malfunctions
have to be detected and actions have to be taken accordingly;
manual detection is inefficient.

A feasible approach to automatically and dynamically characterise
the behaviour of a system, in order to detect execution problems,
is represented by flow intensity invariants.
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
A simple example of flow intensity invariant

A platform for the batch processing of files: each file goes through a
stage of elaboration. The processing time is proportional to the size

Measuring the file size and the time spent in a stage, I(x) and I(y), (the
flow intensities), the equation
I ( y )  k  I ( x)
invariant relationship characterising the expected behaviour of the
batch system.
 If there is an execution problem (e.g., file processing hangs) the equation
does not hold any more.
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
Contribution
 An approach for the automatic discovery of a set of flow
intensity invariant relationships in batch software systems,
useful to characterise the expected behaviour of the
system.
 A detection mechanism that helps the operations personnel in
identifying deviations of the actual system behaviour from the correct
expected one is also presented.
 The application of a tool implementing the approach,
MinerVa, in two real world batch systems:
 A SaaS application by Infosys *
 SCoPE, the Supercomputer of the Federico II University
* S. Sarkar, R. Ganesan, M. Cinque, F. Frattini, S. Russo, and A. Savignano,
“Mining Invariants from SaaS Application Logs”,
accepted for publication in the proceedings of the European Dependable Computing Conference,
May 2014, Newcastle upon Tyne, UK‫‏‬
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
Approach


Source of knowledge of the system behaviour: event logs
Objective
 automatically mine a set of invariant relationships between
time series measurements from system logs (learning phase)
 use such relationships for detecting deviations of the system
behaviour (detection phase)
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
Approach Data Flow
 Learning phase
 Off-line
 Past observations
 RESULTS: a set of linear
relationships,
representing a model of
the system
 Detection phase
 Either on-line or off-line
 Uses the model to
predict the behaviour of
a system on new
observations
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
Sampling: from Event Logs to Time Series
 PROBLEM How to retrieve time series from event logs
 SOLUTION With a certain time interval:
 retrieve the events in the application logs
 count the events, according to the different flow intensities
 the “sampling time” is determined using a tupling
algorithm, applied on particular events, e.g., job arrivals or
completions.
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
Mining: from Time Series to Invariants
 PROBLEM How to discover relationships between time series
 SOLUTION For each possible pair of flow intensities:

Perform a linear regression using:


AutoRegressive models with eXogenous inputs (ARX)
Recursive Least Squares (RLS) algorithm

Test the goodness of the regression:
 R2 coefficient of determination

All the pairwise relationships which have a R2 value greater than a
threshold are considered as linear :

their regressed coefficient parameters are considered invariants
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
Detection: from Invariants to Alerts
 PROBLEM How to detect a deviation from the expected behaviour
 SOLUTION Using the regressed coefficients to predict the
expected behaviour
 For each regressed invariant relationship:
 compute the expected output
 compare the expected output with the actual output
 If the difference between the expected and the actual output is
greater than an adaptive threshold, raise an alert.
 prediction interval based threshold
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
SaaS CPG application by Infosys
A platform provided as a service by means of virtualized machines that:
 Receives transaction details data daily from several subscribers all
around the world.
 Stores the data and performs analysis on the stored data.

The work items consisting of several records are provided as a file in a
specific format and with different frequencies by the various
customers.
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
SaaS CPG application by Infosys: Mining

9 months of observations


Tupling on job completions


Sampling time: 10 min
Identified flow intensities: 33


108.000 workitems
Linear regressions performed: 528
Discovered invariant relationships: 11







Time spent in post-processing validation - Time spent in archiving (i1)
Time spent in transaction validation - # rows completed (i4)
# files arrived - # files completed (i7)
# files started - # files completed (i8)
# files arrived - # files started (i9)
# rows started - # rows completed(i10)
...
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
SaaS CPG application: Broken Invariants
Month
M1
M2
M4

Samples
4320
4464
4598
Threshold
…
i1
i7
i8
…
i9
10%
4035
…
3865
3754
3182
…
p.i.
5
…
14
13
4
…
10%
4204
…
3988
3903
3050
…
p.i.
2
…
17
18
6
…
10%
4306
…
4208
4106
3422
…
p.i.
0
…
8
12
4
...
Detected broken invariants with p.i. threshold correspond or
include the SLA violations discussed with the operations
personnel.
 Maintenance staff confirmed the possibility of violations for invariants
i1-i11.
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
SCoPE: Mining

8 months of observations


Tupling on job arrivals



Sampling time: 60 s
Regression performed on a subset of node processing the same
kind of jobs
Identified flow intensities: 15


1.104.139 jobs
Linear regressions performed: 105
Discovered invariant relationships: 69








# jobs active - # jobs arrived (i5)
WORKLOAD_VMEMORY - # jobs started (i13)
# jobs active - # jobs started (i16)
# jobs started - # jobs arrived (i19)
virtual memory - # of cores (i40)
CPU% - # jobs active (i48)
memory - # jobs started (i56)
memory - CPU% (i66)
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
SCoPE: Broken Invariants
Month
Samples
i5
i13
i16
i19
i40
i48
i66
M1
44640
13
0
0
431
41443
3328
6442
M2
43200
3
0
0
555
42097
4026
6743
M3
44640
2
0
1
371
44172
2958
6800

i5, i13, i16 are almost never broken

i21 is broken for more than the 50% of samples and


i40 is broken for almost each samples
i19, i36, i48, i56, i59, i63, i66 are broken for less than the 2% of
samples.
Scuola Politecnica e delle Scienze di Base
Corso di Laurea Magistrale in Ingegneria Informatica
Mining Invariant Relationships for Failure Analysis of Batch Software Systems
Conclusions and Future Work
It is possible to automatically
discover invariants in batch
systems
They are useful to the operation
personnel to
reduce the quantity of data to be
analysed for understanding the
system behaviour in case of
execution problems
In some cases, to detect the
execution problem itself (SLA
violations)
Future avenues of research:
Use logs from different layers
Improve the detection
mechanism:
Different adaptive threshold
Improve the mining process:
Perform a continuous invariant
mining: forgetting factor RLS
algorithm.
Invariant identification with
Prediction Error Method instead of
the Least Squared Method
Model relationships between
multiple time series instead of only
pairwise relationships
Special thanks to Infosys and Dr. Sarkar, project mentor of the internship experience