Extended Classi?cation of Software Faults Based on Aging
Kalyanaraman Vaidyanathan and Kishor S.Trivedi
Dept.of ECE,Duke University
Durham,NC27708,USA
kv,kst@https://www.wendangku.net/doc/fb11520795.html,
1.Introduction
Jim Gray classi?es software faults into Bohrbugs and Heisenbugs.Bohrbugs are essentially permanent design faults and hence almost deterministic in nature.They can be identi?ed easily and weeded out during the testing and debugging phase(or early deployment phase)of the soft-ware life cycle.Heisenbugs,on the other hand,belong to the class of temporary internal faults and are intermittent. They are essentially permanent faults whose conditions of activation occur rarely or are not easily reproducible.Hence these faults result in transient failures,i.e.,failures which may not recur if the software is restarted.Some typical sit-uations in which Heisenbugs might surface are boundaries between various software components,improper or insuf?-cient exception handling and interdependent timing of var-ious events.It is for this reason that Heisenbugs are ex-tremely dif?cult to identify through testing.Most recent studies on failure data have reported that a large proportion of software failures are transient in nature caused by phe-nomena such as overloads or timing and exception errors. In this paper,we extend Gray’s classi?cation of software faults based on the phenomenon of software aging which has recently gained recognition and importance.
2.Software Aging and Related Faults
The phenomenon of software aging has been reported by several recent studies[3,4].It was observed that once the software was started,potential fault conditions gradu-ally accumulated with time leading to either performance degradation or transient failures or both.Failures may be of crash/hang type or those resulting from data inconsis-tency because of aging.Typical causes of aging,i.e.,slow degradation,are memory bloating or leaking,unreleased ?le-locks,data corruption,storage space fragmentation and accumulation of round off errors.Software aging has not only been observed in software used on a mass scale but also in specialized software used in high availability and safety-critical applications[3].
Software Faults
Bohrbugs Heisenbugs"Aging-related"
faults
Figure1.Extended classi?cation
We designate faults attributed to software aging,which are quite different from Bohrbugs and Heisenbugs,as aging-related faults.These faults are similar to Heisenbugs in that they are activated under certain conditions(for ex-ample,lack of OS resources)which may not be easily re-producible.However,as discussed in Section3,their modes and methods of recovery differ signi?cantly.Figure1shows our extended classi?cation.
3.Software Rejuvenation
To counteract the phenomenon of software aging,Huang et al.[4]proposed a proactive approach of fault manage-ment,called software rejuvenation.It involves occasion-ally terminating an application or a system,cleaning its in-ternal state and restarting it.This process removes the ac-cumulated errors and frees up operating system resources, thus preventing in a proactive manner,the unplanned and potentially expensive system outages due to the software aging.Since the preventive action can be done at optimal times,for example when the load on the system is low, it reduces the cost of system downtime compared to reac-tive recovery from failure.Thus,software rejuvenation is a cost-effective technique for dealing with software faults that include protection not only against hard failures,but against performance degradation as well.Numerous exam-ples of software rejuvenation exists in real-life applications [3].More recently,rejuvenation has been implemented in IBM’s xSeries servers to improve performance and avail-ability[1].The important difference in the treatment of
c ISSRE an
d Chillareg
e Corp.,2001
Heisenbugs and aging-related faults is that in the former, the treatment is reactive while in the latter,it can be proac-tive as well.
3.1Approaches to Software Rejuvenation
Software rejuvenation can be divided broadly into two approaches as follows.
Open-loop approach:In this approach,rejuvena-tion is performed without any feedback from the sys-tem.Rejuvenation in this case,can be based just on elapsed time(periodic rejuvenation)and/or instanta-neous/cumulative number of jobs on the system[5].
Closed-loop approach:In the closed-loop approach, rejuvenation is performed based on information on the system“health”.The system is monitored continu-ously(in practice,at small deterministic intervals)and data is collected on the operating system resource us-age and system activity.This data is then analyzed to estimate time to exhaustion of a resource which may lead to a component or an entire system degra-dation/crash.This estimation can be based purely on time[1,3]or can be based on both time and system workload[5].Another approach to estimate the op-timal time to rejuvenation could be based on system failure data[2].
The closed-loop approach can also be classi?ed based on whether the data analysis is done off-line or on-line.
Off-line data analysis is done based on system data col-lected over a period of time(usually weeks or months) [3,5].The analysis is done to estimate time to reju-venation.This off-line analysis approach is best suited for systems whose behavior is fairly deterministic.The on-line closed-loop approach,on the other hand,per-forms on-line analysis of system data collected at de-terministic intervals[1].The analysis is done after ev-ery new set of data is collected to estimate time to re-juvenate.This approach is very general and can work with systems with unpredictable behavior or whose be-havior cannot be easily determined.In this case,future system behavior is computed based on the current sys-tem parameter values and weighted historical values. This classi?cation of approaches to rejuvenation is shown in Figure2.
3.2Rejuvenation Granularity
Rejuvenation is a very general proactive fault manage-ment approach and can be performed at different levels-the system level or the application level.An example of
a
Software Rejuvenation
Open-loop approach Closed-loop approach
Elapsed Elapsed time
Time-based
analysis
Time &
workload-based
time
(periodic)
and load
analysis
Time &
workload-based
analysis
Time-based
Failure
data analysis
On-line
Off-line
Figure2.Rejuvenation approaches
system level rejuvenation is a hardware reboot.At the ap-plication level,rejuvenation is performed by stopping and restarting a particular offending application,process or a group of processes.This is also known as a partial reju-venation.The above rejuvenation approaches when per-formed on a single node can lead to undesired and often costly downtime.Rejuvenation has been recently extended for cluster system,in which two or more nodes work to-gether as a single system[1].In this case,rejuvenation can be performed by causing no or minimal downtime by failing over applications to another spare node.
4Conclusion
In this paper,we extended Gray’s classi?cation of soft-ware faults to include aging-related faults.We also dealt with the treatment of these faults by software rejuvenation and discussed the various approaches and current methods of rejuvenation in practice.
References
[1]V.Castelli et al.Proactive Management of Software Aging.
IBM JRD,V ol.45,No.2,March2001.
[2]T.Dohi,K.Goˇs eva-Popstojanova and K.Trivedi.Statistical
Non-parametric Algorithms to Estimate the Optimal Reju-venation Schedule In Proc.PRDC2000,Los Angeles,CA.
[3]S.Garg,A.van Moorsel,K.Vaidyanathan and K.Trivedi.
A Methodology for Detection and Estimation of Software
Aging.In Proc.ISSRE-98,Paderborn,Germany.
[4]Y.Huang et al.Software Rejuvenation:Analysis,Module
and Applications.In Proc.FTCS-25,Pasadena,CA.
[5]K.Trivedi,K.Vaidyanathan and K.Goˇs eva-Popstojanova.
Modeling and Analysis of Software Aging and Rejuvena-tion,In Proc.of the33rd Ann.Simul.Symp.,Washington
D.C.,April2000.