Abstract
Several recent studies have established that most system outages are due to software faults. Given the ever increasing complexity of software and the well-developed techniques and analysis for hardware reliability, this trend is not likely to change in the near future. In this paper, we classify software faults and discuss various techniques to deal with them in the testing/debugging phase and the operational phase of the software.We discuss the phenomenon of software aging and a preventive maintenance technique to deal with this problem called software rejuvenation. Stochastic models to evaluate the effectiveness of preventive maintenance in operational software systems and to determine optimal times to perform rejuvenation for different scenarios are described. We also present measurement-based methodologies to detect software aging and estimate its effect on various system resources. These models are intended to help develop software rejuvenation policies. An automated online measurement-based approach has been used in the software rejuvenation agent implemented in a major commercial server.
Chapter PDF
Similar content being viewed by others
Keywords
- Multiple Input Multiple Output
- Preventive Maintenance
- Software Aging
- Software Reliability
- Software Failure
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
E. Adams. Optimizing Preventive Service of the Software Products. IBM Journal of R&D, 28(1):2–14, January 1984.
P. E. Amman and J. C. Knight. Data Diversity: An Approach to Software Fault Tolerance. In Proc. of 17th Int. Symp. on Fault Tolerant Computing, pages 122–126, June 1987.
A. Avizienis and L. Chen. On the Implementation of N-version Programming for Software Fault Tolerance During Execution. In Proc. IEEE COMPSAC 77, pp 149–155, November 1977.
A. Avritzer and E.J. Weyuker. Monitoring Smoothly Degrading Systems for Increased Dependability. Empirical Software Eng. Journal, Vol 2, No. 1, pp 59–77, 1997.
L. Bernstein. Text of seminar delivered by Mr. Bernstein. In University Learning Center, George Mason University, January 29 1996.
A. Bobbio, A. Sereno and C. Anglano. Fine Grained Software Degradation Models for Optimal rejuvenation policies. Performance Evaluation, Vol. 46, pp 45–62, 2001.
K. Cassidy, K. Gross and A. Malekpour. Advanced Pattern Recognition for Detection of Complex Software Aging in Online Transaction Processing Servers. In Proc. Dep endable Systems and Networks, DSN 2002, Washington D.C., June 2002.
V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert. Proactive Management of Software Aging. IBM Journal of R&D, Vol. 45, No.2, March 2001.
R. Chillarege, S. Biyani and J. Rosenthal. Measurement of Failure Rate in Widely Distributed Software. In Proc. of 25th IEEE Int. Symp. on Fault Tolerant Computing, pp 424–433, Pasadena, CA, July 1995.
T. Dohi, K. Goševa-Popstojanova and K. S. Trivedi. Analysis of Software Cost Models with Rejuvenation. In Proc. of the 5th IEEE Int. Symp. on High Assurance Systems Engineering, HASE 2000, Albuquerque, NM, November 2000.
T. Dohi, K. Goševa-Popstojanova and K. S. Trivedi. Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule. Proc. of the 2000 Pacific Rim Int. Symp. on Dependable Computing, PRDC 2000, Los Angeles, CA, December 2000.
S. Garg, A. Puliafito, M. Telek and K. S. Trivedi. Analysis of Software Rejuvenation Using Markov Regenerative Stochastic Petri Net. In Proc. of the Sixth Int. Symp. on Software Reliability Engineering, pp 180–187, Toulouse, France, October 1995.
S. Garg, Y. Huang, C. Kintala and K. S. Trivedi. Time and Load Based Software Rejuvenation: Policy, Evaluation and Optimality. In Proc. of the First Fault-Tolerant Symposium, Madras, India, December 1995.
S. Garg, Y. Huang and C. Kintala, K.S. Trivedi, Minimizing Completion Time of a Program by Checkpointing and Rejuvenation. Proc. 1996 ACM SIGMETRICS Philadelphia, PA, pp 252–261, May 1996.
S. Garg, A. Puliafito, M. Telek and K. S. Trivedi. Analysis of Preventive Maintenance in Transactions Based Software Systems. IEEE Trans. on Computers, pp 96–107, Vol.47, No.1, January 1998.
S. Garg, A. van Moorsel, K. Vaidyanathan and K. S. Trivedi. A Methodology for Detection and Estimation of Software Aging. In Proc. of the Ninth Int. Symp. on Software Reliability Engineering, pp 282–292, Paderborn, Germany, November 1998.
J. Gray. Why do Computers Stop and What Can be Done About it? In Proc. of 5th Symp. on Reliability in Distributed Software and Database Systems, pp 3–12, January 1986.
J. Gray. A Census of Tandem System Availability Between 1985 and 1990. IEEE Trans. on Reliability, 39:409–418, October 1990.
J. Gray and D. P. Siewiorek. High-Availability Computer Systems. IEEE Computer, pages 39–48, September 1991.
B. O. A. Grey. Making SDI Software Reliable through Fault-tolerant Techniques. Defense Electronics, pp 77–80,85-86, August 1987.
J. A. Hartigan. Clustering Algorithms. New York: Wiley, 1975.
C. Hirel, B. Tuffin and K. S. Trivedi. SPNP: Stochastic Petri Net Package. Version 6.0. B. R. Haverkort et al. (eds.): TOOLS 2000, Lecture Notes in Computer Science 1786, pp 354–357, Springer-Verlag Heidelberg, 2000.
J. J. Horning, H. C. Lauer, P. M. Melliar-Smith and B. Randell. A Program Structure for Error Detection and Recovery. Lecture Notes in Computer Science, 16:177–193, 1974.
Y. Huang, P. Jalote and C. Kintala. Two Techniques for Transient Software Error Recovery. Lecture Notes in Computer Science, Vol.774, pp 159–170. Springer Verlag, Berlin, 1994.
Y. Huang, C. Kintala, N. Kolettis and N. D. Fulton. Software Rejuvenation: Analysis, Module and Applications. In Proc. of 25th Symp. on Fault Tolerant Computing, pp 381–390, Pasadena, CA, June 1995.
IBM Netfinity Director Software Rejuvenation-White Paper. IBM Corporation, Research Triangle Park, NC, January 2001.
P. Jalote, Y. Huang and C. Kintala. A Framework for Understanding and Handling Transient Software Failures. In Proc. 2nd ISSAT Int. Conf. on Reliability and Quality in Design, Orlando, FL, 1995.
J. C. Laprie, J. Arlat, C. Béounes, K. Kanoun and C. Hourtolle. Hardware and Software Fault Tolerance: Definition and Analysis of Architectural Solutions. In Proc. of 17th Symp. on Fault Tolerant Computing, pp 116–121, Pittsburgh, PA, 1987.
J. C. Laprie (Ed.). Dependability: Basic Concepts and Terminology. Springer-Verlag, Wien, New York, 1992.
I. Lee and R. K. Iyer. Software Dependability in the Tandem GUARDIAN System. IEEE Trans. on Software Engineering, pp 455–467, Vol. 21, No. 5, May 1995.
L. Li, K. Vaidyanathan and K. S. Trivedi. An Approach to Estimation of Software Aging in a Web Server. In Proc. of the Int. Symp. on Empirical Software Engineering, ISESE 2002, Nara, Japan, October 2002 (to appear).
E. Marshall. Fatal Error: How Patriot Overlooked a Scud. Science, pp 1347, March 13 1992.
D. Mosberger and T. Jin. Httperf-A Tool for Measuring Web Server Performance In First Workshop on Internet Server Performance, WISP, Madison, WI, pp.59–67, June 1998.
A. Pfening, S. Garg, A. Puliafito, M. Telek and K. S. Trivedi. Optimal Rejuvenation for Tolerating Soft Failures. Performance Evaluation,27& 28, pp 491–506, October 1996.
D. K. Pradhan. Fault-Tolerant Computer System Design. Prentice Hall, Englewood Cliffs, NJ, 1996.
S. M. Ross. Stochastic Processes. John Wiley & Sons, New York, 1983.
R. A. Sahner, K. S. Trivedi, A. Puliafito. Performance and Reliability Analysis of Computer Systems-An Example-Based Approach Using the SHARPE Software Package. Kluwer Academic Publishers, Norwell, MA, 1996.
R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications, Springer-Verlag, New York, 2000.
K. Smith and M. Seltzer. File System Aging-Increasing the Relevance of File System Benchmarks In Proc. of ACM SIGMETRICS, June 1997.
M. Sullivan and R. Chillarege. Software Defects and Their Impact on System Availability-A Study of Field Failures in Operating Systems. In Proc. 21st IEEE Int. Symp. on Fault Tolerant Computing, pages 2–9, 1991.
A. T. Tai, S. N. Chau, L. Alkalaj, and H. Hecht. On-board Preventive Maintenance: Analysis of Effectiveness and Optimal Duty Period. In Proc. of 3rd Int. Workshop on Object-oriented Real-time Dependable Systems, Newport Beach, California, February 1997.
K. S. Trivedi, J. Muppala, S. Woolet and B. R. Haverkort. Composite Performance and Dependability Analysis. Performance Evaluation, Vol. 14, Nos. 3–4, pp 197–216, February 1992.
K. S. Trivedi. Probability and Statistics, with Reliability, Queuing and Computer Science Applications, 2nd edition. John Wiley, 2001.
K. Vaidyanathan and K. S. Trivedi. A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems. In Proc. of the Tenth IEEE Int. Symp. on Software Reliability Engineering, pp 84–93, Boca Raton, FL, November 1999.
K. Vaidyanathan, R. E. Harper, S. W. Hunter, K. S. Trivedi. Analysis and Implementation of Software Rejuvenation in Cluster Systems. In Proc. of the Joint Int. Conf. on Measurement and Modeling of Computer Systems, ACM SIGMETRICS 2001/Performance 2001, Cambridge, MA, June 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Trivedi, K.S., Vaidyanathan, K. (2002). Software Reliability and Rejuvenation: Modeling and Analysis. In: Calzarossa, M.C., Tucci, S. (eds) Performance Evaluation of Complex Systems: Techniques and Tools. Performance 2002. Lecture Notes in Computer Science, vol 2459. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45798-4_14
Download citation
DOI: https://doi.org/10.1007/3-540-45798-4_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44252-3
Online ISBN: 978-3-540-45798-5
eBook Packages: Springer Book Archive