Human Error in Software Development and Inspection

Ray Panko's Human Error Website

Synopsis

Professional programmers are taught that they will make errors. In fact, data from over 10,000 code inspections in industry suggests that they will make undetected errors in 2% to 5% of all lines of code at the end of module development. This knowledge has lead to extensive testing in commercial software development. In commercial software development, testing consumes between a quarter and half of all development resources [Jones 1998, Kimberland 2004], and this does not even count rework by developers.

There are several types of testing for software. One is code inspection, in which a team of software engineers examines a module of code to identify errors. This work is done in teams because individual inspectors only find a minority of all errors in a module. Even teams typically find about 60% to 75% of the errors in a module.

Consequently, there are multiple rounds of testing during commercial software testing. By the time the product is delivered, the error rate is slashed but never eliminated. Putnam & Myers [1992] surveyed data from 1486 projects involving 117 million lines of code and written in 78 languages. Faults per line of code at final inspection, after unit testing of individual pieces. The error rate at final inspection was 0.1% to 0.3%.

Errors at the Module Level

One early lesson in software development is that testing cannot be delayed until the end of development. It must be done at each step. The first testing phase is unit testing. This is done after a module is created by the developer and is completed to his or her satisfaction. This is when code inspection by a team of developers and testers.

Code inspection using the Fagan (1976 1986) has the tenet that overall results should be reported from code inspections. Although this is not universally done, it is done often enough to produce data on tens of thousands of real-world code inspection. There is nothing like this in any other field. Table 1 shows results from four large studies that collectively involved about 10,000 code inspections in commercial software development firms.

Table 1: Code Inspection Corpuses during Unit Testing for Commercial Vendor Programs

Study	Description	Error Rate	Adjusted	Residual
Weller (1993)	More than 6,000 code inspections in industry, per line of code	1.9%	2.7%	0.8%
O'Neill [1994]	Writing a software statement. National Software Quality Experiment	2.0%	2.9%	0.9%
Cohen [2006]	Writing a software statement. 2,500 inspections at Cisco Systems	3.2%	4.6%	1.4%
Graden & Horsley [1986]	Writing a software statement. ATT. 2.5 million lines of code over 8 software releases	3.7%	5.3%	1.6%

Measured error rates indicate that there are about 2% to 4% of all lines of code. However, although code inspections typically uses teams of three to eight, they still only catch about 60% to 75% of the errors in a software module at unit testing. To appreciate this, the "adjusted" column indicates the likely true error rate average if code inspection catches 70% of the errors. This indicates that the actual error density is 3% to 5%. Looked at another way, a 70% detection rate indicates that there are still errors in about 1% of all lines of code after unit testing.

Error rates in unit testing vary in several systemmatic ways.

vary widely, especially when there are differences in the complexity of the code. They also vary by how an error (called a fault in software terminology) is defined. However, the means in these large studies were very close. In addition, means do not vary much over time. In cognitive science, this is often the case because brains do not change over time. Although claims have been made that newer programming languages reduce errors, such claims are rarely tested.

One thing about the table is misleading. Although code inspections typically uses teams of three to eight, they still only catch about 60% to 75% of the errors in a software module at unit testing. To appreciate this, the "adjusted" column indicates the likely true error rate average if code inspection catches 70% of the errors. This indicates that the actual error density is 3% to 5%. Looked at another way, a 70% detection rate indicates that there are still errors in about 1% of all lines of code after unit testing.

Final Error Rates

Programs undergo constant testing during development. In functional testing, groups of modules that constitute a function are tested. There is further testing during each phase of integration. This includes testing the testing of final programs.

Several large studies have investigated final error rates based on bugs filed after product delivery. These are shown in Table 2. The Putnam & Myers [1992] corpus is enormous, and the Beizer [1999] corpus is also extremely large. Note that there are few errors at the end, but there are still some.

Table 2: Final Error Rates in Commercial Software

Study	Details	Error Rate
Beizer [1990]	Late inspection. 6.9 million lines of code. Faults per line of code at final inspection, after unit testing of individual pieces.�	0.24%
Endress [1975]	Late inspection, DOS/VS mainframe operating system. 96 KLOC.	0.20%
Grady [1992]	Hewlett-Packard, 5 systems.	0.2% - 2.3%
Grady [1992]	Simpler program / More complex programs	0.4% / 0.8%
Nikora & Ryu [1996]	Systems testing, 5 systems at Jet Propulsion Laboratory.	0.4% - 1.0%
Putnam & Myers [1992]	Late inspection. Sample of 1486 projects, 117 million lines of code, 78 languages. Faults per line of code at final inspection, after unit testing of individual pieces.	0.30%
	~1,000 Lines of code.
	~10,000 Lines of code.	0.30%
	~100,000 Lines of code.	0.20%
	~1,000,000 Lines of code.	0.10%

Comparison with Spreadsheet Studies

In spreadsheet development experiments, participant spreadsheets are compared with a master spreadsheet that has a unique solution. This allows the experimenter to catch all errors. The weighted average of spreadsheet development experiments is 3.9%.

Only three inspections ("audits") of operational spreadsheets reported cell error rates. Hicks (1995), who conducted a three-person inspection of a budgeting spreadsheet at NYNEX, produced a cell error rate of 1.2%. Lukasic [1998} reproduced two operational spreadsheets in a financial modeling program and compare the two for errors. He detected errors in 2.2% and 2.5% of the two cells in the two operational spreadsheets. Powell, Baker, and Lawson [2008], who used a methodology that depended heavily on a static analysis program compared to human inspection and only averaged 3.25 hours per spreadsheet, found a cell error rate of 0.9% if we eliminated numbers that the developer used in formulas but that were entered as numbers instead of cell references. ("Hardcoding" is normally considered a qualitative error.) As in software code inspection, these numbers are undoubtedly underestimates of the true error rate. Given the cursory nature of some of these studies, they may have found a low percentage of all errors.

Cell error rates done in experiments are fairly consistent with errors in inspections of operational spreadsheets. They should not be. Experiments produce spreadsheets that are module size; they are representative of unit development. Inspections of operational spreadsheets represent final development. We would inspections of operational spreadsheets to produce far lower error rates than module development in experiments. In fact, they should be an order of magnitude lower.

It is possible and even likely that testing is so limited in spreadsheet development that relatively few errors are caught during development. We certainly know that spreadsheet testing has typically been very limited [Caulkins et al., 2007; Cragg & King, 1993; Fernandez, 2002; Floyd, et al., 2005; Galletta & Hufnagel, 1992; Gosling, 2003; Hall, 1996; Hendry & Green, 1994; Nardi, 1993; Schultheis & Sumner, 1994; Wagner, 2003].

Experiments in Software Inspection

Spreadsheet testing experiments have found that individual inspectors catch about 60% of all errors in the spreadsheets they inspect. In software inspection experiments, individual participants examining code modules with seeded errors on average found 40% or fewer of all errors [Basili & Perricone 1993, Basili & Selby, 1986; Johnson & Tjahjono, 1997; Myers, 1978; Porter, Votta, & Basili, 1995; Porter & Johnson, 1997; Porter, Sly, Toman, & Votta, 1997; Porter & Votta, 1994]. Reinforcing the supposition that team detection rates are higher than individual error rates, individuals in the Johnson and Tjahjono [1997] study found that individuals only caught 23% of the seeded errors, while teams of three found 44%.

References

Basili, V. R., & Perricone, B. T. (1993). Software Errors and Complexity: An Empirical Investigation. In M. Sheppard (Ed.), Software Engineering Metrics (Vol. I: Measures and Validation, pp. 168-183). Berkshire, England: McGraw-Hill International.

Basili, V. R., & Selby, R. W., Jr. (1986). Four Applications of a Software Data Collection and Analysis Methodology. In J. K. Skwirzynski (Ed.), Software System Design Methodology (pp. 3-33). Berlin: Springer-Verlag.

Beizer, B. 1990. Software testing techniques. (2nd ed.). New York: Van Nostrand.

Caulkins, J. P., Morrison, E. L., and Weidermann, T. 2007. “Spreadsheet Errors and Decision–Making: Evidence from Field Interviews,” Journal of Organizational and End User Computing (19:3), pp. 1-23.

Cohen, J. 2006. Best Kept Secrets of Peer Code Review, Austin Texas: Smart Bear.

Cragg, P. G., and King, M. 1993. “Spreadsheet Modelling Abuse: An Opportunity for OR?” Journal of the Operational Research Society (44:8), pp. 743-752.

Endress, A. 1975. “An Analysis of Errors and their Causes in System Programs,” IEEE Transactions on Software Engineering, (SE-1:2), pp. 140-149.

Fagan, M. E. 1976. “Design and Code Inspections to Reduce Errors in Program Development,” IBM Systems Journal (15:3), pp. 182-211.

Fagan, M. E. 1986, July. “Advances in Software Inspections,” IEEE Transactions on Software Engineering, (SE-12:7), pp. 744-751.

Fernandez, K. 2002. Investigation and Management of End User Computing Risk. Unpublished MSc thesis, University of Wales Institute Cardiff (UWIC) Business School (2002).

Floyd, B. D., Walls, J., and Marr, K. 1995. “Managing Spreadsheet Model Development,” Journal of Systems Management (46:1), pp. 38-43, 68.

Galletta, D. F., and Hufnagel, E. M. 1992, January. “A Model of End-User Computing Policy: Context, Process, Content and Compliance,” Information and Management (22:1), January, pp. 1-28.

Gosling, C. 2003. To What Extent are Systems Design and Development Used in the Production of Non-Clinical Corporate Spreadsheets at a Large NHS Trust? Unpublished MBA thesis, University of Wales Institute Cardiff (UWIC) Business School.

Graden, M., and Horsley, P. 1986. “The Effects of Software Inspection on a Major Telecommunications Project,” AT&T Technical Journal, 65.

Grady, R. B. 1995. “Successfully Applying Software Metrics,” Communications of the ACM, (38:3), pp. 18-25.

Hall, M. J. J. 1996. “A Risk and Control Oriented Study of the Practices of Spreadsheet Application Developers,” Proceedings of the Twenty-Ninth Hawaii International Conference on Systems Sciences, Vol. II, Kihei, Hawaii, Los Alamitos, CA: IEEE Computer Society Press, January, pp. 364-373.

Hendry, D. G., and Green, T. R. G. 1994. “Creating, Comprehending and Explaining Spreadsheets: A Cognitive Interpretation of What Discretionary Users Think of the Spreadsheet Model,” International Journal of Human-Computer Studies (40:6), June, pp. 1033-1065.

Hicks, L. (1995). NYNEX, personal communication with the first author via electronic mail.

Johnson, P. & Tjahjono, D. (1997, May), “Exploring the Effectiveness of Formal Technical Review Factors with CSRS, A Collaborative Software Review System”, Proceedings of the 1977 International Conference on Software Engineering, Boston, MA.

Jones, T. C. 1998. Estimating Software Costs, New York: McGraw-Hill.

Joseph, Jimmie L. (2002). The effect of group size on spreadsheet error debugging, unpublished doctoral dissertation, Katz Graduate School of Business, University of Pittsburgh, Pittsburgh, Pennsylvania.

Kimberland, K. 2004. “Microsoft’s Pilot of TSP Yields Dramatic Results,” news@sei, No. 2. http://www.sei.cmu.edu/news-at-sei/.

Lukasik, T., CPS. (1998, August 10). Personal communication with the author.

Myers, G. J. (1978, September), “A Controlled Experiment in Program Testing and Code Walkthroughs/Inspections”, Communications of the ACM, 21(9), 760-768.

Nardi, B. A. 1993. A Small Matter of Programming: Perspectives on End User Computing, Cambridge, Massachusetts: MIT Press.

Nikora & Ryu [1996] O'Neill, D., 1994, October "National Software Quality Experiment," 4th International Conference on Software Quality Proceedings

Panko, R. R. (1999). Applying code inspection to spreadsheet testing. Journal of Management Information Systems, 16(2), 176.

Porter, A. A., & Johnson, P. M. (1997), “Assessing Software Review Meetings: Results of a Comparative Analysis of Two Experimental Studies”, IEEE Transactions on Software Engineering, 23(3), 129-145.

Porter, A. A., & Votta, L. G. (1994, May 16-21, May 16-21), “An Experiment to Assess Different Defect Detection Methods for Software Requirements Inspections”, Proceedings of the 16th International Conference on Software Engineering, Sorrento, Italy.

Porter, A., Votta, L. G., Jr., & Basili, V. R. (1995), “Comparing Detection Methods for Software Requirements Inspections: A Replicated Experiment”, IEEE Transactions on Software Engineering, 21(6), 563-575.

Powell, S. G., Baker, K. R., and Lawson, B. 2008. “An Auditing Protocol for Spreadsheet Models,” Information & Management, 45, pp. 312-320.

Putnam, L. H., and Myers, W. 1992. Measures for Excellence: Reliable Software on Time, on Budget, Englewood Cliffs, NJ: Yourdon.

Schultheis, R., and Sumner, M. 1994. “The Relationship of Application Risks to Application Controls: A Study of Microcomputer-Based Spreadsheet Applications,” Journal of End User Computing (6:2), Spring, pp. 11-18.

Wagner, J. 2003, October. “Mission Critical Spreadsheets in a Larger Urban University,” INFORMS, Atlanta, GA.

Weller, M. 1993. “Lessons from Three Years of Inspection Data,” IEEE Software (10:5), pp. 38-45.