1 Introduction

During program comprehension, developers need to understand how software modules relate to each other. It is especially important when changes are being made to the software and developers need to assess the impact of their changes. One way to understand such relationships is to measure the coupling between parts of the software. Coupling is one of the fundamental properties of software with a strong influence on comprehension and maintenance of large software systems. Proposed coupling measures are used in software engineering tasks, such as change impact analysis (Briand et al. 1999a; Wilkie and Kitchenham 2000), assessing the fault-proneness of classes (El-Emam and Melo 1999; Yu et al. 2002; Gyimóthy et al. 2005; Olague et al. 2007), software re-modularization (Abreu et al. 2000; Yang et al. 2005), identifying software components (Lee et al. 2001) and design patterns (Antoniol et al. 1998), assessing software quality (Briand et al. 2000), etc.

Depending on the programming paradigm used, the choice of programming language for the implementation, and the design of a software system, coupling is influenced by several factors—such as control and data flow—and hence it may be measured differently. Researchers proposed a variety of coupling measures, but recent studies (Briand et al. 2000) suggest that some of these metrics tend to compute the same form of coupling, though through different measuring mechanisms.

In this work we define a set of coupling measures, which capture new dimensions of coupling, based on the textual information shared between modules of the source code. While elements of the source code written in a programming language help identify control or data flow between software modules, the comments and identifiers express the intent of the software. Two parts of the software with similar intent will most likely refer to the same (or related) concepts in the problem or solution domains of the system. Hence, they are conceptually related. This has been also confirmed by the earlier work of other researchers who examined overlap of semantic information in comments and identifiers among different software modules (Etzkorn and Delugach 2000; Stein et al. 2004). This relationship is the foundation for the new coupling measures, named conceptual coupling. The measures are computed using IR techniques that help extract and analyze the textual information embedded in software (i.e., in the comments and identifiers). While any of several IR techniques could be used, in this work, we use Latent Semantic Indexing (LSI) (Deerwester et al. 1990). The set of conceptual coupling metrics can be defined and used for any type of programming paradigm, but we define and use them here in the context of OO software systems.

Existing coupling measures have been previously used to support the impact analysis process, where the task is to identify all classes that would change when a given class is being changed. Existing models (Briand et al. 1999a) do not capture all the ripple effects of changes in existing software. Given that the conceptual coupling metrics reflect different relationships than structural coupling metrics, we assume that they also propagate changes in software. The paper focuses on the use of the conceptual coupling metrics to predict classes that will change during impact analysis. We conducted a case study on a large open-source software system (i.e., MozillaFootnote 1) to see how the conceptual coupling metrics compare with nine existing structural coupling metrics, when used during impact analysis. The case study indicates that one of our conceptual coupling metrics provides best results for predicting classes that need to be changed.

2 Related Work

We are discussing here the major approaches to coupling measurement, in order to contrast between existing approaches and our proposed metrics. The conceptual coupling metrics are based on the use of IR methods and constitute a novel application, compared to previous uses of IR in program comprehension, which we also present here. Coupling measures have been used to support impact analysis and we present those approaches here as well.

2.1 Coupling Measurement

Coupling measurement is a rich and interesting body of research work, resulting in many different measuring approaches for structural coupling metrics (Chidamber and Kemerer 1991; Chidamber and Kemerer 1994; Lee et al. 1995; Briand et al. 1997), dynamic coupling measures (Arisholm et al. 2004; Hassoun et al. 2004), evolutionary and logical coupling (Gall 2003), coupling measures based on information entropy (Allen et al. 2001), coupling metrics for specific types of software applications such as procedural systems (Offutt et al. 1993), knowledge-based systems (Kramer and Kaindl 2004), ontology-based systems (Orme et al. 2006) and systems developed using an aspect-oriented approach (Zhao 2004).

The structural coupling metrics have received significant attention in the literature. These metrics are comprehensively described and classified within the unified framework for coupling measurement (Briand et al. 1999b). The best known among these metrics are CBO (coupling between objects) and CBO' (Chidamber and Kemerer 1991; 1994), RFC (response for class) (Chidamber and Kemerer 1991) and RFC (Chidamber and Kemerer 1994), MPC (message passing coupling) (Li and Henry 1993), DAC (data abstraction coupling) and DAC1 (Li and Henry 1993), ICP (information-flow-based coupling) (Lee et al. 1995), the suite of coupling measures by Briand et al. (Briand et al. 1997) (IFCAIC, ACAIC, OCAIC, FCAEC, etc). Other structural metrics such as Ce (efferent coupling), Ca (afferent coupling) and COF (coupling factor) are also overviewed by Briand et al. (Briand et al. 1999b).

Many of the coupling measures listed above are based on method invocations and attribute references. For example, the RFC, MPC, and ICP measures are based on method invocations only. CBO and COF measures count method invocations and references to both methods and attributes. The suite of measures defined by Briand et al. (Briand et al. 1997) captures several types of interactions between classes such as class–attribute, class–method, and method–method interactions. The measures from the suite also differentiate between import and export coupling as well as other types of relationships including friends, ancestors, descendants etc.

Dynamic coupling measures (Arisholm et al. 2004; Hassoun et al. 2004) were introduced as the refinement to existing coupling measures due to some gaps in addressing polymorphism, dynamic binding, and the presence of unused code by static structural coupling measures.

Another important family of coupling measures derives from the evolution of software system in contrast to structural coupling which is determined by program analysis of a single version of software or dynamic coupling which is obtained by executing the program. These are called evolutionary couplings among parts of the systems which are determined by the past common changes or co-changes (Gall 2003).

Another form of coupling, namely interaction coupling, captures relations among software artifacts which are relevant to a particular software engineering task (Zou et al. 2007). Interaction coupling uses information gleaned using an Integrated Development Environment on when artifacts are being used or modified in the same development task.

Recently, several specialized coupling metrics were proposed for different types of software systems. They are coupling metrics for knowledge-based systems (Kramer and Kaindl 2004) as well as coupling metrics for aspect-oriented programs (Zhao 2004).

Existing work on clustering software (Maletic and Marcus 2001; Kuhn et al. 2007), retrieving similar components in software libraries (Michail and Notkin 1999) and measuring semantic overlap of information in comments and identifiers among software modules (Etzkorn and Delugach 2000) uses the concept of semantic similarity between elements of the source code (Marcus et al. 2008), which stands at the foundation of the conceptual coupling, as defined in this paper.

2.2 The Use of IR Methods in Program Comprehension

IR methods were proposed and used successfully to address tasks of extracting and analyzing textual information existing in software artifacts. Early models were used to construct software libraries (Maarek et al. 1991; Fischer 1998) and support reuse tasks (Helm and Maarek 1991; Etzkorn and Davis 1997; Michail and Notkin 1999; Pan et al. 2004; Ye and Fischer 2005), while more recent work focused on specific software maintenance and development tasks such as recovery of traceability links. Several approaches have been proposed to recover traceability links between source code and external documentation using probabilistic IR, vector space models (Antoniol et al. 2002) and LSI (Marcus et al. 2005a). Other work proposed a set of approaches to recover traceability links among requirements (Clelang-Huang et al. 2005; Lo et al. 2006), requirements and source code (Hayes et al. 2006), requirements and test cases (Lormans and Van Deursen 2006), etc. A set of tools that integrates facilities to manage traceability links among different types of software artifacts was developed and evaluated recently (De Lucia et al. 2007).

IR methods have been also successfully used for concept and feature location (Marcus et al. 2004; Zhao et al. 2006; Poshyvanyk et al. 2007; Poshyvanyk and Marcus 2007; Eaddy et al. 2008) in the source code. Other approaches use IR methods to classify software systems based on their source code in open-source repositories (Kawaguchi et al. 2006) as well as cluster source code to obtain high-level views of software systems (Maletic and Marcus 2001; Kuhn et al. 2007).

IR techniques were also used to identify the starting impact set of a maintenance request (Antoniol et al. 2000), and to link change request descriptions to the set of historical file revisions impacted by similar past change requests (Canfora and Cerulo 2005). An approach to automatically classify the type of maintenance activity based on a textual description of changes was also proposed in (Mockus and Votta 2000). IR approaches have been used in the context of software measurement to assess the quality of identifiers and comments (Lawrie et al. 2006), measure complexity of the underlying software (Etzkorn et al. 2002), compute conceptual cohesion (Patel et al. 1992; Marcus and Poshyvanyk 2005) and coupling (Poshyvanyk and Marcus 2006) of classes.

In addition, IR techniques have been applied to several other tasks, such as identification of duplicate bug reports (Runeson et al. 2007; Wang et al. 2008), classification of software maintenance requests (Di Lucca et al. 2002), recommendation rendering for novice programmers (Cubranic et al. 2005) and identification contributions of developers (Linstead et al. 2007).

2.3 Impact Analysis Approaches

During software change, programmers need to modify the source code of existing software systems. The first step during software change is to identify a part of the source code that needs to be changed. Once the starting point of the change is identified, developers need to identify the other components that need to be changed. Bohner et al. (Bohner 1996) recognized impact analysis as an activity that estimates all components to be changed. One of the techniques of impact analysis was proposed in the work of Queille et al. (Queille et al. 1994), where an interactive process was suggested, in which the programmer, guided by dependencies among program components (i.e., classes, functions), inspects components one-by-one and identifies the ones that are going to change—this process involves both searching and browsing activities. This interactive process was supported via a formal model, based on graph rewriting rules (Chen and Rajlich 2000).

More recent work appears in (Bohner and Gracanin 2003; Robillard 2005; Hill et al. 2007), where proposed tools can help navigate and prioritize system dependencies during various software maintenance tasks. The work in (Hill et al. 2007) relates to our approach in as much as it also uses lexical (textual) clues from the source code to identify related methods. Several recent papers presented algorithms that estimate the impact of a change on tests (Rountev et al. 2001; Kosara et al. 2003). A comparison of different impact analysis algorithms was provided in (Orso et al. 2004).

Coupling measures have been used to support impact analysis in OO systems (Briand et al. 1999a; Wilkie and Kitchenham 2000). Wilkie and Kitchenham (Wilkie and Kitchenham 2000) investigated if classes with high CBO coupling metric values are more likely to be affected by change ripple effects. Although CBO was found to be an indicator of change-proneness in general, it was not sufficient to account for all possible changes. The work of Briand et al. (Briand et al. 1999a) investigated the use of coupling measures and derived decision models for identifying classes likely to be changed during impact analysis. The results of empirical investigation of the structural coupling measures and their combinations showed that the coupling measures can be used to focus underlying dependency analysis and reduce impact analysis effort. On the other hand, the study revealed a substantial number of ripple effects, which are not accounted by the highly coupled (structurally) classes. This work motivated our quest for novel coupling measures, which use alternative sources of information (i.e., text in identifiers and comments) to capture dependencies that are not captured by the existing structural coupling measures.

3 Using IR Methods for Coupling Measurement

Our approach to coupling measurement is based on the hypothesis that modules (or classes) in (OO) software systems are related in more than one way. The evident and most explored set of relationships is based on data and control dependencies. In addition to such relationships classes are also related conceptually, as they may contribute together to the implementation of a domain concept. In this work, we propose a mechanism, based on IR techniques, to capture and measure this form of coupling, named as conceptual coupling. Our choice of IR technique in this type of application is LSI.

Developers use comments and identifiers to represent elements of the problem or solution domain of the software. In our previous work (Maletic and Marcus 2001; Marcus et al. 2004; Poshyvanyk and Marcus 2006; Poshyvanyk et al. 2007; Poshyvanyk and Marcus 2007; Marcus et al. 2008) we investigated approaches to extract, encode, and analyze the semantic information embedded in the comments and identifiers of the software. We use the same type of information in the definition of the conceptual coupling metrics.

In order to compute the conceptual coupling of classes, the source code of the software system is converted into a text corpus, where each document contains elements of the implementations of a method. Comments and identifiers are extracted from the source code, as well as structural information. The user has an option to choose the desired granularity (e.g., class or method level) for documents (see more details in Section 0). LSI uses this corpus to create a term-by-document matrix, which captures the distribution of words in methods. The main idea behind LSI is that the information about word contexts in which a particular word appears or does not appear provides a set of mutual constraints that determines the statistical similarity of meaning of sets of words to each other. LSI relies on a Singular Value Decomposition (SVD) (Salton and McGill 1983) of a term-by-document matrix derived from a corpus that pertains to knowledge in the particular domain of interest. SVD is applied to the term-by-document matrix to construct a subspace, called an LSI subspace. Each document from the corpus (i.e., method from the source code) is represented as a vector in this LSI subspace. Once the documents are represented in the LSI subspace, conceptual coupling measures can be computed between methods and classes. We use the cosine between the vectors corresponding to the methods as a measure of the conceptual coupling between the two methods.

The definition of and the methodology for measuring the conceptual coupling would not change radically if another IR method is to be used. The only significant change would be in the definition of the conceptual coupling between methods (see definition 3 in the next section).

3.1 System Representation and Coupling Measures

In order to define and compute the conceptual coupling measures, we introduce a graph based representation of a software system, similar to those used to compute other coupling measures.

Definition 1 (System, Classes)

We consider an OO system as a set of classes C = {c1, c2…cn}. The number of classes in the system C is n = |C|.

Definition 2 (Methods of a Class)

A class has a set of methods. For each class c ∈ C, M(c) = {m1, …, mz} represents its set of methods, where z = |M(c)| is the number of methods in a class c. The set of all methods in the system is defined as M(C).

Definition 3 (Conceptual Coupling Between Methods—CCM)

The conceptual coupling between two methods mk ∈ M(C) and mj ∈ M(C), CCM(mk, mj), is computed as the cosine between the vectors vmk and vmj, corresponding to mk and mj in the semantic space constructed by LSI.

$${\text{CCM}}\left( {{\text{m}}_{\text{k}} {\text{,}}\;{\text{m}}_{\text{j}} } \right) = \frac{{vm_k^T vm_j }}{{\left| {vm_k } \right|_2 \times \left| {vm_j } \right|_2 }}$$

As defined, the value of CCM(mk, mj) ∈ [−1, 1], as CCM is a cosine in the LSI space. In order to comply with non-negativity property of coupling metrics (Briand et al. 1999b), we refine CCM as:

$$CCM^1 \left( {m_k ,\;m_j } \right) = \left\{ {\matrix {CCM\left( {m_k ,m_j } \right)\quad if\quad CCM\left( {m_k ,m_j } \right) \geqslant 0} \hfill \\ {else\quad 0} \hfill \ } \right.$$

Definition 4 (Conceptual Coupling Between a Method and a Class—CCMC)

Let ck ∈ C and cj ∈ C be two distinct (ck ≠ cj) classes in the system. Each class has a set of methods M(ck) = {mk1, …, mkr}, where r = |M(ck)| and M(cj) = {mj1 , …, mjt}, where t = |M(cj)|. Between every pair of methods (mk, mj) there is a conceptual coupling measure—CCM(mk, mj). We define the conceptual coupling between a method mk and a class cj as follows:

$$CCMC\left( {m_k ,\;c_j } \right) = \frac{{\sum\limits_{q = 1}^t {CCM^1 \left( {m_k ,m_{jq} } \right)} }}{t},$$

which is the average of the conceptual couplings between method mk and all the methods from class cj.

Definition 5 (Conceptual Coupling Between two Classes—CCBC)

We define the conceptual coupling between two classes ck ∈ C and cj ∈ C as:

$$CCBC\left( {c_k ,\;c_j } \right) = \frac{{\sum\limits_{l = 1}^r {CCMC\left( {m_{kl} ,c_j } \right)} }}{r},$$

which is the average of the couplings between all unordered pairs of methods from class ck and class cj. The definition ensures that the conceptual coupling between two classes is symmetrical, as CCBC(ck , cj ) = CCBC(cj , ck ).

3.2 The Conceptual Coupling of a Class

With this system representation, we define a measure that approximates the coupling of a class in an OO software system by measuring the degree to which the methods of the class are conceptually related to the methods of the other classes.

Definition 6 (Conceptual Coupling of a Class—CoCC)

For a class c ∈ C, conceptual coupling is defined as:

$$CoCC{\left( c \right)} = \frac{{{\sum\limits_{i = 1}^n {CCBC{\left( {c,d_{i} } \right)}} }}}{{n - 1}},$$

where n = |C|, di ∈ C, and c≠di.

Based on the above definitions, CoCC(c) ∈ [0, ..1] ∀ c ∈ C. If a class c ∈ C is strongly coupled to the rest of the classes in the system, then CoCC(c) should be closer to one meaning that the methods in the class are strongly related conceptually with the methods of the other classes. In this case, the class most likely implements concepts that overlap with concepts implemented in other classes (which are related in the context of the software system).

If the methods of the class have low conceptual coupling values with methods of other classes, then the class implements one or more concepts with limited interaction with the rest of the system. The value of CoCC(c) in this case will be close to zero.

In this form, CoCC does not make distinction between method types. If needed, CoCC can be altered to account for overloaded, friend, and other method stereotypes, as discussed in (Briand et al. 1997).

3.2.1 An Example of Measuring the Conceptual Coupling of a Class

In order to illustrate how the CoCC metric is computed, let us consider three classes from the source code of TortoiseCVS software system (see Fig. 1) with similarities between the methods outlined in Table 1. To simplify the example, we computed similarities only between a few methods in every class, however, in a real setting similarities will be computed for all pairs of methods among classes. We will refer to the class CVSServerFeatures as c1 and to its methods as m1 and m2; to the class ConflictListDialog as c2 and its methods as m3, m4, and m5; to the class CommitDialog as c3 and its methods as m6, m7, and m8.

Fig. 1
figure 1

Source code of the CVSServerFeatures, CommitDialog, and ConflictDialog classes from the TortoiseCVS system

Table 1 Conceptual couplings between the methods of the classes CVSServerFeatures (m1, m2), ConflictListDialog (m3, m4, m5), and CommitDialog (m6, m7, m8). Conceptual couplings between methods of the same class

In order to compute CoCC for class c1, we need to compute conceptual similarities between classes (c1, c2) and (c1, c3), since CoCC (c1) = (CCBC(c1, c2) + CCBC(c1, c3) )/2.

In order to compute the conceptual similarities between c1 and c2, we use the following formula: CCBC(c1, c2) = (CCMC (m1, c2) + CCMC (m2, c2))/2. In this case, CCMC(m1, c2) is an average of conceptual similarities between a method m1 and all other methods in class c2. Thus, CCMC(m1, c2) = (CCM1(m1, m3) + CCM1(m1, m4) + CCM1(m1, m5))/3 = (0.7 + 0.27 + 0.13) / 3 = 0.366. Similarly, CCMC (m2, c2) = (0.68 + 0.34 + 0.25)/3 = 0.423. Therefore, CCBC(c 1 , c 2 ) = (0.366 + 0.423)/2 = 0.3945.

Similarly, we compute conceptual couplings between classes c1 and c3, CCBC(c 1 , c 3 ) = 0.4515.

Now we are able to compute CoCC(c1), since CoCC(c 1 ) = (CCBC(c1, c2) + CCBC(c1, c3))/2 = (0.3945 + 0.4515)/2 = 0.423. Similarly, CoCC(c 2 ) = 0.357 and CoCC(c 3 ) = 0.385.

3.3 The Maximum Conceptual Coupling of a Class

If a class c ∈ C has a high CoCC value, one can easily infer that it is strongly related to most other classes in the system. The opposite conclusion can be inferred if CoCC value is low. Little can be said if CoCC value is neither high nor low. It is a general drawback of average based metrics. In these cases we can still have classes strongly related to c, which are important from program comprehension point of view. These strong relationships can also propagate changes between classes.

An analogous logic can be applied to the coupling between two classes (e.g., if two methods from different classes are conceptually similar, they might need to be changed in concert).

With that in mind, we refine CoCC to capture only the strongest couplings among methods. The goal here is to make sure that our measuring mechanism does not miss classes that are highly coupled even to a part of the system, as developers need to be aware of such classes. Thus, we define:

$$CCMC_m \left( {m_k ,\;c_j } \right) = max\left\{ {{\text{CCM}}^{\text{1}} \left( {{\text{m}}_{\text{k}} {\text{,}}\;{\text{m}}_{{\text{jt}}} } \right){\text{,}}\forall \;{\text{t = 1}}..\left| {{\text{M}}\left( {{\text{c}}_{\text{j}} } \right)} \right|} \right\}$$

The maximum conceptual coupling between method m k j is denoted by the highest conceptual coupling among all possible pairs of methods between method mk and all the methods in class cj.

The maximum conceptual coupling between two classes based on CCMCm is defined as the following:

$$CCBC_m \left( {{\text{c}}_{\text{k}} {\text{,c}}_{\text{j}} } \right) = \frac{{\sum\limits_{l = 1}^r {CCMC_m \left( {m_{kl} ,c_j } \right)} }}{r}$$

The maximum conceptual coupling metric CoCCm for a class c, is defined:

$$CoCC_m \left( c \right) = \frac{{\sum\limits_{i = 1}^n {CCBC_m \left( {c,d_i } \right)} }}{{n - 1}},$$

where n = |C|, di ∈ C ,c≠di.

Referring back to the example in the previous subsection, with these new definitions, CoCC m (c 1 ) = (CCBCm(c1, c2) + CCBCm(c1, c3) )/2 = 0.645. Similarly, CoCC m (c 2 ) = 0.486 and CoCC m (c 3 ) = 0.515.

Class c1 in our example is the one which has highest values of CoCC and CoCCm metrics, whereas class c2 has the lowest conceptual coupling.

4 Comparing Structural and Conceptual Coupling Measures

As CoCC and CoCCm are new coupling measures, we evaluated them accordingly. In our previous work (Poshyvanyk and Marcus 2006) we analyzed theoretical properties of the proposed measures, such as, non-negativity, null value, monotonicity, merging of classes, and merging of unconnected classes. Additionally, we compared the conceptual coupling with existing structural coupling measures on ten different open source software systems. The key findings of those studies are presented in the following sub-sections.

4.1 Principal Component Analysis of Metrics Data

We compared the following set of coupling measures: nine structural (CBO, RFC, MPC, DAC, ICP, ACAIC, OCAIC, ACMIC, and OCMIC) and two conceptual coupling measures (CoCC and CoCCm). In order to identify the causal, orthogonal dimensions captured by the coupling measures we performed Principal Component Analysis (PCA) (Jolliffe 1986) on the metrics measured on the set of 979 classes in ten open-source software systems. All studied measures were subjected to an orthogonal rotation. The results of the PCA revealed that the CoCC and CoCCm measures defined two new dimensions on their own (they were two separate significant factors in identified principal components). These results clearly indicated that conceptual coupling measures capture different types of coupling between classes, than those captured by the structural metrics. This unique result derives from the fact that CoCC and CoCCm are coupling measures that are based on completely different ideas and measurements than the existing coupling measures; CoCC and CoCCm are based on the semantic information obtained from the source code encoded in identifiers and comments, whereas the existing metrics use the structure of the software as the basis for measurement. In addition, we compared the results of the PCA with those reported elsewhere in the literature (Briand et al. 1998; 2000). Although the PCs and loadings obtained in our case and those reported in the literature do not completely overlap, they were relatively similar (Poshyvanyk and Marcus 2006).

4.2 Differences Between Conceptual and Structural Coupling Measures

To obtain more insights into how the conceptual coupling metrics differ from the structural ones, we chose several classes from different software systems for detailed analysis. As the cases where the two sets of metrics agree are of little interest, we were interested in those cases with different values of conceptual and structural metrics, e.g., high conceptual metric values and low structural metrics values, and vice versa. We considered both the CoCC and CoCCm measures that capture the coupling of the classes to the rest of the system. In this subsection, we present some of the noted differences between the conceptual coupling measures (CoCC and CoCCm) and the CBO and RFC structural coupling measures.

The classes chosen for detailed analysis are from the WinMergeFootnote 2 and TortoiseCVSFootnote 3 systems (Table 2). We selected these classes based on high values of CoCC and CoCCm and low values of CBO and RFC metrics.

Table 2 Classes with highest conceptual coupling in WinMerge and TortoiseCVS according to CoCC and CoCCm

The IVSSItems, IVSSUsers, and IVSSCheckouts classes from WinMerge show high conceptual and low structural coupling to the rest of the system. Closer inspection of these classes revealed that these classes are part of a larger cluster of related classes, which contribute to the implementation of a feature related to accessing functions of other ActiveX objects; they all implement the COleDispatchDriver interface. All the classes in the cluster have several common characteristics—they are all wrappers; the majority of the methods in these classes call the InvokeHelper() function to execute specific functionality in the ActiveX object; the majority of pairs of classes from the cluster have high conceptual similarities. The “IVSS” cluster consists of eleven classes wrapping similar functionalities. This explains the high values for CoCC and CoCCm since these classes are conceptually related to the other classes from the cluster, as well as other classes in the system. Their construction as wrappers and their main usage explains the low structural cohesion.

The classes ConflictParser and ConflictListDialog from the TortoiseCVS system implement important domain concepts—identifying conflicts in the working version of the file and current file revision as well as dialog to list the conflicts in the file. These concepts are important in the system, which extends the file system’s interface to support collaborative software development with CVS. The high values of CoCC and CoCCm metrics for these classes from TortoiseCVS can be explained by the fact that these classes use domain concept terms like “parse” and “conflict”, which are spread across many methods of this system. These terms have high global frequencies, meaning that they frequently occur as parts of identifiers or comments across different methods in the system compared to the other 1,915 unique terms indexed in this system. The terms “conflict” and “parse” occur more than a thousand times in 679 methods of TortoiseCVS system.

The classes analyzed in this section implement domain concepts, which relate to the rest of the system, yet they are loosely coupled to the rest of the system. It is important to identify these classes from a maintenance point of view. The loose structural coupling may indicate a low architectural importance, but the high conceptual coupling indicates that these classes are most likely contributing to the implementation of the main domain concepts. The classes which relate conceptually to the majority of classes in the system may exhibit a form of dependency, called hidden dependency (Yu and Vaclav 2001), which is not always expressed by structural coupling measures. Modifications in these classes may trigger special types of ripple effects, which are currently not captured by existing coupling measures (Briand et al. 1999a).

4.3 Conceptual Coupling Between Pairs of Classes

From the impact analysis point of view, even more important are pairs of classes that relate conceptually, yet not structurally. To better understand the pair-wise conceptual coupling measures and how they can be used to rank classes during impact analysis, we also analyzed the CCBC and CCBCm measures, computed for pairs of classes in WinMerge and TortoiseCVS software systems. For illustration purposes, we selected several pairs of classes with highest CCBC and CCBCm values (see Table 3).

Table 3 Pairs of classes from WinMerge and TortoiseCVS with highest CCBC values

It came as no surprise that pairs of classes, mentioned in Section 4.2 as part of the “IVSS” cluster, were among those with highest CCBC values. These classes implement different, but related tasks, which are all based on implementation of client side of Object Linking and Embedding (OLE) automation. Detailed inspection of the source code for these classes has shown that they are not directly connected structurally, meaning that they do not use each other’s services. On the other hand after inspecting the history of co-changes for these files (using CVS data for WinMerge project) we noticed that these classes are not only strongly conceptually coupled together, but they also have a history of common changes (i.e., they were changed and submitted to the repository at the same time).

Another pair of classes MergeDlg and UpdateDlg from TortoiseCVS system has high conceptual coupling values for CCBC and CCBCm metrics. This is once again not surprising, since both classes implement similar concepts—front end dialogs for merging and updating file revisions. Both classes share similar terms which come from names of classes used to create elements of user interface: “button”, “static text”, “check box”, etc., as well as terms more specific to the concepts which are implemented in these classes: “fetch”, “revision”, “tag”, “branch”, etc. Again these classes do not have direct structural dependencies between them. This is a case of unconnected classes, which implement similar functionality (Marcus and Maletic 2001).

5 Using Coupling Measures for Impact Analysis

The coupling measures can help order (rank) classes in software systems, based on different types of dependencies among classes, captured by the coupling measures (Briand et al. 1999a). Such coupling measures and derived ranks of classes can be computed automatically. The next section describes probabilistic decision models based on coupling measurement to support impact analysis.

5.1 Ranking Classes Using Coupling Measures

For a given class c ∈ C (which may be the starting point of a change, identified by the programmer based on his experience, or automatically with some feature location technique), the other classes in a software system are ranked according to their strength of coupling to the class c, based on a coupling measure or a combination of such measures (Briand et al. 1999a). The list of ranked classes is provided to the developer for further inspection. Since software systems may be large, sometimes containing thousands of classes, focusing impact analysis on strongly coupled classes may significantly reduce the burden on the developer.

In Section 2.1 we summarized the best known structural coupling measures. In the literature, these coupling measures are defined and used at the system level (classic definitions of coupling measures), meaning that they count, for a given class c, all dependencies (connections) from c to all other classes in the system. In order to use the coupling measures for impact analysis, they need to be modified to account for coupling between pairs of classes only. Table 4 shows how we redefined some of the structural coupling measures. More details on how other structural coupling measures are redefined on a class pair-wise basis are provided by Briand et al. (Briand et al. 1999a). Section 3.1 provides details on how we defined conceptual coupling measures on pair-wise basis.

Table 4 Examples of redefined structural coupling measures used to rank classes during impact analysis

5.2 An Example of Using Coupling Measures for Impact Analysis in Mozilla

The following example illustrates how conceptual and structural coupling metrics are used to rank classes to focus impact analysis. The bug #232570Footnote 4 reports some problems associated with ‘ldap2.server.position values for ab pane and search order’ in Mozilla. In order to fix the bug, the developer needs to find and change the classes in the source code containing this bug. Assume that the starting point of this change, the class nsAbDirectoryQuery, is identified via some available feature location technique. Given the starting point, the developer needs to perform impact analysis to identify the remaining classes in order to complete the change. In our approach, we compute the set of pair-wise coupling measures for all possible pairs between nsAbDirectoryQuery and other classes. Using these coupling measures, all the classes in Mozilla are ranked based on the strength of coupling (different type of couplings are captured by different measures) to the nsAbDirectoryQuery class. The idea is that the strongly coupled classes to the given class are more likely to change (Briand et al. 1999a). In our example, Table 5 provides the list of top classes ranked by the values of two coupling metrics, CCBCm and ICP. These measures provide the quantitative estimation of the strength of coupling between the class nsAbDirectoryQuery and the classes in Table 5. In order to determine the number of candidate classes suggested for inspection during impact analysis, different strategies can be used. The most common approaches are to use a cut point cp (i.e., select the top n classes from the list or the top n%) or a threshold t (i.e., select all classes that have a coupling value higher/lower than some metric value t). Combinations of the two approaches are also used. For example the top n% classes will be retrieved if they have a coupling value higher or lower than t.

Table 5 Classes strongly coupled with nsAbDirectoryQuery and ranked according to CCBCm and ICP coupling measures

In this example, for each metric, a cut point strategy is used (e.g., the top five classes from each rank list are retrieved).

While using CCBCm for ranking conceptually similar classes to nsAbDirectoryQuery class, we retrieve five out of 4,853 (see Table 5). Two of these classes, nsAbMDBDirectory and nsAbLDAPDirectory, are among those ten classes in the official patch that were changed to fix this bug (nsAbAutoCompleteSession, nsAbBSDirectory, nsAbCardProperty, nsAbDirProperty, nsAbDirectoryDataSource, nsAbDirectoryProperties, nsAbDirectoryQuery, nsAbLDAPDirectory, nsAbMDBDirectory, nsMsgCompose). However, when ICP metric is used with this cut point, only two classes are suggested and none of them is among the changed classes. The precision and recall for these two metrics is computed as the following. Precision for CCBCm is 2/5*100% = 40%, while recall is 2/9*100 = 22% (we use nine classes instead of ten in the denominator, since one of the changed classes, nsAbDirectoryQuery, is already identified and used as a starting point in impact analysis). None of the changed classes has structural dependencies, which are captured by ICP coupling measure, with nsAbDirectoryQuery class and thus precision and recall for ICP measure is zero.

6 Case Study on Using Coupling Measures to Support Impact Analysis

In this section we present a case study, where we empirically investigated how conceptual coupling metrics can be used during impact analysis as well as compared them to a set of existing structural coupling measures used for the same task.

6.1 Design of the Case Study

The case study is designed in a similar fashion to the one presented in the work by Briand et al. (Briand et al. 1999a), where a set of structural coupling metrics was used to rank classes during impact analysis in an OO system. While designing and conducting the case study, we followed the guidelines in two papers written by Yin and by Flyvbjerg, respectively (Yin 2003; Flyvbjerg 2006).

6.1.1 Objectives and Methodology

In this case study, the CCBC and CCBCm measures are compared with nine existing structural coupling measures (i.e., PIM, ICP, CBO, MPC, OCMIC, DAC, OCAIC, ACMIC, and ACAIC) to evaluate whether they provide better support for impact analysis or not. The premise is that given the nature of the captured information (e.g., textual information in identifiers and comments) and the counting mechanism employed by CoCC and CoCCm, these measures should capture different aspects of coupling among classes as compared to the nine existing coupling metrics, which utilize only structural information.

In the case study, we used the source code of Mozilla v1.6, which is an open-source web browser ported on almost all known software and hardware platforms. It is large enough to represent a real-world software system and it comes with an available history of changes. The source code of Mozilla consists of 4,853 classes implemented in approximately four million lines of source code (including 738,180 lines of comments).

Our case study addresses the following question: Do CCBC and CCBC m provide better support for ranking classes during impact analysis than any of the following structural coupling measures: PIM, ICP, CBO, MPC, OCMIC, DAC, OCAIC, ACMIC and ACAIC?

6.1.2 Settings of the Case Study

All the structural coupling measures, including pair-wise versions of coupling measures, were computed using Columbus (Ferenc et al. 2002). The conceptual coupling measures were computed with the IRC 2 M tool (Poshyvanyk and Marcus 2006), which can be used with several settings for the underlying LSI-based analysis. In the case studies, we used the following common settings.

We used a method level granularity, to construct the corpus for Mozilla, meaning that the implementation (source code) of every method from the software system was extracted and represented as a separate document in the corpus. We extracted all types of methods from classes in the source code, including constructors, destructors, and accessors. Comments and identifiers were extracted from each method as well. The resulting text from source code is pre-processed as the following: some of the tokens are eliminated (e.g., operators, special symbols, some numbers, keywords of the C++ programming language, standard library function names including standard template library, etc.); the identifier names in the source code are split into parts based on observed coding standards and naming conventions. For example, all the following identifiers are broken into separate words ‘coupling’ and ‘measures’: ‘coupling_measures’, ‘Coupling_measures’, ‘CouplingMeasures’, etc. Given that we did not consider n-grams, the order of words in text passages is of no significance. In this process, LSI does not use a predefined vocabulary, or a predefined grammar, therefore no morphological analysis or transformations are required, such as stemming or abbreviation expansion. We believe that such methods may even improve the results. Based on our previous experience with LSI on corpus of similar size (Marcus and Poshyvanyk 2005; Poshyvanyk et al. 2007; Marcus et al. 2008), we used a reduction factor of 500 for the Mozilla software system corpus.

6.1.3 Collecting Change Data in Mozilla for Evaluation

In order to compare conceptual and structural coupling measures for identifying classes that change together (i.e., changes related to the same bug report and having the same identification number in the configuration management system) during impact analysis, we utilized the history of changes in Mozilla. We used BugzillaFootnote 5, a bug-tracking system used in the development of Mozilla and collected the bugs between two versions of Mozilla (i.e., 1.6 and 1.7) and correlated each bug with specific classes. The Bugzilla database contained around 256,613 different bug entries for all the versions of the system, however, we restricted the scope of mining only to the bugs which appear between versions 1.6 and 1.7 and were fixed, meaning that the bug was officially closed and contained an official patch file with modifications). In our analysis we did not consider bug reports for accessory software systems such as Bonsai, Tinderbox, etc. We extracted 1,021 different bug entries which satisfied all the aforementioned requirements. By analyzing the patch files, associated with bug reports, we assigned bugs to particular intervals in the source code. This was possible to complete automatically, since each patch file contained the name of the changed file and it described how many lines were deleted from a given line number and how many lines were inserted at a given line number. Using this line-level information about changes, we determined intervals of actual changes in the file and localized the bugs to implementations in the source code of specific classes. To ensure that the files in the patch were changed at the same time, we searched and checked log messages in the configuration management system to ensure that check-in messages for those files contained the same identification number, assigned by the Bugzilla system.

After collecting a set of bug reports and sets of changed classes respectively, we filtered the data to eliminate those bugs which contained only one modified class. After the filtering we ended up with 391 bug reports, containing on average 7.3 modified classes (standard deviation: 14.6). However, we also removed some outliers in the data. For example, one of the removed outliers, the patch in the bug report #226439Footnote 6, contained the record number of modified classes (149) to fix this bug.

6.1.4 Evaluation Methodology

Our evaluation strategy is to utilize the history of changes, observed in Mozilla, to determine whether existing structural and conceptual coupling measures can be used during impact analysis to identify classes with common changes (i.e., changes in classes related to the same bug report and having the same bug identification number in the configuration management system). The history of changes can be used to evaluate rankings of classes produced with different coupling measures against actual changes in the software system. We expect that the conceptual coupling measures, namely CCBC and CCBCm, will be at least as effective as the nine existing coupling measures in ranking classes during impact analysis.

The evaluation methodology can be summarized in the following steps:

  • For a given software system, a set of bug reports B = {b1, b2…bn} is mined from the bug tracking system. The set of classes, which had been changed to fix each bug (e.g., c(b1) = {c1, c2…cn}) are mined from the configuration management system. Specific details on how the bug reports and changed classes are identified are provided in Section 6.1.3.

  • For each class in c(bi), pair-wise structural and conceptual coupling metrics are computed. The values of each metric are used to compute ranks of the remaining classes in the software system.

  • Using a specific cut point criteria (which ranges anywhere from 10 to 500 classes), defined as cp, select top n classes in each ranked list of results generated by every metric. For every class in c(bi), which is used in the evaluation, we assess the effectiveness of identifying relevant classes (i.e., the other classes in c(bi)) via rankings of specific coupling metric.

  • In order to evaluate each coupling measure and compare all the coupling measures used in the case study, the suggested ranked lists of classes are compared against classes that were actually changed. Average precision (P), recall (R), and F-measure (F) for each class in c(bi) for each i = 1..|B| are computed for every metric. In our case, precision is the percentage of classes suggested by a metric that are actually changed together with the given class according to the bug report. Recall is the percentage of the classes that are changed together with the given class and are successfully retrieved using the coupling measures (see Section 5.2 for an example on how precision and recall values are computed). The F-measure is a weighted harmonic mean of precision and recall and calculated as (2 × precision × recall) / (precision + recall) and can be used as a comprehensive indicator of combined precision and recall values. The F-measure is better suited than techniques like averaging, since it weights the lower measure more heavily. For example, a coupling measure producing 80% precision but only 20% recall provides only a few suggested classes, but most of these classes are relevant. The F-measure for this case is 32%, whereas the average of precision and recall is 50%. Thus F-measure (32%) better reflects a coupling measure’s effectiveness, since the measure helps to identify only 20% of the relevant classes. For each measure, a higher value is more desirable.

The complete results on using conceptual and structural coupling measures to rank classes for all mined bug reports are presented and discussed in Section 6.2.

6.2 Comparing Conceptual and Structural Coupling Metrics for Impact Analysis

In order to compare the coupling measures, we followed the evaluation methodology presented in Section 6.1.4. We computed precision and recall values for each coupling metric for every class in each of the 391 bug reports. We computed 1,490 precision and recall values for eleven coupling measures. In addition, we studied the impact of different cut points, on precision, recall and F-measure values for particular coupling measures. We computed precision, recall and F-measure values for each cut point for each coupling metric.

The results of using nine structural and two conceptual coupling measures to rank classes during impact analysis in Mozilla are presented in Tables 6, 7, 8. The values of the computed coupling metrics for classes in Mozilla can be downloaded from http://www.cs.wayne.edu/~severe/CoCC/Mozilla_coupling_metrics.zip. The results in Tables 6, 7, 8 contain precision and recall values for using CCBCm, ICP, PIM, CCBC, CBO, MPC, OCMIC, OCAIC, DAC, ACMIC and ACAIC coupling measures with different cut points ranging from 10 classes to 500 classes.

Table 6 Precision (Pre) and recall (Rec) values for using conceptual and structural coupling measures to rank classes during impact analysis based on different cut points from 10 to 50 classes
Table 7 Precision (Pre) and recall (Rec) values for using conceptual and structural coupling measures to rank classes during impact analysis based on different cut points from 60 to 100 classes
Table 8 Precision (Pre) and recall (Rec) values for using conceptual and structural coupling measures to rank classes during impact analysis based on different cut points from 200 to 500 classes

Only two of the coupling metrics, CCBC and CCBCm, are normalized (see Section 3.2), thus we could compute precision, recall, and F-measure values for various thresholds within the complete specter of metric values (see Fig. 2). The other coupling metrics are not normalized as they count the total number of coupling connections of a class with other classes in the system (the larger the metric value, the stronger the coupling between two classes). The only exception is CBO coupling measure, which has a binary value indicating if two classes have a coupling connection or not. In case of CBO, we based our evaluation on choosing n coupled classes to a given class instead of using actual metric values (as it is done in cases of other structural coupling measures).

Fig. 2
figure 2

Precision, recall and F-measure values for using CCBCm and CCBC conceptual coupling measures to rank classes during impact analysis. The values are based on different thresholds (pertinent to each metric). The number of actually retrieved classes for every threshold is given in parenthesis

In case of each coupling measure we varied a cut point from 10 to 500 classes. For instance, in case of using CCBCm metric (see Table 6), with a cut point of 10 classes, obtained precision was 27.80%, recall was 14.6% and F-measure was 19.1%. Increasing a cut point to 20 classes provides more candidate classes, thus decreasing precision to 24.7%, but significantly increasing recall values to 22.1% and increasing F-measure to 23.3%. Also notice that while using a cut point of 10 classes, the CCBCm value for a class taken at a cut point is 0.64, however while using a cut point of 20 classes the class at a cut point has a smaller CCBCm value of 0.61, meaning that conceptual similarities for ten of the candidate classes are within [0.61 and 0.64] interval of CCBCm metric values.

Analysis of the results, presented in Tables 6, 7, 8, 9 reveals that CCBCm conceptual coupling metric is the best coupling measure for ranking classes during impact analysis (in terms of precision, recall and F-measure). None of the other coupling metrics achieves the same value of F-measure (i.e., maximum of 24.8% for a cut point of 30/40 classes and 19.0% on average across all the cut points) for any given cut point. For example, when using a cut point of 30 classes, using CCBCm, around 28% of actually changed classes are recovered (recall) and one in five suggested classes is correct (precision). These are encouraging results as the source code of Mozilla consists of 4,853 classes and focusing developers on a set of relevant classes can significantly reduce amount of time developers spend on impact analysis.

Table 9 F-measure values for using conceptual and structural coupling measures to rank classes during impact analysis based on different cut points from 10 to 500 classes

The results for using structural coupling measures for the same task are less encouraging. The second best metric after CCBCm is the structural coupling measure ICP (based on the average F-measure, see Table 9), which captures information flow based coupling. This coupling measure captures the number of invocations in a class c i ∈ C, of methods in a class d i ∈ C, weighted by the number of parameters of the invoked methods. This coupling measure also takes polymorphism into account. While using the cut point of 20 classes, the precision of identifying relevant classes using ICP is 10.1%, recall is 9.7% and F-measure is 9.89%. The best value of F-measure for ICP metric, which is 11.7%, is obtained while using a cut point of 60 classes (see Table 9).

The next metric after ICP is PIM (see Table 9), which captures the number of method invocations in class c i ∈ C of methods in class d i ∈ C. The measure also takes polymorphism into account. For example, when using the first twenty classes with the highest PIM values as a cut point, the precision of identifying relevant classes is 9.84%, achieved recall is 9.56 while F-measure is only 9.7%. The best value of F-measure for PIM metric, which is 11.6%, is obtained while using a cut point of 60 classes (see Table 9). PIM has been shown to be a relatively effective coupling measure (as compared to other structural measures) to rank classes during impact analysis in other case studies (Briand et al. 1999a).

The MPC coupling measure shows higher precision values as compared to other coupling measures in some cases (more than 7%), however it has low recall (around 2% on average) for all of the studied cut points.

The other coupling measures, such as CBO, DAC, ACAIC, ACMIC, OCAIC and OCMIC have low precision and recall values (less than 10%) for all of the computed cut points.

While CCBC coupling measure uses the same type of information as CCBCm, it uses a different counting mechanism based on average similarities as opposed to CCBCm, which is based on the strongest coupling link between classes. According to the results (see Tables 6, 7, 8, 9), CCBCm significantly outperforms CCBC. Moreover CCBC is outperformed by some of the structural coupling measures such as ICP and PIM.

The results show that CCBCm is a useful (the best among the studied coupling measures) indicator of an external property of classes in OO systems-change proneness. This coupling measure can be effectively used to rank relevant classes during impact analysis in OO systems. The measure performed better on average than any of the structural metrics we compared it to. While we do not investigate to which extent structural and conceptual coupling measures complement each other in this case study, there is a noticeable potential in combining these coupling measures for ranking classes during impact analysis. As it has been observed in several examples in Section 4.2 there are cases where high conceptual coupling metric values capture dependencies between classes, which are not structurally connected and vice versa. Thus, combining conceptual and structural coupling measures may lead to significant reduction of programmer’s efforts during impact analysis via increasing precision (and recall) of identifying relevant classes.

6.3 Testing Statistical Significance of Differences Among Precision and Recall Values

In order to compare values of precision and recall for the coupling measures for each of the cut points and conclude whether or not the difference is statistically significant, we executed the Kruskal–Wallis’s test, which is a nonparametric alternative to the one-way analysis of variance (ANOVA) in those cases when more than three independent samples are present. In our case, the Kruskal–Wallis’s test is an appropriate alternative to ANOVA test, because we have eleven independent samples of precision and recall values for each of the coupling measures.

We executed Kruskal–Wallis’s test separately for precision and recall values for all of the coupling measures (see Table 10). For more details on the Kruskal–Wallis’s test, the reader is referred to the work of Siegel and Castellan (Siegel and Castellan 1988).

Table 10 The results of running two Kruskal–Wallis tests for precision (test 1) and recall (test 2) values of eleven coupling metrics across the different cut points

In the both tests, at the level of significance for alpha = 0.05, the decision was to reject the null hypothesis of absence of differences between even metric values. In other words, both tests have shown that the differences between precision (first test) and recall (second test) values for eleven coupling metrics were statistically significant.

6.4 Threats to Validity

We identify several issues that affected the results of our case study and limited our interpretations.

In the case study we considered only structural metrics that were based on the static information obtained from the source code. The results can differ to some extent if dynamic coupling measures are used (Arisholm et al. 2004; Mitchell and Power 2006).

The conceptual coupling measures depend on rational naming conventions for identifiers and comments in source code. When these are missing, the only hope for measuring any aspects of coupling rests on the structural coupling measures.

CoCC, CoCCm, CCBC and CCBCm measures, as currently defined, do not take into account polymorphism and inheritance. The measures only consider methods of a class that are implemented or overloaded in the class. One of the solutions, which accounts for inheritance, consists of extending the measures to include the source code of inherited methods into the documents of derived classes, as it is currently done by Kuhn et al. (Kuhn et al. 2007).

In our case study we used one large software system, however, to allow for generalization of results, large-scale evaluation is necessary, which should take into account several releases of software systems from different domains, developed using different programming languages.

Also, our evaluation is based on the changed classes extracted from patches in related bug reports. This could have impacted evaluation procedure as these patches may contain incomplete information about actually changed classes or these changes could have introduced other bugs. We alleviate this issue by considering only patches which are officially approved by module owners in Mozilla.

7 Conclusions and Future Work

The paper defines a novel set of operational measures for the conceptual coupling of classes, based on IR, which are theoretically valid and empirically studied. These new metrics capture new dimensions in coupling measurement, compared to existing structural metrics. Moreover, one of the conceptual coupling measures, CCBCm measure, appears to be a superior indicator of change ripple effects as compared to existing structural coupling measures and can be effectively used to rank classes in the course of impact analysis in a large OO system.

The paper lays the foundation for a wealth of work that makes use of the coupling metrics which use lexical (textual) information in software. The proposed metrics could be further extended and refined, for example by taking into account inheritance in measurement. The IRC2M tool will be adapted to compute conceptual coupling measures in other programming languages such as Java or C#. We are also planning on comparing and combining the conceptual coupling metrics with the evolutionary based coupling (Gall et al. 2003). Since conceptual coupling measures use textual information, we are considering including external documentation in the corpus. This will allow extending the context in which words are used in the software and identifying inconsistencies between source code and external documentation.

More importantly, we will investigate combinations of the conceptual and structural coupling measures for impact analysis and detection of hidden dependencies. In addition, we will use these metrics to extend prior work on software clustering (Kuhn et al. 2007), concept location (Marcus et al. 2004; Marcus et al. 2005b; Poshyvanyk and Marcus 2007), and high-level concept clone detection (Marcus and Maletic 2001). We are also planning on investigating how changes in the structure and lexicon of software during software evolution (Antoniol et al. 2007) are reflected in structural and conceptual coupling measures.