Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

One of the fundamental elements of functional assessment for challenging behaviors is the collection of data. Data are used to help determine the nature of the problem, create a case formulation, carry out the functional assessment, and monitor the progress of interventions (Hartmann, Barrios, & Wood, 2004). All decisions made during the course of a functional assessment are based off of data. Most data collected during this process are collected in vivo. That is to say, most data are collected through real-time observations of individuals in their natural environment.

There are a number of advantages to in vivo data collection, primarily that observations are carried out in the same setting in which the behavior naturally occurs (Gardner, 2000). Take, for example, an individual who engages in self-injurious behavior when he is hungry. In vivo assessment allows for data to be collected in this individual’s home, where he or she typically engages in the behavior. Were the individual brought into a clinic for observation, the properties of the behavior, such as frequency, intensity, and duration, may differ due to the novel environment. Therefore, in vivo data collection has its primary advantage in its ability to be collected in the natural environment in which the behavior typically occurs. It does not require experimental manipulation of the environment, which can create an artificial setting and reduce the external validity of the data (Martens, DiGennaro, Reed, Szczech, & Rosenthal, 2008).

In vivo data collection also allows for greater specificity of data, as definitions of challenging behavior can be modified to fit each individual’s unique pattern of behavior (Matson & Nebel-Schwalm, 2007). This makes in vivo data collection procedures very flexible, as they can be adapted to target a variety of behaviors in a variety of settings (Hartmann et al., 2004; Martens et al., 2008). Due to this specificity and flexibility, real-time data are more sensitive to treatment effects than data collected via scaling methods (Matson & Nebel-Schwalm, 2007).

Another advantage is that observers are able to directly see interactions between an individual and his or her environment (Gardner, 2000; Iwata, Vollmer, & Zarcone, 1990). Indirect assessments from outside informants, such as parent reports, may be susceptible to personal biases, such as expectations, attributions, and mood (Eddy, Dishion, & Stoolmiller, 1998; Fergusson, Lynskey, & Horwood, 1993). In contrast, real-time data collection allows for direct observations, thereby bypassing the need for an informant. In vivo data collection, in principle, allows for objective evaluation of the effects of treatment (Iwata et al., 1990; Lipinski & Nelson, 1974). However, as we will see later in this chapter, that is not always the case.

While in vivo data collection has a number of advantages and clear utility, there are a number of problems that must be considered. This chapter discusses a variety of problems associated with the collection and use of real-time data that must be addressed. These problems are broken into five general categories: Defining the behavior, collecting the data, reliability, validity, and the interpretation and use of the data. Problems common to each of these areas are discussed in depth.

Defining the Behavior

The first step in data collection is to determine and define on what behavior data will be collected. In order for this to be done, an accurate, operational definition must be established for the behavior. For the purposes of in vivo data collection, the behavior must be defined in clearly observable terms (Bijou, Peterson, & Ault, 1968). Hawkins and Dobes (1977) suggest three characteristics of a well-formed operational definition: (1) the definition should include objective terms and refer only to observable characteristics of the behavior; (2) the definition should be unambiguous and clear to experienced observers; and (3) the definition should be complete, defining what should be included and what should be excluded, thereby reducing inference on the part of the observer. Additionally, they suggest that the definition be explicitly stated to data collectors, as implicit definitions are more prone to error. These guidelines help prevent multiple observers from using varying definitions and allow for replication of data collection.

Barrios (1993) provides a four-step process for creating such an operational definition. The first step is to research how the behavior has previously been operationalized, as it may be possible to adopt a similar definition. The second step is to construct a definition appropriate for the current behavior in the current setting. Next, the definition should be reviewed by people with knowledge on the subject matter, as well as by those who will be using it for observation. Finally, if the definition is found to be appropriate and clear, it should then be field tested by two observers who only have the definition. If high agreement is found, the definition is ready to be used; if not, additional revisions may be required.

Finally, it is necessary that the criteria used to operationally define the behavior easily distinguish the target behavior from similar behaviors. For example, if the target behavior is hitting, the criteria must clearly distinguish this from other similar behaviors, such as pushing (Bijou et al., 1968). A lack of clarity in definitions will often lead to a decrease in reliability, as will be discussed later (Bijou et al., 1968).

Collecting the Data

After defining the target behavior, the next step is to determine how data will be collected. There are many ways of classifying data collection techniques. Two primary categories of in vivo data collection procedures will be discussed in this chapter: event recording and time sampling. There are a number of methods of collecting data within each of these categories, each of which will be discussed along with associated problems.

Event Recording

Event recording involves counting the number of times a specific behavior occurs during an interval (Sulzer-Azaroff & Mayer, 1977). This is most appropriate for collecting data on discrete behaviors that have clear beginning and end points (Sulzer-Azaroff & Mayer, 1977). The most basic type of event recording is to simply write down what behavior is occurring during certain periods of times (Lipinski & Nelson, 1974). This method involves writing a descriptive account of everything relevant that is occurring during an observation. Although this method can provide a very thorough description of what has occurred, there are a number of problems with its use. Firstly, it can be very difficult to complete such recordings, as the observer is required to accurately note everything that is occurring. This requires the observer to direct the majority of their attention toward actually recording the behavior, rather than toward the individual’s behavior (Lipinski & Nelson, 1974). This can result in the observer failing to fully observe the behavior, leading to inaccuracies in the recording. Additionally, as the recording is a narrative account of the behavior, it is difficult to compare findings with other researchers and clinicians (Bijou et al., 1968). As described above, a properly operationalized target behavior is necessary for communication between researchers and clinicians. A narrative summary does not allow for such communication. Moreover, as there is no clear structure, different observers might record different information. For example, one observer might record the location of the observation, whereas another may omit this information. Finally, it can be difficult, if not impossible, to objectively determine the duration, latency, or intensity of a behavior from a narrative account (Bijou et al., 1968). Without the use of standardized language and recording procedures, there is too much room for subjective interpretation.

To combat many of these weaknesses, one can add behavioral codes to the above method. Behavioral codes identify target behaviors that can either be specific or general (Bijou et al., 1968). A specific code lists specific, operationally defined behaviors to be observed. In contrast, a general code lists a class of behaviors, allowing for the recording of multiple behaviors (e.g., verbal responses). Specific symbols can be used to represent the operationally defined target behavior, making the recording process much simpler (Lipinski & Nelson, 1974). Additionally, the frequency of target behaviors can be recorded with a checklist, hand counters, or electronic counters (Sulzer-Azaroff & Mayer, 1977). This methodology improves upon the previous one, primarily by standardizing the definitions and procedures. This reduces the subjectivity involved, thereby allowing for the communication and comparison of results. As the behaviors are discretely and operationally defined, constructs such as duration, latency, and intensity can be objectively recorded. For example, if the target behavior is tantrum behavior, defined as crying and pounding fists, the observer could easily record how long the behavior lasts and how much time lapses between occurrences. Additionally, in comparison to the previous method, much less attention needs to be directed toward the actual recording, allowing the data collector to more accurately and thoroughly observe the behavior.

Despite these improvements, there are still a number of problems associated with this methodology. Firstly, it requires prior selection of behaviors to be recorded (Goldfried, 1982; Lipinski & Nelson, 1974). Any other behaviors that occur during the observation are not recorded. For example, if tantrums are selected as the target behavior, the occurrences of other challenging behaviors during the observation are not recorded. Potentially valuable information can be lost because of this. Challenging behaviors that occur at low frequencies can also be problematic for this type of event recording (Singh et al., 2006; Tarbox et al., 2009). If observations are carried out at specific times for data collection, the behavior must occur during the observation for any information to be recorded. Take, for example, an individual who engages in self-injury when hungry. If he or she is not hungry during the observation period, it is unlikely the behavior will occur, and therefore no information can be gathered on the behavior. This has been noted to be especially problematic in mental illness, as challenging behaviors typically occur at low frequencies but high intensities in this population (Singh et al., 2006). A final problem, especially with respect to functional assessment, is that event recording says nothing about the function of the behavior (Bijou et al., 1968). No information is provided about the antecedents and consequences that may be influencing the behavior.

This final limitation can be addressed through the use of Antecedent–Behavior–Consequence (ABC) cards or sequence analysis (Bijou et al., 1968; Sulzer-Azaroff & Mayer, 1977). In addition to recording the occurrence of a behavior, the events that occur immediately before and after the behavior are recorded. Similar to the recording of the behavior itself, the antecedents and consequences can either be recorded in a ­narrative fashion, or target antecedents and ­consequences can be selected beforehand and included on a checklist. This provides considerably more information than the previous methods, as it begins to provide information about possible functions of the behavior. However, one should keep in mind that this method only describes interactions between behaviors and environmental events; this does not, in and of itself, establish a functional relationship (Bijou et al., 1968). An additional problem with this method is that it may difficult to quantify the data obtained through this method (Lerman & Iwata, 1993). It is difficult to determine if the antecedent events are similarly correlated with nonoccurrences of the target behavior. For example, suppose being asked to complete a task is identified as an antecedent. While this may frequently occur before the target behavior, it may just as frequently, or more frequently, occur without being followed by the target behavior. As data are only collected on occurrences, it is not possible to make this comparison. Finally, while some strategies for examining sequential data have been proposed (Martens et al., 2008), there is no consensus within the field on how such data should be analyzed (Tarbox et al., 2009). Additional problems with the use of ABC data will be discussed below in the section discussing use and interpretation of data.

Time Sampling

A second category of in vivo data collection is termed time sampling (Bijou et al., 1968; Sulzer-Azaroff & Mayer, 1977). Time sampling involves recording the occurrence or nonoccurrence of a behavior in a specified time interval. Time sampling, in contrast to event recording, can be appropriate for both discrete and nondiscrete behaviors (Sulzer-Azaroff & Mayer, 1977). For example, take an individual who engages in self-injury by hitting his or her head. If this individual hits his or her head twice, pauses for 20 s, and then hits his or her head again two more times should this be counted as one or two occurrences? While this is a problem for event recording, it is not at all problematic for time sampling. Additionally, unlike event recording, time sampling allows for quantification of data (Lerman & Iwata, 1993). For example, a 30-min data collection session on self-injury could be broken into 30, 1-min intervals. If self-injury occurs during 15 intervals, this can be quantified, allowing one to say that the behavior occurred during 50% of the intervals. This number can then be compared to subsequent observations to determine if there is an increase or decrease in the behavior. Given these advantages, time sampling has frequently been used in a variety of settings with a variety of challenging behaviors (Lerman & Iwata, 1993)

Similar to event recording, there are a number of methods of collecting time-sampling data. The three primary methods of time sampling are whole-interval time sampling, partial-interval time sampling, and momentary time sampling (Sulzer-Azaroff & Mayer, 1977). Whole-interval time sampling requires the behavior to occur throughout the interval. In the above example, the individual must engage in self-injury for the entire minute for it to be scored. Conversely, in partial-interval time sampling, only one instance of the behavior must occur during the interval. In the above example, if the individual engages in self-injury, even for just 1 s during the interval, it is scored. Finally, momentary time sampling requires the behavior to occur at the end of the interval.

While time sampling clearly has some advantages, as with event recording, there are a number of problems that must be considered. Firstly, time sampling is not practical for infrequent behaviors (Sulzer-Azaroff & Mayer, 1977). For example, take an individual who engages in self-injury approximately once a week. While this behavior can be very serious, time-sampling procedures will provide little to no information on the behavior. Again, as this is often case in those with mental illness, the effectiveness of this approach is limited (Singh et al., 2006). Additionally, it is much more difficult to identify antecedents and consequences with time-sampling procedures. Specific incidences of the behavior are not recorded, thus it is not feasible to record antecedent and consequences to the behavior. This limits the utility of this method for establishing functional relationships.

Another problem is that time-sampling procedures do not record all behaviors that occur during an observation (Johnston & Pennypacker, 1993). For example, in momentary time sampling, no recording occurs until the end of the interval. However, this results in a large period of nonobservation time. The data collected will, therefore, typically be under- or over-representative of the true behavior. There is no way to assess the extent to which the data is inaccurately representing the data (Johnston & Pennypacker, 1993). Johnston and Pennypacker (1993), therefore, suggest limiting the amount of nonobservation time that occurs during a session. They also recommend limiting the interpretation of data collected through time sampling, periodically assessing for accuracy of data, and matching procedures with the distribution of responding.

A considerable amount of research has examined the various methods of time sampling, highlighting problems inherent with each. One set of researchers compared frequency recording (i.e., event recording), interval recording (i.e., partial-interval time sampling), and time sampling (i.e., momentary time sampling) for use with different rates of behavior (Repp, Roberts, Slack, Repp, & Berkler, 1976). Results showed that momentary time sampling did not produce representative data, particularly when the behavior did not occur at a constant rate and was frequently occurring. Partial-interval recording was more accurate for behaviors that occurred at low and medium rates; however, it underestimated behaviors that occurred at high rates.

Powell, Martindale, and Kulp (1975) compared all three methods of time sampling with frequency recording for measuring in-seat behavior. For frequency recording, the behavior was continuously measured over the course of the session. Whole-interval time sampling was found to consistently underestimate the frequency of the behavior, partial-interval time sampling was found to consistently overestimate the frequency of the behavior, and momentary time sampling both over-and underestimated the frequency of the behavior. However, it was noted that as the intervals were made shorter (i.e., more observations were made), the time sampling methods became more accurate. A follow-up study (Powell, Martindale, Kulp, Martindale, & Bauman, 1977) similarly found that partial-interval time sampling overestimated the frequency of behavior, while whole-interval time sampling underestimated the frequency of behavior. As error occurred in only one direction for each method, conducting a large number of observations could not control for this error. Additionally, error remained large even when intervals lasted only 30 s and was not directly related to either the frequency or duration of the behavior. The authors suggest that this may lead researchers and clinicians to inaccurately interpret changes in behavior due to treatment. Conversely, momentary time sampling was fairly accurate when observations were conducted at 5, 10, 20, or 60-s intervals. However, when intervals went beyond this length, error began to increase. Finally, momentary time sampling was found superior to both types of interval time sampling for estimating the duration a behavior occurred.

Another study compared momentary time sampling and partial-interval time sampling in measuring behavior change, both absolute and relative (Harrop & Daniels, 1986). Both methods tended to overestimate the absolute rate of behaviors. Additionally, partial-interval time sampling overestimated the absolute duration of behaviors, especially when behaviors occurred at lower rates and shorter durations. Conversely, momentary time sampling did not produce such errors. Based on these findings, the authors suggested that duration, not rate, should be the dependent measure when using momentary time sampling. However, when measuring relative changes, the authors found partial-interval time sampling to be more sensitive than momentary time sampling. Despite this superiority, partial-interval time sampling underestimated the change in a behavior if it was of high frequency and short duration.

One shortcoming of these studies was their use of simulated or computer-simulated behaviors. Therefore, it is unclear to what extent they apply to behavioral observations in naturally settings. Unfortunately, few studies have been conducted that examine these methods with naturally occurring challenging behaviors (Matson & Nebel-Schwalm, 2007). One of the only such studies, by Gardenier, MacDonald, and Green (2004), compared partial-interval time sampling and momentary time sampling for recording stereotypies in children with autism spectrum disorders. In this study, partial-interval time sampling was found to consistently overestimate the duration of stereotypies. Momentary time sampling was found to both overestimate and underestimate duration, but to a lesser extent than partial-interval time sampling. Across all samples, partial-interval time sampling overestimated the duration by an average of 164%, whereas momentary time sampling over- and underestimated the duration by an average of 12–28% (depending on the interval length). The authors concluded that momentary time sampling should, therefore, be used for duration recording of stereotypy. There is clearly a need for additional research examining the use of these methods with challenging behaviors.

As a whole, these studies suggest a number of strengths and limitations inherent in each method of time sampling. It does not appear that any form of time sampling provides a true representation of the frequency at which behaviors occur, although partial-interval may be more representative for behaviors that do not occur at high frequencies (Harrop & Daniels, 1986; Repp et al., 1976). Conversely, momentary time sampling may be less susceptible to error when recording the duration of behaviors (Gardenier et al., 2004; Harrop & Daniels, 1986; Powell et al., 1975, 1977). The limitations of these methods must be taken into account when considering how data will be collected.

Electronic Data Collection

Although most data are collected by hand (i.e., using pen and paper), electronic equipment is being increasingly used to collect real-time data (Tarbox, Wilke, Findel-Pyles, Bergstrom, & Granpeesheh, 2010). Potential advantages of electronic data collection include simplicity, electronic storage of data, the ability to electronically analyze data, and simpler recording (Tarbox et al., 2010). Kahng and Iwata (1998) conducted a review of 15 computerized systems for collecting real-time data. Although the reviewers were unable to systematically analyze these systems, they provided descriptive reports of the various systems. Most of the systems reviewed included software for analyzing data and many could be used on handheld devices. Systems ranged from free in price to over $1,500.

Unfortunately, there has been little empirical research comparing electronic data collection with hand-collected data. One such study compared the two methods for recording responses of children with autism during discrete trial training (Tarbox et al., 2010). Results found that electronic data collection required more time than pen and paper collection for all four participants. Accuracy of data was similar for both methods, with the average accuracy of electronic data ranging from 83.75% to 95% and the average accuracy of pen and paper data ranging from 98.13% to 100%. Graphing data was accomplished faster via electronic data collection for all participants. Although this study employed a small sample size, it is one of the only to empirically examine electronic data collection. The researchers summarize that although electronic data collection may save time outside of therapy sessions, it may require more time during actual sessions.

General Problems with Data Collection

In addition to the problems discussed thus far, there are additional problems general to all in vivo data collection procedures. In vivo data collection procedures can be very time consuming, especially if narrative accounts are required (Arndorfer & Miltenberger, 1993; Iwata et al., 1990). This decreases the likelihood of compliance with data collection procedures. For example, teachers may not have time to complete ABC or time-sampling data, which require their frequent attention (Arndorfer & Miltenberger, 1993). On the other hand, they may be much more likely to complete interviews and indirect assessments, which can be conducted in one session. An additional problem is that most of these techniques require extensive training (Gardner, 2000; Hartmann et al., 2004; Tarbox et al., 2009). Inadequate training may lead to decreases in the reliability of collected data, as will be discussed later (Bijou et al., 1968).

It is important that the data collected be representative of the individual’s typical behavior. However, how does one determine when enough data has been collected? How does one know that the data is now representative? If the behavior frequently occurs, one may need to collect many observations in many settings to get a full picture (Lipinski & Nelson, 1974). Unfortunately, there is no objective criterion for making this determination (Lipinski & Nelson, 1974). This is especially problematic, considering the time and training needed to implement these techniques.

In order for accurate in vivo data collection to occur, the observer must maintain contact with the subject of data collection. A number of factors, such as movement during the observation session, may interfere with this contact or make observation of the target behavior more difficult (Johnston & Pennypacker, 1993). For example, if the target behavior is biting one’s hands, the observer must be able to see the individual’s hands and mouth. If the individual moves during the observation so that these are no longer visible, the observation can no longer occur. Other behaviors by the individual, or other individuals in the environment, may similarly make data collection difficult, if not impossible (Barrios, 1993; Johnston & Pennypacker, 1993). For example, a noisy environment may make it difficult to record instances of cursing. To counter such problems, one may need to manipulate the environment to restrict these possibilities (Johnston & Pennypacker, 1993). However, this adds a possible confounding variable, as the setting is no longer truly the natural setting in which the behavior typically occurs.

Another problem with in vivo data collection is the presence of frequent variables unrelated to the target behavior. These unrelated events might overshadow or mask relevant variables that occur less frequently (Iwata et al., 1990). Consider an observation in which another individual is constantly yelling and screaming next to the subject of the observation. This variable (i.e., others yelling) may have no relationship to the target behavior; however, its presence may deter the observer from detecting other important, but less frequently occurring, antecedents.

A final problem is that some behaviors and stimuli are difficult, if not impossible, to quantify (Bijou et al., 1968). This is especially true of biological or internal stimuli. For example, how does one quantify feelings of anxiousness through observation? Challenging behaviors may serve physical functions (Paclawskyj, Matson, Rush, Smalls, & Vollmer, 2000), such as being uncomfortable or feeling ill. However, how can one record the frequency, duration, or intensity of these feelings through in vivo data collection? Similarly, social stimuli can be very difficult to objectively quantify. However, Bijou et al. (1968) stress that such specific biological and social variables must be assessed for a thorough functional assessment to take place. Additional problems that may occur while collecting data, such as reactivity and observer effects will be discussed more in depth below.

Reliability of Data

One of the most important factors to consider when examining real-time data is the reliability of the data. It is extremely important that the data that is collected be both consistent and accurate. It should be noted that agreement and accuracy are not synonymous (Kazdin, 1977). Agreement exists when multiple raters make similar recordings, regardless of if these recordings are correct. For example, if both raters record that a behavior occurs 10 times, agreement is 100%. This is true regardless of how often the behavior actually occurred. In contrast, accuracy reflects if raters record how often the behavior truly occurs. Typically, interobserver agreement is calculated and agreement is assumed to reflect accuracy (Kazdin, 1977). However, agreement alone may not be enough to ensure the quality of data; accuracy and generalizability should also be reported when possible (Mitchell, 1979).

While on the surface it seems that reliability should be easy to achieve, there are a number of factors that can affect the achieved reliability. Firstly, as discussed above, the definition of the behavior itself can affect reliability (Bijou et al., 1968). If there is room for subjective interpretation, two observers may define the behavior differently. The two observers may, therefore, be recording two different behaviors, thereby affecting the reliability of the data. Additional factors that will be discussed below include the method of calculating reliability, the coding system employed, inadequate training, reactivity to the reliability assessment, observer drift, and characteristics of the observers (Bijou et al., 1968; Kazdin, 1977; Lipinski & Nelson, 1974).

Calculating Reliability

There are a number of methods for calculating the reliability of data, each with its advantages and disadvantages. While a full discussion of reliability techniques is beyond the scope of this chapter, a brief discussion about the importance of selecting an appropriate technique is given. This is extremely important, as selecting an inappropriate method for calculating reliability may be one reason that inadequate reliability of data is found (Bijou et al., 1968).

Interobserver agreement is one of the most common methods by which researchers calculate reliability of data (Hartmann, 1977). Interobserver agreement refers to the extent to which data between observers agree with one another (Mudford, Martin, Hui, & Taylor, 2009). There are a number of ways to calculate interobserver agreement, and the limitations of each method should be known before selecting one for use. For example, one can divide the number of sessions in which the two observers agreed by the number of total sessions and multiply by 100. This method, while commonly used, is very stringent and does not use all of the information available (Hartmann, 1977). For a more detailed review of methods of calculating interobserver agreement, the reader is directed elsewhere in the literature (e.g., Hopkins & Hermann, 1977; House, House, & Campbell, 1981; Mudford et al., 2009). For the purposes of this chapter, it is merely important to understand that there is more than one method for calculating the reliability of data. It is important to understand the advantages and disadvantages of each, so that the results can be interpreted correctly. Incorrectly used methods may inflate or deflate the perceived reliability of the data, leading to incorrect interpretations.

Coding Systems

There are many different ways to code behavior, each of which can impact the reliability. As discussed previously, one decision that must be made is whether to use specific or general codes (Bijou et al., 1968). General codes allow for more complex behavioral patterns to be recorded; however, general codes allow more room for interpretation. This, in turn, can lead to a decrease in reliability. The more comprehensive and specific the code is, the higher reliability will be (Bijou et al., 1968).

A second reliability factor related to coding is the complexity of the coding system. Complexity can be defined as the number of categories in the coding system, the number of behaviors being observed, or the number of individuals being observed (Kazdin, 1977). As these numbers increase, the complexity of the coding system increases. The question then becomes what impact does increasing complexity have on reliability? A number of researchers have sought to answer this question. Mash and McElwee (1974) examined the effects of complexity, defined as the number of categories, on both accuracy and agreement. They compared the use of two coding systems, one with four behavior categories and the other with eight. Additionally, the eight-category system required the observer to make more complex discriminations between categories. The authors found an inverse relationship between complexity and reliability. They found that agreement increased over time in the more complex group, to the point that significant differences were no longer found after the fourth trial. Accuracy similarly increased over time, although it remained significantly lower in the more complex group throughout the study. This was despite the fact that both groups showed mastery of the coding systems during training. The predictability of the behavior did not have any effect on accuracy.

Taplin and Reid (1973) conducted a study on the effects of instructions and experimenters on observer reliability. While not the primary aim of the study, the researchers also examined the effect of complexity, defined as the number of different codes used, on reliability. They found a moderate negative correlation (r  =  −0.52) between complexity and reliability.

Kazdin (1977) provides a number of implications of these findings. Firstly, estimates of reliability must be interpreted with respect to complexity. Additionally, the complexity of the data collection system may change over time. For example, if interventions are successful, the number of behaviors being recorded may decrease. Therefore, calculations of reliability may not be comparable across different phases.

Barrios (1993) calls for a rational appraisal of the demands of the coding system. This involves having the system evaluated by those creating the system, colleagues, and potential observers. If the demands are found to be too high, one may need to decrease the number of behaviors being tracked, simplify the nature of the behavior being tracked, or decrease the duration of observations.

Training

Another factor that can influence the reliability of data is the training of observers. If observers are not properly trained, they may inaccurately collect data and fail to control their own behaviors (Bijou et al., 1968). Bijou et al., (1968) provide some recommendations for ensuring adequate training, such as familiarizing observers with recording tools and employing a second observer during training.

Barrios (1993) provides a six-step model for training and monitoring observers. The first step is orientation. This involves conveying the importance of objective data collection to observers. In this step, observers are also told what they will be doing and what is expected of them. This includes warnings against potential sources of error, including biases, observer drift, and reactivity (as discussed below). The second step is to educate observers about the operational definition that will be used and how data will be recorded. This may be accomplished through written materials, filmed instructions, or in-person demonstrations. The third step is to evaluate the observer’s training. Observers are assessed to ensure that they have an adequate understanding of the operational definition and coding system. Feedback and corrections are given at this step, until the observer has mastered the system. Additionally, the operational definition and coding system may be altered at this step if they are found to be inadequate. The fourth step is application; observers begin using the data collection system, first in analog situations and then in real situations. Observers must attain sufficient agreement and accuracy to progress. This ensures that observers have mastered the system before collecting data in the situation of interest. Observers are gradually introduced to data collection in the setting of interest, as mastery in analog sessions does not ensure mastery in actual sessions (as discussed further below). Additionally, observers are continually provided feedback concerning their reliability and reminded that reliability will be periodically checked. The fifth step is recalibration. This is where reliability of data collection is assessed in the actual situation of interest. The final step in training and monitoring is termination. After data collection is completed, observers are asked for feedback on the data collection system, provided information on what was found and how it will be used, reminded of confidentiality, and thanked for their assistance. Hartmann and Wood (1982) provide a similar seven-step model.

An additional aspect of training is the type of behavior that is trained. Mash and McElwee (1974) found that observers who were trained to code unpredictable behavior had better accuracy in novel situations than those trained to code predictable behavior sequences. This is critical as in vivo data collection necessitates that observations occur in varying settings and situations.

As mentioned above, even if observers achieve mastery in training, this does not necessarily mean that they have been adequately trained. Mash and McElwee (1974) found that the reliability of data collected by observers using a complex coding system was inadequate, despite the fact that they had achieved mastery of the coding system during training. Similarly, Taplin and Reid (1973) found that observers never attained the same level of performance during data collection that they achieved in training. In fact, reliability decreased, on average, 15% once training was completed and data collection began. Thus, the reliability achieved during training is likely to be an overestimate of the reliability that will be achieved during actual data collection. The steps outlined above by Barrios (1993) are intended to help protect against this drop in reliability.

Reactivity to the Assessment of Reliability

As reliability of data is crucial, researchers and clinicians will often compute reliability as data are being collected. However, knowledge of this reliability assessment can, in and of itself, affect reliability. Reid (1970) conducted an experiment in an analog setting to examine this effect. Participants were instructed to code behavioral data in a lab, believing that the reliability was being assessed only during the first two sessions (reliability was covertly assessed during the remaining three sessions). The reliability of observers significantly decreased when they did not believe that reliability as being checked. Reliability dropped by an average of approximately 25% when reliability was covertly checked. This change was not gradual, occurring immediately upon switching conditions. Although a complex coding system was used in this study, it should be noted that in the previously discussed study by Mash and McElwee (1974), reliability increased over time in the complex condition. The opposite was found in the current study due to the covert reliability assessment.

A follow-up study by Taplin and Reid (1973) compared three conditions: covert reliability assessment, random overt reliability assessment, and spot-check assessment. That is to say participants believed that no reliability assessment would take place, that about 20% of their data would randomly be checked for reliability after observations were completed, or that periodic overt checks would take place. Results showed that reliability during spot-checks was significantly higher than both before and after the checks. No significant differences were found between the groups with respect to reliability. This is likely because reliability for all groups dropped significantly after training.

Romanczyk, Kent, Diament, and O’Leary, (1973) conducted a similar experiment to examine the differential affects of overt and covert reliability assessment. Two assessors accompanied observers during all observations. In the overt condition, observers were led to believe that either one assessor was computing reliability or both were computing reliability. In reality, both assessors were computing reliability in all conditions each using a unique code. This allowed the researchers to compare reliability with an identified assessor and reliability with an unidentified assessor. In the covert condition, observers were not told that reliability was being assessed. Reliability during covert assessment was consistently lower than reliability during overt assessment. Disruptive behaviors were reported at lower rates (approximately 25% lower) when observers did not believe that reliability was being checked. Additionally, reliability was consistently higher with the identified assessor than with the unidentified assessor. This suggested that, when observers knew who was computing reliability, they shifted their observational criteria to be more like the assessor.

These findings strongly suggest that observers are reactive to knowledge of reliability assessments. Observers are more likely to record reliable data when they believe that reliability is being assessed. Unfortunately, it is not typically possible to continuously assess for reliability. Similarly, it may be difficult to convince observers that their reliability is constantly being checked (Kazdin, 1977). Taplin and Reid (1973) suggest that it may be necessary to overtrain observers, so that reliability remains adequate even if it decreases after training.

Observer Drift

One possible explanation for decreases in reliability after training is known as observer drift. While different researchers have defined the term observer drift differently, the term chiefly refers to changes in an observer’s definition of a behavior over time (Kazdin, 1977; Smith, 1986). For example, observers may be trained to collect data on physical aggression, defined as hitting, pushing, and pinching. If pinching is low in frequency or intensity, the observer may no longer include this in their definition of physical aggression. So while data would include pinching at first, later data would no longer include this aspect of the behavior. As pinching should be recorded based on the behavior’s definition, reliability will decrease over time.

Kent, O’Leary, Diament, and Dietz (1974) examined the variance in behavioral recordings accounted for by observer. Observer pairs accounted for a total of about 17% of the variance in recordings of disruptive behavior, with about 5% representing consistent differences between pairs of observers throughout the experiment. Additionally, about 12% of the total variance in disruptive behavior recordings were accounted for by interactions of observer pair with other factors.

Observer drift may be difficult to detect, as agreement between observers may remain high, while accuracy decreases (Kazdin, 1977). Observers who work together closely may make similar changes to their definitions, thereby maintaining agreement and losing accuracy. This effect is known as consensual observer drift (Johnson & Bolstad, 1973; Power, Paul, Licht, & Engel, 1982; Smith, 1986). Observer drift may also affect within subjects research designs, as the observer may change his or her definition over the course of the study (Lipinski & Nelson, 1974). Taking the above example of pinching, data might show a decrease in physical aggression over time. However, this decrease would be confounded by the observer’s modification of the definition of physical aggression. The best way to combat this problem is likely to periodically retrain observers, ensuring they are applying the definition correctly (Hartmann et al., 2004; Kazdin, 1977). Observer drift is, therefore, another reason that in vivo data assessment can require extensive training and costs.

Observer Bias, Distraction, and Discontent

Characteristics of the observer can also threaten the reliability of real-time data. Both implicit and explicit biases by the observer, distractions to the observer, and discontent can all affect the reliability of collected data. Additionally, the manner in which data is presented by observers can affect the way in which the data is interpreted.

Observer expectancies and biases can influence the reliability of collected data. Such biases include hypotheses about the purpose of the data collection, hypotheses as to how the subject of data collection should behave, and beliefs about what the data should look like (Hartmann & Wood, 1982). Biases can also be developed based on subject characteristics and information expressed by the primary investigator (Hartmann & Wood, 1982).

A study by Kent et al. (1974) examined the effect of expectation biases on behavior recording. Observers were either told to predict changes in behavior or to predict no change. They found that the evaluations of the treatment effects were significantly affected by predictions. Those told to predict changes in behavior reported seeing a global change in behavior. However, actual behavior recordings did not significantly differ between the groups. Those who believed the behavior would change were just as accurate as those who did not believe the behavior [would] change. This would seem to suggest that expectancies do not bias in vivo data collection, although they may affect global evaluations.

A follow-up study (O’Leary, Kent, & Kanowitz, 1975) examined the influence of both instructions and feedback on data collection. Observers were told that two behaviors were expected to decrease, whereas two other behaviors would experience no change. Positive feedback was given when observers reported decreases in the behavior that was expected to change, whereas negative feedback was provided when they reported no change or increases in the target behaviors. After feedback, observers recorded the target behaviors significantly less frequently, suggesting that significant bias had occurred. No changes were found with respect to the control behaviors. These findings suggest that while expectancies may not bias data collection, a combination of both expectancies and feedback can have a significant impact.

Another possible source of bias involves the presentation and analysis of data. Although often overlooked as a source of bias, inappropriate analysis and misleading presentation of data can be another form of bias (Mcnamara & MacDonough, 1972). For example, data may be biased if no statistical analyses are conducted. Similarly, graphical data can be misleading if not displayed accurately. It is critical that data be reported as unambiguously as possible, so that anyone who uses the data can come to similar conclusions (Mcnamara & MacDonough, 1972).

If observers are distracted, either externally or internally, collected data may not be reliable (Barrios, 1993). For example, if there is a lot of noise in the environment, the observer may be unfocused and unable to record all relevant behaviors. Similarly, worries or preoccupations by the observer may distract him or her from accurately recording data. Barrios (1993) also discusses discontent as a type of internal distractor that may be especially problematic. If observers are treated in a disrespectful or harsh manner, discontent may arise. Discontent may also arise from unpleasant interactions between observers and others involved in the data collection process. While steps can be taken to reduce possible external distractors, they are likely to reduce the external validity of data. Barrios (1993) suggests monitoring observers for signs of internal distractions and intervening if necessary. However, it may not be possible to detect all internal distractions.

Validity of Data

In addition to ensuring accuracy and agreement, it is critical that data collection remain valid. One of the main advantages of in vivo data collection is that it can occur in the natural environment, increasing the external validity of data. Unfortunately, there are still a number of extraneous factors that may influence in vivo data collection, threatening the validity of the data. In vivo data collection is often used to measure the frequency, intensity, and duration of a target challenging behavior. Extraneous variables, such as observer effects and reactivity, can alter these target variables during data collection sessions, making the data no longer representative of the behavior.

Observer Effects and Reactivity

As discussed, one of the primary advantages of in vivo data collection is that it can be conducted in the natural environment in which the behavior occurs (Gardner, 2000). This assumes, however, that the presence of an observer does not affect this environment. Research has shown that this is not necessarily the case (Hartmann & Wood, 1982; Lipinski & Nelson, 1974; Repp, Nieminen, Olinger, & Brusca, 1988). Those being observed may even be hostile about the fact that they are being observed (Lipinski & Nelson, 1974). Conversely, they may try to impress observers or reduce challenging behaviors in the presence of observers (Lipinski & Nelson, 1974). Such changes in behavior during observations are often termed reactive effects or reactivity (Hartmann & Wood, 1982). A number of factors can contribute to reactive effects, such the child’s gender and age, the gender of the observer, the familiarity of the participant with the observer, and the observation setting (Gardner, 2000).

The issue of reactivity is a complex one and its effects on the validity of data are unclear (Goldfried, 1982; Hartmann & Wood, 1982). Hartmann and Wood (1982) outlined five factors that may contribute to reactivity: social desirability, subject characteristics, conspicuousness of observation, observer attributes, and the rational for observation. Individuals may try to suppress undesirable behaviors or engage in more socially appropriate behaviors when being observed. Therefore, an individual may be less likely to engage in the challenging behavior during observations, as they are not socially desirable behaviors. Characteristics of those being observed may also contribute to reactivity. Hartmann and Wood (1982) suggest that those who are more sensitive, less confident, and older than 6 years old may be more reactive than others during observations. Additionally, the more obtrusive the data collection is, the more likely reactive effects will occur. However, findings on obtrusiveness are not consistent and do not guarantee that the data will be invalid. The fourth factor outlined by Hartmann and Wood, 1982 is characteristics of the observer. Attributes of the observer, such as race and gender, may influence reactive effects. Finally, the rationale for observation is potentially influential factor. If there is not a good rationale for the presence of the observer, reactive effects may be more likely.

Unfortunately, there is a general lack of research on the effects of such characteristics, particularly with respect to challenging behaviors (Goldfried, 1982; Harris & Lahey, 1982). If substantial changes in behavior occur, reactivity can affect the external and internal validity of collected data. It is not possible to separate the effects of reactivity and one cannot be sure the data are similar to what would be found without the effect of reactivity (Repp et al., 1988).

Researchers have also found that children can be reactive to parents’ behavior during observations (Harris & Lahey, 1982). Lobitz and Johnson (1975) examined the effects of parental manipulation on their children’s behaviors. Parents were asked to present their child as bad, good, or normal. Significantly more challenging behaviors were found under the bad condition, when compared with the other two. This was true for both children with a history of challenging behaviors and those without such a history. No significant differences were found between the good and normal conditions, suggesting that parents could not make their children look better, only worse.

In a similar study examining compliance, parents were asked to make their children appear obedient and later disobedient in a clinic playroom (Green, Forehand, & McMahon, 1979). Significant changes in compliance were found in children with a history of challenging behaviors and those without such a history. Parental behaviors, such as use of rewards and questioning versus commanding, differed between the two conditions, likely accounting for the changes in compliance.

Taken together, these findings are quite significant. Suppose parents wants to ensure that their children will receive treatment. Parents could manipulate their behavior to make the children’s behavior appear worse than it typically is during an observation. The data collected in this observation would not be representative of the true behavior and thus no longer valid (Harris & Lahey, 1982). However, it may be difficult to detect if this is occurring. While certain parental behaviors were associated with changes in behavior (e.g., commanding), it is unlikely that the observer will know if this behavior is typical of the parent.

Interpretation and Use

The use of real-time data for functional assessment is described in greater detail elsewhere in this text. However, a brief consideration should be given to how in vivo data is used and interpreted. When in vivo data is used to examine the maintaining variable of a challenging behavior, it is often termed descriptive analysis (Iwata et al., 1990; Lerman & Iwata, 1993). There are many ways to conduct descriptive analyses, many of which are slight adaptations of others. A brief description of two of the more basic methods will be given, along with associated problems. The reader is directed to Chap. 8 for a more in-depth discussion of these methods.

The first method, as discussed above, is the use of ABC cards, also known as sequential analysis (Bijou et al., 1968; Sulzer-Azaroff & Mayer, 1977). As data are quantified, one can calculate the probability that a target behavior will follow a specific antecedent or be followed by a specific consequence (Iwata et al., 1990). However, as mentioned previously, there are no standards for interpreting this data (Tarbox et al., 2009). Additionally, the results from such an analysis are often inconsistent; the function suggested by the antecedent may be inconsistent with the function suggested by the consequence (Tarbox et al., 2009).

A second method, developed by Touchette, MacDonald, and Langer (1985), is known as a scatter plot analysis. Data is graphed on a scatter plot, with levels of a behavior observed during specified time intervals being recorded throughout the day. The main purpose of this method is to see if there is a pattern of distribution of behaviors throughout the day. Thus the scatter plot shows when a behavior typically occurs and when it rarely occurs. This allows for identification of possible temporal variables that may be affecting the target variable. The authors suggest that the scatter plot analysis be used when a behavior is frequently occurring, as informal observation may not suggest a reliable functional relationship. The main problem with scatter plot analyses is that the scatter plot only provides information on environmental variables related to the time of day (Axelrod, 1987; Iwata et al., 1990). For example, suppose an individual engages in self-injury when he or she sees another individual with a preferred item. As this could occur at any point during the day, the scatter plot is unlikely to reveal information about this relationship. Thus, the scatter plot will only identify antecedents and consequences that are relate to the behavior on a fixed, regular basis (Axelrod, 1987).

A major problem for all methods of descriptive analyses is that established relationships are correlational (Bijou et al., 1968; Iwata et al., 1990; Lerman & Iwata, 1993; Tarbox et al., 2009). Just because a relationship has been established, it does not necessarily reflect a functional relationship. For example, as described above, the target behavior may be highly correlated with frequently occurring, but unrelated events (Iwata et al., 1990). Conversely, the true functional variable may be one that is only reinforced intermittently. For example, take a child who engages in tantrums in an attempt to escape hygiene-related tasks. This may serve as the function even if the probability of escape is very low. Therefore, a very low correlation would be found between the behavior and the consequence of escaping. Conversely, if the mother of the child provides the child with attention during the tantrum, there would be a high correlation found between the behavior and the consequence of attention. Thus descriptive analysis may incorrectly identify attention as the maintaining variable, when it is, in fact, escape.

Conclusions

The collection of data is one of the primary aspects of conducting a functional assessment. Data are used to help understand the nature of the challenging behavior, including characteristics such as the frequency, duration, intensity, and function of the behavior. The primary method in which this data is collected in vivo. In vivo data has its primary advantage in that it is collected in the natural environment in which the behavior occurs, bypassing the sources of bias and error that may come from indirect data.

While in vivo data no doubt has its advantages, many potential problems with such data have been discussed. There are a number of factors that can influence the reliability and validity of in vivo data. Careful consideration should be given to these factors at each stage of data collection, from defining the behavior to using the data. While the problems of in vivo data have been discussed at length in this chapter, in vivo data are not without utility. Several researchers have provided recommendations on how to minimize many of these sources of error (e.g., Barrios, 1993; Repp et al., 1988). Additionally, while research does not support conducting functional assessments based solely on in vivo data (Iwata et al., 1990; Lerman & Iwata, 1993; Tarbox et al., 2009), there may be value in combining this approach with others.

While a great deal of research has been conducted on this subject matter, there is still a need for more. Much of the research that has been conducted on in vivo data collection predates the popularization of modern functional assessment. Thus little research has been conducted on these sources of error with respect to their use in the functional assessment process. While much of the findings are likely to hold true, there is a need for more research examining this empirically. Additional research in this area will help to ensure the reliability and validity of data, so that more meaningful interpretations can be made.