Participants in many psychological experiments have to compare the magnitudes of two stimuli. The outcome of such comparisons is not always as “common sense” would expect, which is still not fully explained. This is the point of departure of this study.

It is often assumed that comparative judgment is determined only by the difference between the stimuli’s magnitudes, as experienced one by one. According to this simple difference model of comparison (Thurstone, 1927a, 1927b), no systematic underestimation or overestimation of one stimulus relative to the other should occur, regardless of the order in which they are presented. Nevertheless, such effects do occur: Often, when two physically equal stimuli are compared, one of them tends to be judged as being greater (e.g., heavier or of longer duration) than the other. This kind of effect was first noted by the founder of psychophysics, Gustav Fechner (1860), who named it the time-order error (TOE). When the first stimulus is overestimated relative to the second stimulus, the TOE is positive, and in the opposite case, negative.

The Fechnerian TOEs have been the subject of much research throughout the years (see Hellström, 1985, for a review), and several explanations have been given. Most of these have assumed that the TOE is a perceptual/cognitive phenomenon. Yet, during the era of S. S. Stevens’s “new psychophysics,” it became an established “truth” that the TOE was due to a methodological flaw (Stevens, 1957) or to some form of judgment bias (Allan, 1977; Allan & Kristofferson, 1974; Engen, 1971; Luce & Galanter, 1963; Restle, 1961). However, Jamieson and Petrusic (1975) and Hellström (1977) varied the response format in TOE experiments and concluded from their results that a bias-based explanation could not hold: The TOE proved virtually insensitive to the response format—for instance, judging the second stimulus as less or greater than the first, or the first as less or greater than the second. Whereas Ulrich and Vorberg (2009) as well as Alcalá-Quintana and García-Pérez (2011) and García-Pérez and Alcalá-Quintana (2017, 2019) have maintained that judgment bias is the major determining factor of the TOE, most contemporary researchers emphasize perceptual-cognitive mechanisms (e.g., Bausenhart, Dyjas, & Ulrich, 2015; Hellström & Rammsayer, 2015; Patching, Englund, & Hellström, 2012; Preuschhof, Schubert, Villringer, & Heekeren, 2010; Raviv, Ahissar, & Loewenstein, 2012; van den Berg, Lindskog, Poom, & Winman, 2017). Nonetheless, stimulus comparison, like human judgment in general, cannot be expected to be free from bias, and this fact has to be taken into account. The most likely kind of bias in stimulus comparison seems to be “indecision bias” (García-Pérez & Alcalá-Quintana, 2017, 2019): When the participant compares two stimuli and must select one as being the greater, they have to guess when uncertain.

Measurement of difference limens

Studies of the comparison of stimuli are often performed in order to measure discriminability, which is usually conceived in terms of a difference limen (DL; also, just noticeable difference). In typical experimental designs, based on the constant method (Guilford, 1954), a standard stimulus (St) and a comparison stimulus (Co) are presented in succession, St being held at a constant magnitude, and Co varying from trial to trial. Two so-called limens (thresholds) can then be determined: the upper limen (the value of Co that evokes 75% judgments of Co > St) and the lower one (the value of Co that evokes 75% judgments of Co < St). Both of the limens are affected when there is a TOE, so the DL is usually taken as half the difference between the upper and the lower limen (e.g., Luce & Galanter, 1963).

One problem with the DL is that its size has been found to depend on the presentation order of St and Co—that is, on whether the changes to be detected are in the first stimulus or the second one. Holding the first stimulus constant and varying the second one (order StCo) has an impact on the proportion of judgments of “second greater” that is often found to differ from what is obtained in the reverse procedure (order CoSt). Thereby, the two DLs will differ. This is called the Type B effect (Bausenhart et al., 2015; Ulrich & Vorberg, 2009), or standard position effect (SPE; Hellström & Rammsayer, 2015; Rammsayer & Wittkowski, 1990). In terms of DLs, the Type B effect can be defined as the difference DLStCo − DLCoSt. Most often, the DL has been found to be smaller with the presentation order StCo than with CoSt, so that there is a negative Type B effect (Ellinghaus, Ulrich, & Bausenhart, 2018).

The TOE (also called the Type A effect) and the Type B effect make accurate determination of stimulus discriminability a methodological challenge that has been largely neglected, but it is a challenge that needs to be addressed. For instance, adequate assessment of duration discrimination is important in research on the neuropsychological basis of time perception (Rammsayer, 2008). To take account of the presentation-order effects, the simple difference model has to be replaced by a better one. This is also required for a deeper understanding of what goes on in our minds when we carry out the experimental—and also everyday—task of comparing two successive stimulus magnitudes.

Modeling successive stimulus comparison

Michels–Helson (MH) model

Michels and Helson (1954; also in Helson, 1964, Ch. 4) studied comparison of the magnitudes of two successive stimuli on a difference rating scale. They found, besides the TOE, that the scaled difference between the two stimuli was determined to a greater extent by the second-presented stimulus than by the first-presented one. The MH model states that the second-presented stimulus in the pair is not compared directly to the first-presented one, but to a weighted compound of the first-presented stimulus and the series adaptation level (AL). The latter is, in turn, a weighted geometric mean of previously experienced stimuli with weights according to their degree of recency—termed by Helson (1964) as series, background, and residual stimuli. Hence, d12* = u {[s · ψ1 + (1 − s) ψa] – ψ2}, where d12* is the scaled stimulus difference, u is a scale factor, ψ1 and ψ2 are the subjective stimulus magnitudes, ψa is the subjective magnitude corresponding to the series AL, and s is the stimulus weight.

Internal reference (IR) model

This model (Dyjas, Bausenhart, & Ulrich, 2012) bears similarity to the MH model. The second stimulus in a pair is not compared with the first stimulus, but to an IR. This IR is updated in a dynamic process, where the IR in the current trial is a weighted mean of the magnitudes of the first stimulus in the current pair (weight g; 0 < g < 1) and the IR in the previous trial (weight 1 − g): d12 = IR - ψ2 = [g · ψ1 + (1 − g) IRp] − ψ2, where ψ1 is the magnitude of the first stimulus of the current pair and IRp is the previous IR. So, g thereby also becomes the impact weight of the first stimulus in its comparison with the second stimulus, which goes straight in with Weight 1. Therefore, in the constant method, the DL is predicted to be smaller when the second stimulus is varied (presentation order StCo) than with the order CoSt. This is, by definition, a negative Type B effect. The IR model predicts no TOE, which is because (unlike in the MH model) stimuli outside the series have no influence on the internal reference. As is noted by Dyjas and Ulrich (2014), “the [IR model] implicitly assumes that the Type B effect and the [TOE] are independent and that these effects reflect different underlying mechanisms” (p. 1139).

Sensation-weighting (SW) model

For clarity, it is pertinent to revisit the origins of the SW model. Hellström (1979) carried out a loudness comparison experiment with 16 stimulus magnitude combinations in each of 16 combinations of stimulus duration and interstimulus interval. To describe the total set of data, a preliminary linear model was adopted which, in terms of subjective magnitudes, was d12* = B1k· ψ1B2k· ψ2 + Ck, where d12* is the scaled subjective difference (calculated, for each stimulus combination [k], on group data for 12 participants, different for each condition), ψ1 and ψ2 are the magnitudes of the first and the second stimulus, B1k and B2k their regression coefficients, and Ck the intercept. This model was fitted to d12* and to the physical stimulus magnitudes via a power function with a fitted exponent. Across conditions, Ck proved highly linearly dependent on B1k and B2k. Using the best-fitting account of this dependence, Ck = a2B2ka1B1k + c, the total number of fitted parameters in the model was reduced from 49 to 36, while preserving an excellent fit to the data (error variance 3.50% in the raw model and 4.94% in the accepted model). By analogy with the MH model, a1 and a2 were interpreted as reference levels (ReLs), ψr1 and ψr2, associated with the first and the second stimulus, respectively. c was interpreted as ur1 - ψr2), where u is a scale factor. This resulted in the SW model, which can be written (Hellström, 1979; cf. Hellström, 1985, 2000, 2003; Hellström & Rammsayer, 2004, 2015):

$$ {d_{12}}^{\ast }=u\ \left\{\left[{s}_1\cdotp {\uppsi}_1+\left(1-{s}_1\right)\ {\uppsi}_{r1}\right]\hbox{--} \left[{s}_2\cdotp {\uppsi}_2+\left(1-{s}_2\right)\ {\uppsi}_{r2}\right]\right\}+b, $$
(1)

where s1 and s2 are the weighting coefficients of the stimuli, and ψr1 and ψr2 are their current ReLs. Judgment bias is represented by b (which was not included in the original version of the SW model).

The SW model is a natural generalization of the MH model, assuming that an adaptation-weighting mechanism operates on each of the compared stimuli, not only on the first one, so that the real comparison is not between the stimuli as such, but between two weighted compounds. Each of these compounds combines the subjective magnitudes of a stimulus and of its reference level (ReL). A ReL is conceptually similar to Helson’s (1964) adaptation level in being a product of the pooling of stimulus information from various sources. However, in the SW model the ReLs are not tied to Helson’s specifications of adaptation levels as weighted geometric means. The ReLs should usually be located near the center of the stimulus range, but have often been found to be slightly lower. ψr2 may differ from ψr1: Hellström (1979) found sound pressure levels of 67.38 dB and 68.20 dB corresponding to ψr1 and ψr2. Both of these are in the middle range of the stimulus magnitudes, but clearly below their mean dB value, 69.75 (the series AL value predicted by Helson’s theory). The difference between the two ReLs is likely to be due to the updating of ψr2 with fresh magnitude information on the current ψ1.

Importantly, the formulation of the SW model in Equation 1 allows estimation of the scale factor u, and thereby of the “absolute” values of s1 and s2. These values, or their relation, are not subject to any formal restrictions. Although s values may usually be expected to stay between 0 and 1, indicating compromise or assimilation, Hellström (1979) obtained s values >1 in many stimulus conditions, implying negative weights for ψr1 or ψr2 − a contrast effect (Hellström, 1985).

The three models discussed are all built on the common, empirically well-grounded notion of stimulus comparison, as described by a linear model with different weights for the two stimuli. The SW model emerged as an extension of the MH model, generalized by assuming a weighting process for both of the stimuli, not just the first one. Like the MH model, the IR model corresponds to the SW model with s2 = 1 (cf. Bausenhart et al., 2015; Dyjas et al., 2012). However, unlike the MH model, the IR model recognizes no influence by stimuli external to the current experimental series (but see Bausenhart, Bratzke, & Ulrich, 2016). It may be noted that this limitation may be more realistic for studies where the standard stimulus is fixed within a block, as in the studies just cited, than for experiments where stimulus magnitudes show greater variation between trials (e.g., Hellström, 1979, 2003; Michels & Helson, 1954).

Unlike the other models discussed, the SW model places no restrictions on the values of s1 and s2. Thereby, it can account for such stimulus-condition dependent patterns of negative and positive TOEs and Type B effects as were found by Hellström (1979, 2003). The SW model has proved extremely useful for analysis and interpretation of the data in a number of later studies (e.g., Hellström & Cederström, 2014; Hellström & Rammsayer, 2015). In the present study, the SW model correctly predicts an experimental outcome.

Explaining the TOE

In a common special case, ψr1 can be assumed equal to ψr2, and thereby both can be denoted by ψr. In this case, letting ψ1 = ψ2 = ψ, Equation 1 becomes

$$ {d}_{12}=u\ \left({s}_2-{s}_1\right)\ \left({\uppsi}_r-\uppsi \right)+b $$
(2)

When two stimuli of equal magnitude are compared, a value of d12 ≠ 0 implies, by definition, a TOE. So, the SW model basically accounts for the TOE as being caused by the difference between stimulus weights, multiplied by the subjective difference between the ReL and the stimulus level, and, additionally, a judgment bias. With s1 < s2 and ψr below the mean level of ψ, this results in the common finding of a generally negative TOE. Also, in experiments with varying stimulus magnitude level, the TOE becomes negatively related to the current level, a relation that reverses in the rarer case of s1 > s2 (Hellström, 1979, 2003).

Type B effect in the SW model

The SW model accounts for the Type B effect as being, like the TOE, a consequence of the differential weighting: The stimulus that is changed has an impact on the discriminative response in proportion to its weight (in presentation order StCo, s2, and in order CoSt, s1) and the DL is therefore inversely proportional to this weight.

Recently, Ellinghaus et al. (2018) surveyed the Type B effect across several stimulus continua, and maintained that when it is found, it is consistently negative, as predicted by the IR model. In contrast, results of Hellström and Rammsayer (2015) suggest that also positive Type B effects occur. Furthermore, results by Hellström (2003) and, in particular, Hellström (1979), obtained with methods that did not directly assess the DL, show equivalents (in terms of the SW model, s1 > s2) of large positive Type B effects for tonal loudness with brief stimuli and short interstimulus intervals. Verifying the results of Hellström and Rammsayer (2015) would therefore be of theoretical importance, as this would refute the MH and IR models, but would be consistent with the SW model. Such verification was attempted in the present study, for the case of duration discrimination, which is no exceptional case with regard to the phenomena just discussed (Eisler, Eisler, & Hellström, 2008; Ellinghaus et al., 2018).

The present study

Hellström and Rammsayer (2004, 2015) used an adaptive staircase method to measure the DL for interval duration, with separate blocks for different stimulus presentation conditions. Experiment 2 in Hellström and Rammsayer (2015) employed filled auditory intervals, with St durations of 100, 215, 464, and 1,000 ms. In the present Experiment 1 we replicated this experiment with an improved procedure (see the Appendix). We also conducted two experiments with empty visual intervals (bounded by brief flashes): Experiment 2 (analogous to Experiment 1) and Experiment 3. In the two first experiments, we addressed perceptual-cognitive processes in duration discrimination, their expression as the TOE and the Type B effect, and their separation from judgment bias. In Experiment 3, we investigated whether, as is predicted by the SW model, the TOE can be shifted by manipulation of the ReLs. This attempted manipulation was done by using two St durations, instead of one as in Experiment 2, in each separate block of trials. The prediction was tested by comparing the results of Experiments 2 and 3.

Experiments 1 and 2

In Experiments 1 and 2, duration discrimination was assessed with different presentation orders of standard (St) and comparison (Co) stimuli, and different St durations. DLs were measured using an adaptive two-alternative, forced-choice staircase method. Four interval durations were used in separate blocks. In Experiment 1, the intervals were filled auditory, and in Experiment 2, empty visual. These stimulus types were selected from those (also empty auditory and filled visual) used in Experiment 1 of Hellström and Rammsayer (2015) in order to confirm and further investigate the effect of stimulus duration on the size and direction of the Type B effect, which was found by Hellström and Rammsayer (Experiments 1 and 2) for these particular stimulus types.

Method

Participants

Undergraduate psychology students at the University of Bern took part in the experiments. In Experiment 1, there were 57 females and eight males ranging in age from 19 to 48 years (M ± SD = 22.4 ± 4.3 years), and in Experiment 2, 44 females and 11 males, 19 through 29 years of age (21.3 ± 2.0 years). The participants received course credit. All of them were naïve about the purpose of the study and reported normal hearing and normal or corrected-to-normal vision. Because of the clear audibility or visibility of the stimuli, and the task being to compare the duration of the stimuli, not their magnitude, no further screening of hearing or vision was deemed necessary. All participants gave their written, informed consent.Footnote 1

Apparatus and stimuli

Presentation of stimuli and recording of the participants’ responses were controlled by a computer program written in Turbo Pascal and an assembler-based timing routine. Timing accuracy of stimulus presentation was better than ±1 ms. Filled auditory stimuli (Experiment 1) were white-noise bursts presented binaurally through headphones (Sony CD 450) at an intensity of 66 dBA. Empty visual intervals (Experiment 2) were bounded by 3-ms flashes of a red light-emitting diode (LED; diameter 0.38°, viewing distance 60 cm, luminance 68 cd/m2) positioned at the eye level of the participant. The intensity of the LED was clearly above threshold, but not dazzling.

Procedure

The procedure was identical in Experiments 1 and 2. The participant was seated at a table with a keyboard and a computer monitor in a sound-attenuated and dimly lit room. To initiate the first trial, the participant pressed the space bar; the first stimulus interval was then presented after 900 ms, and then, after the 900-ms interstimulus interval, the second stimulus interval. Thereafter, the response was given by pressing one of two designated keys on the keyboard, labeled “first interval longer” and “second interval longer,” respectively. Footnote 2 Accuracy, not speed, was emphasized in the instructions. The next trial started 900 ms after the participant’s response. No correctness feedback was given.

Adaptive staircase method

A more detailed description of the psychophysical procedure is given in Rammsayer (2012). Participants compared the durations of two successive intervals, standard (St) and comparison (Co), using a two-alternative forced-choice response: “first interval longer” or “second interval longer.” On each trial of a series, the Co was increased or decreased in duration after having been judged as shorter or longer, respectively, than the St. A step that increased the absolute difference between Co and St was three times longer than a step that decreased this difference, which made performance settle at 75% responses of “first longer” or “second longer” (see Hellström & Rammsayer, 2015, for an explanation). Each participant took part in only one experiment, which was run in one experimental session consisting of eight blocks, with a 1-min break following each block. After six practice trials, the experimental session comprised four pairs of 64-trial blocks, each block pair using one St duration, with the order of the four St durations (100; 215; 464; and 1,000 ms) balanced across participants. Each block pair comprised one Hi-Co block, where Co was initially longer than St, and one Lo-Co block, where Co was initially shorter than St. For half of the participants, each block pair started with a Hi-Co block, and for the other half, with a Lo-Co block. Each block comprised two randomly interleaved 32-trial series, one series of pairs with an Up (U) profile, where the second interval was initially longer than the first, and one with a Down (D) profile, where the second interval was initially shorter than the first. So, with StCo and CoSt indicating the presentation order, the four series types were StCoU, StCoD, CoStU, and CoStD. Trials in a Hi-Co block were, equally often and in random order, from the StCoU and the CoStD series, and in a Lo-Co block, from the StCoD and the CoStU series.

When the St was 100 (215; 464; 1,000) ms, the initial duration of the Co in a series was 35 (70, 100, 500) ms below the St duration (in Lo-Co blocks) or above it (in Hi-Co blocks). The Co duration was then changed, using the weighted up–down method as described above, to estimate the upper or the lower DL (i.e., the duration difference for which 75% judgments of “first interval longer” or “second interval longer,” as pertinent, were obtained). In a Lo-Co (Hi-Co) block, the Co was increased (decreased) by 5 (9, 15, 100) ms after having been judged as shorter (longer) than the St, and decreased (increased) by 15 (27, 45, 300) ms after having been judged as longer (shorter) than the St. These steps were used for Trials 1–6; in Trials 7–32, the corresponding steps were 3 (6, 10, 25) and 9 (18, 30, 75) ms. See Table 6 in the Appendix for a summary of the procedure.

Measurement and modeling

Raw DLs.

In experiments where d12 is measured on each experimental trial (e.g., Hellström, 1979, 2003), fitting the SW model (Equation 1) to the data is quite straightforward. In contrast, what is measured in each condition of the present experiments is the value of Co that evokes 75% or 25% judgments of “first interval longer.” For each participant and each of the four conditions per St duration, the mean, across the last 20 trials, of the duration difference between the first and second presented stimulus (i.e., Co − St in CoSt series and St − Co in StCo series) was computed. From this we obtained the raw DL − rDLD in D series and rDLU in U series. At the rDLD the d12 value corresponds to the 75th percentile, and at the rDLU to the 25th percentile, in this participant’s distribution of d12 across trials. We denote these d12 values by d12x and −d12x, respectively. The measured rDL values are, as is detailed in the text, subject to condition-specific effects, and they should not be taken as indices of discriminability.

Modeling approach

To model the participant’s comparison behavior, the SW model (Equation 1) was adapted to the particular type of experimental data obtained. Similar modeling was used in Hellström and Rammsayer (2004, 2015). The psychophysical function was assumed to be the identity function, ψ = ϕ, over the range of Co intervals for each St duration (no assumption was made concerning its shape across St durations). Also, d12 is specified in ϕ units, so that the scale factor u can be dropped. From Equation 1 we obtain

$$ {d}_{12}=\left[{s}_1{\upphi}_1+\left(1-{s}_1\right)\ {\upphi}_{r1}\right]-\left[{s}_2{\upphi}_2+\left(1-{s}_2\right)\ {\upphi}_{r2}\right]+b $$
(3)

For Experiments 1 and 2, the blocked design, with only one St duration per block, makes it reasonable to assume that the two ReLs are equal, ϕr1 = ϕr2 = ϕr (cf. Hellström, 2000), which yields the simpler expression

$$ {d}_{12}={s}_1{\upphi}_1-{s}_2{\upphi}_2+\left({s}_2-{s}_1\right)\ {\upphi}_r+b $$
(4)

The “noise” dispersion of d12 across trials, σd12, may be termed the comparatal dispersion (Gulliksen, 1958), and we assume it to be proportional to the mean subjective stimulus magnitude (as per Ekman’s law; see Eisler et al., 2008). For simplicity, in the equations the physical magnitudes of the St and the Co, ϕSt and ϕCo, are abbreviated S and C. Our assumption ψ = ϕ then yields d12x = wi · S (as per Weber’s law in its simple form), where wi is the participant-specific value of σd12 / S, multiplied by 0.6745 (i.e., the standard normal deviate corresponding to the 75th percentile). We term w the Weber constant; w is not the same thing as a measured Weber fraction, but is assumed to underlie it. Judgment bias is likewise modeled as a participant-specific proportion of the St duration, bi · S.

Weight ratio and Type B effect

As appropriate for each of the four series types (StCoU, StCoD, CoStU, CoStD), S and C, or C and S, were substituted in Equation 4 for ϕ1 and ϕ2, and the value of d12 was specified as either d12x (in D series) or -d12x (in U series). This resulted in Equations 1417 (see the Appendix). From these equations we obtain, in terms of Weber fractions (WFs), where WF = DL/S and the WF for an individual series type is called a raw WF (rWF),

$$ {\mathrm{WF}}_{\mathrm{StCo}}=\left({\mathrm{rWF}}_{\mathrm{StCo}\mathrm{U}}+{\mathrm{rWF}}_{\mathrm{StCo}\mathrm{D}}\right)/2=w/{s}_2 $$
(5)
$$ {\mathrm{WF}}_{\mathrm{CoSt}}=\left({\mathrm{rWF}}_{\mathrm{CoSt}\mathrm{U}}+{\mathrm{rWF}}_{\mathrm{CoSt}\mathrm{D}}\right)/2=w/{s}_1 $$
(6)

Hence,

$$ {\mathrm{WF}}_{\mathrm{StCo}}/{\mathrm{WF}}_{\mathrm{CoSt}}={s}_1/{s}_2 $$
(7)

Estimation of model parameters from Weber fractions

For the mean WF across presentation orders, WFM, we have,

$$ {\mathrm{WF}}_{\mathrm{M}}=\frac{1}{2}\left({\mathrm{WF}}_{\mathrm{StCo}}+{\mathrm{WF}}_{\mathrm{CoSt}}\right)=\frac{1}{2}w\left({s}_1+{s}_2\right)/\left({s}_1{s}_2\right) $$
(8)

For s1 = s2 = s, WFM= w/s. From the data given in Table 7, in the Appendix, we obtained, with WFs estimated (by interpolation) at s1/s2 ≈ 1, rough estimates of w/s: 11.7% for Experiment 1 and 23.3% for Experiment 2.

The Type B effect is here defined as the Type B effect quotient (QTBE), the difference between the WFs in presentation orders StCo and CoSt as a fraction of WFM,

$$ \mathrm{QTBE}=\left({\mathrm{WF}}_{\mathrm{StCo}}-{\mathrm{WF}}_{\mathrm{CoSt}}\right)/{\mathrm{WF}}_{\mathrm{M}}=\left[w\ \left({s}_1-{s}_2\right)/\left({s}_1{s}_2\right)\right]/\left[\frac{1}{2}\ w\ \left({s}_1+{s}_2\right)/\left({s}_1{s}_2\right)\right]=2\left({s}_1-{s}_2\right)/\left({s}_1+{s}_2\right), $$
(9)

so that s1/s2 < 1 implies a negative, and s1/s2 > 1 a positive Type B effect.

Time-order errors (TOEs)

A positive (negative) TOE means that the first stimulus is overestimated (underestimated) relative to the second one. Thus, with a positive TOE, rDLU (in U series) becomes larger than the corresponding rDLD (in D series). One might attempt to estimate the TOE, for each presentation order (StCo or CoSt), as (rDLU − rDLD)/2. However, it may theoretically be expected that the psychometric function, while symmetric on a logarithmic scale, is somewhat asymmetric on the linear duration scale, its slope being steeper at low than at high stimulus magnitudes (Eisler et al., 2008). Such an asymmetry would increase the DL in blocks of StCoU and CoStD (Hi-Co blocks; see the Appendix) as compared with blocks of StCoD and CoStU (Lo-Co blocks), and so bias the QTOE estimates (positively with the StCo order and negatively with the CoSt order). Such an effect is balanced out by defining the QTOE as its mean across presentation orders StCo and CoSt. Therefore, only this measure will be discussed in the following.

Adapting the SW model, as described in the Appendix, to fit the S and rDL values in each of the four series types yields Equations 1720 (in the Appendix), which in turn yield Equations 1821 that predict the rWFs from the SW model parameters. From these equations, the TOE quotient (QTOE), TOE/S, can be predicted as follows:

$$ \mathrm{QTOE}=\frac{1}{2}\ \left[\left({\mathrm{rWF}}_{\mathrm{StCoU}}-{\mathrm{rWF}}_{\mathrm{StCoD}}\right)/2+\left({\mathrm{rWF}}_{\mathrm{CoStU}}-{\mathrm{rWF}}_{\mathrm{CoStD}}\right)/2\right]=\frac{1}{2}\ \left\{\left[b+\left({s}_2-{s}_1\right)\ Q\right]/{s}_2+\left[b+\left({s}_2-{s}_1\right)\ Q\right]/{s}_1\right\}=\frac{1}{2}\ \left[b\ \left({s}_1+{s}_2\right)+Q\ \left({s_2}^2-{s_1}^2\right)\right]/{s}_1{s}_2, $$
(10)

where Q is the ReL distance quotient—that is, the relative distance of the ReL from the St: Q = (ϕrS) / S.

Origin of QTOE

Equation 10 implies that QTOE depends on the weight difference as well as on the judgment bias, b. When the ReL is at a distance from the St, a QTOE arises from multiplication of Q by (s22s12). With Q < 0, QTOE will be negatively related to (s22s12), and thereby positively related to s1/s2.

Furthermore, it follows from the SW model that QTOE is closely related to QTBE. From Equations 9 and 10 we get

$$ \mathrm{QTOE}=\frac{1}{2}\ \left[b\ \left({s}_1+{s}_2\right)+Q\ \left({s_2}^2-{s_1}^2\right)\right]/{s}_1{s}_2=\frac{1}{2}\ b\ \left({s}_1+{s}_2\right)/{s}_1{s}_2-\frac{1}{2}\ Q\ \left({s}_1-{s}_2\right)\ \left({s}_1+{s}_2\right)/{s}_1{s}_2=\frac{1}{2}\ b\ \left({s}_1+{s}_2\right)/{s}_1{s}_2-Q\cdotp \mathrm{QTBE}\ \left[\frac{1}{4}\ {\left({s}_1+{s}_2\right)}^2/{s}_1{s}_2\right] $$
(11)

For s1= s2 = s, QTBE = 0, and QTOE = b / s. For a wide range of s1/s2 ratios, the factor 1/4 (s1+ s2)2 / s1s2 is close to 1, so that for moderate b values the slope of QTOE versus QTBE is predicted to be close to −Q (with QTOE and Q expressed in percentages).

Results

All statistical analyses were conducted using IBM SPSS Statistics, Versions 25 and 26 for MacOS X.

Outlier exclusion

An initial screening for multivariate outliers (i.e., unusually deviating data patterns) was conducted, using the procedure described in Tabachnick and Fidell (2007, p. 74). Each participant’s squared Mahalanobis distance (based on the 16 rDLs) was tested against the χ2 distribution with df = 16 (matching the number of variables). Because of the limited number of participants in each experiment, failing to exclude a multivariate outlier might incur misleading results. Therefore, a criterion of p < .025 was used, instead of p < .001 as recommended by Tabachnick and Fidell. The test resulted in exclusion of the data from four participants in Experiment 1 and five participants in Experiment 2. Their exclusion was further justified by their squared Mahalanobis distances deviating clearly from the straight line in “Q–Q” plots of their quantiles against those of the χ2(16) distribution (cf. Garrett, 1989). Consequently, the analyses were based on n = 61 in Experiment 1, and n = 50 in Experiment 2.

Weber fractions

For each experiment, descriptive statistics of rWF are given in Table 7, in the Appendix, for each of the four series types, as well as mean WFs for each St duration and across St durations. Nonpositive rWF values were observed in 7.1% and 4.6% of the cases in Experiments 1 and 2, respectively. For each experiment and St duration, the mean (M) and standard error of the mean (SEM) of the WF for each presentation order are shown in Fig. 1, as well as the estimate of WFStCo/WFCoSt (indicating s1/s2).

Fig. 1
figure 1

For Experiments 1 and 2, mean Weber fractions for presentation orders StCo and CoSt are plotted against standard (St) duration (logarithmic time scale). Error bars show the standard error of the mean (for clarity, drawn as one sided). Below the graph, the WF ratio WFStCo/WFCoSt (which estimates s1/s2) is given for each St duration

For each experiment, the values of WFStCo and WFCoSt for each of the four St durations were submitted to a repeated-measures ANOVA, with St duration (100; 215; 464; 1,000 ms) and presentation order (StCo, CoSt) as within-participant factors. Here, as in all our ANOVAs, multivariate (Pillai) tests were used. The results are given in Table 1.

Table 1. ANOVA table for Weber fractions (WFs) from Experiments 1 and 2

TOE Quotient (QTOE)

Descriptive statistics of QTOE for each St duration are given in Table 7, in the Appendix. The means and their standard errors are shown in Fig. 2. For St durations that yielded values of s1/s2 near 1 (i.e., 215 and 464 ms) QTOE was positive, indicating b > 0—that is, a judgment bias in the direction of “first interval longer.” Using Equation 11, b/s was preliminarily and roughly estimated as the mean QTOE value for these durations, about +3.5% for both experiments.

Fig. 2
figure 2

For Experiments 1 and 2, (a) TOE quotient (QTOE) is plotted against standard (St) duration (logarithmic time scale). Error bars show the standard error of the means (for clarity, drawn as one sided); (b) Type B effect quotient (QTBE; i.e., [WFStCo − WFCoSt] / WFM) is plotted against standard (St) duration (logarithmic time scale); (c) QTOE is plotted against QTBE

For each experiment, the eight QTOE values were submitted to a repeated-measures ANOVA with St duration (100; 215; 464; 1,000 ms) and presentation order (StCo, CoSt) as within-participant factors. The results are given in Table 2.

Table 2. ANOVA table for time-order error quotients (QTOEs) from Experiments 1 and 2

Interpretation of univariate results

The SW model (Equation 1) describes the perceptual stimulus-comparison mechanism as being based on a comparison between two weighted compounds, each comprising a stimulus magnitude and a ReL. Accordingly, the model predicts that the weighting is reflected in Weber fractions as well as in TOEs.

Weber fractions

Equation 9 predicts that QTBE changes with the weighting balance (specifically, [s1s2] / [s1 + s2]) across St durations. In accordance with this, the ANOVA of WFs for Experiment 2 showed a significant St Duration × Order interaction, p = .003, to which the linear effect of St duration made the greatest contribution. Thus, the Type B effect—the effect of presentation order on the WF—was not constant, but changed with the St duration. However, in post hoc t tests the only clearly significant evidence for a nonzero Type B effect occurred for the 1,000-ms St duration, where the effect was negative (implying s1/s2 < 1).

For Experiment 1, the St Duration × Order interaction failed to reach statistical significance, p = .076. Still, one may note that the linear contribution of St duration to this interaction was significant, p = .008.

TOE quotients

Equation 10 implies that QTOE should be directly related to Q (s22s12) / (s1s2). Figure 2 gives some support to this, as it shows QTOE to be generally positively related to QTBE, and thereby to s1s2. This suggests that in each block Q < 0 (i.e., the ReL falls below the St). From the slopes of the linear regressions (QTOE vs. QTBE, group data) depicted in Fig. 2c, Q was estimated as −26.0% for Experiment 1 (r = .91) and −14.6% for Experiment 2 (r = .92). The b values were estimated as equal to the regression intercepts, +3.7% (Experiment 1) and +3.3% (Experiment 2).

The negative Q values are as could be expected from the results of Hellström and Rammsayer (2015). They are also in harmony with results for weight comparison with a single standard (Hellström, 2000). A parallel is the finding in temporal bisection experiments, where participants classify intervals as long or short, that the bisection (neutral) point is located below the arithmetic mean of the interval durations (Brown, McCormack, Smith, & Stewart, 2005; Wiener, Thompson, & Coslett, 2014). Similar findings were addressed by Helson (e.g., 1964) by specifying the adaptation level as a weighted geometric mean of the stimulus magnitudes.

Model fitting by NLR

For additional guidance regarding model parameters, Equations 1821, in the Appendix, were used to fit the SW model, using the SPSS routine nonlinear regression (NLR). For each experiment, all the individual rWF estimates were entered together. Q, w, and b were assumed to be constant across conditions, and s1 and s2 to be condition specific. Only the value of Q could be uniquely estimated; s1, s2, b, and w were estimated relative to each other. Using the formula WFM = w/s with the above WFM estimates of 11.7% (Experiment 1) and 23.5% (Experiment 2), the values of w were fixed at 5.85% for Experiment 1 and at 11.75% for Experiment 2 to yield plausible average values for s1 and s2 of about 0.5 (cf. Hellström, 2003). The NLR results are given in Table 3. The model used in this analysis is obviously simplified, and R2 (corrected) is modest: .133 (Experiment 1) and .152 (Experiment 2), so the results should only be taken as guidance. Nevertheless, they generally confirm the preliminary results.

Table 3. Results from model fitting by SPSS NLR

Multivariate approach: Principal component analyses of raw Weber fractions

Although the Type B effect clearly changed with St duration, unequivocal statistical evidence of its reversal (from negative to positive) for brief St durations was not obtained from our univariate analyses, as reported in Table 1. It also appears hazardous to build theoretical conclusions solely on measures built up by combinations of different forms of the rWF, each of which is highly variable across individuals.

However, this interindividual variability of the rWFs is a liability that can be turned into an asset: It carries information that is lost in univariate statistics. An attempt was therefore made to assess the parameters of the SW model by analyzing the multivariate variability of the rWFs.

Multivariate model

The multivariate model and its application to each of the series types is described in the Appendix. Equation 22, in the Appendix, corresponds to the basic model of principal component analysis (PCA), with components corresponding to w (Weber constant), b (judgment bias), and Q (relative distance of ReL from St). These components were therefore expected to emerge in a PCA of the rWFs for the 16 conditions (without rotation of extracted components). The eigenvalue of each component should then measure its contribution to the variability in rWFs. The calculated component scores for the ith participant should estimate this participant’s standardized values of wi, bi, and Qi, respectively. The three components’ loadings for the kth condition should estimate its values of ωk (discrimination difficulty), βk (bias expression), and δk (weight difference expression), respectively.

Analogy with ability testing

A useful analogy could be to think of each experiment as an ability-test battery, the ith participant’s characteristics (Weber constant, wi; judgment bias, bi; ReL distance quotient, Qi) being scores on three basic abilities, and the kth condition being one of 16 heterogeneous tests. Each test (i.e., condition) has loadings on w, b, as well as Q. As there is thus no “simple structure” that could be revealed by rotation, an unrotated PCA is appropriate. When the PCA is conducted on the “battery”—that is, the rWFs in the 16 conditions of the experiment—three components, corresponding to w, b, and Q, respectively, would then be expected to be extracted, in an order corresponding to their contribution to the total variance.

Principal component analyses (PCAs)

For each experiment, the 16 rWFs in the four series types (i.e., StCoU, StCoD, CoStU, and CoStD) for each of the four St durations (100; 215; 464; and 1,000 ms), were submitted to a PCA, using the FACTOR routine in SPSS. The Kaiser–Meyer–Olkin (KMO) measure of sampling adequacyFootnote 3 (Kaiser, 1974) was .657 for Experiment 1 and .635 for Experiment 2. For each experiment, three components were extracted, with eigenvalues of 3.9 (explaining 24.4% of the variance), 3.1 (19.2%), and 1.6 (9.9%) for Experiment 1, and 4.4 (27.2%), 2.8 (17.7%) and 1.7 (10.4%) for Experiment 2.

Results of the PCAs

The unrotated component loadings are given in Table 8 in the Appendix. Scores of the three extracted components (wi, bi, Qi) were also computed for each participant. For an interpretation of the loadings, note that in Equations 1821, in the Appendix, w always occurs as a positively signed term, whereas the b term is positively signed for Up (U) series, and negatively signed for Down (D) series.

For Experiment 1, the first component had (after reversal of loading signs) positive loadings for U series and negative loadings for D series, and individual component scores correlated highly with QTOE (see Fig. 5). It could thereby be identified as b, the loading for condition k indicating this condition’s bias expression, βk. The second component, whose scores correlated highly with WFM and whose loadings (except one) were positive, could be identified as w, the loading for condition k indicating this condition’s discrimination difficulty, ωk.

For Experiment 2, the first component was identified as w (all loadings positive, highly correlated with WFM) and the second (after reversal of signs) as b (scores highly correlated with QTOE, loadings generally positive for U series and negative for D series). For each experiment, the third component was identified as Q (ReL distance quotient), its loading for condition k reflecting the weight difference, δk, in this condition, that is, the multiplier of Qi in determining the QTOE. The results are consistent with weight ratios s1/s2 > 1 for St durations of 100 and 215 ms, and s1/s2 < 1 for 464 and 1,000 ms (as was found from the analysis of WFStCo/WFCoSt ratios) in combination with Q < 0 (i.e., the ReL being situated below the St) for each St duration.

In Table 8, in the Appendix, mean values of ω, β, and δ for each St duration are given, as estimated from the mean component loadings using Equation 22, in the Appendix. For Experiment 1, β (bias expression) was positive for each St duration, which indicates, in accordance with the estimated positive b value for s1/s2 = 1, a judgment bias that favors judgments of “first interval longer” for all St durations. For Experiment 2, such a bias was obtained for all St durations except 1,000 ms, where the bias was close to zero.

Variance components in the comparison process

As predicted by Equations 1821, in the Appendix, the measured rWF is affected by the SW mechanism as well as by two participant-specific factors—namely, Weber constant (w) and judgment bias (b). The present experimental design made it possible to estimate, using PCA, the contributions of each of these factors to the total variance of the rWFs. As assessed by eigenvalues from PCAs of the rWFs, w and b dominated in this respect, leaving about 10% for the ReL distance quotient Q, the latter factor generating systematic TOEs by multiplication with the weight difference (s2− s1). This effect was limited by the blocked design, with the St duration fixed within each block, which minimized the possible asymmetry of Q as well as its interindividual variation. As is demonstrated in the next section, the role of Q in modulating the shift of QTOE with the St duration was still considerable, as was predicted from the SW model.

Relating PCA-estimated model parameters to univariate results: Comparison of univariate results from participants with low, medium, and high PCA component scores

For each of the three extracted components, the scores were partitioned at their low, medium, and high tertiles. Each of Figs. 3, 4, and 5 shows mean WF or QTOE for each partition of a component score, and is supplemented with ANOVA results.

Fig. 3
figure 3

For Experiments 1 and 2, mean Weber fraction is plotted against standard (St) duration (logarithmic time scale) at low, medium, and high third score levels of w component. Included are ANOVA results for Weber fractions

Fig. 4
figure 4

For Experiments 1 and 2, mean TOE quotient (QTOE) is plotted against standard (St) duration (logarithmic time scale) at low, medium, and high third score levels of b component. Included are ANOVA results for QTOEs

Fig. 5
figure 5

For Experiments 1 and 2, mean TOE quotient (QTOE) is plotted against standard (St) duration (logarithmic time scale) at low, medium, and high third score levels of Q component. Correlation between Q component and QTOE is given for each St duration (Bonferroni corrected: ***p < .001, **p < .01, ns = not significant). Included are ANOVA results for QTOEs

Weber fractions (WFs)

Figure 3 shows, plotted against the St duration, the mean WF for participants with lowest, medium, and highest third levels of the w (Weber constant) component score. As expected, mean WFs increased with increasing w scores.

TOE quotients (QTOEs)

Figure 4 shows, in the same manner, the mean QTOE for participants with lowest, medium, and highest third levels of the b (judgment bias) component score. Mean QTOEs were directly related to b scores, except for Experiment 2 with S = 1,000 ms.

Finally, Fig. 5 shows the mean QTOE for participants with lowest, medium, and highest third levels of the Q (ReL distance quotient) component score. Correlations of the Q score with QTOE are also given for each St duration. According to the SW model, QTOE is proportional to the squared-weight difference (s22s12), multiplied by Q. As is shown in Fig. 5, and verified by the ANOVA results, scores of the Q component indeed modulated the slope of QTOE against St duration, and thereby against weight difference. This slope did not become positive even with the highest Q scores.

This suggests that most individual Q values stayed on the negative side. In the univariate analyses we found evidence (clearly significant only for Experiment 2) that the difference s2s1 was positive for S = 1,000 ms. This is confirmed by the significantly positive correlations between QTOE and Q component score for this St duration. Conversely, the significantly negative correlations for, in particular, St = 100 ms in both experiments indicate negative values of (s2s1). So, the univariate indications were confirmed: The weighting balance did reverse into s1/s2 > 1 (equivalent to a positive Type B effect) for brief St durations; significantly so for St = 100 ms (Experiments 1 and 2) and for St = 215 ms (Experiment 1).

Response times

Response times in Experiments 1 and 2 are reported and discussed in the Appendix.

Discussion of Experiments 1 and 2

Weighting change and its interpretation

The present results are generally consistent with those of Hellström and Rammsayer (2015). In particular, in both studies, the ratio s1/s2 tended to decrease with increasing stimulus duration. This parallels the decrease of s1/s2 with increasing interstimulus interval that generally occurs in TOE experiments (e.g., Hellström, 1979, 2003). The interval between the onsets of the first and the second stimulus increases with the interstimulus interval as well as with stimulus duration, so it seems likely that both of these temporal factors contribute to the change of the weighting balance.

This change, to the disadvantage of the first stimulus, has been proposed to reflect the tuning of a mechanism that increases discrimination sensitivity by optimal weighting-in of ReL magnitude information (Hellström, 1989; Patching et al., 2012; cf. Preuschhof et al., 2010). In particular, the weighting change is thought to reflect a transition, with longer interstimulus intervals and/or stimulus durations, from stimulus interference to memory loss.

Taking advantage of the interindividual variability provided the extra statistical power needed to confirm the reversal of the weighting pattern (i.e., yielding s1 > s2) with brief St durations. Similarly, in Hellström and Rammsayer (2004), for duration comparison of filled auditory intervals across interstimulus intervals of 100–2,700 ms, s1/s2 > 1 was generally found for St durations of 50 ms, and s1/s2 < 1 for 1,000 ms.

Time order errors (TOEs)

Figures 3, 4 and 5 suggest that our univariate and multivariate analyses of the rWFs captured the essential factors in the build-up of the TOEs: sensation weighting and judgment bias. Importantly, positive as well as negative TOEs were shown to occur even with a blocked design, that is, in the absence of trial-to-trial variation of the St duration.

Fig. 6
figure 6

For Experiment 3 (empty visual intervals) mean Weber fraction, for presentation orders StCo and CoSt, is plotted against standard (St) duration (logarithmic time scale). Error bars show the standard error of the mean. Below the graph, the ratio WFStCo/WFCoSt (which estimates s1/s2) is given for each St duration.

Judgment bias (b) contributes considerably to the interindividual variation of the TOE, but only moderately to its mean value across individuals. The bias and its interindividual variation are most easily understood as being due to individual guessing habits in cases of uncertainty (García-Pérez & Alcalá-Quintana, 2017). In Experiment 2, the impact of judgment bias vanished for the St duration of 1,000 ms. This may be due to participants using different guessing strategies for uncertain cases with the longest St duration than with shorter durations.

According to the present results, judgment bias does not account for the existence of the TOE or its variation across St durations and presentation orders. Instead, sensation weighting appears to be a major factor behind the TOE. In Experiment 3, this interpretation was put to a direct test.

Experiment 3

Background

In Experiments 1 and 2, one single St duration was used in each experimental block. This resulted, according to our findings, in values of Q (ReL distance quotient; i.e., relative dislocation of ϕr from the St duration) that were consistently negative.

Manipulating the TOE

So far, only indirect evidence was obtained for the corollary of the SW model that Q, multiplied by the weight difference (s2s1), affects the subjective stimulus difference, and thereby determines the QTOE. So, in Experiment 3, using empty visual intervals like in Experiment 2, an attempt was made to manipulate Q, and thereby the QTOE.

Double-standard design

A variation of the blocked experimental design, intermixing two St durations in the same block, offers an opportunity for an experimental test of this prediction. Thus, the procedure was modified so that in each block two St durations, short (100 and 215 ms) or long (464 and 1,000 ms), alternated randomly.

Modeling for the double-standard design

For this type of design, it cannot be assumed that the two ReLs are equal (i.e., that ϕr1 = ϕr2). We therefore return to the basic version of the SW model, in the form of Equation 3. This results in equations for the rWF in the four series types. These equations (24–27) are given in the Appendix. From those equations we obtain

$$ \mathrm{QTOE}=\left[\left({\mathrm{rWF}}_{\mathrm{StCoU}}-{\mathrm{rWF}}_{\mathrm{StCoD}}\right)+\left({\mathrm{rWF}}_{\mathrm{CoStU}}-{\mathrm{rWF}}_{\mathrm{CoStD}}\right)\right]/4=\left[\left(1-{s}_1\right)\ {Q}_1-\left(1-{s}_2\right)\ {Q}_2+b\right]\ \left(1/{s}_1+1/{s}_2\right)/2 $$
(12)

It follows that if, under otherwise unchanged conditions, Q1 or Q2 is manipulated, this will shift QTOE, in a manner determined by the values of (1 − s1) or (1 − s2), respectively. In Experiment 3, such manipulation was attempted by including pairs with two different St durations in random order (100 and 215 ms, or 464 and 1,000 ms) in the same experimental block.

In the double-standard design, when awaiting the first interval in the pair, participants cannot prepare for a particular approximate interval duration, and adjust ϕr1 accordingly. Instead, they are expected to use a default value of ϕr1. Having perceived the first-presented interval, the participant will then adjust ϕr2 in the direction of this interval. It is here assumed that ϕr1 will be close to the geometric mean of the two St durations in the block (cf. Helson, 1964), and that, in logarithmic measure, ϕr2 will be adjusted from this in the direction of the first stimulus in the current pair by 20% of the distance (by analogy with results in Hellström, 1979, 2003). Expressed in terms of weighted geometric means, we have, on average, ϕr1 = StLower0.5 . StHigher0.5, ϕr2Lower = ϕr10.8 . StLower0.2, and ϕr2Higher = ϕr10.8 . StHigher0.2.

Equation 12 predicts that in comparison with results from Experiment 2, QTOE will shift by the amount

$$ \Delta \mathrm{QTOE}=\left[\left(1-{s}_1\right)\ \Delta {Q}_1-\left(1-{s}_2\right)\ \Delta {Q}_2\right]\ \left(1/{s}_1+1/{s}_2\right)/2, $$
(13)

where ΔQ1 = (Q1,Exp. 3Q1,Exp. 2), and ΔQ2 = (Q2,Exp. 3 −Q2,Exp. 2). From the above, it is predicted that |ΔQ2| < |ΔQ1|. This is because ϕr2, but not ϕr1, is partially adjusted in the direction of the current St duration.

Predicting shifts in QTOE

To get an idea of the likely shifts in QTOE between Experiments 2 and 3, rough estimates of Q1 and Q2 can be made from the above assumptions, using the NLR results (see Table 3). For Experiment 2, Q1 and Q2 are both estimated as −13.6% throughout. For Experiment 3, estimates of Q1 are +46.7% for St = 100 ms (blocked with 215 ms) and St = 464 ms (blocked with 1,000 ms), and −31.8% for St = 215 ms (blocked with 100 ms) and St = 1,000 ms (blocked with 464 ms); estimates of Q2 are +35.8% for St = 100 ms and St = 464 ms, and −26.4% for St = 215 ms and St = 1,000 ms. From this we get, for St = 100 ms and 464 ms, ΔQ1 = +60.3% and ΔQ2 = +49.4%; and for St = 215 ms and St = 1,000 ms, ΔQ1 = −18.2% and ΔQ2 = −12.8%. Also, using the NLR results (see Table 3), s1 is estimated (for Experiment 2 as well as Experiment 3) as 0.391, 0.526, 0.485, and 0.434 for St = 100; 215; 464; and 1,000 ms, respectively, and s2 as 0.339, 0.455, 0.536, and 0.729 for the same durations. Using Equation 14, we then roughly predict QTOE shifts of +11.0% (100 ms), −3.4% (215 ms), +15.9 (464 ms), and −12.6% (1,000 ms). Most importantly, these shifts in QTOE are predicted to form a zig-zag pattern when plotted against St duration. This is because as long as s1 < 1, s2 < 1, and |ΔQ2| < |ΔQ1|, the shift in QTOE will generally be positive in series with St intervals of 100 ms and 464 ms, which are blocked with longer St intervals (215 ms and 1,000 ms, respectively), and negative for series with St intervals of 215 ms and 1,000 ms, which are blocked with shorter St intervals (100 ms and 464 ms, respectively). (A possible exception could occur for [1 − s1] / [1 − s2] << 1, for instance, with s1 close to 1.)

With the standard deviations (SDs) of QTOE for Experiment 2 given in Table 7 in the Appendix, the predicted shifts with the four standard durations represent Cohen’s d values of 1.15, 0.35, 1.91, and 1.57, respectively. The predicted zig-zag effect (calculated as the mean, 10.75%, of the unsigned shift percentages) represents (as compared with the SD, 5.80, of the grand mean QTOE in Experiment 2) a Cohen’s d of 1.85, and with the current sample sizes even an effect half as large should be detected with a probability > 0.99 at α = 0.05.

Predictions of increased Weber fractions

It was further predicted that, due to the intermixing of St durations in a block, Q1 and Q2 would be less stable across trials in Experiment 3 than in Experiment 2, where the standard was fixed within each block. This would make perception of the duration difference (d12) in the pair more variable from trial to trial. As a result, WFs would be larger for corresponding conditions in Experiment 3 than in Experiment 2 (cf. Hellström, 2000). The extent of this effect is hard to predict, but a moderate shift, with Cohen’s d = 0.5, of the mean WF (across St durations and presentation orders) would be detectable (at α = 0.05) with a power of 0.76.

Method

Participants

Participants were undergraduate psychology students at the University of Bern, 67 females and six males, ranging in age from 18 to 32 years (21.7 ± 2.6 years). The participants received course credit. All of them were naïve about the purpose of the study and reported normal hearing and normal or corrected-to-normal vision. None of them had participated in Experiment 1 or Experiment 2. All participants gave their written informed consent (see Footnote 1).

Procedure

Apparatus and stimuli were the same as in Experiment 2. The experimental session comprised a total of eight blocks, with a 1-min break between blocks. In four of the blocks, Co was initially longer than St (Hi-Co blocks) while in the other four blocks Co was initially shorter than St (Lo-Co blocks). Furthermore, the St durations in four of the blocks were short (100 and 215 ms) and in the other four blocks, they were long (464 and 1,000 ms). Each block consisted of two randomly interleaved series of 32 trials each. In one of these series, the stimuli were always presented in the order StCo, and in the other series, in the order CoSt. As in Experiments 1 and 2, series types were StCoU, StCoD, CoStU, and CoStD. If the St duration in the StCo series of a block was 100 (464) ms, the St duration in the CoSt series of the same block was 215 (1,000) ms, and vice versa. Block order was balanced across participants.

Results

Following Experiments 1 and 2, a Mahalanobis distance criterion of p = .025 was applied for outlier detection, which resulted in the exclusion of eight participants, so that analyses are based on n = 65.

Descriptives

In Table 9, in the Appendix, descriptive statistics for rWFStCoU, rWFStCoD, rWFCoStU, rWFCoStD, WFM, and QTOE are given for each St duration in Experiment 3, as well as for mean WFM across St durations. Figure 6 shows the mean (M) and standard error of the mean (SEM) of the WF for each presentation order, as well as the ratio of the estimates of WFStCo and WFCoSt (indicating s1/s2).

Figure 7 displays mean WFs (left panel) and QTOEs (right panel) for Experiments 2 and 3 together, plotted against St duration in a logarithmic time scale. As can be seen, WFs show a similar dependence on St duration in Experiment 3 as in Experiment 2, albeit at a higher level. For QTOEs, the results for Experiment 3 depict, when superimposed on the sloping curve from Experiment 2, a zig-zag pattern with maxima for St = 100 ms and St = 464 ms, and minima for St = 215 ms and St = 1,000 ms. The change in QTOE from Experiment 2 to Experiment 3 was, for St = 100 ms, +5.17% (SEM = 2.19), for St = 215 ms, −4.80% (SEM = 1.91), for St = 464 ms, +6.03% (SEM = 1.89), and for St = 1,000 ms, −8.70% (SEM = 1.94). The mean change in QTOE in the predicted directions was 6.18% (SEM = 0.71).

Fig. 7
figure 7

For Experiments 2 and 3 (empty visual intervals), mean Weber fraction across stimulus orders (left) and TOE quotient (QTOE; right) is plotted against standard (St) duration (logarithmic time scale). Error bars indicate the standard error

ANOVA results

Experiment 3

The WFs and the QTOEs from Experiment 3 were submitted to repeated-measures ANOVAs, with presentation order (StCo, CoSt) and St duration as within-participant factors. The results are given in Table 4.

Table 4. ANOVA table for analysis of Weber fractions and QTOEs from Experiment 3 (empty visual intervals)

Weber fractions (WFs)

For WFs, the pattern (see Fig. 6) was similar to that obtained in Experiment 2. Again, the Duration × Order interaction was significant, showing a Type B effect that shifted with St duration. Paired t tests (with Bonferroni corrections) of WFs were conducted for orders StCo versus CoSt. For St = 100 ms, another piece of evidence for a positive Type B effect was obtained: WFStCo − WFCoSt > 0, p < .001.

TOE quotients (QTOEs)

For QTOEs, not only the linear trend of the main effect of duration was statistically significant (p < .001) like in Experiment 2, but also the quadratic and cubic trends, confirming the predicted zig-zag pattern. The shifts are smaller than our rough predictions above (which are highly dependent on the estimates of s1, s2, Q1, and Q2), but what is important is that their zig-zag pattern was correctly predicted. It may well be the case that ReLs are more resilient to manipulation within an experiment (e.g., due to effects of residual stimulation) than we expected.

Experiments 2 and 3 together

Each measure (WF, QTOE) was submitted to a repeated-measures ANOVA, with presentation order (StCo, CoSt) and St duration (100; 215; 464; 1,000 ms) as within-participant factors, and experiment (2, 3) as a between-participants factor. The results are shown in Table 5.

Table 5. ANOVA tables for analyses of Weber fractions and QTOEs from Experiments 2 and 3 combined

Weber fractions (WFs)

As predicted, WFs were significantly higher in Experiment 3 than in Experiment 2. The M (SD, SEM) of the mean WF was, for Experiment 2, 25.33% (8.19, 1.16) and for Experiment 3, 29.69% (7.98, 0.99), yielding an actual Cohen’s d value of 0.54.

TOE quotients (QTOEs)

For the QTOEs, the main effects of duration and order were significant, like the Duration × Order interaction. Most importantly, the Duration × Experiment interaction was significant. The effect size, ηp2 = .397, could serve as an index of the degree of impact of the weighting mechanism on the QTOE in the combined Experiments 2 and 3; p values were < .001 for the linear and cubic contributions of duration to the interaction, highlighting the contrast of the zig-zag pattern of Experiment 3 with the regular negative slope for Experiment 2 (see Fig. 7, right).

The model used in the analysis of the results from Experiment 3 is not compatible with the simplified model (assuming one single ReL for each St duration), which was used in the multivariate and NLR analyses of data from Experiments 1 and 2. Therefore, no such analyses were conducted on the data from Experiment 3.

Response times

Response times in Experiment 3 are reported and discussed in the Appendix.

Discussion of Experiment 3

The results of Experiment 3, which are shown in Table 5 and in Fig. 6, confirm the theoretical predictions from the SW model of how QTOEs change as a function of the design-generated level of Q1. They demonstrate the predictive power of the SW model, and also strengthen the concept of the ReL as the result of pooling of stimulus magnitude information (cf. Helson, 1964). This ReL constitutes a realistic expectation for the duration of the upcoming stimulus interval, which is weighted-in to enhance the efficiency of the comparison process (Patching et al., 2012).

General discussion

Type B effects: Not always negative

Ellinghaus et al. (2018) state that “Type B effects reported in the literature . . . are almost exclusively negative . . . . Positive Type B effects have rarely been reported in the case of very short-duration stimuli, especially when presented with very short interstimulus intervals” (p. 8). This may be true for the stimulus conditions usually employed, but this fact seems to be due to researchers’ strange reluctance to use interstimulus intervals other than about 1,000 ms, or stimuli briefer than 500 ms. With shorter interstimulus intervals and/or briefer stimuli, cases of (in terms of the SW model) s1/s2 > 1, with large TOEs and positive Type B effects or equivalent results, have been found (Hellström, 1979, 2003; Hellström & Rammsayer, 2004). In our view, to fully explore the effects of stimulus presentation conditions, psychophysical research should not avoid brief stimuli or fast stimulus presentation.

The results of Ellinghaus et al. (2018), which were obtained by using only an interstimulus interval of 1,000 ms and an St duration of 500 ms, across 10 different stimulus types, highlight the similarity between the comparison of durations and of other stimuli. Bausenhart et al. (2015) used auditory durations, with St durations of 100 ms and 1,000 ms, and found consistently negative Type B effects when the interstimulus interval was 1,000 ms. In contrast, when it was 300 ms, there was an interaction of presentation order (StCo, CoSt) and St duration, the Type B effect being negative for St = 1,000 ms, but slightly and nonsignificantly positive for St = 100 ms. Bausenhart et al. (2015) acknowledge that “we cannot refute the findings of a positive Type B effect under specific conditions. . . . A more general framework [than the IR model], such as Sensation Weighting . . . would be needed to account for any reversal of the Type B effect” (p. 1038).

The Type B effect can be seen primarily as an indicator of the sensation-weighting balance, but a rather insensitive one, as it is based on the comparison of measures of discrimination, such as DLs. In Experiments 1 and 2, this balance, as evidenced also by the QTOE, was once more found to be heavily dependent on the stimulus conditions. The present results affirm once more (cf. Hellström, 1979, 1985, 2003; Patching et al., 2012) that it is unwarranted to conclude that s1/s2< 1 is a general rule in the comparison of successive stimuli.

Conclusion

Our results demonstrate the necessity of considering, when assessing stimulus discrimination, methodological factors such as the presentation order of St and Co, which are not recognized by the time-honored simple difference model. Even in a design with a single standard duration per stimulus block, TOEs depend systematically on stimulus conditions (here, St duration) in combination with participant-specific factors such as judgment bias and ReL location. This means that a model for comparison of interval durations, and of stimulus magnitudes in general, must be able to account for both the Type B effect and the TOE, as well as for each of these going in either direction. Because it has these capabilities, the SW model has proved useful in previous studies using various study designs and stimulus modalities (e.g., Englund & Hellström, 2012, 2013; Hellström, 1979, 1985, 2000, 2003; Hellström, Aaltonen, Raimo, & Vilkman, 1994; Hellström & Cederström, 2014; Patching et al., 2012). The SW model also predicts the close relation between the TOE and the Type B effect. Although, by necessity, it gives a simplified account of what actually happened in the present experiments, the SW model has once more helped to understand the contributions and the interplay of the perceptual-cognitive factors behind the discrimination and comparison of stimulus magnitudes.

Our multivariate results from Experiments 1 and 2, as well as the univariate results of Experiment 3, provide clear evidence for a reversal of the weighting balance, yielding s1/s2 > 1 and thereby positive Type B effects, for brief St durations (cf. Hellström, 1979, 2003; Hellström & Rammsayer, 2004, 2015). This casts doubt on theoretical models, like the MH and IR models, that do not allow for such cases. It is also a serious challenge for such models (e.g., Preuschhof et al., 2010; Raviv et al., 2012) that rest on the notion of Bayesian inference of the true magnitude of the first stimulus from its internal representation, which inevitably yields s1/s2 < 1. The limitation of these models seems to be their disregard of the possibility that, for optimality in the comparison of the two stimuli, also the true magnitude of the second one has to be inferred. Like the MH and IR models, they consider the representation only of the first stimulus as being subject to modification or supplementation, while the second stimulus enters the comparison in a direct way. Instead, as pointed out by Hellström (1979), both of the stimuli should be seen as being in memory at the time of comparison; an analogy with perceptual aftereffects, affecting the perception of the second out of two successive stimuli, may also be made (cf. Hellström, 1985). In summary, we argue that a more flexible model of stimulus comparison has to be adopted, which allows stimulus weighting to be optimized for this task (Hellström, 1989; Patching et al., 2012). The SW model allows for such weighting, and also suggests an underlying mechanism: the weighting-in of supplementary magnitude information by way of reference levels.