What makes a prototype a prototype? Averaging visual features in a sequence

Tong, Ke; Dubé, Chad; Sekuler, Robert

doi:10.3758/s13414-019-01697-5

What makes a prototype a prototype? Averaging visual features in a sequence

Published: 01 May 2019

Volume 81, pages 1962–1978, (2019)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

What makes a prototype a prototype? Averaging visual features in a sequence

Download PDF

Ke Tong¹,
Chad Dubé¹ &
Robert Sekuler²

2194 Accesses
12 Citations
Explore all metrics

Abstract

After viewing a series of sequentially presented visual stimuli, subjects can readily generate mean representations of various visual features. Although these average representations seem to be formed effortlessly and obligatorily, it is unclear how such averages are actually computed. According to conventional prototype models, the computation entails an equally weighted average taken over all the stimuli. To test this hypothesis, we had subjects estimate the running averages of some feature in a series of sequentially presented stimuli. Part way through the series, we perturbed the distribution from which stimuli were drawn, which allowed us to test alternative models of the computations behind subjects’ estimates. In both explanatory and predictive tests, a model in which the most recent items had disproportionate high weight outperformed a model in which all items carried equal weight. Such recency-weighted behavior was shown consistently in multiple experiments in which subjects estimated running averages of length of vertical lines. However, the degree to which recent items were prioritized varied with the type of stimulus, such that when estimating the running averages of a series of numerals, subjects showed less recency prioritization. We conclude that previous evaluations of prototype models have made unrealistic assumptions about the nature of a prototype, and that a reassessment of prototype models of visual memory and perceptual categorization may be in order.

Dynamic Modeling of Visual Search

Article 08 September 2023

Temporal integration of feature probability distributions

Article 08 January 2022

Relating categorization to set summary statistics perception

Article Open access 26 June 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

A fundamental question in cognitive neuroscience is how the brain deals so effectively with the overwhelming amount of sensory input it receives, encoding fine details of selectively attended stimuli, while also retaining a stable representation of the larger environmental context. Various theoretical accounts postulate that some balance is struck between the quality and quantity of information that is extracted and maintained, but the details of that balance between quality and quantity are unresolved.

On one hand, many influential accounts of perception, memory, and categorization assign a central, or even exclusive, role to the outcome of matches between a probe stimulus’ features and features of individual items in memory (Estes, 1994; Hintzman & Ludlam, 1980; Nosofsky et al., 2011; Shiffrin & Steyvers, 1997). On the other hand, some accounts of both perceptual and short-term memory tasks assume that summary statistical representations (also called ensemble representations), extracted from sets of stimuli, are key to subjects’ performance, even when memory for individual items is reduced to chance (Ariely, 2001; Corbett & Oriet, 2011). Moreover, memory for averages appear to influence memory retrieval even in the absence of instruction or encouragement to report or compute an average (Dubé & Sekuler, 2015). Such evidence suggests that, at least in some cases, summary representations may be obligatorily encoded, stored, retrieved, and deployed.

These findings raise important questions, such as (i) whether summary representations rely on mechanisms that are distinct from those that encode and store individual item representations; (ii) how ensemble and item representations impact memory retrieval; (iii) how statistical moments are represented at the neural level; and (iv) whether sequential and spatial averaging rely on the same or different mechanisms. To answer these questions, computational models seem to be indispensable.

Of course, any computational modeling of summary statistical representation must include (i) the number of items that enter into the computation, and (ii) the form of the computation. To answer the first question, Whitney and Leib (2018) pooled results from 21 related studies and argued that subjects effectively integrate approximately the square root of the number of all displayed objects (Whitney & Leib 2018). As our study focuses on tasks in which subjects average sequentially presented stimuli (henceforth, “sequential averaging”), we evaluated this square root relationship only for the five sequential averaging studies. Figure 1 shows that results of those five studies deviate substantially from a square root relationship. As a result, we are cautious about taking the square root relation as a given for incorporation into a model of sequential averaging.

Answers to the second question, of how the stimuli are combined in the average, are also not consistent across studies. For instance, in studies focused on perceptual classification such questions are often not even posed. Instead, it is assumed that, if an average or “prototype” were computed, it would be an equally weighted average over all prior stimuli (Nosofsky, 1987; Smith & Minda, 2000). However, results from some studies of ensemble perception are consistent with non-equally weighted averaging schemes (Juni et al., 2012; Hubert-Wallander & Boynton, 2015).

In the following, we first discuss the critical findings regarding sequential averaging in visual short-term memory (VSTM). We then present experiments designed to identify the weighting scheme that subjects apply when producing estimates of averages over trials. We conclude that the averages extracted from a sequence of stimuli reflect a recency-prioritized weighted average. We discuss the implications of our findings for existing models of memory and perceptual categorization, and underscore the need for a reassessment of prototype models in these domains.

Weighting scheme in sequential averaging

What ensemble features are encoded in sequential averaging tasks? Subjects could form representations of the mean over time when they are explicitly instructed to do so (Albrecht & Scholl, 2010; Hubert-Wallander & Boynton, 2015) or without explicit instruction (Dubé & et al. 2014; Oriet & Hozempa, 2016). Other than the mean, variance priming (Michael et al., 2014) and perceptual adaptation of variance (Norman et al., 2015) suggest implicit encoding of the variance information. One recent study, Chetverikov et al., (2016), also established that subjects implicitly encode the entire feature distribution of the distractors in visual search tasks over time.

The current study focuses on the ensemble perception of sequentially presented stimuli, especially the mechanism of item integration in sequential averaging. While conventional prototype models typically assume an averaging process in which all items are weighted equally, an alternative hypothesis assumes that subjects’ estimates of averages give more weight to the most recent items. A recency-prioritized weighting scheme has been demonstrated for the sequential averaging of many features, e.g., size, emotion, and motion directions, although not for the averaging of spatial locations (Hubert-Wallander and Boynton, 2015).

The recency hypothesis for averaging assumes that although the average may be stored in long-term memory, the items factoring into its computation on a given trial may be in various stages of serial position-dependent decay (Wilken & Ma, 2004; Huang & Sekuler, 2010). Since recent items have stronger memory representations at the time of an average’s computation, those items will be given more weight in that computation. In fact, that differential weighting would be consistent with a mathematically optimal strategy for item integration, in which subjects assigned more weights to items at serial positions with less noise (Juni et al., 2012).

In what follows, we report four experiments meant to identify the weighting scheme that subjects use when they report running averages over sequentially presented stimuli. In doing so, we include models based on both serial positions and temporal positions, and with multiple weighting schemes. Our results support the operation of an averaging computation that is recency-weighted. This suggests that prototypes are not simple averages, and that a reconsideration of prototype models is in order.

Experiment 1

Mean-shift design

Most previous studies of sequential averaging drew stimuli randomly from a single distribution, and at the end of a series of stimuli, subjects estimated the mean of what they had seen (e.g., Hubert-Wallander & Boynton 2015). The current study introduces two design changes. The first change is that after every new stimulus, subjects report the mean value of all the stimuli they have seen up to that point (Weiss & Anderson, 1969). This greatly increases the amount and grain of the resulting data, supporting more efficient and reliable modeling analysis about the weighting schemes. In a second design change, the mean value of the distribution from which stimuli were drawn shifted midway in the sequence (Parducci, 1956). The advantage of this mean-shift design is to enhance the discriminability of alternative weighting schemes in the empirical data, by examining subjects’ estimates in the aftermath of the shift.

In short, the mean-shift design allows fine-grained tracking of subjects’ estimates of the running means and provides better differentiation between different item integration mechanisms. We elaborated the advantages of the design changes by a simulation with two quite different weighting schemes. Specifically, we simulated ideal observers’ estimates of the running means assuming a model, Equal, in which subjects gave equal weight to all items, and a model, Recency, in which subjects utilized all previous items, but prioritized more recent items.

The simulated estimates and stimuli are plotted in Fig. 2, showing that when all the stimuli were from the same Gaussian distribution (left panel), the simulated responses from the two weighting schemes (solid black for Recency and dashed black for Equal) were largely overlapped, making it difficult to tell which weighting scheme was in use. However, when there was a mean shift in the stimulus values (right panel), the two weighting schemes were clearly differentiated after the mean shift. So, compared to an experiment in which only a single stimulus distribution is used, suddenly shifting the mean of the stimulus distribution can highlight the weighting scheme that subjects are using. With this fact in mind, we incorporated the mean shift design in Experiment 1 and the other experiments.

In addition to the above benefits, we also wonder if the mean-shift design could alter subjects’ item integration mechanism, specifically, promoting a higher degree of recency weighting after the mean shift. On post-shift trials, stimuli from the pre-shift distribution may be weighted less or discounted in item integration due to their significant differences from the ongoing stimulus distribution, since it has been shown that outliers may be excluded in mean estimations (Haberman & Whitney, 2010).

To address this potential confound, we asked each subject to complete three sequences with different mean-shift configurations, “Small Shift”, “Large Shift”, and “Large Variance” (See Methods section for details). If the mean shift does promote recency in item integration, we would expect more recency in the “Large Shift” condition where the shift is most distinctive, and less recency in the “Large Variance” condition, where the elevated variance makes the mean-shift less obvious to subjects.

Methods

Subjects

Fifteen University of South Florida undergraduates participated in the experiment for course credit (ten female, mean age = 19.53 years, SD = 1.36 years). All had normal or corrected-to-normal vision. All procedures were approved by the IRB of University of South Florida.

Procedure

Subjects were presented with a sequence of gray vertical lines, one at a time. Each line was displayed in the center of the screen for one second. After each gray line, subjects were asked to estimate the average length of all the gray lines they had seen to that point in the sequence by using up and down arrow keys on a computer keyboard to adjust the length of a white probe line to represent that estimate. This adjustable, white probe was presented on the screen immediately after the disappearance of each gray line. Stimuli and probes were presented in different colors to reduce the potentially confounding influence of the probe length on estimates of prior stimuli. When subjects completed an adjustment, they pressed the keyboard’s Enter key to proceed. The next stimulus appeared on the screen right after the subject submitted his/her estimate. The same presentation procedure was used in Experiments 1 and 2, as detailed in Fig. 3.

Prior to the experiment, the QUEST (Watson & Pelli, 1983) routine was used to measure each subject’s Weber fraction for line length. On each trial, QUEST controlled the successive presentation of two vertical lines (500 ms each, 1000 ms ISI) at the center of the screen. After each pair of stimuli, subjects judged which had been longer. Feedback was given after each response (“correct” or “wrong”). Subjects’ individual Weber fractions were obtained from a QUEST run of 40 trials and were used to scale all stimuli in just noticeable difference (JND) units for the experiment.

The stimuli were scaled with individual Weber fractions and a base length of 100 pixels, so the actual stimulus size in pixels was 100∗ (1 + wb)^x, where wb is the Weber fraction and x is the JND value (specific JND values are detailed in the next section). The prior for the Weber faction used in QUEST had mean = .03 (Teghtsoonian, 1971) and SD = .04. The large values of SD provided a vague prior. For the 20 subjects (15 from Experiment 1 and five from Experiment 2), the mean Weber fraction was .064 and SD was .016.

Subjects were provided with detailed instructions and practice trials to ensure their understanding of the task. The instructions can be found in the Supplementary Materials.

Design and stimuli

The stimulus values presented in this paragraph are all in JND units. In the “Small Shift” condition, the pre- and post-shift means are 15 and 25, with a SD of 5. The “Large Shift” condition used a larger mean shift (pre/post-shift means = 15/30, SD = 5). The “Large Variance” condition used a larger variance across the sequence (pre/post-shift means = 15/25, SD = 8). The order of the three mean-shift conditions was counterbalanced across subjects.

For each sequence, line lengths were drawn randomly from one of two Gaussian distributions with different means but the same SD. Line lengths were sampled from the range spanned by ± 2 SDs around a distribution’s mean. Each entire stimulus sequence comprised 120 trials, split equally between pre- and post-shift trials. Starting on the 61st trial, the mean value of the distribution from which stimuli were drawn was altered. Subjects were not informed that stimuli might change during a sequence.

With the exception of the very first trial, the initial length of the adjustable probe line on successive trials was set to the value of the subject’s immediate prior response. On the first trial, the length of the first probe was fixed at 5.4 degrees of visual angle (Experiment 2 examines whether the probe’s initial value affects judgments). Stimuli were presented on a Dell 1905FP LCD computer monitor, with a resolution of 1280 × 1024, at a viewing distance of approximately 60 cm.

Because actual stimulus sizes were personalized for each subject based on their individual Weber fraction, the actual stimulus sizes were different for each subject. Across all subjects and all conditions, the mean and SD of stimulus lengths in visual angle were 9.63 and 5.85 degrees.

Modeling analysis

The data comprise subjects’ successive estimates of the running means and the stimulus values. For each complete sequence, 120 data points were recorded.

The data were fitted with three models representing different item integration mechanisms, namely the Equal, Recency, and Compression model. The Equal model predicts the mean estimates to be the actual running averages of the stimuli. All items were weighted equally in the averaging process, regardless of their serial positions. We included the Equal model as a null model to compare with the following two models.

In the Recency model, subjects’ estimates were modeled as the dot product of a stimulus vector and a weight vector (Weiss & Anderson, 1969; Juni et al., 2012; Hubert-Wallander & Boynton, 2015). The weight vector is recency-prioritized (Newer items have more weights). Additionally, a bias term was added to capture any systematic bias in observers’ estimates (3).

$$ \begin{array}{@{}rcl@{}} s_{i} &=& (s_{1}, s_{2}, \ldots, s_{i}) \end{array} $$

(1)

$$ \begin{array}{@{}rcl@{}} w_{i} &=& r^{\{1:i\}} / \sum r^{1:i} \end{array} $$

(2)

$$ \begin{array}{@{}rcl@{}} Est_{i} &=& w_{i} \cdotp s_{i} + bias + \varepsilon. \end{array} $$

(3)

In the serial position-based modeling analysis, the weight vector is assumed to be strictly serial-position dependent. The Recency weights were modeled as a normalized exponential function defined over the serial position of successive stimuli (Brown et al., 2007). The exponents represent the serial positions of the stimuli. The rate parameter, r, ranges from 0 to 1, allowing the model to capture different degrees of recency prioritization: a smaller r indicates a higher degree of recency prioritization, and when r equals 1, w_i reduces to 1/i for each of the i stimuli, representing Equal averaging. Dividing by the summed weights of all stimuli within a trial normalizes the weight term, so that all weights sum to one for each single trial’s estimation. The r parameter is responsible for the shape of the weight distributions, thus it is the parameter of interest. We aim to evaluate the best-fitting r values and compare them to the null hypothesis of an Equal weighting scheme (r = 1).

We also tested an alternative construct of the Recency model using a normalized power function to model the weights. The power-based model performed worse than the exponential-based model, so we kept the Recency construct as specified in Eq. 2. See Supplementary materials for details.

It is worth noting that in sequential averaging studies, the same stimulus can be characterized in terms of either temporal position (e.g., the item was presented x seconds ago) or serial position (e.g., the item was presented y items back), so, the influences on the averaging computation may come from factors of either temporal or serial positions, or both. To address this issue, we ran an alternative temporal position-based modeling analysis, in which the weight vector was assumed to be strictly time-dependent. The centers of stimulus presentation durations were used as temporal positions. For each trial, the prior items’ temporal distances (in seconds) to the newest item were used as the exponents over the rate parameter. So, each prior item was weighted based on its temporal distance to the current trial. This temporal position-based Recency model was added to the model comparison. In the following sections, we termed these two models as Recency-s (serial) and Recency-t (temporal).

The Compression model provides an alternative account in which subjects complete their estimation by updating their immediate previous estimation (a single “compressed” representation of the previously shown items) with the newest stimulus. This strategy is plausibly encouraged as a strategy in Experiment 1 because subjects are asked to frequently estimate the running averages and the initial value of the adjustable probe on each trial is set to subjects’ estimation on the previous trial.

$$ \begin{array}{@{}rcl@{}} Est_{i} &=& w_{old}\cdotp Est_{i-1}+ w_{new}\cdotp s_{i}+bias+ \varepsilon \end{array} $$

(4)

$$ \begin{array}{@{}rcl@{}} w_{old} &=& (i-1) \cdotp f \cdotp w_{new} \end{array} $$

(5)

$$ \begin{array}{@{}rcl@{}} w_{old} + w_{new} &=& 1 \end{array} $$

(6)

In the Compression model, an ideal observer should adjust the relative weights of their previous estimation and the new stimulus, by putting less weight on the new stimulus as the sequence extends to include additional stimuli. We modeled subjects’ estimations as a weighted average of their previous estimation (“old”) and the newly shown item (“new”), plus a constant bias term and random noise. This description is summarized in Eq. 4. Constraints on the weights for “old” and “new” items are summarized in Equations 5 and 6 below. To elaborate, the weighting relationship between the “old” and “new” terms is modulated by a factor parameter f, which ranges from 0 to 1. When f = 1, the new stimulus will take a weight of 1/i in the estimation, which is the ideal ratio for the task and relates to equally weighted averaging. When f = 0, the weight on a new item is 1, which means subjects rely solely on the newest item.

For each model, we separately fit each stimulus sequence from each subject. To obtain the best-fitting parameters, we computed the sum of squared error between model predictions and observed estimates and minimized the error term using the “L-BFGS-B” bounded optimization method (Byrd & et al. 1995). The initial value used for r and f was 1, with the boundaries set to (0, 1]. The initial value used for the bias term was 0, bounded at [-50, 50].

Results and discussion

Data from three representative subjects are plotted in Fig. 4 (All individual plots can be found in Supplementary Materials). Data from all mean shift types (SS for “Small Shift”, LS for “Large Shift”, and LV for “Large Variance”) showed a general pattern, namely that subjects’ estimates of the mean did not follow predictions from the Equal model.

After the mean shift (the vertical dotted line in the middle of the sequence denotes the final pre-shift trial), the influence of the recent stimuli grew more evident as subjects’ estimates rose toward the mean of the post-shift distribution, increasingly deviating from the equally weighted moving averages. Individual differences were observed. For instance, Subject 9’s estimates closely varied with the new stimulus, demonstrating a greater influence of the most recent item. Subject 15’s estimates changed less in the post-shift trials. Subject 12’s estimates reflected a degree of Recency in between. Despite the individual differences we observed, no subject showed estimates that aligned with predictions based on the Equal model.

These results suggest that that subjects’ estimates were unlikely to arise from the Equal model. In the following section, we compared the performance of the two non-equal models: Recency and Compression.

Model comparison

For each model, we conducted model fitting for all sequences separately. The best-fitting parameters are shown in Table 1. To evaluate the performance of non-equal models, we conducted both explanatory and predictive tests.

Table 1 Best-fitting parameters of Experiment 1

Full size table

In the explanatory test, all the observed data were fitted with the competing models. As a result, we obtained best-fitting parameters for each model, and compared the model fits to the data using root mean squared error (RMSE). The model with the smallest RMSE “explains” the observed data the best.

Note that a model that excels in the explanatory test can fail in the predictive test, in which part of the observed data are used to obtain parameters to predict the remaining data that are not used in parameter training. One notable reason is overfitting, suggesting that if explanatory performance is the sole mode of assessment, the model could end up fitting meaningless noise and error in the data. Unfortunately, a lack of predictive assessment is common in psychological modeling studies (Shmueli, 2010; Yarkoni & Westfall, 2017). We adopt the suggestions from Shmueli (2010) to treat explanatory and predictive performance as two dimensions of model performance assessment.

Explanatory tests

Explanatory performances were summarized in Table 2. The three non-equal models (Compression, Recency-s, and Recency-t) consistently outperformed the Equal model, again suggesting the Equal model is least likely to capture the item integration mechanism among competing models.

Table 2 Mean explanatory and predictive RMSE of Experiment 1. Models are Recency-Serial (RS), Recency-Temporal (RT), Compression Model (CM), and Equal (EQ)

Full size table

Among the non-equal models, the Recency models (both Recency-s and Recency-t) outperformed the Compression model. This result suggests that subjects are more likely to integrate multiple recent items (Recency models) rather than updating a compressed prior estimation (Compression model).

The explanatory performances of the two Recency models are almost identical. The mean RMSE difference between the two Recency models is 0.003, compared to the mean RMSE difference from Compression model (1.19) and Equal model (27.99). This is due to high correlation (mean correlation coefficient = 0.99) between serial and temporal positions used in the two Recency models. So, from the current design and analysis, it is unclear whether the cause of Recency weighting is from serial positions or temporal positions. Future studies could make timing controls more specific to separate the influence from these two factors.

Going forward, we will use Recency-s as the representative model due to its excellent explanatory performance. The three mean-shift conditions (“Small Shift”, “Large Shift”, and “Large Variance”) did not affect the recency rate parameters, F(2, 28) = 0.041, p = 0.96. If the abrupt up-shift of the mean had significantly influenced subjects’ item integration, we would expect the modeling results to differ among the three conditions. The similar best-fit parameters among mean-shift conditions suggest this possibility is unlikely.

Since the mean-shift manipulation did not affect the modeling parameters, we averaged each subject’s best-fitting r parameters from the three conditions. The group average of the r parameter is 0.86, suggesting a recency prioritization (this group-averaged best-fitting r was used to estimate the effective number of items integrated later in this paper). A one-sample t test rejects the Equal null hypothesis, t (14) = -4.71, p< 0.001.

Predictive tests

In the predictive tests, we used the averaged best-fitting parameters from ten randomly sampled subjects to predict data from a new subject. For training, best-fitting parameters from all mean shift conditions were averaged together. For testing, the sequence at test was randomly sampled from a new subject whose responses were not used in the parameter training. This predictive process was repeated with random sampling 100 times and predictive performance (RMSE) was averaged over iterations.

In general, the non-equal models outperformed the Equal model (Fig. 5). Among the non-equal models, the Recency models were better than the Compression model. Detailed results are reported in Table 2 (Predictive RMSE).

Model comparison results

In both explanatory and predictive tests, the Equal model performed the worst among all models. Does the model performance of the non-equal models benefit from adding the bias term? We tested an alternative form of the Equal model with a bias term to capture systematic under- or over-estimation. In both explanatory and predictive tests, the Recency-s model outperformed this Equal model (p s < 0.001). So, the determinant of model performance is not the bias term but the weight distribution in item integration.

In the non-equal models, the Recency model outperformed the Compression model in both tests. The performance of Recency models based on serial positions and temporal positions are close to each other in both tests (Table 2).

To sum up, the Equal averaging scheme does not seem like a plausible explanation or prediction mechanism for subjects’ estimates of the running means of sequentially presented stimuli. To estimate the running averages, subjects are more likely to utilize multiple recent representations rather than updating a single compressed representation of all prior stimuli. Subjects are likely to assign more weight to more recent items in their item integration.

Effective number of items integrated (ENI)

Both explanatory and predictive tests favored the Recency weighting scheme. The averaged best-fitting rate parameter for the Recency-s model was 0.86, suggesting that in a task of tracking the running means of sequentially displayed stimuli, subjects rely more on recent stimuli, rather than treating all items equally. The group-averaged weight distributions, plotted in the left panel of Fig. 6, show that a few of the most recent stimuli accounted for the majority of the weights, leaving other older stimuli a small fraction of the total weights to split between them.

To quantify this observation, we defined Effective Number of items Integrated (ENI) as the fewest items that are needed to accumulate weights over a certain threshold. We found that to account for 90% and 95% of the cumulative weight required the most recent 16 and 20 stimuli, respectively. Since the Recency-s model assumes the weights follow an exponential function of serial positions, the weights assigned to these prioritized items are not uniformly distributed over the items either (e.g., the most recent five stimuli alone provide > 50% of the cumulative weights). Hence, a conclusion on ENI depends on the criterion that is used. For the current study using vertical lines, ENI₉₀ = 16, ENI₉₅ = 20, a small fraction of the total sequence length of 120.

Whitney and Leib (2018) suggested a square root relationship between the ENI and the total number of stimuli, although the studies cited for ENI values did not share a unified definition of ENI. Our data seem to suggest a ceiling of the ENI regardless of the total number of items in the sequence. Looking back at the five sequential averaging studies cited in Fig. 1, we see that the total numbers of stimuli in those studies are small (less than 20), and the estimated ENIs (max at 10) are far below the ceiling suggested in the current results (around 20). This suggests the possibility of a two-stage relationship between the ENI and total number of stimuli, that the two may have a certain functional relationship (e.g., the square root relation), however, the maximum number of ENI may also be capped at some upper limit.

The mean-shift design

Previously in the introduction, we mentioned two concerns over the mean-shift design: (i) the distinctive shift in the mean may encourage recency in item integration, (ii) reporting the mean on every trial may encourage subjects to compress prior items into a single representation rather than averaging prior items. Modeling results ruled out these two concerns in Experiment 1.

Firstly, the best-fitting parameters from the three mean-shift conditions were not significantly different from each other, contrary to the idea that the distinctiveness of the mean-shift encourages recency in item integration. Secondly, the better model performance of the Recency model over the Compression model suggests that by requiring subjects to report the mean on every trial, the preceding stimuli are unlikely to be compressed into a single value, as suggested in the Compression model.

In sum, our results challenge prior treatments of prototype models as equally weighted averages over all items (Nosofsky, 1987; Smith & Minda, 2000), and call into question the conclusions drawn on the basis of those assumptions.

Experiment 2

In Experiment 1, the initial value of the adjustable probe on each trial (except the first trial) was set to the subject’s response from its immediate previous trial. This may have encouraged subjects to base their estimations on their previous responses. This strategy is reasonable because the responses are naturally auto-correlated in this task, but we wonder whether the starting values were responsible for the previous results that suggest recency weighting. Existing studies using the method of adjustment have used either random starting values (Haberman & Whitney, 2010) or fixed values that are outside the range of the regular stimuli (Huang & Sekuler, 2010). In Experiment 2, we changed the starting values of the adjustable probes to a fixed small value.

Additionally, we aim to determine whether the findings in Experiment 1 are limited to line length. To this end, we include a condition in which subjects are asked to keep track of the running averages of sequentially displayed numerals. Unlike simple visual stimuli like vertical lines, numerals are symbols that carry conceptual information that is directly related to subjects’ estimation responses. So, Experiment 2 afforded a direct comparison of response patterns from the line and numeral tasks.