1 Introduction and Background

The history of statistics as part of the school mathematics curriculum is relatively short compared with topics related to geometry and algebra. Approximately 30 years ago, the National Council of Teachers of Mathematics (NCTM) published its Curriculum and Evaluation Standards for School Mathematics (1989), including statistics and probability at all grade levels from kindergarten. In introducing what might be termed the “practice of statistics” for grades K-4, the NCTM suggested the inclusion of experiences for students to

  • collect, organize, and describe data;

  • construct, read, and interpret displays of data;

  • formulate and solve problems that involve collecting and analyzing data;

as well as to

  • explore concepts of chance. (p. 54)

At about the same time, Moore (1990) wrote his seminal manuscript on “uncertainty”. Among the many significant points advanced were three that provide background for this chapter. In recognising the importance of the new curriculum (NCTM, 1989), he said, “However, because of the emphasis that these recommendations place on data analysis, it is easy to view statistics in particular as a collection of specific skills (or even as a bag of tricks)” (p. 95). Second, although not the first to see the importance of context to the field of statistics, as Rao (1975) had made the point earlier, Moore amplified the word “data”, which was key in the NCTM’s Standards (1989). Moore emphasised that “… data are not merely numbers, but numbers with a context … Teachers who understand that data are numbers in a context will always provide an appropriate context when posing problems for students” (p. 96). Third, in summing up his message about uncertainty, Moore (1990) focussed on the fundamental concept underlying data and data analysis: variation. “The core elements of statistical thinking” were

  1. 1.

    The omnipresence of variation in processes …

  2. 2.

    The need for data about processes …

  3. 3.

    The design of data production with variation in mind …

  4. 4.

    The quantification of variation …

  5. 5.

    The explanation of variation. (p. 135)

Shortly after this, Green (1993) started asking questions related to Moore’s claims for variation: “What do students understand of variability and how does this originate?” and “What are the essential experiences needed to develop a full appreciation of variability?” (pp. 227–228).

Furthering the issues raised by the questions of Green (1993), Shaughnessy (1997), echoing the opinion of Moore (1990) above, suggested that one of the problems associated with the lack of focus on variability in the classroom was the procedural nature of teaching combined with the complex computations needed to calculate the standard deviation, which was the measure of variation used by statisticians. Because the belief was common that a “measure” was needed for every concept in mathematics, variation had to wait for the standard deviation, which required, for example, the square root. In the meantime, the arithmetic mean was obtained by a simple procedure using addition and division, and hence, expectation became the focus of the curriculum.

Expectation arises out of the variation in data when data are summarised, perhaps with a measure of centre or a measure of association. Konold and Pollatsek (2002) used the metaphor of “signals within noisy processes” to characterise expectation and variation. Both sets of terms also apply when considering trials of random processes related to probability models. Often in authentic settings, expectations are materialised as predictions of outcomes from data in a context involving variation. In other settings, however, stated expectations or predictions may be the catalyst leading to consideration of the variation creating them.

Shaughnessy (1997) followed his supposition about the cause of the delay in focussing on variation in the curriculum with a call to “investigate students’ conceptions of variability and try some research approaches that uncover what our students can do in problem solving in chance and data, rather than merely documenting what they are unable to do” (p. 18).

More recently, the Guidelines for Assessment and Instruction in Statistics Education (GAISE) Report (Franklin et al., 2007) brought variation to the forefront in its description of the four steps of statistical problem-solving.

  1. I.

    Formulate Questions, Anticipating Variability—Making the Statistics Question Distinction

  2. II.

    Collect Data, Acknowledging Variability—Designing for Difference

  3. III.

    Analyse Data, Accounting of Variability—Using Distributions

  4. IV.

    Interpret Results, Allowing for Variability—Looking beyond the Data. (pp. 11–12)

The nature of variability that GAISE sees as relevant at Level A (the lowest of three levels across the school years) includes measurement variability, natural variability, and induced variability but it only sees these considered within a data set at Level A (p. 15). GAISE also speaks of the helping young children distinguish variability from error and how these notions are used to explain outliers, gaps, and clusters (p. 33). Although the Common Core State Standards for Mathematics (Common Core State Standards Initiative, 2010) recognises variability as an essential starting point for the study of statistics, statistics and probability are not included in the curriculum until Grade 6.

The most recent Australian Curriculum: Mathematics (Australian Curriculum, Assessment and Reporting Authority [ACARA], 2016) recognises both variation and expectation in its early years, Foundation to Year 2. In these years, “[c]hildren have the opportunity to access mathematical ideas … by developing an awareness of the collection, presentation and variation of data and a capacity to make predictions about chance events” (p. 8). This chapter expands on this extract by considering both variation and prediction (i.e. expectation) for both “data” and “chance”. Further, reflecting Moore (1990), it is the uncertainty created by data, whether collected from surveys, experiments, or random devices, which means that a prediction made from the results of the data collected must be expressed with a corresponding degree of uncertainty. Statisticians may express this uncertainty in confidence intervals or p values, but the goal for young children is to use the evidence they have collected or the information available to them to express informally how confident they are in their concluding expectation.

Although research on school students’ understanding of the practice of statistics has grown rapidly since it was introduced in curriculum documents, the focus has been on older students (e.g. Konold & Pollatsek, 2002; Lehrer, Kim, & Schauble, 2007; Watson, Callingham, & Kelly, 2007). This focus is likely to reflect the assumptions associated with the requirements to draw conventional graphs and calculate statistical measures. Research on young children’s statistical understanding has been relatively sparse. Recently, the work of English (2010, 2012), Hourigan and Leavy (2015), and Kinnear (2013) has focussed on representations created to model situations based on picture books. In particular, English (2012) and Kinnear and Clark (2014) engaged children in activities based on picture books in the context of recycling and rubbish collection that brought attention to variation in contexts leading to making predictions. In this very concrete, yet imaginary context, children were able to consider variation in the data presented and make predictions, that is, state expectations, that were reasonable. In considering the relationship of variation and expectation across the grades Watson (2005) claimed that appreciation of variation developed before expectation, opposite to the focus of curricula suggested by Shaughnessy (1997). In continuing to consider the relationship of variation and expectation in detail, this study gathers more evidence to investigate this claim for quite young children.

In the light of the evolving appreciation of the importance of introducing statistical and probabilistic notions in early childhood, of using meaningful contexts in which to do so, and of developing a foundation based on understanding variation, this chapter presents the outcomes of interviews with seven 6-year-old children in their beginning year of formal schooling. The interviews sought to present the children with meaningful contexts that would allow them to display their naïve understanding of data, variation, and expectation. How would the students respond to the data presented to them or imagined? Would they recognise or create variation and how would they deal with it? Would their predictions reflect meaningful expectations from the data and variation experienced?

2 Methodology

2.1 Interview Protocols

As a part of previous projects focussing on the statistical understanding of students in Grades 3–9, four interview protocols had been developed, three related to the data section of the curriculum and one to the chance section (Watson & Kelly, 2005; Watson & Moritz, 1999, 2001). Three of the protocols included the use of concrete manipulative materials. The first was based on a container containing 100 lollies of different colours, of which 50 were red. The lollies questions used with students are shown in Fig. 4.1. The second involved creating a pictograph to show how many books some children had read using cut-out images of books and of the named children. The books questions are shown in Fig. 4.2. The third protocol employed a large moveable bar chart, which could be manipulated to show the number of children who arrived at school by four different means of transport. The transport questions are shown in Fig. 4.3. The fourth protocol involved speculating about the maximum daily temperature in their city given the information that the average daily maximum temperature for the year was 17 °C. The weather questions are shown in Fig. 4.4. The materials used in the first three protocols are shown in Fig. 4.5: the container with 100 lollies of which 50 were red, 30 green, and 20 yellow; the images of books and named children; and the moveable bar chart with the bars set for the initial question asked.

Fig. 4.1
figure 1

Lollies interview protocol

Fig. 4.2
figure 2

Books interview protocol

Fig. 4.3
figure 3

Transport interview protocol

Fig. 4.4
figure 4

Weather interview protocol

Fig. 4.5
figure 5

Materials for the protocol for lollies (left), books (centre), and transport (right)

The results of analysing the responses to these protocols have been published for older children based on other research questions, at times including some of the responses of the 6-year-old children (e.g. Kelly & Watson, 2002; Watson & Kelly, 2005; Watson & Moritz, 1999, 2001). The data reported in this chapter are a comprehensive summary of all exchanges with the 6-year-old children in the four contexts to gauge the starting points for their appreciation of variation and expectation.

Table 4.1 summarises the focus of the protocols on the relationship of variation and estimation in the four contexts for the data in the context. Some contexts were expected to be more difficult, and the protocols were ordered as presented in Table 4.1 because of the increasingly complex contexts associated with decreasing concrete hands-on contact with materials. Two of the protocols (books and weather) were shortened from use with older children by eliminating more complex explanations of representations at the end of the protocols. The lollies protocol was exactly as developed (e.g. Shaughnessy, Watson, Moritz, & Reading, 1999) and the transport protocol had one extra question, c), added near the beginning.

Table 4.1 Data, variation, and expectation in the protocols

2.2 Participants

The seven children (five boys and two girls) were in a preparatory class (before Grade 1) in a government school with a teacher who had implemented an innovative mathematics programme but who had not yet introduced material related to chance and data that year. The children were chosen by the teacher, from a class of 25, as articulate and willing to talk with “visitors from the university”. Each interview took place individually in a quiet room for approximately 45 min, including all protocols. Students showed interest in all questions and did not appear to experience fatigue. Parental permission was obtained and the interviews were video-recorded, from which transcripts were produced.

2.3 Analysis

For the purpose of this chapter, the data from the interviews were reanalysed specifically with respect to three aspects of the students’ developing understanding:

  • DATA—the children’s interaction with the data related to the contexts presented in the protocols;

  • VARIATION—the children’s capacities to (a) recognise and/or describe variation in data presented or created and (b) include acknowledgement of variation within predictions made; and

  • EXPECTATION—to use the variation implicit or explicit in the context to make predictions that reflect meaningful expectations.

At this age and lack of experience, it was not the aim to classify the responses to the protocol questions hierarchically but to document the interaction of the basic concepts of variation and expectation in the contexts exposing the students to data. Following the example of Russell (1990) in exploring how “children construct their ideas about data” (p. 158), the analysis goes beyond the data to explore how children use data to construct ideas about variation and expectation. A descriptive account is presented to illustrate how children are capable at quite young ages to engage with these big ideas intuitively, although often without the ability to provide statistical justifications. The terminology of “data”, “variation”, and “expectation” was not used during the interviews, and the language suggested in Figs. 4.1, 4.2, 4.3, 4.4 was closely followed.

3 Results

The results are presented for each protocol with a summary at the end for the three aspects of student understanding explored across the protocols: data, variation, and expectation.

3.1 Lollies Protocol

Expectation was the main contextual motivation in the lollies protocol (Figs. 4.1 and 4.5) with interest in the contribution variation made in the predictions of students or in their explanations of the outcomes they obtained from their trials. All of the students understood the setting and the drawing out of the lollies that created the data with which they worked, although the word “data” was not used. Most of the questions were based on predictions of outcomes of drawing lollies from the container, although the questions were posed in a manner to allow recognition of variation in the outcomes.

Given the contents of the container, the initial expectations for the number of reds in ten draws (no replacement) were reasonable: 4, 5, or 6, with qualifications of “or more”, “about”, or “maybe” for four responses, recognising potential variation. The reasons, however, were not based on proportional reasoning:

  • There are some reds on top and bottom—in the corner.

  • 5 + 5 = 10, 5 of one colour and 5 of another.

  • 5 + 5 = 10, one more makes 6, and 4 is 10.

One student appeared to have an intuition about the proportion but did not have the language to express it: “Because there’s 50, and 5 … like 10”. All students said either “No” or “Don’t know” when asked if repeated draws would produce the same result. Responses reflected appreciation of variation in sampling, for example, “Might get a different number every time” or “If it’s mixed up I might get 4 yellow, 3 red, and 3 green”. When asked how many reds would be a surprise, five said a higher number such as “maybe 10”, with two saying “6” or “6 or 5”. Justification for these answers generally reflected other possibilities or “don’t know” with the response for “6” being “it’s my favourite number”.

When asked to predict the outcomes for the number of reds in six separate trials, six of the seven responses contained no repeated numbers of reds, whereas one had “4” listed twice. Four predictions were consistent with a mean of 5 reds, with two sets considered high and one set low. In terms of variation, four were judged as wide and three as reasonable.Footnote 1 Only one set of values was both centred on 5 and with reasonable variation.

Asked which of five outcomes of six draws (see Fig. 4.1) would best describe the most likely outcome, four chose the best response, “3, 7, 5, 8, 5, 4”, whereas one each chose “all 10s”, “all 5s”, and “2, 3, 4, 3, 4, 2”. The four reasons for the best response were similar to “Mixed up, different amounts” with only one specifically mentioning “5”. The reasoning for “all 10s” was “you could get the same number”, whereas for “all 5s” it was “there are more red”. Asked which set described the likely outcomes least well, students either replied “all 5s” or “all 10s” with intuitive reasoning reflecting the list (“it’s got heaps of 10s”) or the contents of the container (“not enough 10s”).

When asked the range of outcomes for six trials, responses varied from “0–10” to “2–8” and “3–9” with five responses including “10”. Reasons were generally idiosyncratic, for example, “I can fit 10 in my hand” or “2 + 8 = 10”, or reflecting single outcomes, for example, “I might grab them all from the red part”.

Asked to show a way to record the results from many trials, six displays are shown in Fig. 4.6. An oral response of the other student was “ask everyone—get a clipboard”. Drawings (ii), (iii), and (vi) represented the setting of the trials, whereas (i) and (v) recorded numbers for the outcomes. Drawing (iv) was accompanied by the explanation, “write each number up to 8 and then write the number of people next to each of the numbers that got that many”.

Fig. 4.6
figure 6

Representations for data collection for lollies protocol

When provided with grid paper and asked to show how many reds 40 students might draw from the container, five appeared to understand the task as they were colouring in the squares, although they did not necessarily fill the squares from the bottom of the grid. None were urged to complete the task for all 40 students. Two students explicitly said they had not seen a graph like the grid before. Two graphs are shown in Fig. 4.7. The choices of squares to fill reflected either “possible” outcomes or “numbers I like”, with no further explanation.

Fig. 4.7
figure 7

Grids filled in for number of reds in repeated draws of ten lollies for “other” students

3.2 Books Protocol

The books protocol (Figs. 4.2 and 4.5) gave students the opportunity to use concrete materials to represent a data set. Of interest was how they used their displays to show the variation in the data and ultimately how they would make predictions about implications from within or outside the data displayed.

When presented with the cards for the books and the characters, six of the seven students could distribute the cards next to the characters in such a way that the numbers of books read could be compared for the characters. Two students distributed the books to one side of the character, two students distributed the books vertically below the character, and two students distributed the books around the character (non-overlapping). The remaining student ignored the data information and piled cards on top of each other next to the character. When asked what someone new to the room could tell from their displays, some responses repeated the information displayed or said “she could tell by counting”. One student made up a story about the girl with the most books “winning” and if “someone came along and stole one of her books she’ll only have 3 left”.

When Andrew, having read five books, and Jane, having read four books, were introduced, all students added the characters and their books as before, including the student who put the books in piles. Given a card for Ian with no books, one student puts him to the side, whereas the others put him with the other characters (with no books). When asked to show an additional library book for each character, all students added a book to each character but those with books scattered around the character had trouble keeping track of which had received an extra book and the student with piles of books got confused about which character owned which book and missed out one character.

Asked what a new visitor could now tell from the display, responses varied widely from “people are reading books”, to “did you put one more on each person?”, to again reading a count of how many books each character had read. Asked who likes reading the most, all said Lisa because she had the most books; none expressed any uncertainty in the suggestion. Asked how they could tell how many books the characters had read altogether, some just said “count them”, whereas others tried to do so, with mixed success and only one reaching the correct total of 33. Asked who was most likely to want a book for Christmas, again five students said Lisa because she “reads the most” or “likes reading”. The remaining two students gave different imaginary accounts, “Ian, because he’s only got one book” and “Terry, he’s got 5 books—dinosaur one, a skeleton one, and a giraffe one and he wants one on plants so he can see how they grow”.

Finally, students were asked to predict how many books two new students, Paul and Helen, might have read. Responses were quite varied. One boy would not predict for either new character saying, “Don’t know, my sister always makes me guess, I have to put up with it!” The rest made predictions, including “0, because it was his first time in the library and he doesn’t know how to choose books”, “10, I just think he would”, and “3, because one of my sisters is 3”. None of the responses used the information in the display (data) to make a prediction about the new students.

3.3 Transport Protocol

At the beginning of the transport protocol (Fig. 4.3), four of the students said they had seen a bar graph like the moveable bar graph in Fig. 4.5 before; three had not. Initial questions checked if students could read the graph and distinguish the variation presented. Two students required initial help in reading the graph, but then all said that “most” children came by bus. When asked how many came by car, six replied “6” and one who had trouble initially said “5”. The two questions requiring basic mathematical calculations caused difficulty. Five of the students worked out that 4 more of Mr. Smith’s class came by bus than car with two counting the lines between the bars and two justifying their answer with “3 + 4 = 7”. Only one student, however, obtained the total of 18 children in the class, by counting on his fingers. Others required help and one responded, “Just go down there [to the class] properly and count and the ones I count can stand up and when I go back down they can sit down”. One student suggested 10 because that was the largest number on the vertical axis of the bar graph.

Asked to predict how a new child would come to school, students provided a wide range of responses and explanations, as shown in Table 4.2. Only one response was based on the bar graph, whereas the others were based on the students’ own experiences or imagined scenarios in the context.

Table 4.2 How a new child would come to school

Students were then asked to predict how Mr. Smith’s class would arrive if there were no bus and to move the bars on the graph to show their predictions. One student pushed bus down to zero and moved car to 7, bike to 7, and walking to 5 for the total of 18. Others required help to make the adjustments necessary to adjust for the 7 in the bus.

After returning the bar graph to its original position (Fig. 4.5), students were asked to adjust the graph again to show how Mr. Smith’s class would get to school if it were raining tomorrow. Again this was not an easy task for the students, with none making all of the adjustments for the correct total without prompts. Two students adjusted car and bus upward appropriately but did not initially move bike and walk to zero. Two others moved bike and walk to zero but had difficulty adjusting car and bus upward by the correct numbers, one agreeing that some of Mr. Smith’s class did not come to school that day. Adjusting the variation in the data to fit the prediction in this context was very difficult, even with the concrete representation of the bar graph to help.

3.4 Weather Protocol

The weather protocol was the most difficult for the 6-year-old students, and many parts used with older students were not included for these students. The initial question (see Fig. 4.4), worded for the possible suggestion of variation by the students, only elicited one response about the temperature that appeared to acknowledge variation, “it’s a little bit cold, lower than today”. Two responses in the context reflected an interpretation of the expectation: “quite hot at 17 °C” and “going to be hot for the whole year”. Other responses commented on the TV news and it being wrong with the weather. One student was “not sure” what the average temperature meant.

When asked the more explicit question about all days of the year having 17 °C as the highest temperature, five said “no”, one said “sometimes”, and one said “yes, maybe”. Explanations included the following, acknowledging variation.

  • No. The temperature always changes.

  • No. You get summer, spring, winter, autumn, and summer again. You get hot, mild/cool, cold, mild/cool, and hot again.

  • No. One day it might be cold, the next day it might be colder.

  • No. Every single thing is different, so they do different things every single day.

  • Sometimes. Sometimes it’s raining.

Predicting the temperatures for six different days of the year revealed an acknowledgement of variation but not necessarily appropriate values for temperatures in the city. Three of the responses were within ranges reasonable for the city, for example “11–30” ( °C), whereas three others had maxima of “70” °C or higher and one had a range of “5–10” °C (too low for the city). Similar responses were given for January and July.

Only four students were asked to draw a representation of the temperature over a year. Their representations are shown in Fig. 4.8. Figure 4.8a represents the variation from a sunny day and to a rainy day, in Fig. 4.8b the circle represents “the land with how hot it would be written on it” (perhaps from seeing a weather map on television), Fig. 4.8c shows a beach and the student described how “hot it is when we go to the beach”, and Fig. 4.8d shows a picture of the student in the sun and she explained what she wore “when it was hot or cold”. Figure 4.8a, d, and perhaps Fig. 4.8b, and the accompanying explanations, recognise the variation present in the weather.

Fig. 4.8
figure 8

Representations for the temperature over a year

3.5 Summary

Students’ familiarity with the context within which the data were presented or created influenced their ability to comprehend the questions asked. Lollies, books, and transport contexts were all quite familiar, with weather related to temperature less so. For lollies, the “data” were based on the actual sweets in a container; for books, the data were cards representing books and children in a one-to-one match with the context; for transport, the data were represented in a moveable bar chart, although never individually; and for weather, no data were presented and hence they needed to be created by the students.

For the lollies protocol, students appreciated the variation in outcomes but did not have the language to explain random behaviour. Only one student used the same number twice in reporting the number of red lollies in six different trials. For the others, this may reflect a naïve view that the numbers were being “used up” as they were chosen, or because the students were given a choice, they would be “fair” in choosing as many different numbers as possible. A similar tendency appeared for Grade 3 in a related study with 56% of students not including repeated values, whereas this decreased to 27% for older students (Kelly & Watson, 2002). In a different context of predicting outcomes following the presentation of the data in a table, Kinnear and Clark (2014) found 5-year-olds more likely to reuse numbers from the table.

For the books protocol, when asked to make predictions about the characters based on the data visible in their displays, six students could do so with reasons based on the data (e.g. “most” and “least”). When asked about characters outside those visible, however, they did not use the visible data to inform a prediction. In the transport protocol, predicting how a new child outside of the data displayed would come to school, posed a similar difficulty as for predicting how many books a new character would have read. In the weather context, students clearly understood about changes in weather conditions but struggled with actual numerical temperatures, which is not surprising at their age. Working backward from a specific expectation expressed as an “average”, however, was very difficult in the context.

4 Discussion

The four protocols, initially devised for older students, helped distinguish the limits of understanding of variation and expectation for these 6-year-olds. The results support the view that recognising and discussing variation in data in their experience are very natural to 6-year-olds, even though they may not be able to explain its origin. Dealing with variation also generally develops before the ability to express meaningful expectation related to that variation. In the two protocols that began with variation (books and transport), students’ expectations, expressed as predictions, were often not based on reasoning associated with the data but instead with imaginary situations, within or outside the context. In the two protocols initiated with expectation, the lollies task was easier because the concrete materials were in front of the student and the prediction was based on “visible” data. Being presented with the fixed expectation in the weather protocol was more difficult because it was a single value associated with a less familiar context (temperature). For lollies, there was variation in the predictions made, based on variation in the lollies seen in the container. The predictions had well-understood boundaries (0–10 red), whereas for weather, the variation was in the data without boundaries as such, and with which students had much less familiarity. This made the task more difficult but the students understood enough about the context to suggest numbers for temperatures.

In Kinnear’s (2013) study, the responses where students gave predictions or explanations based on the context of the protocol but not based on the data presented were called abductive reasoning. In her study, the context was a picture book including a plot, which some of her 5-year-old students used to make predictions, rather than examining the actual data provided in the context. Similar examples from the current protocols include discussing where red lollies may be in the container or how many would fit in the hand, suggesting that a character is not familiar with the library for selecting a book, discussing the amount of time it takes to reach school or familiarity with bus routes, and providing general characteristics of weather and seasons. For these protocols, however, there were also other responses that were based completely on imagination, not context, such as, “my sister is 3” or choosing “numbers I like”. Studies such as these with young children suggest there is a progression in thinking from what might be called imaginary reasoning outside of the context presented, to abductive reasoning using only the context presented, to the beginning of statistical (or inferential) reasoning using the data within the context in decision-making (Ben-Zvi, Aridor, Makar, & Bakker, 2012; Makar & Rubin, 2009). Asking for predictions for books read or transport to school for children outside of the visible data set could be considered precursors to introducing samples and populations, elements of inferential reasoning. Ben-Zvi et al. and Makar and Rubin, however, were working with students in Grades 4–6 and also focussing on acknowledging uncertainty in decision-making. Students in this study were not questioned about the certainty of the responses given, although the impression gained from some answers was that they were guesses, indicating that certainty was not an issue. More research with young children should shed light on this suggested pathway and propose ways of scaffolding children into the practice of statistics.

The predominance of variation throughout the interviews, which students had virtually no trouble recognising or creating, supports the views of Moore (1990) and Shaughnessy (2003) that variation is in fact the foundation of all statistical enquiry. In terms of expectation, it is variation that either creates the prediction or provides supporting evidence that the expectation is reasonable, supporting Watson’s (2005) claim that appreciation of variation is the starting point for children’s engagement with the practice of statistics.