Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

12.1 Introduction

Predicting outcomes of football games has been the focus of research of several researchers, mostly applied to championship leagues. For instance, Keller [8] has fitted the Poisson distribution to the number of goals scored by England, Ireland, Scotland, and Wales in the British International Championship from 1883 to 1980. Also, Lee [9] considers the Poisson distribution, but allows for the parameter to depend on a general home-ground effect and individual offensive and defensive effects. Moreover, Karlis [7] applied the Skellam’s distribution to model the goal difference between home and away teams. The authors argue that this approach does not rely neither on independence nor on the marginal Poisson distribution assumptions for the number of goals scored by the teams. A Bayesian analysis for predicting match outcomes for the English Premiere League (2006–2007 season) is carried out using a log-linear link function and noninformative prior distributions for the model parameters.

Taking another approach, Brillinger [1] proposed to model directly the win, draw, and loss probabilities by applying a trinomial regression model to the Brazilian 2006 Series A championship. By means of simulation, it is estimated for each team the total points, the probability of winning the championship, and the probability of ending the season in the top four places.

In spite of the vast literature directed to League Championship prediction, few articles concern score predictions for the World Cup tournament (WCT) [5, 12, 13]. The WCT is organized by Fédération Internationale de Football Association (FIFA, French for International Federation of Football Association), occuring every 4 years. Probably, the shortage of researches on the WCT is due to the limited amount of valuable data related to international matches and also to the fact that few competitions confronts teams from different continents.

A log-linear Poisson regression model which takes the FIFA ratings as covariates is presented by Dyte and Clarke [5]. The authors present some results on the predictive power of the model and also present simulation results to estimate probabilities of winning the championship for the 1998 WCT. Volf [13], using a counting processes approach, modeled the development of a match score as two interacting time-dependent random point processes. The interaction between teams are modeled via a semiparametric multiplicative regression model of intensity. The author has applied his model to the analysis of the performance of the eight teams that reached the quarter-finals of the 2006 WCT. Suzuki et al. [12] proposed a Bayesian methodology for predicting match outcomes using experts’ opinions and the FIFA ratings as prior information. The method is applied to calculate the win, draw, and loss probabilities for each match and also to estimate classification probabilities in group stage and winning tournament chances for each team on the 2006 WCT.

In this chapter, we proposed a Bayesian method for predicting match outcomes with use of the experts’ opinions and the FIFA ratings as prior information, but differently from [12], for all 48 matches of the first phase (group stage in which each team played three matches) the experts’ opinions were provided before the beginning of 2010 WCT. The motivation for such purpose was the difficulty to get the experts’ opinions (four sportswriters contributed with their opinions) at the end of each round of group stage. The drawback is that these matches were played within 15 days, mostly in different dates and times. For the second phase, the experts’ opinions were provided before each round. Moreover, we incorporate a time-effect weights for the matches, that is, we consider that outcomes of matches which were played first have less importance than the outcomes of more recent matches. An attractive advantage of our approach is the possibility of calibrating the experts’ opinions as well as the importance of previous match outcomes in the modeling, directing for a control on the model prediction capability. Considering a grid of values for the experts’ opinions weight a 0 and for the last matches importance, the \(p_{i}'s\) values, we can assess the impact of these weights on the model prediction capability.

We used the predictive distributions to perform a simulation based on 10,000 runs of the whole competition, with the purpose of estimating various probability measures of interest, such as the probability that a given team wins the tournament, reaches the final, qualifies to the knockout stage and so on.

The chapter is outlined as follows. In Sect. 12.2, we present the probabilistic model and expressions for priors and posterior distribution of parameters, as well as for predictive distributions. In Sect. 12.3, we present the method used to estimate the probabilities of winning the tournament. In Sect. 12.4, we give our final considerations about the results and further work.

12.2 Probabilistic Model

In the current format, the WCT gathers 32 teams, where the host nation(s) has a guaranteed place and the others are selected from a qualifying phase which occurs in the 3-year period preceding the tournament. The tournament is composed by a group stage followed by a knockout stage. In the group stage, the teams play against each other within their group and the top two teams in each group advance to the next stage. In the knockout stage, 16 teams play one-off matches in a single-elimination system, with extra time of 30 min (divided in 2 halves of 15 min each) and penalty shootouts used to decide the winners when necessary.

The probabilistic model is derived as follows. Consider a match between teams A and B with respective FIFA ratings R A and R B . In the following we shall assume that, given the parameters \(\lambda_{A}\) and \(\lambda_{B}\), the number of goals X AB and X BA scored by team A and B, respectively, are two independent random variables with

$$\begin{aligned} X_{AB} \mid \lambda_A &\sim \textrm{Poisson} \bigg(\lambda_{A} \frac{R_A}{R_B} \bigg),\end{aligned}$$
(12.1)
$$\begin{aligned} X_{BA} \mid \lambda_B &\sim \textrm{Poisson} \bigg(\lambda_{B} \frac{R_B}{R_A} \bigg).\end{aligned}$$
(12.2)

In this model, the ratings are used to quantify each team’s ability and the mean number of goals A scores against B is directly proportional to team A’s rating and inversely proportional to team B’s rating. If A and B have the same ratings (\(R_A = R_B\)), then the mean score for that match is \((\lambda_A, \lambda_B)\). So, the parameter \(\lambda _{A}\) can be interpreted as the mean number of goals team A scores against a team with the same ability and an analogous interpretation applies to λ B .

We first consider the prior distribution formulation. In order to formulate the prior distribution, a number of experts provide their expected final scores for the incoming matches which we intend to predict. This kind of elicitation procedure is natural and simple, since the model parameters are directly related to the number of goals and the requested information is readily understandable by the respondents not requiring any extra explanation. We have adopted multiple experts since we believe it aggregates more information than using only one expert.

Assuming independent experts’ opinions and following a Poisson distribution, we shall obtain the prior distribution for the parameters using a procedure analogous to the power prior method [2] with the historical data replaced by the experts’ expected scores. The proposed elicitation process is based on the assumption that the experts are able to provide plausible outcomes for the incoming matches that could be observed but in fact were not. This elicitation process is in accordance with the Bayesian paradigm for prior elicitation as discussed in [6] and [3] as we will see later in this section. Moreover, although the independence assumption is taken mainly because of mathematical simplicity, we can argue that, at least approximately, the independence assumption holds in our case since the selected experts work at different media and do not maintain any contact.

Suppose we intend to predict a match between teams A and B. Consider that s experts provide their expected scores for m incoming matches of team A and B. Denote by \(y_{i,j}\) the jth expert’s expected number of goals scored by team A against opponent OA i and by \(z_{i, j}\) the expected number of goals scored by team B against opponent OB i , \(i = 1, \ldots, m\), \(j = 1, \ldots, s\).

In the following, we shall assume the probability density functions as the initial information about the parameters given by

$$\pi_0(\lambda_{A}) \propto \lambda_A^{\delta_{0}-1}{\rm exp}\{-\beta_{0}\lambda_A\} {\rm and } \pi_0(\lambda_{B}) \propto \lambda_B^{\delta_{0}-1}{\rm exp}\{-\beta_{0}\lambda_B\},$$
(12.3)

where \(\delta_{0}>0\) and \(\beta_{0}\geq 0\). Note that if \(\delta_{0}=1/2\) and \(\beta_{0}=0\) we have the Jeffreys’ prior for the Poisson model, if \(\delta_{0}=1\) and \(\beta_{0}=0\) we have an uniform distribution over the interval \((0, \infty)\) and if \(\delta_{0}> 0\) and \(\beta_{0}> 0\) we have a proper gamma distribution. The first two cases are usual choices to represent noninformative distribution for the parameter.

Updating this initial prior distribution with the experts’ expected scores, we obtain the power prior of λ A

$$\pi \left(\lambda_A | \mathcal{D}_0 \right) \propto \lambda_A^{a_0 \sum\limits_{i=1}^{m} \sum\limits_{j=1}^{s} y_{i, j} +\delta_{0}-1} {\rm exp}\bigg\{ -\left(a_0 s \sum\limits_{i=1}^{m} \frac{R_A}{R_{OA_i}}+\beta_{0}\right)\lambda_A\bigg\},$$
(12.4)

where \(0 \leq a_0 \leq 1\) represents a “weight” given to experts’ information and \(\mathcal{D}_0\) denotes all the experts’ expected scores. Thus, if \(0 < a_0 \leq 1\), the prior distribution of λ A is

$$ \lambda_A | \mathcal{D}_0 \sim{\rm Gamma} \left(a_0 \sum\limits_{i=1}^{m} \sum\limits_{j=1}^{s} y_{i, j} + \delta_{0}, a_0 s \sum\limits_{i=1}^{m} \frac{R_A}{R_{OA_i}}+\beta_{0} \right),$$

and if \(a_0 = 0\) the prior for λ A is the initial Jeffreys’ prior (12.3) and corresponds to disconsider all the experts’ information. In particular, if \(a_0 = 1\) the prior for λ A equals to the posterior which would be obtained if all the expected scores were in fact real data. Thus, the a 0 parameter can be interpreted as a degree of confidence in the experts’ information.

The elicitation of the prior distribution (12.4) can also be viewed in the light of the Bayesian paradigm of elicitation [6], when we consider the likelihood for the experts’ information

$$L'(\lambda_A | y_{1, 1}, \ldots, y_{m, s}) \propto \prod\limits_{i = 1}^m \prod\limits_{j = 1}^s \bigg[\lambda_A^{y_{i, j}} {\rm exp} \bigg\{ - \lambda_A \dfrac{R_A}{R_{OA_i}} \bigg\} \bigg]^{a_0}$$
(12.5)

and combine it with the initial noninformative prior distribution (12.3) by applying the Bayes theorem. The likelihood (12.5) provide information for the parameter through the experts’ information, which are treated like data, i.e, we assume them to follow a Poisson distribution.

Analogously, the prior distribution of \(\lambda_{B}\) is

$$\pi \left(\lambda_B | \mathcal{D}_0 \right) \propto \lambda_B^{a_0 \sum\limits_{i=1}^{m} \sum\limits_{j=1}^{s} z_{i, j} +\delta_{0}-1} {\rm exp}\bigg\{ -\left(a_0 s \sum\limits_{i=1}^{m} \frac{R_B}{R_{OB_i}}+\beta_{0}\right) \lambda_B \bigg\}.$$
(12.6)

It is important to note that by the way the experts present their guesses there is possibility of contradictory information. There is some literature on prior elicitation of group opinions directed towards to remove such inconsistencies. According to O’Hagan et al. [10], they range from informal methods, such as Delphi method [11], which encourage the experts to discuss the issue in the hope of reaching consensus, to formal ones, such as weighted averages, opinion polling, or logarithmic opinion pools. For a review of methods of pooling expert opinions see [6].

Now the posterior and predictive distributions are presented. Our interest is to predict the number of goals that team A scored against team B, using all the available information (hereafter denoted by \(\mathcal{D}\)). This information is originated from two sources: the experts’ expected score and the actual scores of matches already played. So, we may be in two distinct situations: (i) we do have the experts’ information but no matches have been played, and (ii) we have both the experts’ opinions and the scores of played matches.

In situation (i), we only have the experts’ information. So, from the model (12.1) and the prior distribution (12.4), it follows that the prior predictive distribution of X AB is

$$X_{AB} \sim \text{NB} \left(a_{0}\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{s} y_{i,j} + \delta_{0}, \frac{a_{0} s \sum\limits_{i=1}^{m} \frac{R_{A}}{R_{OA_i}}+\beta_{0}}{a_{0} s \sum\limits_{i=1}^{m} \frac{R_{A}}{R_{OA_i}} +\frac{R_{A}}{R_B}+ \beta_{0}} \right), \qquad k = 0, 1, \ldots,$$
(12.7)

where \(NB(r, \gamma)\) denotes the negative binomial distribution with probability function given by

$$f(k; r, \gamma) = \dfrac{\Gamma(r + k)}{k! \Gamma(r)} (1 - \gamma)^k \gamma^r, \qquad k = 0, 1, \ldots,$$

with parameters \(r> 0\) and \(0 < \gamma < 1\).

Analogously, from model (12.2) and the prior distribution (12.6), it follows that the prior predictive distribution of X BA is given by

$$X_{BA} \sim \text{NB} \left(a_{0}\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{s} z_{i,j} + \delta_{0}, \frac{a_{0} s \sum\limits_{i=1}^{m} \frac{R_{B}}{R_{OB_i}}+\beta_{0}}{a_{0} s \sum\limits_{i=1}^{m} \frac{R_{B}}{R_{OB_i}} +\frac{R_{B}}{R_A}+ \beta_{0}} \right), \qquad k = 0, 1, \ldots.$$
(12.8)

In situation (ii), assume that team A has played k matches, the first against team C 1, the second against team C 2, and so on until the kth match against team C k . Suppose also that, given λ A , \(X_{A, C_1}, \ldots, X_{A, C_k}\) are independent Poisson random variables with parameters \(\lambda_A \frac{R_A}{R_{C_1}}, \ldots, \lambda_A \frac{R_A}{R_{C_k}}\). Hence, from model (12.1) it follows that the weighted likelihood is given by

$$\begin{aligned} L^{\mathbf{p}}(\lambda_A | \mathcal{D}) &= \prod\limits_{i=1}^{k} P\left[ X_{A, C_i} = x_A^i \right]^{p_{i}}\propto {\rm exp} \left\{-\lambda_A \sum\limits_{i = 1}^{k} p_{i} \frac{R_A}{R_{C_i}} \right\} \lambda_A^{\sum\limits_{i=1}^{k} x_A^i p_{i}},\end{aligned}$$
(12.9)

where \(x_{A}^i\) are the number of goals scored by A against the ith opponent, \(i = 1, \ldots, k\), and \(\mathbf{p} = (p_1, \ldots, p_k)\), \(0 < p_i < 1\), is the vector of fixed weights assigned to each match in order to decrease the influence of past matches.

From the likelihood (12.9) and the prior distribution (12.4), it follows that the posterior distribution of λ A is

$$\lambda_A | \mathcal{D} \sim{\rm Gamma} \left(a_0 \sum\limits_{i=1}^{m} \sum\limits_{j=1}^{s} y_{i, j} + \sum\limits_{l = 1}^{k} p_{l}x_{A}^l +\delta_{0}, \sum\limits_{l=1}^{k} p_{l}\frac{R_A}{R_{C_l}}+a_0 s \sum\limits_{i=1}^{m} \frac{R_A}{R_{OA_i}}+\beta_{0} \right),$$
(12.10)

which implies by the model (12.1) that the posterior predictive distribution of X AB is

$$X_{AB} | \mathcal{D}\sim{\rm NB} \left(a_{0}\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{s} y_{i,j} + \sum_{l = 1}^{k} p_{l}x_{A}^l +\delta_{0}, \frac{\sum\limits_{l=1}^{k} p_{l}\frac{R_{A}}{R_{C_l}} +a_{0} s \sum\limits_{i=1}^{m} \frac{R_{A}}{R_{OA_i}}+\beta_{0}} {\sum\limits_{l=1}^{k} p_{l}\frac{R_{A}}{R_{C_l}} +a_{0} s \sum\limits_{i=1}^{m} \frac{R_{A}}{R_{OA_i}} +\frac{R_{A}}{R_B}+ \beta_{0}} \right).$$
(12.11)

Analogously, the posterior distribution of λ B is given by

$$\lambda_B | \mathcal{D} \sim{\rm Gamma} \left(a_0 \sum\limits_{i=1}^{m} \sum\limits_{j=1}^{s} z_{i, j} + \sum\limits_{l = 1}^{k} p_{l}x_{B}^l +\delta_{0}, \sum\limits_{l=1}^{k} p_{l}\frac{R_B}{R_{D_l}}+a_0 s \sum\limits_{i=1}^{m} \frac{R_B}{R_{OB_i}} +\beta_{0}\right),$$
(12.12)

where \(x_{B}^l\) is the number of goals team B scores against the lth opponent, \(D_l, l = 1, \ldots, k\). Hence, from the model (12.2) and the posterior (12.12), it follows that the posterior predictive distribution of X BA is

$$X_{BA} | \mathcal{D} \sim \text{ NB} \left(a_{0}\sum\limits_{i=1}^{m} \sum\limits_{j=1}^{s} z_{i,j} + \sum_{l = 1}^{k} p_{l}x_{B}^l +\delta_{0}, \frac{\sum\limits_{l=1}^{k} p_{l}\frac{R_{B}}{R_{D_l}} +a_{0} s \sum\limits_{i=1}^{m} \frac{R_{B}}{R_{OB_i}}+\beta_{0}} {\sum\limits_{l=1}^{k} p_{l}\frac{R_{B}}{R_{D_l}} +a_{0} s \sum\limits\limits_{i=1}^{m} \frac{R_{B}}{R_{OB_i}} +\frac{R_{B}}{R_A}+ \beta_{0}} \right).$$
(12.13)

At this point, it is important to note that matches taken to construct the prior distribution are distinct from those considered to the likelihood function, that is, the matches already played have their contribution (though their final scores) included in the likelihood function but not in the prior.

12.3 Methods

In this section, we shall consider the competition divided into seven rounds, where the first three rounds are in the group stage (first phase) and the last four in the knockout stage (second phase). The four experts’ were asked for their expected final scores for the matches in five distinct times: just before the beginning of tournament and just before each of the four rounds in the knockout stage. At the beginning of competition, experts provided their expected final scores for all matches in the group stage at once, while in the knockout stage they provide their expected final scores only for matches in the incoming round. To account for the mean experts’ opinion, we have chosen \(a_{0}= 1/(3*4) = 1/12\) (for the group stage) and \(a_{0} = 1/4\) (for the knockout stage), in the sense that the posterior distribution of the parameter is the same as that, which would be obtained if we took one observation equal to the mean expected score from the sampling distribution. It is important to note that the experts were selected from different sports media, in order to make their guesses as independent as possible.

For the knockout stage, teams can play for additional 30 min if they remain level after the 90 min regulation time and if the result persists the teams proceed to a penalty shootout decision. For the extra time, we considered a Poisson distribution with parameter multiplied by one third to account for the shrinkage of time, which is equivalent to 30 min of extra time. That is, one third of the overall match time (90 min). For penalty shootout, we simulated a Bernoulli random variable proportional to the ratio of parameter estimates (posterior means).

The exact calculation of probabilities is possible just for the case of a single match prediction. We calculate the probabilities exactly from the predictive distributions. The probabilities regarding qualifying chances, winning tournament chances among others must be performed by simulation, since they may involves many combinations of match results.

A method used to measure the goodness of a prediction is to calculate the De Finetti distance [4] which is the square of the Euclidean distance between the point corresponding to the outcome and the one corresponding to the prediction. It is useful to consider the set of all possible forecasts given by the simplex set

$$S = \left\{(P_W, P_D, P_{L}) \in [0, 1]^3: P_W + P_D + P_{L} = 1 \right\}.$$

Observe that the vertices \((1,0,0), (0,1,0)\), and \((0,0,1)\) of S represent the outcomes win, draw, and loss, respectively. Thus, if a prediction is \((0.2, 0.65, 0.15)\) and the outcome is a draw \((0, 1, 0)\), then the De Finetti distance is \((0.2 - 0)^2 + (0.65 - 1)^2 + (0.15 - 0)^2 = 0.185\). Also, we can associate to a set of predictions the average of its De Finetti distances, known as the De Finetti measure. So, we shall consider the best among some prediction methods the one with the least De Finetti measure.

To assess the impact of the experts’ information on the quality of the predictions, Table 12.1 displays the relative differences (in %) of the De Finetti measure of our pool of experts fixing a 0 at 0 and 0.25, denoting respectively the total absence of experts’ opinion and the amount of experts’ information as considered in our method, relatively to a fictitious pool of “perfect” experts, who always forecast the exact score for each one of the matches, with a 0 fixed 0.25.

Table 12.1 Relative differences (in %) of the De Finetti measure of our pool of experts fixing a 0 at 0 and 0.25, relatively to a fictitious pool of perfect experts

At the initial rounds, the use of experts’ information greatly improves prediction, with the De Finetti measure with \(a_0 = 0.25\) always closer to the results of the “perfect” experts’ opinion. However, as observed data enters the model, with the progress of the competition, the gain of using expert information is decreasing. This feature is in fully agreement with our initial motivation to consider experts’ information in our modeling: filling the lack of information when there is shortage of objective information (data) available. Note that De Finetti measure without considering the experts’ opinion (\(a_0 = 0\)) in the first and second round is much larger than such measure for an equiprobable predictor, which assigns equal probability to all outcomes. This is another evidence in favor of the usefulness of our modeling. Our model also joined a prediction model competition for the matches of the group stage of the 2010 WCT organized by the Brazilian Society of Operational Research, the World Cup 2010 Football Forecast Competition, reaching the first place. The inclusion of subjective information into our modeling through expert’s opinions was crucial for such achievement. For the knockout stage matches, experts’ information did not improve prediction, which can be explained in part by the small difference of skill level between teams and by the lack of confidence on Spain and Netherlands teams who defeated the traditional teams of Germany and Brazil, respectively.

12.3.1 Predictions for the Whole Tournament

We use the predictive distributions to perform a simulation of 10,000 replications of the whole competition. The purpose of this simulation is to estimate probabilities that a given team wins the tournament. The probabilities are estimated by the percentage of times the event

Considering only the four finalists (Spain, Netherlands, Germany, and Uruguay) we obtained, just before each round of the knockout stage, the winning tournament probabilities assuming different values for the a 0 weight attached to the experts’ information and different choices for the sequence of p i ’s weights attached to the previous observed matches. Table 12.2 presents the obtained results for this simulation study. Various different time weighting values were considered within the range (0, 1), including someone that, from the practical point of view, may not make sense, but allows us to realize the impact of the p i ’s in the winning tournament probabilities. From these results, we can observe that, for a fixed value of a 0, different values for the p i ’s does not alter significantly the predictions, particularly in the advanced stages of the championship. On the other hand, for fixed values of the p i ’s, we see a noticeable influence on predictions according to changes in the a 0 value. For instance, observe the probabilities for Netherlands and Germany in the semifinals, and Netherlands and Spain in the quarterfinals.

Table 12.2 Percentage of tournament wins

12.4 Final Remarks

In this chapter, we propose a Bayesian simulation methodology for predicting match outcomes of the 2010 Football World Cup, which makes use of the FIFA ratings and experts’ opinions. FIFA ratings system are based on previous 4-year performance of teams. The drawbacks of this ratings system is the great changes in the formation of teams in such a large period of time and the small number of games played between teams of different continent in comparison with those played by teams of the same continent. Other measures of strength of teams should be considered further and compared with the FIFA ratings. Moreover, the development of two ratings for teams, possibly via experts’ information or even FIFA documentation, one for attack and another for defense, could improve prediction since there are teams which have strong defense but weak attack and vice versa. A simple possibility to incorporate those abilities would be to add one parameter for each team directly on the Poisson model, as it was made for instance in [13]. This major embedding can be seen as a direct generalization of our modeling and should be considered in future research in the field.

The prior distributions are updated every round, providing flexibility to the modeling, once the experts’ opinion are influenced by all previous events to the match. The use of expert’s opinions may compensate, at least in part, for the lack of information of the factors which can influence a football match during a competition, such as tactic disciplines, team psychological conditions, referee, player injured or suspended, amongst others.

The method may be used to calculate the win, draw, and loss probabilities at each single match, as well as to simulate the whole competition in order to estimate, for instance, probabilities of classification at group stage, of reaching the knockout stage or the final match, and of winning the tournament.

Moreover, the method presents a high performance within a simulation structure since known predictive distributions are obtained. This enables a rapid generation of predictive distribution values and consequently the probabilities of interest are obtained quickly.

Overall, the Bayesian simulation methodology with different weight values for the played matches and different weight values for the expert’s opinions provides a better idea on the impact of the latest matches and the different weights assigned to the experts’ opinion on the estimated probabilities of interest, evidencing the advantage of incorporating time-effect weights for the match results. In our analysis the weights of the experts’ opinion are fixed and known. As further work it may be considered one distinct value of a 0 for each expert and round allowing changes of the values over the rounds.

One interpretation that can be made is that, for a particular expert, if a 0 increases over the rounds, the confidence of the information given by this expert increases as well.

Alternatively, if a 0 decreases, that means the information provided by this expert in previous rounds were not reliable. Furthermore, we can assume that a 0 is a random variable and use a full hierarchical structure specifying a parametric distribution for the parameter a 0, like a beta distribution, as suggested in [2].