1 Introduction

This paper presents a methodological illustration of a simple, cost-effective and replicable technique to estimate the homeless population on the national level. In the Netherlands, for 5 subsequent years estimates have been conducted and the results show an increase in the homeless population from 17.8 thousand in 2009 to 27.3 thousand in 2012. In 2013 no further increase was found. This trend is completely congruent with the development of the financial crisis in this time span. It is the first time that this promising technique to assess the homeless population has been successfully applied on a national level. This approach is highly recommendable for other countries as it saves time, reduces costs and yields into reliable and plausible figures about the homeless. However, adoption of this methodology is only worthwhile if some requirements are met, including the availability of national population registers of sufficient quality, and the possibility to uniquely identify people. We will outline this new methodology and put the homeless in the context of the difficult to survey groups.

Homeless people are a hard-to-reach population: in the Netherlands, for example, they are not identifiably included in the basic municipal registration and are thus excluded from sample frames and general population statistics. As a result, they are under-represented in large scale social surveys conducted by national statistical institutes. And—even if feasible—approaching the homeless for interviews using standard sample survey methods would result in excessive non-response rates and consequently in non-response bias.

For a number of reasons, it is important to know how many people in a society are homeless. First, the number of homeless people is potentially an indicator of social exclusion within a society (Coumans and Schmeets 2014; Schmeets and Coumans 2013). An indication of the size of the homeless population can play a pivotal role in determining whether the problem will be included on a government’s policy agenda in the first place (Amore et al. 2011). Second, homeless people often have multiple health problems—physical, mental and substance abuse problems; large numbers of homeless—and especially roofless people may pose a threat to public health. Second, homeless people may trigger feelings of insecurity among fellow-citizens and may cause nuisance in public spaces as a result of their behaviour, public drinking and drug use and/or petty crime. And fourthly, homeless people have specific characteristics, particularly in terms of quality-of-life indicators such as health, lifestyle, and social participation. Because non-coverage of this specific group results in biased population statistics, an effort should be made to estimate the number of homeless people and their characteristics.

An accurate and reliable identification and classification of homeless people is also necessary to develop policies for different manifestations of homelessness and to monitor the effectiveness of such interventions (Amore et al. 2011). An estimate of roofless people may, for example, provide insight into the number of housing units needed for homeless people.

For most homeless people providing housing in combination with intensive support will result in a considerable increase in their quality of life and participation in society, as well as in their sense of safety (Busch-Geertsema 2013; Maas et al. 2012; Patterson et al. 2013). Also, from a more economic perspective, housing is a first condition to take part in education and the labour market, which will also reduce the number of people dependent on social benefits. Furthermore, by decreasing the number of roofless people the quality of public life will also improve.

Notwithstanding the importance and relevance of meaningful statistics on the homeless population for informed policy-making, an accurate and reliable estimate of the size of the total Dutch homeless population has not been available for years. Although reliable and accurate estimates of homeless populations have been constructed for several Dutch municipalities in the past years (Hulschbosch et al. 2003, 2005, 2006; Reinking et al. 2001; Van Zwieten et al. 2008; Vocks et al. 2008), the most recent estimation at national level was compiled in 2003 (De Bruin et al. 2003). This number was based on extrapolated estimates of roofless and homeless people from interviews with 95 experts of 44 municipalities, and quick scans in 38 smaller municipalities (De Bruin et al. 2003). Older estimates were established in 1989 and 2000 by prognostic studies (PLOTT; Heydendael 2000; Heydendael and Brouwers 1990; Heydendael and Brouwers 1989) and in 1987 by the SGBO/VNG, the umbrella organisation of Dutch municipalities (Van der Zwet et al. 1990), which mainly used extrapolation of municipal estimations. Also, the Federatie Opvang, an umbrella organisation for social services, provides yearly counts of the homeless population using their services. However, since their target population differs substantially from that of the present study, a comparison is not possible.

This study presents an advanced application of the capture-recapture approach (CRC) to produce updated official figures on homeless people in The Netherlands for five subsequent reference dates (January 1st of 2009, 2010, 2011, 2012 and 2013). For all dates the same method as well as the same updated registers were used, which makes this study pre-eminently useful for signalising trends for the whole population and subpopulations. The Netherlands is the first country adopting this procedure to produce such national estimates. Moreover, since 2009 the figures are published by Statistics Netherlands on a yearly base as the official statistics on homelessness.

This paper also illustrates how the CRC approach can be applied to assess the homeless population in terms of background characteristics. By using covariates, we are able describe the estimated population by the levels of these covariates, giving insight into characteristics of individuals not included in any registers.

Thus the main aim of the current paper is to present a cost-effective strategy to estimate the homeless population on the national level, including information on various subpopulations. Therefore, as an illustration, data of 2009 were used. Furthermore, in order to test the robustness of the selected statistical model yielding the estimate of 2009, also the selected models and their resulting estimates of the years 2010–2013 are presented.

This paper is structured as follows. Section 2 outlines the definition of homelessness and the system of services provision for the homeless in the Netherlands. Section 3 offers a literature review on the various method for estimating homeless populations, including the CRC approach and its merits compared to other approaches. Section 4 details the registers used, as well as the CRC approach and its assumptions, and explains the process of model selection. Section 5 presents the results, and in Sect. 6 we make some concluding remarks and discuss both the application of CRC and its resulting estimates. Also, we shortly discuss some of the potential policy implications.

2 Definition of the Homeless Population in the Netherlands

How homelessness should be defined is a fundamental and persistent problem. Definitions of homelessness vary considerably across the world and only a few are based on a defined concept (Amore et al. 2011). In the Netherlands there is no broadly accepted definition of homelessness. The authorities in the Netherlands (Department of Health, Welfare and Sport, central local authorities) have not formally established a definition of homelessness for policy aims. In principle, each of the municipalities is free to determine the way in which homelessness and groups of homeless people are defined. Furthermore, continuity in handled definitions has not been guaranteed in each municipality due to policy changes after municipal elections. However, the government and local authorities increasingly acknowledge the necessity to distinguish and to assess factual homeless people, residentially homeless people and marginally housed people. In the Netherlands, this distinction was introduced in a large scale homelessness study in The Hague (Wolf et al. 2002) and was derived from homelessness research internationally (Koegel et al. 1996). In the Homelessness Action Plan (Maas and Planije 2009), implemented between 2006 and 2013 by the Dutch government together with the municipal authorities of the four largest cities (Amsterdam, The Hague, Rotterdam and Utrecht), the distinction between factually homeless people and residentially homeless people was also implemented.

Consensus also emerged in other European countries, and led to the European Typology of Homelessness and Housing Exclusion (ETHOS), which is classified on the basis of living situation. This typology, also adopted by Statistics Netherlands, consists of the four dimensions: roofless, houseless, insecure accommodation, and inadequate accommodation. The roofless and houseless dimensions together define homelessness; insecure and inadequate accommodation refer to housing exclusion (www.feantsa.org; Amore et al. 2011).

In this study we focus on roofless people, whom we define in accordance with Wolf et al. (2002) as individuals who had no permanent accommodation on the reference date. Also in line with Wolf et al. (2002), we distinguish the following categories:

  • people who sleep outdoors, either in the open air or in covered public spaces such as doorways, bicycle sheds, railway stations, shopping centres or cars;

  • people who spend the night indoors in transient accommodation run for the homeless, including emergency shelters;

  • people who sleep indoors in the homes of friends, acquaintances or relatives, without knowing where they can sleep the following night.

In the Dutch institutional setting this population is mainly served by so called low-threshold services according to their basic needs such as a need for food, shelter and a safe haven during the day or night. These shelters also include drop-in services and night shelters. In addition, workers of outreach services make contact with this marginalized category on the streets and support them in finding services that match their needs.

Although the term roofless thus refers to the specific group of the literal or factual homeless, for the sake of clarity and recognisability, throughout this paper the term ‘homeless’ will be used. Focusing on the roofless means that this study will focus on a specific subgroup of the homeless population, i.e. those who often are publicly most visible. This also implies that our study will not provide estimates of the following houseless categories within the homeless population (see ETHOS): (1) people staying in residential services, such as a homeless hostel, temporary accommodation or transitional supported accommodation; (2) people in women’s shelter; (3) people in accommodation for immigrants; (4) people due to be released from institutions (for example, penal institutions); and (5) people receiving longer-term support (due to homelessness) in supported housing settings in which clients have their own apartment (sometimes shared with others) and receive targeted support on a regular basis from professional workers.

3 Literature Review on Size Estimations of the Homeless Population

Several methods have been developed to estimate the number of homeless people, particularly at the local level. This section offers an overview of the literature on several variants of population size estimations in the field of homelessness; it does not intend to be exhaustive. This review distinguishes the following methods for estimating or counting homeless populations: (1) link tracing methods, (2) indirect estimations, (3) single-contact censuses and their evaluation, and (4) capture-recapture methods. Although this classification reflects the existing literature on this broad subject, we are aware that the distinction is fairly rough as some methods partly overlap and can be applied simultaneously to estimate homelessness.

  1. 1.

    Link-tracing methods are well-known: they include all kinds of sampling techniques that use people from the target population as a starting point and ask them to nominate other members of their group, the so-called nominees. The nominees are in turn asked to do the same, and this is repeated until a predetermined point of saturation is reached (Spreen 1992; Thompson and Frank 2000). Specific examples of the use of this method among the homeless population are snowball sampling (Beata and Snijders 2002) and network sampling (Coumans and Spreen 2003; Spreen 1999; Spreen and Coumans 2001).

  2. 2.

    Other techniques to estimate the size of the homeless population are more indirect in nature and comprise document analysis, analysis of administrative data, key-informant interviews, or general population surveys (Benjaminsen and Dyb 2008; Collins 2010; De Bruin et al. 2003; Fitzpatrick et al. 2012; Hombs 2011; Metraux et al. 2001; Salo and Campanelli 1991; Wood et al. 2014). A less often referred to evaluative estimation technique is that of small area estimators (Hudson and Vissing 2010; Kriegler and Berk 2007, 2010). Some studies have combined several more or less evaluative approaches or sources (Collins 2010; Beata and Snijders 2002; De Bruin et al. 2003; D’Onise et al. 2007; Fitzpatrick et al. 2012; Pointing et al. 2013), which may be more valuable than opting for one single approach (D’Onise et al. 2007). General population surveys are often applied to study life-time or 5-year prevalence of homelessness (Link et al. 1994, 1995; Ringwalt et al. 1998; Tompsett et al. 2006; Toro et al. 2007).

  3. 3.

    A specific part of the literature concerning estimations of the homeless comprises the use of techniques to monitor and evaluate the count or enumeration of homeless people during S-night, the census in the USA and Australia (Cousineau and Ward 1992; Devine and Wright 1992; Edin 1992; Hopper 1992; Hopper et al. 2008; Stark 1992; Williams 2011; Wright and Devine 1992). Cowan et al. (1986) referred to this approach as a single-contact census. Critical observation of the enumerators, interviews with homeless people, post-count survey methods and a plant-captureFootnote 1 methodology were most common as evaluative approaches. Also ethnography is often used as an additional source of information (Salo and Campanelli 1991).

  4. 4.

    Finally, probably one of the best-known and most widely discussed method to estimate the number of homeless people is the capture-recapture (CRC) approach (Beata and Snijders 2002; Berry 2007; Brent 2007; Chao et al. 2001; D’Onise et al. 2007; Félix-Medina and Thompson 2004; Fisher et al. 1994; International Working Group For Disease Monitoring and Forecasting, IWGDMF 1995; Shaw et al. 1996; Van der Heijden et al. 2009; Van der Heijden et al. 2012; Williams and Cheal 2002). This estimation technique has its origins in biology and refers to the estimation of an unobserved part of a certain population, originally populations of animals. The approach has evolved into a useful technique with applications in epidemiological research and the social sciences. The methodology has proven to be especially useful for estimating hidden populations, such as drug users and homeless people. In the field of homelessness, there are two main variants of CRC applications: (1) using two or more independent observations of the study population in their own environment, and (2) using incomplete lists with members of the study population. In both cases, assuming we have two observers or two lists, the unobserved part of the population is calculated by using the ratio between individuals observed or registered only once and individuals observed by both observers or registered on both lists. In this study, we applied CRC of linked administrative sources, which will be elaborated in the next section.

4 Methods

4.1 CRC of Linked Administrative Sources

Compared with many of the approaches discussed above, CRC of linked administrative sources has the advantage that it is cost-effective for a statistical bureau in need of a national estimate of the number of homeless. Many of the above approaches may be used successfully to provide local or regional estimates, but even at this level are time-consuming and demanding. Also it is difficult, if not impossible, to carry out a series of local surveys and generalise these to a national estimate. Another major advantage is that this approach deals with incomplete lists, which is often an evident problem using registers of this population.

However, CRC of linked administrative sources assumes that the administrative sources are of high quality. If the number of sources to be linked is two, there is also the assumption that inclusion of a homeless person in one source is independent of him/her being registered in the other. However, as we shall discuss in more detail in Sect. 4.3, if the number of sources to be linked is larger than two, this restrictive independence assumption is not necessary to find unbiased estimators, but is replaced by the less restrictive assumption that, in the case of k registers, the so-called k-factor interaction is absent. Another assumption is that the population is closed, and that there are no problems in linking the individuals in the different registers to each other. For this latter assumption each of the registers must contain the relevant information to do this, but also privacy regulations may not obstruct the linking of individuals from different registers. For the Netherlands, each of the registers used contains a key variable as well as detailed information on the homeless, which is derived from the same official source, the national population register. By law, Statistics Netherlands is entitled to link the registers used.

4.2 Data and Selections

For all reference dates, i.e. January 1 of 2009–2013 the estimations were based on the same three data sources, i.e.:

  1. 1.

    A list compiled especially for this study, consisting of individuals who, according to the official Dutch population registration, the basic municipal administration system (GBA), are residing at an address where a low-threshold service (shelter) for the homeless is located. Homeless people can use these services as a postal address, in order to receive a social benefit. To identify these individuals uniquely, we matched a list of shelter addresses with the GBA on the reference date. The addresses were provided yearly by the main umbrella organisations offering low threshold services for the homeless, i.e. the Federatie Opvang, the Army and the Centrum voor Dienstverlening (located in Rotterdam). The coverage of the initial list was an estimated 90–95 %. In addition, each consecutive year Statistics Netherlands updated this address list by internet research and contacting institutions. This list is further referred to as Shelter.

  2. 2.

    A list of individuals between 18  and 65 years old who received income support (WWB), but did not have a permanent residence according to the official WWB registration on the reference date of January 1. Apart from not having a permanent address, there are some specific conditions to get WWB-support. First of all, one is supposed not to be able to support oneself, which means that someone has not enough money to pay the rent, health insurance, food, and bills for electricity, gas and light. Other criteria are that the (household) income has to be below a certain level, that the applicant has the Dutch Nationality and lives or stays in the Netherlands, and has a dwelling way of life. People who are (expectantly) temporarily homeless, or residentially homeless are excluded from this arrangement. Over the years, the quality of this register is consistent. This list is called WWB below.

  3. 3.

    A list of individuals registered as being homeless in the National Alcohol and Drugs Information System (Ladis; source: Stichting Informatie Voorziening Zorg, SIVZ). This list is further referred to as Ladis. It has to be noted that the housing situation of individuals was recorded at the moment of their registration in the institute. This means that possibly their housing situation had been changed between registration and our reference date from homeless to having a residence or the other way around. Unfortunately, it was not possible to determine the exact size of this potential bias to our estimates.

Each person on all three registers was assigned a unique identification number based on the same Dutch population registration, the GBA, which aims to register all inhabitants in the Netherlands. This number is assigned to every citizen at birth or in the case of permanent residence. Moreover, we were also able to link for each homeless person the officially registered background characteristics according to the GBA.

Within these lists, further selections were made which are important for correct interpretation of the data. This study focuses on homeless people aged between 18 and 65. This selection was made as by definition people who receive income support according to the WWB are between 18 and 65 years (see also point 2 above). Second, the two other lists comprise hardly any people younger than 18 or older than 65 years. For the youngest category this is mainly due to Dutch policy on aid for the homeless, which has a separate circuit for homeless youngsters. Older people generally have changed street life for living in residential services, such as homeless hostels, or they stay in nursing homes for homeless people. Therefore, their number is not expected to be substantial among the factual homeless group.

Lastly, people registered at addresses at which more than one kind of service was located were not selected for this study. Otherwise the list would be biased as it would also include residentially homeless people or people living in women’s refuges. After all, we are only interested in estimating the size of the actually roofless population. The only exception to this was when people registered at these addresses were also registered as receiving income support according to the WWB. As income support is only provided under strict conditions (see above), it is plausible to assume that these people were actually roofless.

In the analysis the variables Shelter (S), WWB (W) and Ladis (L) were used to indicate the three lists.

Gender (G), age (A; 18–29/30–49/50–64 years), place of living (P; in one of the four largest cities,Footnote 2 i.e. Den Hague, Amsterdam, Rotterdam and Utrecht vs. other) and origin (O; native Dutch/western foreign or non-western foreign) were used as covariates that enable description of the estimated population in terms of these variables (see Sect. 5.2). For the coding scheme see Table 5 of appendix.

Table 1 presents the cross-classification of the variables Shelter, WWB and Ladis. The total number of identified homeless is 5169. The cell (no, no, no) is empty by definition and its value has to be estimated. Subsequently, the identified 5169 homeless added to this estimate will yield an estimate of the total population size.

Table 1 Cross-classification of homeless included in Shelter, WWB and Ladis

4.3 CRC and the Model Selection Process

A well-known CRC technique for estimating the size of a population is to select two or more registers of this population, link the individuals in the registers, and estimate the number of individuals that occur in neither register (Bishop et al. 1975; Fienberg 1972; Cormack 1989; International Working Group for Disease Monitoring and Forecasting, IWGDMF 1995; Van der Heijden et al. 2012). For example, with two registers A and B, linkage gives a count of individuals in A but not in B, a count of individuals in B but not in A, and a count of individuals both in A and B. These counts form a contingency table denoted by A × B with the variable labeled A short for ‘inclusion in register A’, taking the levels ‘yes’ and ‘no’, and likewise for register B. In this table the cell ‘no, no’ has a zero count by definition, and the statistical problem is to estimate this value in the population. A population size estimate is obtained by adding this estimated count of missed individuals to the counts of individuals found in at least one of the registers.

The frequency of the missing ‘no, no’ cell can be obtained by fitting a log-linear model to the incomplete contingency table. Log-linear models describe (the logarithm of) the observed cell frequencies of the contingency table in terms of main and interaction effects of the variables in the model. To distinguish between different log-linear models we use the notation by Bishop et al. (1975). In this notation, variables that interact with each other are enclosed by a single set of square brackets, while variables that do not interact are enclosed by different sets of square brackets.

For example, consider a 2 × 2 contingency table of the registers A and B. The log-linear model [AB] for these two registers is given by

$$\log m_{ab} = \lambda + \lambda_{a}^{A} + \lambda_{b}^{B} + \lambda_{ab}^{AB} ,$$
(1)

where \(m_{ab}\) denotes the expected frequency of the cell ab, for a, b = {‘yes’, ‘no’}. The parameter λ denotes the intercept, the parameters \(\lambda_{a}^{A}\) and \(\lambda_{b}^{B}\) the main effects of A and B, and the parameter \(\lambda_{ab}^{AB}\) the interaction effect of A and B. The inclusion of \(\lambda_{ab}^{AB}\) in the model implies that the probability to be in A depends on whether the subject is in B or not, and vice versa. The model is called saturated since it has as many parameters as cell frequencies. However, because the cell \(m_{no,no}\) is not observed, model [AB] has one parameter too many, and can therefore not be estimated.

The independence model [A][B] on the other hand, as given by

$$\log \,m_{ab} = \lambda + \lambda_{a}^{A} + \lambda_{b}^{B},$$
(2)

has only three parameters, and the absence of the interaction parameter \(\lambda_{ab}^{AB}\) implies that the inclusion probabilities of the registers A and B are assumed to be independent. Note that for a 2 × 2 contingency table with one unobserved cell the model [A][B] is actually a saturated model, since it has exactly as many parameter as observed cell frequencies. After fitting this model to the three observed cell frequencies, we can use the parameter estimates to derive an estimate of the frequency of the missing ‘no, no’ cell, and hence of the total population size.

Aside to the independence assumption, the population size estimate is derived under the assumptions that in at least one of the two registers the inclusion probabilities are homogeneous (see Chao et al. 2001; Zwane et al. 2004), that the population is closed and that it is possible to perfectly link the individuals in registers A and B. However, it is generally agreed that these assumptions are unlikely to hold for human populations. We discuss two approaches that may be adopted to make the impact of possible violations less severe.

The first approach is to include covariates in the model, in particular covariates whose levels have heterogeneous inclusion probabilities for both registers (Baker 1990; see Bishop et al. 1975). For example, with the covariate X we can then expand the two-way contingency table to a three-way contingency table, and fit a log-linear model [AX][BX], as given by

$$\log \,m_{abx} = \lambda + \lambda_{a}^{A} + \lambda_{b}^{B} + \lambda_{x}^{X} + \lambda_{ax}^{AX} + \lambda_{bx}^{BX} ,$$
(3)

where the two-factor interaction parameters \(\lambda_{ax}^{AX}\) and \(\lambda_{bx}^{BX}\) denote interactions between the covariant X and the registers A and B, respectively. The restrictive independence assumption is replaced by a less restrictive assumption of independence of A and B conditional on the covariates; and sub-population size estimates are derived (one for every level of the covariates) that add up to the total population size estimate.

The second approach is to include a third register C, and to analyze the three-way contingency table with log-linear models that may include one or more two-factor interactions. The saturated model is then given

$$\log \,m_{abc} = \lambda + \lambda_{a}^{A} + \lambda_{b}^{B} + \lambda_{c}^{C} + \lambda_{ab}^{AB} + \lambda_{ac}^{AC} + \lambda_{bc}^{BC} ,$$
(4)

which in short-hand notation is expressed by [AB][AC][BC], which is the commonly used and more comprehensible notation for log linear modeling.

Note that this model allows for the pairwise dependence of registers but—given the absence of the parameter \(\lambda_{abc}^{ABC}\)—not for the three-factor interaction. Unfortunately, it is not always possible to include a third register, either because such a register is not available, or because there is no information that enables linkage of individuals in the third register to both other registers.

In this study, we dispose over a third register as well as a number of covariates, and the assumptions underlying population size estimation can therefore be substantially relaxed. Since we have three registers, we can test model pairwise dependencies between the registers by including the interaction terms \(\lambda_{ab}^{,AB} ,\lambda_{ac}^{,AC}\) and \(\lambda_{bc}^{,BC} ,\) and testing whether these are significant. Furthermore, since we have covariates, we no longer need to assume homogeneity of inclusion probabilities. For example, we can investigate whether the inclusion probabilities of the registers Shelter and Ladis are independent, and whether the inclusion probabilities for males in each of the three registers differs from those for females. As noted above, by using covariates we are also able to give insight into characteristics of individuals not included in any of the registers.

For model selection we follow the standard approach in log-linear modeling. The goodness-of-fit of a model is evaluated by the deviance. The deviance compares the fit of the fitted model against that of the saturated model, and a non-significant deviance indicates that the model fits adequately. The deviance is Chi square distributed with degrees of freedom equal to the number of observed cell frequencies minus the number of estimated parameters. The relative fit of a model is evaluated by the AIC, which is the standard measure of fit for this kind of model testing. The AIC allows for the comparison of non-nested models; the model with the lowest AIC is to be preferred (for an example in the context of population size estimation, see Burnham and Anderson 2002).

Although the availability of the three register variables S, W and L and the four covariates G, A, P and O allows for substantial relaxation of the assumptions, it severely complicates model selection. With seven variables the number of potential log-linear models is extremely large. In combination with the absence of a definite theoretical explanation, the use of some exploratory model selection procedure is indispensable. To this end we use the STEP function of the statistical software package R (see Hastie and Pregibon 1992; Venables and Ripley 2002). Note that we use the coding scheme as introduced in Sect. 4.2 to indicate the variables in our model (see also Table 5 of appendix).

The STEP function searches for the best model by starting with independence model [S][W][L][G][A][P][O], and adding and deleting parameters in a stepwise fashion on the basis of the AIC, with the restriction that the saturated model is [SWGAPO][SLGAPO][WLGAPO] and not [SWLGAPO]. It is well known, however, that the use model selection procedures as the STEP procedure are no guarantee for finding the model with the lowest AIC. In an attempt to further improve the AIC, we performed an ad hoc stepwise model search on the basis of the model selected by STEP. The same procedure to optimize the model was used for model selection in 2010–2013. The resulting model was used to obtain a point estimate estimation of the population size, and a 95 % confidence interval of the population size estimate was based on the parametric bootstrap (compare Van der Heijden et al. 2012).

5 Results

5.1 Model Selection

Table 2 shows the results of our model search. Model 1 is the independence model. As expected it does not fit the data, with a deviance statistic \(X^{2} = 1653,df = 242,p < .001.\) Model 2 is the model selected by the STEP procedure. It fits the data well, \(X^{2} = 175,df = 198,p = .88,\) but yields an unrealistically high population size estimate of 31,415 and an excessively large confidence interval, ranging from 23,682 to 47,948. In an effort to explain this awkward result, we found that the interaction term [WL] is largely responsible for the high estimate. We suspect that the interaction term [WL] is either due to sampling error, or that it is part of the more general interaction term [SWL]. In the former case, we would expect that there is a model without the term [WL] that fits the data almost equally well. In the latter case, the omission of the three-factor interaction parameter \(\lambda_{swl}^{SWL}\) might have resulted in a biased estimate. Unfortunately, this assumption cannot be verified, since the parameter \(\lambda_{swl}^{SWL}\) cannot be estimated due to the missing cell frequency of not being in any of the three registries.

Table 2 Model selection in 2009

An alternative model was found by rerunning the STEP procedure with the restriction that the selected models should not include the interaction term [WL]. This resulted in model 3 of Table 2. This model also fits the data well, \(X^{2} = 180,df = 196,p = .79,\) and yields a population size estimate of 17,584. Our own model search entailed some small adjustments to Model 3. The most important refinements of Model 3 include the introduction of the four-factor interaction terms [SWGP], [WOGP] and [WGAP] and the three-factor interaction terms [LGA] and [GAO], and the elimination of the three-factor interaction [SGA]. The resulting model 4 has more parameters than model 3, but is to be preferred due to the lower AIC. It yields a slightly higher population size estimate of 17,767. Its AIC is also very close to that of the rejected model 2 (285.6 vs. 282.8), which seems to support our suggestion that the interaction term [WL] in model 3 might be due to sampling error.

Table 3 presents the selected models, the estimates and their associated confidence intervals for reference dates of January 1st 2010–2013 using the same method as well as the same updated registers. These results show that the models of the subsequent years are very similar to the finally selected model of the year 2009 (model 4, Table 2). The main differences are that in 2010 and 2011 the interaction term [GAP]—indicating the interaction between gender, age and place of living—was included, and in 2012 and 2013 the interaction between origin and place of living—as expressed by [OP]—was adopted in the model. For the years 2010 and 2011 and for the years 2012 and 2013, the models are identical. This implies that the model for this population estimation is fairly robust, which is an important indicator for its reliability. This also supports the trustworthiness of the resulting estimations.

Table 3 Selected models in 2010–2013

The estimated homeless populations has increased from about 18,000 to 27,000 in 2012. Although there seems to be a considerable increase over the years, the accompanying confidence intervals overlap for consecutive years and the only significant difference was found between 2009 and 2012.

5.2 Findings

We describe model 4, as selected for reference date of January 1 2009, in some more detail. Based on 5169 registered homeless, we estimate that there are 12,589 homeless people who are not included in any of the three registers. Together, this yields the estimate of 17,767 homeless in the Netherlands, with a 95 % confidence interval ranging from 15,601 to 21,225. In this model, two pair-wise interactions between the registers are included, i.e. between Shelter (S) and WWB (W), and between Shelter (S) and Ladis (L). The interdependence between Shelter and WWB is reflected by the four-factor interaction term [SWGP], which also includes the covariates with gender (G), and location in or outside one of the four largest cities (P). The corresponding parameter estimates (see Table 4 of appendix) reveal a positive interdependence, which is strongest for homeless women living in one of the four largest Dutch cities, and weakest for men living in one of these cities. In other words, the probability of being registered in both Shelter and WWB is highest for women in the large cities and the lowest for men in the large cities. Note that this interaction term is also consistently included in all models of the subsequent years 2010–2013.

The interdependence between Shelter (S) and Ladis (L) is reflected by the term [SLO], which also includes the covariate O, denoting a person’s origin. Interpretation of the corresponding parameter estimates shows a negative dependence, which is strongest for western immigrants and weakest for non-western immigrants. In the models of 2010–2013 the pairwise interdependence of being on Ladis with being on the Shelter list as expressed by [SLO] is absent.

As log-linear parameter estimates are sometimes difficult to interpret, this section also explains some of the relevant margins of the fitted values of the table of registers by covariates, based on the data and model of 2009.

As an illustration, Fig. 1 shows the estimated distribution of Dutch homeless people recorded on one of the three lists by the covariates used in the model, i.e. gender, age, origin and place of living. The distribution shows that most of the homeless are registered in the WWB and fewest are registered in Ladis. It seems plausible that many homeless people receive income support, and only a limited percentage receives help for alcohol and drugs addiction.

Fig. 1
figure 1

Estimated inclusion probabilities on one of the three list by gender, age, origin and place of living, 2009 (CBS/SIVZ). Note that these values are fitted values using the log-linear modelling estimates

Furthermore, the findings show that more men than women are registered in Shelter and WWB, which means that those lists register males better than females. All three lists seem to register the 30–49 age category better than older and younger categories. Homeless living in one of the big four cities are better registered by WWB than those living elsewhere. This is true to a lesser extent for Ladis, while Shelter shows the opposite picture: this list registers better outside ‘the big four’. Lastly, compared to native Dutch homeless and western immigrants, non-western immigrants are best registered by the WWB. For the other two lists there are no or only slight differences in the extent to which the three groups are registered.

The homeless population can be characterised by the covariates included in our model. Figure 2 shows the percentages of the homeless population compared to those of the general Dutch population at the same reference date of January 1.

Fig. 2
figure 2

Estimated homeless 18–64 aged population by gender, age, and country of origin, 2009 (CBS/SIVZ). Note that these values are fitted values using the log-linear modelling estimates

Figure 2 clearly shows that, compared to the total Dutch population, males are largely overrepresented. Also people aged 30–49 years are overrepresented at the expense of the youngest age category. The category with a non-western background is over three times as large as that of the total Dutch population.

If we compare the background characteristics of homeless males and females, the results show that about 30 % of the women are between 18 and 29 years old, compared with one-fifth of homeless men (Fig. 3). Relatively many homeless men are aged between 30 and 49 years: 59 %, compared with 47 % of homeless women. Homeless women are more likely to be native Dutch than homeless men (39 % against 26 %); men account for most of the non-western homeless. There are no remarkable differences between male and female homeless in terms of their place of living: of both sexes 37 % lives in the big four cities.

Fig. 3
figure 3

Estimated homeless population, gender by age, place of living and origin, 2009 (CBS/SIVZ). Note that these values are fitted values using the log-linear modelling estimates

Comparing the homeless living in one of the four large cities with those who live elsewhere, no differences between gender and almost none between age categories reveal (Fig. 4).

Fig. 4
figure 4

Estimated homeless population, location by gender, age, and origin, 2009 (CBS/SIVZ). Note that these values are fitted values using the log-linear modelling estimates

The place of living strongly differentiates for ethnic background of the homeless: 58 % of homeless people in the four large cities have a non-western foreign background, while this is only around a quarter for homeless people living elsewhere.

The distributions for the subpopulations in 2010–2013 are quite similar to the distribution found in 2009 (see also Statline: http://statline.cbs.nl/Statweb/publication/?DM=SLNL&PA=80799ned&D1=0-1&D2=a&D3=a&VW=T).

6 Discussion

As recent studies have shown the Netherlands to be a high-trust country with high levels of social capital (Schmeets and Te Riele 2014), 18 in every 10 thousand inhabitants being completely deprived from any societal rights and services could be a rather high figure. Therefore, the question may arise whether this figure is plausible for a relatively small country such as the Netherlands.

Taking into account a confidence interval (CI) ranging from 15,601 to 21,225 our estimate does not seem to deviate substantially from de study of De Bruin et al. (2003), who estimated the homeless population at about 15,200 people, with a CI from 12,500 to 18,000. On the other hand, our estimate of 17,767 homeless people is lower compared to the estimate of Heydendael and Brouwers (1989), which was 30,000 and ranged from 26,000 to 34,000 roofless and homeless. To a lesser extent the same can be said of the estimate of Van der Zwet et al. (1990), which was 20,000, ranging from 17,500 to 34,000. However, both studies were conducted some 30 years ago, a completely different time frame. Furthermore, there are other explanations for why the figures do not match. The first is that each study used different definitions and focused on different categories of homeless people. Our estimation also includes people who stay with friends and relatives, while the study of De Bruin et al. (2003) did not cover this homeless category. The other way around, our study did not estimate the residentially homeless, while theirs did include some residential homeless categories. The studies by Heydendael (1989) and Van der Zwet et al. (1990) also covered some residentially homeless groups. The second reason for the different estimates is the methods used: extrapolation, in particular, may lead to overestimation, especially for smaller municipalities. Van der Zwet et al. (1990) acknowledged this problem in their research report. Overall, these differences in definitions as well as methods seriously complicate a direct comparison.

There seems to be a considerable increase of the homeless population over the years. However, as noted before, the accompanying confidence intervals overlap for consecutive years and the only significant difference found is between the 2009 and 2012 results. Furthermore, it is important to note that the estimate for 2009 is not in all respects comparable to those for 2010–2013, mainly because the quality of all three registrations has increased in the course of the years, especially of the Shelter register and Ladis. The Shelter list has been improved by yearly updating the original list of institutions and the quality of Ladis by including more people who were assigned a personal identification number and could therefore be linked to the GBA and thus be included in our estimation. This differences in quality complicate the interpretation of the apparent rise of this population since 2009. However, in view of the financial crisis an increase in this population would not be unrealistic. Experts from the largest relevant organisations confirm that for a few years now, they have also observed an increase in the number of certain homeless people, such as homeless people who visit night shelters or crisis centres. Apart from that, they also mention the emergence of a new category of homeless people, the “new poor”: people who, despite their low income, were just able to make ends meet, but got into difficulty when they lost their jobs or their social benefits, and consequently became homeless. Although it is hard to quantify the causes of the increase of the homeless population into exact percentages, we suspect that most part of the increase is due to the crisis and that a smaller share is caused by the improved data quality. The fact that we used the same data sources as well as the same methodology over the years, and that this yielded similar or even identical statistical models, supports our confidence that the increase was not an artefact.

To explore the plausibility of our estimate further, we should compare this figure with the findings of similar European countries. However, such a comparison is not feasible. Edgar (2009) wrote an extensive review of homelessness in Europe in 2009, including an “Appendix” with the numbers of homeless people according to the ETHOS typology. For some European countries, the number of homeless—and more specifically roofless—people is not available, or at least not available for all the categories included in this study, i.e. people sleeping rough, people in emergency accommodation, and people temporarily living with family/friends. And for countries that do provide this information, it is still questionable whether their figures are comparable with those for the Netherlands. Several countries, for example, provide figures only for some main cities (e.g. Belgium, France, Hungary, Czech Republic, Portugal, Italy). For some countries the figures were based on available bed spaces (England and Scotland), others were based on survey data (Spain), period prevalence data (Germany and Latvia); and for some countries stock data were used for the estimations. Apart from this, it is important to note that the present study also includes people without a home who are temporarily staying with friends or relatives in the roofless population, while this is not the case in the ETHOS typology. In addition to the review by Edgar (2009), Benjaminsen and Dyb (2008) estimated the homeless population in Denmark, Norway and Sweden using surveys of services and organisations in the field of homelessness. Their estimates indicate homeless rates of 1.3 per 1000 inhabitants in Sweden, 1.2 in Norway and in 1.0 in Denmark. However, these estimates are based on the whole population, and taking into account that among youngsters as well as among elderly the homeless rates might be substantially lower, such rates will very probably be higher for the 18–65 aged. Consequently, compared to these rates found in the Scandinavian studies, the rate of 1.8 per 1000 inhabitants in the Netherlands for the 18–65 years old group is fairly high, but does not seem to be out of range. Apart from the plausibility of the estimate itself, it is also relevant to compare the characteristics of the estimated population with those of other studies. Findings show that four out of five homeless people are men, over half are aged between 30 and 50 years, and more than one-third (37 %) are of non-western origin. This profile is consistent over the years and also in line with that of other studies about Dutch homeless populations. De Bruin et al. (2003) found that 87 % of the people they met during their street survey were men, most of them were between 25 and 55 years old and 41 % were not born in the Netherlands. Also, recent local studies in several Dutch municipalities show largely the same profile (Hulschbosch et al. 2003, 2005, 2006; Reinking et al. 2001; Van Zwieten et al. 2008; Vocks et al. 2008). Furthermore, 37 % of the homeless population are estimated to live in one of the largest cities. This matches exactly the proportion found by De Bruin et al. (2003). The absolute figures are also similar: De Bruin et al. estimated that 5667 of homeless people lived in one of the four largest cities, while our findings show 5818.

One last remark we should make about the plausibility of the outcome is that our study focused on the population of literally homeless people aged between 18 and 65 years. As mentioned before, the Dutch homeless policy has a separate focus on homeless young people; most older people have left the streets and live in nursing homes or other accommodation for older formerly homeless people. Instead of roofless, older people often become residential homeless. In 2003, De Bruin et al. reported that in their street sample survey they met almost no homeless people younger than 18 years and the number of homeless aged 65 or older was also almost zero. However, as the younger and older homeless populations are particularly relevant for policy, future research should also attempt to estimate the size of these populations. The same is true for other subgroups this study did not take into account, such as people staying in hostels, temporary or transitional sheltered accommodation or women’s shelters, people in accommodation for immigrants, people due to be released from penal institutions and people receiving long-term support.

Since the plausibility of the results depends also on the applied methodology, a discussion to the chosen approach is indispensable. Over the last decades the CRC methodology has been used and further developed for application in a wide range of population size estimations. A recent approach proposed by Böhning et al. (2013) has extended the Chao-estimator in order to take covariate information into account. The Chao estimator uses data stemming from one single source in which people can appear multiple times. For example, in the context of the homeless, this could be a sleeping address for homeless, that registers over a limited time period how many homeless appear once at this address, how many appear twice and so on. The estimator then calculates the number of homeless that appeared zero times at this address and by adding this estimate to the number of homeless that did appear at this address, a population size estimate of the homeless population is obtained. Although this approach is fruitful for obtaining local estimates of homeless, for example in a city or in a region, for a country it is difficult to obtain this type of data.

Therefore, in this study a different approach was used, which provides some important insights into how to estimate a hidden population at the national level. First, it yielded an recent estimate at the national level, while this has not been available for years. Second, periodically applying the same methodology to the same registers made it possible to show trends and developments over time. Third, the CRC as applied to registers is efficient in terms of both costs and time. Fourthly, this approach provides valuable insights into characteristics of the homeless population in the Netherlands, including the part of that population missed by the three registers. Lastly, the unique opportunity to use the Dutch official municipal population registration (GBA) in combination with the availability of three national registers of fairly high to very high quality increased the accuracy of the findings.

Compared to estimations based on two registers, the current estimation makes use of assumptions that are not very restrictive. First, we assume that the three-factor interaction between the registers is absent. This is difficult to verify, yet it is much less restrictive than assuming that inclusion probabilities of pairs of registers are independent. Second, it is unlikely that the covariates in the model explain all the heterogeneity of the inclusion probabilities. However, it is uncertain whether adding covariates would lead to a more satisfactory solution, as this would increase the number of parameters exponentially, together with the risk of unstable and non-informative solutions. Therefore, we do not expect the probability of being included in the municipal population registration (GBA) for the people living in “mismatched” shelters to deviate substantially from that for the people living in the successfully matched locations.

Another assumption is that the population is closed, i.e. that the population did not change during the period of observation. One way to comply with this assumption is to keep the period of observation as short as possible. As we used one reference date in this study for each of the three registers (1 January of each year), this first assumption is met. Note that the chosen reference date itself is not relevant for the estimate as long as it is the same for all registers as well as for the date of their linkage to the municipal population registration. However, strictly speaking, the size of the population estimated consequently refers only to this date. This type of estimate, which is based on stock data, has been criticised because it does not take into account duration of homelessness or transitions into and out of homelessness and therefore does not give any information about the persistence and complexity of the problem (Hulchanski 2000). Also, using stock data may beg the question of whether all people on the lists are really roofless. Acknowledging the fact that homelessness is a dynamic process, this study intends to estimate the size of the homeless population at one moment in time, and the results do not give any information about routes into or out of homelessness. However, taking into account the conditions and criteria to be included on the WWB and the Shelter list, the probability that someone is registered erroneously or for a very short period of time is small. Inclusion on Shelter or WWB almost certainly indicates rooflessness, and thus deprivation in several areas, for a longer period of time. For example, a person will not be entitled to income support for only a few days, and will not use the address of a shelter as an official postal address unnecessarily or incidentally. For the Shelter register, we checked the influence of different selections of institutions on the estimates by including and excluding certain locations according to the potentially degree of bias. Therefore, we repeated all estimates with different selections of locations, i.e. (1) including all low threshold services; (2) Including low threshold services who according to a closer investigation, offer shelter to the literal homeless; (3) Excluding all locations who offer shelter to more than one single target population, regardless whether they are literally homeless or not. This check showed that the effect of using other selections appeared limited. As noted before, Ladis may be biased by the inclusion of people who are no longer homeless on the reference date or became homeless between the moment of registration and the reference date. In order to quantify the effect of this potential bias to our estimates, an extra effort should be made to ascertain the exact size of this shortcoming. However, this is only feasible in close cooperation with the institution that gathered the data and provided this list, the Stichting Informatie Voorziening Zorg (SIVZ).

The last assumption made to the CRC approach is that the linkage of the registers is perfect. In this study we used identification numbers based on the numbers in the municipal population registration, which aims to register all inhabitants in the Netherlands. An unique identification number is assigned to every citizen at birth or in the case of permanent residence. These numbers were used for all three sources, and as they are unique for each inhabitant of the Netherlands, data linkage can be assumed to be nearly perfect. We could not link all addresses of low threshold shelters to the addresses in the GBA. This meant that the location of the service was not registered as an address in the GBA register and thus not officially used for a place of residence. As the inclusion probability of people living at successfully matched addresses may differ from that of people at mismatched addresses, we checked the effect of removing some shelters from the address list. These analyses showed that the removal of these shelters or agencies from the list did not substantially affect the estimation.

7 Conclusion

For years, no accurate and reliable estimate of the size of the Dutch homeless population at the national level has been available. Even the Homelessness Action Plan that the Dutch cabinet and the four largest cities have implemented in recent years was based on guesstimates of local homeless populations. This study has shown that with relatively little extra effort, an accurate, reliable and replicable estimate of the size of the homeless population can be calculated, as well as a description of its characteristics.

According to the estimation presented in this study, 17.8 thousand homeless people aged 18–64 years old lived in the Netherlands on 1 January 2009. This corresponds to almost 18 homeless people per 10 thousand inhabitants in this age category at that date. To put this in some perspective, 4.2 % of the Dutch population are socially excluded, i.e. deprived of social contacts, social values, and basic rights and materially deprived (Coumans and Schmeets 2014; Schmeets and Coumans 2013). This figure is based on a general population survey, in which—as already mentioned—hardly any homeless people participated. It could be argued, therefore, that this figure should be raised to 4.4 % by adding the roofless population. We discussed the plausibility of our results and provided some methodological remarks. We also highlighted the potential value of the findings for policy.

Al in all, the discussion justifies the two-folded conclusion that (1) applying the CRC approach to three registers and covariates has proven to be a worthwhile venture, and (2) that the resulting estimates are plausible both in the Dutch context and in the actual timeframe. This can serve as an indispensable basis for informed policy-making. These statistics not only provide insight into which specific services and interventions are necessary to meet the basic needs of the current homeless population, but also shed light on systematic longitudinal measures necessary to prevent circumstances that increase the risk of homelessness: poverty, social exclusion and non-participation (unemployment, social isolation etc.), and lack of affordable housing. The relevance of information on the characteristics of homeless people may not be underestimated, as these statistics also point to factors at the individual level, such as non-western background, that may increase the risk for social exclusion and homelessness (Coumans and Schmeets 2014).