Introduction

A geodemographic classification is essentially a grouping and labelling of geographical neighbourhoods, or other small areas, in terms of their social and economic characteristics. Attempts to classify areas from the characteristics of the people living there go back some way before computational approaches emerged, for example, Booth’s 1903 maps of London and Rowntree’s exploration of poverty in York in 1900. Some of these classifications have proved remarkably resilient: Orford et al. (2002) demonstrated that a poverty index constructed from Booth’s survey and maps was a better predictor of mortality rates for the over 65s for the period 1991–1995 than some contemporary poverty measures.

Singleton and Longley (2009), Longley (2012) and more recently Singleton and Spielman (2014) have critiqued the development of the production of geodemographic systems. Longley outlines a rationale that the increasing differentiation in lifestyles has been used to justify the addition of a wide range of datasets to the census data variables routinely used to derive the classifications. He notes, however, that the additional sources can lack scientific validity, which therefore preclude the scientific scrutiny of the resultant classification. Regardless of academic critique, the geodemographics industry has proved remarkably resilient, including its recent UK diversification encouraging consumers to check their credit ratings.

As geodemographics of this type have moved towards classifying consumption and consumers, there have been parallel developments in the production of deprivation indices, which focus more on social need. The basic classifications produced in the 1980s and 1990s (e.g. Townsend, Carstairs) have also been subsumed into a deprivation industry using increasing numbers of indicators e.g. Index of Multiple Deprivation (England and Wales), Pobol HP Deprivation Index (Ireland). Again these can be used uncritically, and it is possible to see examples of these indicators used as predictors for outcomes which can be found in the morass of indicators used to construct the indices: something of a circular argument. We leave readers to identify these themselves. Applications through commercial products, and more worryingly academic research, have promoted a view that a single classification is suitable for an extraordinarily wide range of applications. Most existing classifications operate at a national scale. This can be problematic when the influence of a capital city does not follow the socio-economic ‘norms’ which are determined and applied, as illustrated for Greater London by Singleton and Longley (2015).

A classification is generally achieved by applying a clustering algorithm such as k-means (Hartigan and Wong 1979) to a dataset of social and demographic variables computed for each of the areas. A key reason to do this is that there may be links identified within these geodemographic classifications of areas and other processes. For example, Brunsdon et al. (2011) use geodemographic approaches to predict participation in higher education in the UK. Another influential motivation is that there are many commercial and marketing applications of geodemographics, for example identifying which particular neighbourhood groups are most likely to yield customers for certain products, so that marketing campaigns can then target these areas. These kinds of application have led to several commercially available geodemographic classifications - one such example being A Classification of Residential Neighbourhoods (ACORN, http://acorn.caci.co.uk) - a system produced and sold by CACI.Footnote 1 In addition, the use of geodemographics has gained attention in the public sector (where it is gaining credibility as ‘social marketing’) for example to target areas for initiatives to encourage people to stop smoking (Tomintz et al. 2009). Indeed, the proliferation continues, so there are now products for specific sectors, e.g. ACORN Health and MOSAIC Health available, and in use, within the UK. Whilst geodemographics have been subject to some critical evaluations (e.g. Feng and Flowerdew 1999; Openshaw and Wymer 1994) and methodological enhancement in the academic literature, application elsewhere has been worryingly uncritical as the commercialization potential has been quickly identified and diffused.

There are commercial geodemographic segmentation products available in Ireland. These include Data Ireland’s OGHAM product which classifies households into 34 ‘Lifestyle and Affluence’ groupsFootnote 2; Experian Ireland’s MOSAIC classificationFootnote 3; and Gamma’s Inca segmentation system.Footnote 4 There is no open source system currently available in the Republic of Ireland.

More recently attention has been focused on freely available geodemographic classifications, in particular the UK’s Output Area Classification (OAC) system produced by Vickers et al. (2005) which provides a geodemographic classification based on the 2001 UK Census. The focus here arguably moves away from market research and towards social applications, and a notable, and laudable, characteristic of OAC is that information relating to the data and clustering method used is freely available (see also Singleton and Longley 2015; Spielman and Singleton 2015). This offers a number of advantages - it ensures that others are able to scrutinise the code, or adapt the approach for a different data set, different spatial units, or employ an alternative classification algorithm. In addition, many studies involve analysing the linkage between exogenous dependent variables and the geodemographic groups. However it is necessary to know which variables were used to determine the groups to ensure the none of the dependent variables are included, and hence avoid the discovery of a misleading association.

It is in this spirit of availability and openness that the classification system discussed here has been created. The authors have produced an open geodemographic classification of the 2011 Irish Census, based on the Small Area areal units (CSO 2011) the genesis and production of which is the subject of this paper.

Methods

In Ireland, a population census is conducted every 5 years. The administrative unit designed for this was the Electoral Division (ED). However, by 2006 the populations of the EDs varied from 76 to 32,288, which had become problematic for data collection, reporting and subsequent analysis. For 2011 a new set of ‘small area’ units was commissioned, which resulted in the production of 18,488 small areas, with a median population of 240.

As the Irish Census differs from the UK ONS census in the questions asked, and the size and geography of the underlying population, our process of clustering and analysis differs from OAC - but the intention of producing an open and freely available area classification remains. Some of the features unique to our approach are:

  1. (i)

    Use of the Partitioning Around Medoids (PAM) cluster analysis algorithm (Kaufman and Rousseeuw 1987) instead of k-means. The algorithm is outlined in Appendix 1. This approach is proposed for a number of reasons. As its creators observe, the method is based on minimising an L1 metric - that is, it chooses clusters based on minimising absolute distances from cluster centres (medoids), rather than squared distances, as k-means does. This makes it more robust to outliers. Less outlier-resistant approaches tend to assign unusual observations to single-item clusters, which is not desirable here. In addition, the approach defines each cluster in terms of a representative case - an observation from the data (in this case a Small Area) that typifies the cluster. This is very helpful when attempting to describe and interpret the characteristics of each cluster.

  2. (ii)

    Use of heat maps as an approach to interpreting the clusters – these are also to be made publicly available.

  3. (iii)

    Use of a reproducible research approach - so that in addition to providing a public description of the analytical techniques and variables, the actual code and data will be made available, allowing third parties to reproduce the exact results. This also facilitates adaptation of the methodology e.g. using a different clustering method, different areal units, or updating with new data. A number of arguments for reproducibility in academic work are made, for example, by Peng (2009) and Laine et al. (2007). An overview of the approach is illustrated as a workflow in Fig. 1; specific details will now be considered.

Fig. 1
figure 1

Workflow of the approach to geodemographic classification. Blue, sharp-cornered boxes indicate computational (and therefore reproducible) activities

Choice of Variables

The aim of the exercise was to create a general-purpose classification of the Small Areas in Ireland. This raises the question of an appropriate sets of variables to use as the basis for the classification. There appears to be little theoretical guidance, but one is cautioned against “the mindless approach in which numbers of variables … easily culled from census volumes … are picked over like cans on a rubbish tip” (Mather and Openshaw 1974, p.290). Early exercises in classification have useful suggestions for variables. Perusing the lists in Cullingford et al. (1975), Webber (1975), Webber (1977), Webber and Craig (1978) allow us to identify common themes. Appendix 1 in Webber (1977) contains a list of 40 variables, which form a basis for later work by Charlton et al. (1985). There are some choices in the earlier studies which point to social problems which are no longer pressing for example Cullingford et al. (1975) include indicators for shared bathrooms and outside toilets.

Vickers et al. (2005) describe the genesis of the UK’s 2001 Output Area Classification (OAC), and for the 2011 OAC exercise the Office for National Statistics (2015a) demonstrates that a similar methodology and variable choice was used for both classifications. To enable some element of international comparison we have attempted to identify a parallel set of variables from the Irish Census. It should be noted that harmonisation of the definitions is not always straightforward. Not only are there differences in the definitions of the indicators, there can be differences in the derivation of the data: in the UK a de jure count is the basis, and in Ireland, the basis is the de facto population. The essential difference is that the UK counts the usually resident population on census night (so visitors are returned to their place of usual residence) but in Ireland the count is of the population present on census night.

There is an additional challenge in that a comparable set of areal units is required. It has been known since the early 1930s that there are scale effects on the correlation structure of variables for modifiable areal units (Gehlke and Biehl 1934). Ordnance Survey Ireland commissioned a set of ‘Small Areas’ which have a common and consistent definition, and are comparable in population size and spatial scale with the Output Areas used in the United Kingdom. The Small Areas are the finest grained spatial units for which 2011 Census of Population data are available in Ireland. It is well established that graded grains make the finest units (Homepride, 1990, personal communication).

Choices in Clustering

There are a number of additional choices concerning the clustering methodology. An approach which has been used widely, and also in many of the commercial classification systems is to provide a hierarchical classification. This may have only two layers (for example that of Charlton et al. 1985), although the UK 2001 and 2011 OACs have a three layer classification. This raises the question of methodology. One approach is to cluster the individual spatial units into a moderate number of groups. There is little theory to guide the analyst on the choice of the number of clusters, although a scree plot is often used to guide the selection of the number of groups. With datasets of several 10s or 100s of thousands of cases, the number that this process yields may be inconveniently large (60 or 70 will not conveniently list on a single page, although it might encapsulate some dimensions of the social structure of the study area). A small number of groups can be arrived at by using a hierarchical clustering procedure (Ward’s method (1963) is a frequent choice for this) using the cluster centroids (the mean vector for each cluster) as data. Vickers et al. (2005) describe a process where the hierarchy is created from the top down. The initial classification of the Output Areas was into 8 ‘Supergroups’. The members of each Supergroup are then classified into 2, 3, or 4 Groups, and finally the members of each Group are classified into Subgroups. Thus there are 76 Subgroups and 26 Groups in the 2011 OAC (ONS 2015a).

We have followed the general workflow described in Charlton et al. (1985). The data are subjected to an orthogonalising transform, and the component scores from this are used as the basis for a non-hierarchical classification into 18 clusters. The cluster centroids are then grouped into 8 classes using complete linkage. There are two differences: we do not scale the component scores by their eigenvalues and we use a non-parametric clustering algorithm. The first component has an eigenvalue of 9.78 - there are about 10 variables which are measurement of a single dimension, so to scale the scores by 9.78 would give undue prominence to this component in the clustering and undo the effect of the component transformation.

Interpreting the Clusters

One area of effort which should not be overlooked is that of attempting to characterise the members of the clusters. This tends to be an activity dominated by the subjective tendencies of the persons responsible. The OAC Pen Portraits are typical as outputs (ONS 2015b): each cluster is given a short verbal description of its most notable characteristics and then provided with a two or three word short title. As an example, ‘Renting rural retirement’ Output Areas are members of the ‘Ageing rural dweller’ group, which itself is a member of the ‘Rural residents’ Supergroup. It is difficult not to confuse the characteristics of the areas with the characteristics of the residents of those areas through ecological fallacy, and the short names can have a tendency towards stereotyping e.g. ‘Ageing juveniles’.

How do we identify the characteristics of a group? One technique is to compare the values of the mean vector for a cluster with the population values for the same variables. If the value of variable p for a cluster lies solidly in the upper or lower tail of the global distribution of values for that variable, then it may be taken as ‘characteristic’ in some sense. We can compute the mean vectors, and tabulate those variables in each cluster whose mean values are in the tails of their global equivalents. However, this asks of the analyst that he or she makes comparison not only between clusters but also within clusters. ONS (2015c) provides radial plots showing the relationship between the 60 variables of the classification for every individual Supergroup, Group and Subgroup with the global values. One hundred and ten pages of plots become daunting after a time, and leads us to consider whether the richness of the data reduction can be encapsulated in a single graphic.

The R heatmap is one solution which is appropriate for our case. The rows represent the variables, and the columns the clusters. The stronger the shade of green that a cell is coloured, the more positively characteristic is the variable of that cluster; the strong the shade of brown, then vice versa. The heatmap also clusters the variables which are related among the clusters, and also clusters the cluster centroids themselves, providing a second tier in a hierarchy. Inspection of Fig. 2 suggests a transition from more ‘rural’ clusters on the left part of the diagram to ‘urban’ on the right.

Fig. 2
figure 2

Heatmap of PAM cluster characteristics

Additionally the cluster locations can be mapped - whether the plots are static or not depends on the application. In the exercise described in this paper, the dataset consists of 18,488 spatial units. However, there is additional information we can add to aid interpretation of the plots. Recall that cluster membership for k-means requires the spatial unit to be closest to the centroid of the cluster to which is assigned - there is a distance between the mean vector and the vector of values for the spatial unit in question. It would be unfortunate to examine a sample of spatial units which are not representative of the cluster under scrutiny. The values of the membership distance should also be mapped, either as a choropleth map, or in some other mode (for example Wood et al.’s (2012) ‘sketchy’ approach).

Details of the Analysis

The PAM approach (Kaufman and Rousseeuw 1987, 1990) was applied to principal components of a number of variables derived from the 2011 Irish Census. This approach detects clusters by identifying a set of medoids (typical cases for each cluster) and assigning the other observations to clusters on the basis of the closest medoid. Here, ‘closest’ can be defined flexibly. All that is required is a distance matrix for the n observations being clustered. There are no constraints on how this distance may be defined - here it is defined by treating the first j principal components as coordinates in Euclidean space and computing distances on that basis.Footnote 5 A characteristic of this approach is that it attempts to minimise sums of absolute distances rather than squared distances (as is the case with k-means) and as a consequence it is more robust to outlying cases than and less inclined to produce classifications with very small numbers of cases – this sometimes being a consequence of the effect of outliers in k-means. Full listings of the variables and the code used to compute them (in the case of derived variables) can be found by visiting the ‘Rpubs’ web site describing this procedure (http://rpubs.com/chrisbrunsdon/14988).

A justification of the use of principal components is that both PAM and k-means clustering make use of the idea of distance between different locations, in terms of the variables associated with them. The distances are defined in an m-dimensional space, where m is the number of variables measured. If the raw variables are used (even if standardized to have zero mean and a variance of one), pairs of correlated variables tend to have similar values – with an effect of increasing the weighting on the underlying cause driving both variables. The issue with this is that unintentional over-representation of certain correlated groups of variables (for example by having a large number of age-category variables) will have the effect of creating a spuriously high emphasis of this group on the distance metric. However, by a principal components transformation (turning the m variables to the corresponding principal components) overcomes this problem, as the components are uncorrelated. The components effectively represent a set of independent underlying factors ‘driving’ the data – but each factor is allocated precisely one dimension, so the problem of unintentional over-representation is addressed.

An additional issue for clustering is the computation of the distances. If we are using Euclidean distances, then the angles between the axes of the multidimensional space must be π/2. If a pair of variables is correlated with correlation ρ, then the angle between the axes is given by cos −1(ρ) . With higher correlations, ρ approaches 1, and the influence of one of the variables should disappear from the calculation. If we assume that ρ is zero in such cases, we give one of the variables in question an undue influence in the clustering process. A principal components transform yields variables with orthogonal axes.

Results

Overview of Clusters

Although the PCA approach is helpful in the reliable formation of clusters, to interpret the clusters once assigned, it is then helpful to return to the original variables. For each cluster, the cluster mean of each variable is computed, and the relative values of these are shown in the heatmap of Fig. 2 above.

Here, the blue-green shaded elements correspond to higher average values of a variable within a cluster, compared to the Irish national average. In contrast, the brown values correspond to low values. The clusters were then subjected to a hierarchical cluster analysis – that due to Ward (1963) to attempt to identify similar clusters. The resultant dendrogram is shown on the x-axis of the heatmap; this also drives the ordering of the categories on the axis (Fig. 3). Similarly variables that are associated by being linked with similar profiles of clusters are also subject to Ward’s hierarchical clustering, with a dendrogram as seen against the y-axis, and again their ordering is determined by the dendrogram. The dendrograms convey information not only about the structure of the clusters, but also the degree of difference in the splits between groups. Divides higher up in the tree are based on greater differences. Thus, for example, the split between the group 8 and the remainder is based on the greatest level of difference which may be seen since the highest branch in the tree represents this division.

Fig. 3
figure 3

Broad-scale cluster naming

Descriptions of the clusters appear in Appendix 2. The dendrogram has suggested a higher order grouping. Clusters 1, 16 and 2 form a very coherent ‘rural’ group. This is not only suggested by their positions on the heatmap, but also by the relatively low population density (see Fig. 6 below), The septic tank variable is a strong discriminator for this group. Agricultural employment is a strong feature of groups 1 and 16, with 16 forming the more remote rural communities - the noticeable lack of broadband connectivity reinforces this interpretation. While a ‘rural’ cluster, members of 2 are closer to main settlements, and are characterised as reasonably well-off older residents.

Apart from the ‘Students’ group which stands out markedly in character from the other groups, the others may all be sub-divided. For example, cluster 16 is in the broad ‘rural’ group but is characterized as having particularly low broadband uptake. Similarly, in the ‘Struggling’ group, group 6 is characterized as having a higher level of Limiting Long-Term Illness (LLTI) than the other member of this broad group (group 13).

Geographical Pattern

Although all of the clusters may also be mapped, here just one (corresponding to the ‘Students’ category above, in the Dublin area) is shown as Fig. 4 below.

Fig. 4
figure 4

‘Student’ Small Areas in the Dublin Region (Dublin and Dun Laoghaire/Rathdown)

As a first pass verification, the highlighted areas in Fig. 4 correspond to the locations of universities and halls of residence in Dublin.

We can examine the variation in population density between the clusters as a further clue to geographical pattern. A histogram of population density at Small Area level reveals a long right tail. If we log the density (in residents per hectare) then we obtain the histogram in Fig. 5:

Fig. 5
figure 5

Population density at Small Area level

The red line represents a density estimate. Notice the bi-modal nature of the distribution - the modes corresponding to ‘more rural’ on the left and ‘more urban on the right’. The minimum between the modes corresponds to a population density of about 2.21 residents per hectare.

This allows us to create a boxplot of population density by cluster, with the boxes in dendrogram order, and width proportional to the square root of the number of objects in each cluster (Fig. 6). The design of the Small Areas has the constraint that they should have similar populations, so the box widths are also approximately proportional to the population in each cluster.

Fig. 6
figure 6

Population density by cluster

The pecked red line in the boxplot corresponds to the ‘more rural’/‘more urban’ threshold identified in the histogram above. Clusters 1, 16 and 2 are predominantly rural in location, with cluster 16 having the lowest general level of population density - remoter rural locations. The rest of the clusters are more ‘urban’ in character, although there are some notable left tails. Clusters 8, 15, 18, and 17 are almost exclusively urban. Clusters 5, 11,6, and 13 have some noticeable outliers in rural areas. Cluster 2, which being predominantly ‘rural’ has a noticeable right tail of ‘urban’ locations.

An initial attempt at naming and characterising the clusters is provided in the Appendix. One possibility, given the open nature of this classification, may be to provide access to the heat maps and geographical maps relating to the clusters on the internet, and use some kind of crowd-sourced approach to cluster naming.

Clusters and Deprivation

It is sometime considered that a single deprivation index will suffice to encapsulate social variation in the population. We examine the Kelly-Teljeur 2011 score (Kelly and Teljeur 2013) for the clusters in the different groups. Positive numbers on the Kelly-Teljeur Index indicate higher levels of deprivation. A multiple boxplot is shown in Fig. 7:

Fig. 7
figure 7

Kelly-Teljeur deprivation scores by cluster

The boxplots are ordered corresponding their cluster’s position on the dendrogram in the heatmap, and Groups are alternately shaded light grey and white to aid interpretation. The pecked red line represents the median score. Of the clusters in Group A, many of the Small Areas in each cluster are below the median, although cluster 16, the ‘remoter rural communities’ has a noticeable tail of deprived Small Areas. Group B, ‘Urban poor’ shows evidence of deprivation, particularly in cluster 12, and in the case of Group E, ‘Struggling Urban Peripheries’, there is strong evidence of deprivation. Of Group H, ‘City diversity’, cluster 15, ‘Stressed inner city singles’ shows some evidence of deprivation, as does Cluster 17. It would be a mistake to suggest that the Classification and Deprivation Index are substitutes for one-another. The index is the result of a spatially uniform Principal Components Analysis, so each value on the index has arisen via the same spatial process. By contrast, there is a different relationship between the variables within each cluster within each cluster; this is clear from the heatmap.

The Overall Pattern

There are 18,488 Small Areas. To show the complete picture on a map in the Irish National Grid or Irish Transverse Mercator projections would result in spatial units less than 1 pixel wide in the urban areas. One solution lies in a cartogram transform, and we have used Brunsdon’s getcartr package, available on Github.Footnote 6 This resulting cartogram gives more prominence to urban areas, but it is still difficult to see the detail (Fig. 8):

Fig. 8
figure 8

Cartogram showing groups of each Small Area in Ireland

We can also examine the structure of the groups in Dublin - the same colours have been used. This map is based on the Irish National Grid projection (Fig. 9).

Fig. 9
figure 9

The four counties of Dublin (Dublin City, Fingal, South Dublin, and Dún-Laoghaire-Rathdown) as well as parts of Louth, Meath, Kildare, and Wicklow

Discussion and Conclusion: Applying Reproducible Research

This paper has outlined a methodology for providing a classification of Irish Small Areas based on publicly available census data and cluster analysis techniques, similar in intention to the UK’s OAC classifications. A distinct feature is that not only is the data publicly available, but also the code used to carry out the analysis. Thus, as well as a classification that may be of use in its own right, this could also be a springboard to alternative classification schemes created by modifying this code. For example, a scheme for a different set of spatial units, or one adding some extra variables could be created relatively easily by modifying this code, and possibly supplying some extra or alternative data.

There are a number of other advantages to this approach, one of which is a kind of future-proofing. Should an alternative approach to cluster analysis be proposed at some future point that is more reliable, more robust or simply more appropriate for geographical data, then this could be easily ‘grafted’ into the existing analysis template, and results compared to the current classification. More generally the need for reproducibility is becoming apparent. Data analyses are becoming more complex and make use of computationally intensive techniques, but with this there is a danger that the analyses become opaque, and characteristics that could play a key role in interpreting the outputs can become hidden inside a ‘black box’. Reproducibility admits the possibility to view these analyses critically, which is often possible if only general details are outlined in a written summary as part of an academic paper or an official report.

In arguing for increased reproducible activity in the social sciences, we need to examine work which permits reproducibility, though that may not have been the original aim. One example is Kelly and Teljeur’s Deprivation Index (2013). Appendix II of their report contains descriptions and definitions of the indicators they use in sufficient detail to allow a researcher to return to their analysis and arrive, with a high degree of reliability, at their results. Whether this would be possible for, say, Cullingford et al.’s (1975) work is debatable: it took place 40 years ago; the software appears to be bespoke code; the data have been obtained from a variety of sources, and were available on a now obsolete medium: magnetic tape.

In order to ensure reproducibility, as defined earlier in this paper, a web-based document outlining the analysis is provided at the web site (http://rpubs.com/chrisbrunsdon/11732). This document contains all of the R code executed to obtain the classification, and information about data sources. The document was produced using RMarkdown by RStudio, a tool designed to facilitate reproducible research, by storing documents with embedded R code, so that reporting of results and the code used to obtain the results are integrated in the same document.Footnote 7

A recent innovation in Ireland has been the launch of postcodes. The Eircode is a service operated by a private company. Information on the structure, design and analytical capabilities offered by the codes is limited, although Irish residents were being informed of their Eircodes in summer 2015. The postcode is an address based code - this resolves to an individual residential address point and building level for commercial address points. The codes have seven characters organised as 3 characters-space-4 characters. The first, the ‘routing key’ refers to area of varying sizes which appear to be unsuitable for analytical purposes (that for Mallow, P51, is over 100 km from west to east, and is cut into two parts by the area for Fermoy, P61). The last four characters are organised randomly from a 25 character alphanumeric set. Whilst this gives in theory 390,625 combinations, a subset is used to remove the possibility of unsuitable words appearing. Adjacent addresses have entirely dissimilar Eircodes.

Hence the location of the address cannot be inferred from the code. The user is required to access a database, known as the Eircode Address Database (ECAD), which contains, inter alia, the small area code for each Eircode. There is a complex pricing structure for Eircodes which contrasts strongly with the open availability of both Census of Population data and digital boundaries for the Small Areas. However, in time, acceptance and wider use of the Eircode may open opportunities for market analysis. We observe that the once closed Postal Address File in the UK is now freely available, as are tables linking Postcodes and Output Area codes - this has allowed the development of a range of private and public analytical possibilities, and it is to be hoped that Ireland may benefit from similar developments in future.