Machine Learning Based Approach for Sustainable Social Protection Policies in Developing Societies

Mumtaz, Zahid; Whiteford, Peter

doi:10.1007/s11036-020-01696-z

Machine Learning Based Approach for Sustainable Social Protection Policies in Developing Societies

Published: 07 January 2021

Volume 26, pages 159–173, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Mobile Networks and Applications Aims and scope Submit manuscript

Machine Learning Based Approach for Sustainable Social Protection Policies in Developing Societies

Download PDF

Zahid Mumtaz¹ &
Peter Whiteford¹

620 Accesses
4 Citations
15 Altmetric
2 Mentions
Explore all metrics

Abstract

Machine learning has been increasingly used for making informed public policy decisions, however, its application in the area of social protection in developing societies has been largely overlooked. We have employed unsupervised machine learning K-means clustering technique for exploring a big data that comprised of 88 attributes and 570 instances for better targeting of households that are in urgent need of welfare from the government. The clusters formed showed common patterns relating to insecurities in terms of loss of income and property, unemployment, disasters and disease etc. faced by households in each cluster. We found that households falling in rural areas jurisdictions face severe insecurities compared to other localities and are in urgent need of social protection interventions. We concluded that by employing K-means clustering unsupervised machine learning approach big data (even if it is limited) can be explored effectively for better targeting of social protection interventions for both developing and smart societies. The unsupervised machine learning technique presented in this study is an efficient approach because it can be used by societies that are facing data constraints and can achieve optimal results for increasing the welfare of poor by using the said approach.

Analytics-Based on Classification and Clustering Methods for Local Community Empowerment in Indonesia

Characterizing the association between child malnutrition and protected areas in sub-Saharan Africa using unsupervised clustering

Article 07 December 2023

Performance Evaluation of Sustainable Development Goals Employing Unsupervised Machine Learning Approach

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We are in an era of information revolution where data are produced and stored in every field at an unprecedented rate [1, 2]. This provides social scientists and policy makers with an opportunity to build and test theories using these latest data analysis techniques [3]. By combining social theory with computer science, we can utilise big data to predict and hopefully answer major problems faced by different societies [4]. The use of artificial intelligence (AI), in the area of public policy is relatively new [5, 6]. One of the AI instruments, which is becoming widely popular is machine learning (ML), [7]. ML techniques emerged primarily from computer science and engineering but recently its application in the area of public policy has increased because of the development of the availability of data, open-source software and sophisticated ML data analysis techniques [8]. Social policy is an important area of public policy which deals with poverty reduction and increasing the welfare of the population [9]. It is aimed at improving the well-being and livelihoods of those who are disadvantaged in a society through a range of mechanisms such as social protection and provisions of health and education etc.^{Footnote 1} [9]. Social policy mechanisms in most high-income countries are well developed and play a vital role in combatting poverty and are considered an essential tool for economic development in these countries [10]. However, social policy outside the developed world is fragmented and governments in developing countries are faced with fiscal, data and capacity constraints which are a major impediment to the effective implementation of social policy instruments such as social protection [11, 12]. In addition, poor targeting of the people who are in need of welfare and institutional weaknesses is a major impediment in the successful formulation and implementation of social protection programmes in developing countries.

Pakistan is a developing country that has the fifth largest population in the world. Its GDP per capita in 2018 was USD 1482 in purchasing power terms or 132nd out of 189 countries and regions. Around 31% of the population are estimated to live in poverty and its 2018 ranking on the United Nations Development Programme (UNDP), Human Development Index is 152 out of 189 countries. While there are a range of official data sources on living conditions and wellbeing in Pakistan, many of these are dated representing a major data constraint. Therefore, for the purpose of this study, a survey of 570 households from 14 different cities in Pakistan was conducted, which shows more in-depth data^{Footnote 2} on living conditions and their relationships to different forms of social protection. The complexity of this big data collected motivates the use of the latest data analysis techniques to more comprehensively explore and target the disadvantage for the provision of welfare. As a result, this article will propose a novel methodology to explore this large data by using unsupervised machine learning (UML), K-means clustering technique and argue that through the application of this technique, better targeting of the population for social protection interventions can be achieved that will not only be useful for its formulation but also assist in overcoming institutional weaknesses in the developing countries.

The article is structured as follows. First, a review of the literature showing the application of ML techniques in various areas of public policy will be presented. Second, based on the findings from the literature, the lack of the application of ML techniques in the area of social protection is identified as a major gap. Thereafter, we will explain that how this paper will bridge this gap by using UML K-means clustering technique to explore a survey data followed by a brief explanation of the concept of social protection. Third, the methodology of survey data collection used in this article is explained along with reasons for using UML K-means clustering technique to explore this survey data. We will then explain UML K means clustering technique and use this approach to explore the said survey data. We will also compare the results of K-means clustering technique with UML DBSCAN (Density-based spatial clustering of applications with noise) clustering approach for the purpose of utilising the best results. Fourth, the results of four clusters formed by using the UML K-means clustering technique will be explained by using descriptive statistics to show the need and priority of social protection interventions for the households surveyed. Finally, we present conclusion and implications of this study for future research.

2 Use of machine learning in public policy – A review of the literature

ML algorithms such as decision trees, dimension reduction methods, K-means nearest neighbour, support vector models, and penalized regression can be used to improve the effectiveness of public policies that have significant social and economic implications and can go beyond policy management to have a theoretical impact [13,14,15]. Several studies indicate that ML techniques have been used for making informed decisions in policy areas such as improving health policy, reforming education sector, improving tax policy and addressing climate change issues [14]. The advantage of ML methods over traditional statistical tools is that they provide new approaches to improve estimation of causal effects, which can reduce the reliance of these estimates on modelling assumptions and thereby enhancing the credibility of policy analysis. In addition, ML places great emphasis on model checking (through holdout samples and cross-validation) and model shrinkage (adjusting predictions toward the mean to reduce overfitting) making it a better approach for policy analysis [16]. In the succeeding paragraphs, a review of the studies conducted in the various areas of public policy where ML has been used for policy analysis will be presented.

Burscher et al. [17] used a ML approach for the automatic coding of policy issues to apply it on news articles and parliamentary questions and compared it with human annotations. The results showed that ML algorithms performed better than human coders and generalizations can be made across contexts highlighting implications for methodological advances and empirical theory testing. Andini et al. [14] argue that effectiveness of tax rebate scheme in Italy can be improved by selecting the beneficiaries of the scheme through using ML algorithms. This use of ML approach for targeting the beneficiaries helped in saving 29.5% (about 2 billion euro) of the funds earmarked to the scheme. Kasy [18] in his qusai experiment combines optimal taxation and insurance theory with ML and nonparametric Bayesian decision theory to propose a framework based on a standard social welfare function by using a data set of a health insurance experiment. When the ML algorithms were applied to the dataset, the values obtained for the optimal policy choice through ML were substantially different from those obtained using the standard statistics approach. The results obtained through ML algorithms points toward a large area of potential applications for these methods in informing policy decisions. Ballestar at al. [7] conducted a study in the area of higher education for identifying the long-term effects of research conducted by university researchers by using six years of program data developed in Madrid. They design a ML multilevel model: automated nested longitudinal clustering, to discover on whom, when, and for how long the policies adopted as a result of the research have an effect. They argue that the findings of this study are relevant for government agencies and universities to understand the productivity of academics working under long-term incentive-based programs and for maximizing the generation of knowledge. Chalfin et al. [19] in a similar study used data on teacher tenure decision to show that large social welfare gains can be achieved from using ML tools to predict worker productivity.

Kleinberg et al. [15] and Ashrafian and Darzi [20] argue that ML approach can be utilised for achieving the objectives and social welfare gains of health policy such as creating the conditions that ensure good health, social care for an entire population through preventive strategies, protection from disease, promotion of healthy lifestyles, and population screening through knowledge capture. Brady et al. [21] used a large data set from the US census bureau to compare the performance of ML algorithms with manual classification of public health expenditures to determine, if ML approach could provide a faster and a cheaper alternative. Compared with manual classification, the ML algorithms produced more accurate estimates showing that ML is a time and cost saving tool for estimating public health spending in the US that can be used in public health organizations to evaluate the impact of evidence based public health resource allocation. Pan et al. [22] in their study used administrative data for 6457 women collected by the department of human services, Illinois, for a period of one year to develop a model for adverse birth prediction and improve upon the existing paper-based risk assessment by using ML approach. ML algorithms developed and then compared with paper-based risk assessment for early assessment of adverse birth risk among pregnant women as a means of improving the allocation of social services. ML algorithms outperformed the current paper-based risk assessment by up to 36%. It was estimated that improvements obtained as a result of ML algorithms will allow 100 to 170 additional high-risk pregnant women screened for program eligibility each year to receive services that would have otherwise been unobtainable which shows potential for machine learning to move government agencies toward a more data informed approach to evaluating risk and providing social services. Benites-Lazaro et al. [23] argue that ML algorithms can be a very powerful tool to provide a different approach of handling complex issues such as climate change and energy. Using a mixed method approach, an unsupervised probabilistic modelling was combined with discourse analysis to examine the changes in debates related to ethanol production in Brazil and its relationship with climate change and food security. The approach was useful in explaining; the discourse of the various actors on climate change, ethanol, and food security issues in Brazil and the narrative of various actors over a period of ten years. Hino et al. [24] argue that public agencies aiming to enforce environmental regulation with limited resources can use ML algorithms to achieve their objectives such as predicting the likelihood of a facility failing a water-pollution inspection and proposing inspection for high-risk facilities. Despite all the advantages that ML provides for informed policy decisions, Athey [8] argue that ML driven policies may deprive stakeholders of the knowledge about how and why policies are made, raising issues like transparency, interpretability, fairness, or discrimination, therefore, public should be informed of the processes that are undertaken by public agencies.

3 Finding and gaps in literature

As discussed in the previous section, various studies indicate the use of ML based approach in different areas of public policy such as health, education, tax and climate change policy, for making and improving policy decisions. In addition, the ML approach has been combined with other interpretative research techniques such as discourse analysis for a more in-depth examination of a policy problem. Ballestar at al. [7] argue that for big data to achieve its full potential in policy studies, multi-disciplinary approaches are needed that build on new computational algorithms from the ML literature, but also that bring in the methods and practical learning from decades of multi- disciplinary research using empirical evidence to inform policy decisions. Despite the fact that ML techniques have been applied in various areas of public policy, however, its application in the area of social protection - a major field of social policy in developing countries, for the better identification and targeting of populations within a country who require immediate social protection interventions has been largely overlooked, which presents a major gap in the literature. The next section of the paper highlights that how this paper will fill in this gap, by first explaining the concept of social protection and then proposing a methodology by using UML K-means clustering technique, to accurately identify populations present in various regions of a country, who are in urgent need of social protection interventions.

4 Social protection and data constraints in developing countries

Poverty is a social problem and in the absence of active redistributive governmental policies coupled with widely shared economic growth, it can continue to span over generations giving rise to serious health, education and other societal problems [25]. Social policies are a subset of public policies that includes state actions to protect weakest members of a community in particular, as well as responding to the social needs of all the members of a society in general [9, 26]. Social protection is an important social policy tool that has been adopted by several developing countries and international donor agencies to combat poverty and increasing the welfare of the poor [27, 28]. However, developing countries are faced with financial constraints, which limits not only their capacity to fund large-scale social protection programmes but also reduces their coverage [12]. In addition, factors such as poor targeting and lack of data availability remains a major impediment towards the successful implementation of social protection programmes [29,30,31]. By applying the latest data analysis techniques some data constraints can be overcome, which can lead to the improvements in the well-being of individuals in developing countries [32].

5 Methodology for data collection and reasons for using K-means clustering technique

This article uses a cross sectional survey dataset collected as part of the first author’s PhD research.^{Footnote 3} This survey was conducted in 14 different cities in Pakistan including 570 households that were receiving informal assistance from religious institutions. The cities were randomly selected based on the multi-dimensional poverty index (MPI).^{Footnote 4} From every decile of MPI at least one city was randomly selected. Three to eight religious institutions from each said city’s rural and urban areas were randomly selected and from the record of every religious institution, at least four to eight households were randomly selected for the survey. The questions in the survey were based on household characteristics, income and jobs/activities of households members, their assets, risks and shocks faced by the households, different kinds and duration of formal social protection received by the households and kinds and duration of informal support received by households through means such as family, friends, landlord, non-governmental organisations (NGOs), religious institutions and employer etc. There were 88 attributes (variables) against which the responses of each household was recorded. Based on the research objectives, which are to identify the dimensions of need in a developing country context and use this for determining better targeting of social protection interventions, this study chose a UML K-means clustering technique. In addition, at his stage, the purpose of the study is not to make predictions, therefore a UML clustering technique best suits the desired outcome of this study. An advantage of using UML clustering is that it requires no parameters (explicit labels), to be provided to the UML algorithms that targets to optimally minimize the human bias while forming clusters. Whereas, other statistical software such as SPSS or STATA require input parameters (explicit labels), in order to form clusters. It is because of this very reason that this study is using UML algorithms to explore a large survey data set. As far as we are aware, this is the first study where the UML K-means clustering technique has been used to explore a survey data in order to identify population and regions within a country for social protection interventions.

5.1 K-means clustering

Clustering [33], an UML approach, determines the way data is distributed in some space called “Density Estimation”. In other words, clustering is the process of grouping together the similar instances based on similarity of their features or attributes without using training-base and assigning labels to instances.^{Footnote 5} There are various ways [34] for measuring the feature similarity based on attribute types such as: cosine similarity for vector-based data, jaccard similarity for set based data or euclidean distance for point data. This article employs euclidean distance-based similarity measures since the available data can only be interpreted as independent points. Clustering algorithms have different variations [35] that can be chosen based on desired output, nature of data and experimental parameters. The types of clustering algorithms that were evaluated for this study are: K-means clustering, DBSCAN (Density-based spatial clustering of applications with noise) clustering, hierarchical clustering and gaussian mixture clustering. After comprehensive analysis, K-means clustering was selected for clustering and DBSCAN for comparison. K-means clustering is also called exclusive clustering that necessarily assigns each instance to a cluster value (leaving no outlier). An instance in the dataset that has been made part of one cluster can never be part of another cluster (non-overlapping). In K-means clustering, it is a crucial aspect to decide upon how many clusters can be made over the existing distribution of dataset instances. So, the Elbow method [36] was exploited to have “inertia value” i.e. optimal number of clusters (here we got 4 clusters to model the instances optimally). Moreover, the metric of silhouette distance was used that is the measure of inter-cluster distances. K-mean clustering model which has the maximum silhouette distance is regarded as best [37,38,39], which in this case was 0.44063. A detailed view of evaluating different parameters of K-means clustering is provided in Table 1.

Table 1 Evaluation of Parameters for K-Means Clustering

Machine Learning Based Approach for Sustainable Social Protection Policies in Developing Societies

Abstract

Similar content being viewed by others

Analytics-Based on Classification and Clustering Methods for Local Community Empowerment in Indonesia

Characterizing the association between child malnutrition and protected areas in sub-Saharan Africa using unsupervised clustering

Performance Evaluation of Sustainable Development Goals Employing Unsupervised Machine Learning Approach

Explore related subjects

1 Introduction

2 Use of machine learning in public policy – A review of the literature

3 Finding and gaps in literature

4 Social protection and data constraints in developing countries

5 Methodology for data collection and reasons for using K-means clustering technique

5.1 K-means clustering

5.2 Description of clusters

6 Results

6.1 Household characteristics of clusters

6.2 Assets

6.3 Risks and shocks

6.4 Formal social protection received by the households through various sources

6.5 Informal social protection received by the households through various sources

6.6 Madrassa benefits

7 Conclusion and implications for future research

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation