Abstract
West Papua is reportedly the second-most populous province in Indonesia. The United Nations International Children’s Emergency Fund (UNICEF) highlights Papua’s performance in selecting the Sustainable Development Goals (SDG) indicators compared to other provinces in the country. The data shows that food, nutrition, health, education, housing, water, sanitation, and protection are defined as multidimensional child poverty. Population statistics and poverty figures show that inter-provincial equity in Indonesia needs to be re-measured. In 2008, the Regional Governments of Papua and West Papua Provinces implemented a Community Empowerment Program called “PNPM RESPEK”, which provided direct community assistance for IDR 100 million per village. To determine the people’s level of understanding and perception towards this program, PNPM RESPEK, in collaboration with the Central Statistics Agency, conducted an integrated PNPM RESPEK Evaluation Survey in July 2009. Based on the survey results, this paper identifies a model (pattern) of understanding the people of Papua and West Papua towards the program and finds the best method to build this model through classification techniques. Then the data model was also tested using unsupervised learning, the clustering method. The experimental results show that the J48 decision tree produces the highest accuracy compared to the others. As for clustering, the clustering hierarchy provides the best accuracy. Decision Tree J48 has the best accuracy with an accuracy of 97.31%. In this case, 97.31% of the people of Papua and West Papua who receive direct community assistance meet the level of understanding and perception of the PNPM RESPEK Program.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Indonesia’s easternmost provinces of Papua and West Papua generally referred to as Papua, are the country’s most violent and resource-rich areas [1]. However, health care standards are lower in West Papua than in other regions of Indonesia [2]. World Health Organization (WHO) reported that poverty is a significant cause of ill health and a barrier to accessing health care when needed. This relationship is financial: the poor cannot afford to purchase things needed for good health, including sufficient quality food and health care. However, the relationship is also related to other factors related to poverty, such as lack of information on appropriate health-promoting practices or lack of voice needed to make social services work for them [3].
In 2007, the Government of Indonesia launched the Mandiri National Program for Community Empowerment (PNPM), which aims to reduce poverty, strengthen local government and community institutions’ capacity, and improve local government governance. In 2008, this program covered approximately 40,000 villages in Indonesia and was expected to cover nearly 80,000 villages by 2009 [4]. In line with this, the regional governments of Papua and West Papua Provinces in 2008 implemented a Community Empowerment Program called “PNPM RESPEK.” RESPEK is funded by the Provincial Expenditure Budget (APBD Propinsi), and it provides 100 million IDR directly to every village in the province [1]. The Regional Governments of Papua and West Papua provide direct community assistance (Indonesian: Bantuan Langsung Masyarakat) of IDR 100 million per village for 3,923 villages in 388 sub-districts. Meanwhile, the Ministry of Home Affairs provides more than 1,000 facilitators through PNPM [4].
The main component of PNPM is its approach called Community-Driven Development (CDD) [5]. Adopting a community-driven development (CDD) approach and with technical financial assistance from the International Bank for Reconstruction and Development, the PNPM is now a national program covering all villages and cities in the country [6, 7]. To determine the level of understanding and perception of Papua and West Papua’s people towards the PNPM RESPEK Program, PNPM RESPEK, in collaboration with BPS-Statistics Indonesia, conducted the PNPM RESPEK Evaluation Survey, which was integrated through the National Socio-Economic Survey (SUSENAS) in July 2009. This research aims to identify a model (pattern) for understanding the people of Papua and West Papua towards the program and find the best method for building this model through experimental classification and clustering techniques. The primary goals of this research are to help the PNPM RESPEK improving the remote area from a data perspective and understanding principles of extracting valuable knowledge from data.
2 Methodology
Data Mining is the process of extracting and identifying patterns from large sets of data to produce output in the form of useful information or knowledge that was not previously known manually on the raw data. Data mining is carried out using statistical methods, mathematical algorithms, artificial intelligence or machine learning. In general, the stages carried out in Data Mining include data selection, pre-processing data, transformation data, modeling, and interpretation data [8].
2.1 Classification
Classification is the supervised learning technique in data mining. In supervised learning, the data label has already been defined. Classification is used to classify each item in a data set into one of a predefined set of classes or groups. The data analysis task classification is where a model or classifier is constructed to predict class labels. So, the classification technique will assign items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. Algorithms for classification include J48 [9] and Logistic Regression [10].
-
The J48 algorithm is an algorithm derived from C4.5 [10]. This algorithm generates decision trees based on rules to classify. Each aspect of information is divided into several small subsets to form the basis of decisions. J48 looks at standard data, which results in the separation of information by selecting attributes [11]. Mathematically, J48 algorithm uses the concept of entropy and information gain (IG). The information gain rate (IGR) is the splitting criterion (SplitInfo) to make the J48 decision tree. The IG, SplitInfo, and IGR are formulated as follows
$$ IG\left( {S,j} \right) = Entropy\left( S \right) - Entropy\left( {S|j} \right) $$(1)$$ SplitInfo_{j} \left( S \right) = - \sum\limits_{{k = k_{0} }}^{{k_{c} }} {\left( {\frac{{\left| {S_{j} \left( k \right)} \right|}}{\left| S \right|} + \log_{2} \frac{{S_{j} \left( k \right)}}{S}} \right)} $$(2)$$ IGR\left( j \right) = \frac{{IG\left( {S,j} \right)}}{{SplitInfo_{j} \left( S \right)}} $$(3)where S is a parent node, j represent the j-th attribute of sample x in one class label, \(Entropy\left( S \right) = - \sum\limits_{j = 1}^{d} {x_{j} \log_{2} x_{j} }\), \(Entropy\left( {S|j} \right)\) is the conditional entropy with \(Entropy\left( {S|j} \right) = \sum\limits_{{k = k_{0} }}^{{k_{c} }} {\frac{{\left| {S_{j} \left( k \right)} \right|}}{\left| S \right|}} \cdot Entropy\left( {S_{j} \left( k \right)} \right)\), and \(S_{j} \left( k \right) = \left\{ {x \in S|x_{j} = k} \right\}\).
-
Logistic regression is an approach to creating predictive models using equations that describe the relationship between two or more variables [13]. The dependent variable for logistic regression has a dichotomy scale. The dichotomy scale is a nominal data scale with two categories: Yes and No, Success and Failure or High and Low [12]. We often named these two categories as binary-valued labels which the correct label y values is denoted either 0 or 1 \(\left( {y^{(i)} \in \left\{ {0,1} \right\}} \right)\). Mathematically, the probability that data samples belong to the “Yes” class versus the probability that it belongs to the “No” class defined as follows
$$ P\left( {y = Yes|x} \right) = h_{\theta } \left( x \right) = \frac{1}{{1 + \exp \left( { - \theta^{T} x} \right)}} \equiv \sigma \left( {\theta^{T} x} \right) $$(4)$$ P\left( {y = No|x} \right) = 1 - P\left( {y = Yes|x} \right) = 1 - h $$(5)where \(\sigma \left( r \right) = \frac{1}{{1 + \exp \left( { - r} \right)}}\) is the sigmoid or logistic function, and \(\theta^{T} x \in \left[ {0,1} \right]\) is the gradient for linear regression. The cost function for a set of training examples with binary labels \(\left\{ {\left( {x^{(i)} ,y^{(i)} } \right):i = 1,2, \ldots ,n} \right\}\) to measure how close a given \(h_{\theta }\) to the correct output y is expressed as below
$$ J\left( \theta \right) = - \sum\limits_{i = 1}^{n} {\left( {y^{(i)} \log \left( {h_{\theta } \left( {x^{(i)} } \right)} \right) + \left( {1 - y^{(i)} } \right)\log \left( {1 - h_{\theta } \left( {x^{(i)} } \right)} \right)} \right)} $$(6)If we plug in the definition of \(h_{\theta } \left( x \right) = \sigma \left( {\theta^{T} x^{(i)} } \right)\) into (6), we will get the loss function as below
$$ J\left( \theta \right) = - \sum\limits_{i = 1}^{n} {\left( {y^{(i)} \log \left( {\sigma \left( {\theta^{T} x^{(i)} } \right)} \right) + \left( {1 - y^{(i)} } \right)\log \left( {1 - \sigma \left( {\theta^{T} x^{(i)} } \right)} \right)} \right)} $$(7)To be noted, the smaller the values of cost function the better the model. In this sense, the model with bigger cost function clearly predict the un-great solution of \(y^{(i)}\).
2.2 Clustering
Clustering is a powerful tool in data analysis. It is used for discovering the cluster structure in data sets with the most remarkable similarity within the same cluster but the most noteworthy dissimilarity between different clusters. Generally, cluster analysis became a multivariate statistical analysis branch, and it is an unsupervised learning approach to machine learning [13, 14]. Algorithms for clustering include K-Means [15], Hierarchical clustering (HCA) [16, 17], and DBSCAN algorithms [18].
-
K-Means
K-means is the simplest and most common clustering method. It is because K-means can classify large amounts of data with fast and efficient computation time. K-Means divides n data points in d dimensions into a number of k clusters where the clustering process is carried out by minimizing the sum squares distance between the data and each cluster center [15]. In its implementation, the K-Means method requires three parameters that are entirely user-defined, namely the number of clusters (# of k), cluster initialization and system distance. The objective function of K-Means is formulated as
$$ J_{K - Means} \left( {U,V} \right) = \sum\limits_{k = 1}^{c} {\sum\limits_{i = 1}^{n} {\mu_{ik} \sum\limits_{j = 1}^{d} {\left( {x_{ij} - v_{kj} } \right)^{2} } } } $$(8)$$ s.t., \, \mu_{ik} \in \left\{ {0,1} \right\}, \, i = 1, \ldots ,n, \, k = 1, \ldots ,c $$(9)The objective function in (8) is optimized by using the Lagrange multipliers and obtained the updating equations of \(\mu_{ik}\) and \(v_{kj}\) as follows
$$ v_{kj} = {{\sum\limits_{i = 1}^{n} {\mu_{ik} x_{ij} } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{n} {\mu_{ik} x_{ij} } } {\sum\limits_{i = 1}^{n} {\mu_{ik} } }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{i = 1}^{n} {\mu_{ik} } }} $$(10)$$ \mu_{ik} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {if \, \sum\limits_{j = 1}^{d} {\left( {x_{ij} - v_{kj} } \right)^{2} = \mathop {\min }\limits_{1 \le k \le c} \left( {\sum\limits_{j = 1}^{d} {\left( {x_{ij} - v_{kj} } \right)^{2} } } \right)} } \hfill \\ {0,} \hfill & { \, \text{otherwise}.} \hfill \\ \end{array} } \right. $$(11) -
Hierarchical clustering
Hierarchical clustering (HCA) groups data through a hierarchical chart [16]. In the initial step, the hierarchical clustering identifies the data that has the closest distance, then associated it into one cluster. Furthermore, hierarchical clustering calculates the distance between the clusters [17]. There are seven hierarchical clustering methods including single link, complete link, group average link, McQuitty’s method, median, centroid, and Ward’s method. These seven hierarchical clustering methods defines a new relation from datasets to hierarchies by using different Lance-Williams dissimilarity update formula. However, the suitable iteration among these seven hierarchical clustering methods similar to each other, they are all carried out until all are connected. Mathematically, if points x and y are agglomerated into cluster \(x \cup y\), then the Lance-Williams dissimilarity update formula is expressed as below:
$$ d\left( {x \cup y,k} \right) = \alpha_{x} d\left( {x,k} \right) + \alpha_{y} d\left( {y,k} \right) + \beta d\left( {x,y} \right) + \gamma \left| {d\left( {x,k} \right) - d\left( {y,k} \right)} \right| $$(12)where \(\alpha_{x} ,\alpha_{y} ,\beta ,\gamma\) define the agglomerative criterion, \(\alpha_{y}\) with index y is defined identically to coefficient \(\alpha_{x}\) with index x. The formulation of Lance-Williams dissimilarity in (12) can be expressed as follow
$$ d_{x \cup y,k} = \alpha_{x} d_{xk} + \alpha_{y} d_{yk} + \beta d_{xy} + \gamma \left| {d_{xk} - d_{yk} } \right| $$(13) -
Density-Based Spatial Clustering of Application with Noise
Density-Based Spatial Clustering of Application with Noise (DBSCAN) is a clustering algorithm developed by density-based. DBSCAN separates high-density clusters from low-density clusters. This algorithm will start by dividing the data into d dimensions, then iteratively count the number of data points close to each other [18]. The DBSCAN relay on two parameters, called MinPts and Epsilon (\(\varepsilon - neighborhood\)). The \(\varepsilon - neighborhood\) is a distance measure that will be used to locate the points or to check the density in the neighbourhood of any point x, formulated as
$$ N_{\varepsilon } \left( x \right) = \left\{ {y \in d|\left\| {y - x} \right\| \le \varepsilon } \right\} $$(14)Here points x is a points inside of the cluster (MinPts) if the \(\varepsilon - neighborhood\) \(N_{\varepsilon } (x)\) of point x greater than or equal to the least number of neighbors v, denoted as \(\left| { \, N_{\varepsilon } (x) \, } \right| \ge v\). A point x is directly density-reachable from a point y with respect to \(\varepsilon - neighborhood\) and the minimum number of points required to form a dense region if \(x \in N_{\varepsilon } \left( y \right)\), and \(\left| { \, N_{\varepsilon } (y) \, } \right| \ge {\text{MinPts}}\).
3 Data Set
This research takes a case study to identify a model for the understanding of the people of Papua and West Papua towards the National Program for Community Empowerment, Strategic Plan of “Kampung” Development (PNPM RESPEK) [7]. The dataset was obtained from the results of the PNPM RESPEK survey in collaboration with BPS-Statistics Indonesia to conduct the PNPM Evaluation Survey, which was integrated through the National Socio-Economic Survey (Susenas) July 2009. The source dataset is openly accessible at https://microdata.worldbank.org/index.php/catalog/1801/ study-description.
This data initially contains 3937 Papua and West Papua people who received support from PNPM RESPEK. Since there are 2041 missing values, the data we used in this research only contains 1896 samples of Papua and West Papua people who have benefited from the PNPM RESPEK project, with 31 attributes. Table 1 shows the data type for each attribute. As our goal is to identify the accuracy of classifiers in predicting people who are likely to get the understanding and perception towards the PNPM RESPEK program, we measured the performances by only using a single evaluation metric, called accuracy rates. It is a common known that accuracy rate is devoted to simultaneously visualize and associate the structure of data based on their similarities. Furthermore, we notice that the high accuracy rate is more important than the resources. The calculation of accuracy rate is based on the percentage of error, expressed as
where \(n(c_{k} )\) is the number of training data that obtain correct classification/clustering. All the procedures for classification and clustering including pre-processing and processing final input data will be done by using Waikato environment for knowledge analysis (WEKA).
4 Result and Discussion
4.1 Classification (Supervised Learning)
We use the PNPM RESPEK 2009 data as a study case by Decision Tree J48 and Logistic regression. Our target variable is to predict whether Papua and West Papua people who received direct community assistance meet or do not meet the level of understanding and perception of funds purposes. The results of classification using Logistic regression and Decision Tree J48 are shown in Table 2.
As can be seen in Table 2, the original distribution of Papua and West Papua people who met and who did not meet the level of understanding and perception towards the PNPM RESPEK Program is 1835:61. Here “Class 0” is defined as people who met the level of understanding and perception towards the PNPM RESPEK Program. While “Class 1” is defined as people who did not meet the level of understanding and perception towards the PNPM RESPEK Program. From Table 2, we can see the Logistic Regression and Decision Tree J48 predictions did not produce perfect results. In this sense, both classifiers suggested some false negative (FN) and false positive (FP). The FN and FP refers to people that incorrectly classified as “Class 0” or “Class 1”. Specifically, there are people who was originally met the level of understanding and perception predicted as the opposite and vice versa.
The accuracy and error rates of Logistic regression and Decision Tree J48 are shown in Table 3. Table 3 demonstrated that the Decision Tree J48 is superior to the Logistic Regression with 97.31% accuracy. In contrast, Logistic Regression accuracy’s 96.78%, with a 3.22% error rate.
The modeling result of the J48 Decision Tree method is shown in Fig. 1. Figure 1 represents that the meeting number is the most crucial variable to build people’s understanding of the program. Information from sub-district employees also helps Papua and West Papua people to understand the program. Another variable that affected Papua and West Papua people’s understanding and perception towards the PNPM RESPEK Program are information from the announcement, have been PNPM actors, became SPP members, received information from Dusun or RT meetings, information from the community of mothers, and previously worked at PNPM program. In comparison, people’s understanding is not influenced by age and gender factors.
4.2 Clustering (Unsupervised Learning)
Because Decision Tree J48 does not make apparent immediately how they can be used for unsupervised learning, we further used the trick is to call the data of Papua and West Papua people who met the level of understanding and perception towards the PNPM RESPEK Program as “Class 1” and people who did not meet the level of understanding and perception as “Class 2.” We used the Papua and West Papua people who received direct community assistance data to demonstrate these unsupervised learning of K-Means, single-linkage hierarchical clustering, and DBSCAN clustering. Since DBSCAN clustering requires two input parameters, we set the value of Epsilon = 2.0 and minPts = 35. The results of these three clustering techniques are presented in Table 4.
Table 4 represented that the Hierarchical clustering (HCA) technique is superior to K-means and DBSCAN clustering techniques, with 96.73% accuracy. Since misclassifying a minority class instance is usually more severe than misclassifying a majority class one, it is clear that class imbalance does not affect the performance of hierarchical clustering (HCA). Figure 2, using HCA, demonstrated that 1895 of Papua and West Papua people who received direct community assistance met the level of understanding and perception towards the PNPM RESPEK. In contrast, Hierarchical clustering (HCA) represented 1 Papua and West Papua people who received direct community assistance did not meet the level of understanding and perception towards the PNPM RESPEK.
Figure 3 visualizes the distribution of Papua and West Papua people who received direct community assistance who met and did not meet the level of understanding and perception towards the PNPM RESPEK Program generated by the K-Means technique. K-means clustering obtained 62.45% accuracy with 1197 of Papua and West Papua people who received direct community assistance met the level of understanding and perception towards the PNPM RESPEK. In contrast, K-means represented 699 Papua and West Papua people who received direct community assistance did not meet the level of understanding and perception towards the PNPM RESPEK.
Meanwhile, using the DBSCAN technique, the distribution of instances is shown in Fig. 4. DBSCAN represented 1270 out of 1896 of Papua and West Papua people who received direct community assistance met the level of understanding and perception towards the PNPM RESPEK Program. If we analyze Fig. 4 deeply, the red color distributions are well-separated. In this sense, these points that belong to the red color can be distinguished into two different clusters. These two clusters are 533 un-clustered people, and 93 Papua and West Papua people who received direct community assistance did not meet the level of understanding and perception towards the PNPM RESPEK Program. However, DBSCAN obtained a competitive accuracy of 92.93%, as shown in Table 4.
5 Conclusion
Conclusions are drawn based on the output of supervised (classification) and unsupervised learning (clustering). The best performance of classification techniques is generated by the J48 Decision Tree method with 97.31% accuracy. On the other hand, the best performance of clustering techniques is generated by hierarchical clustering (HCA) with 96.73% accuracy. These results are relatively high. In this sense, all 30 variables work satisfactorily in measuring whether 1896 of Papua and West Papua people who received direct community assistance met or did not meet the level of understanding and perception towards the PNPM RESPEK Program. However, these k-means and DBSCAN output can be used as a consideration to address the poverty issues in Papua and West Papua. As DBSCAN represented 533 of Papua and West Papua people who received direct community assistance are still questionable, it is recommended to evaluate this phenomenon to improve decision-making in the future. As data can help accelerate a high performance, we encourage the government to investigate these 2041 out of 3937 original data (known as missing values). These 2041 partial data are essential in providing the right insights to drive better strategic, scenario, and situational decisions.
Overall, we have implemented machine learning techniques to Papua and West Papua people who received support from PNPM RESPEK by simultaneously using two supervised and three unsupervised learning based on data collected from National Socio-Economic Survey (Susenas) July 2009. Future work is intended to conduct the update data so that an optimal result will be form appropriately. We also consider further analysis based on more supervised learning approaches to generate comprehensive results.
References
BPS, B.P.S. (2013): Indonesia - Survei Evaluasi Program Nasional Pemberdayaan Masyarakat Rencana Strategis Pembangunan Kampung (2009)
Anderson, B.: Papua’s Insecurity: State Failure in the Indonesian Periphery. East-West Center, Honolulu (2015)
Diani, H.: Health, a specter for Irian Jaya. The Jakarta Post 2000, 21 Aug 5. http://www.library.ohiou.edu/indopubs/2000/08/20/0022.html. Accessed Nov 2008
World Bank: Poverty and Health (2014). https://www.worldbank.org/en/topic/health/brief/poverty-health
Akatiga: A technical evaluation of PNPM-RESPEK infrastructure built by the barefoot engineers technical facilitator training program in Papua (2015). https://www.akatiga.org/wp-content/uploads/2018/05/Barefoot-Technical-Evaluation-Final-Report-2015.pdf
Susilo, A., Trisnanto, A.: The Indonesian national program for community empowerment (PNPM)–Rural: decentralization in the context of neoliberalism and world bank policies. International Institute of Social Studies, 2(1) (2012)
World Bank: Indonesia: Evaluation of the Urban Community Driven Development Program: Program Nasional Pemberdayaan Masyarakat Mandiri Perkotaan (PNPM-Urban) (2013)
Rodrigues, I.: CRISP-DM methodology leader in data mining and big data (2020). https://towardsdatascience.com/crisp-dm-methodology-leader-in-data-mining-and-big-data-467efd3d3781. Accessed 13 Feb 2021
Irwansyah, E.: Clustering. https://socs.binus.ac.id/2017/03/09/clustering/. Accessed 6 Mar 2021
Quinlan, J.R.: C4. 5: Programs for Machine Learning. Elsevier (2014)
Saravanan, N., Gayathri, V.: Performance and classification evaluation of J48 algorithm and Kendall’s Based J48 algorithm (KNJ48). Int. J. Comput. Trends Technol. 59(2), 73–80 (2018). https://doi.org/10.14445/22312803/ijctt-v59p112
Cabrera, A.F.: Logistic regression analysis in higher education: an applied perspective. High. Educ. Handbook theory Res. 10, 225–256 (1994)
Hidayat, A.: Regresi Logistik (2015). https://www.statistikian.com/2015/02/regresi-logistik.html
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Inc. (1988)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281–297 (1967)
Johnson, S.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967). https://doi.org/10.1007/BF02289588
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, no. 34, pp. 226–231 (1996)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yuniati, D., Sinaga, K.P. (2021). Analytics-Based on Classification and Clustering Methods for Local Community Empowerment in Indonesia. In: Mohamed, A., Yap, B.W., Zain, J.M., Berry, M.W. (eds) Soft Computing in Data Science. SCDS 2021. Communications in Computer and Information Science, vol 1489. Springer, Singapore. https://doi.org/10.1007/978-981-16-7334-4_10
Download citation
DOI: https://doi.org/10.1007/978-981-16-7334-4_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-7333-7
Online ISBN: 978-981-16-7334-4
eBook Packages: Computer ScienceComputer Science (R0)