Keywords

1 Introduction

In the present day and age, the significance of dealing with the customers as the most prominent resource of a conglomerate is becoming greater in usefulness. Establishments are expeditiously providing capital for finer consumer accession, perpetuation, and evolution. The notion of corporate brainpower has a climacteric role to play for the probability of establishments using methodological prowess to take possession of preferable understanding for capacity-building programs. Hence through establishing it, in this framework the idea of customer segmentation analysis accumulates much attention due to it being an all-inclusive procedure of gaining and preserving consumers, making use of business intellect, with the purpose of augmenting the customer desirability for a corporate endeavor. One of the two most paramount intents of customer segmentation is consumer advancement with the aid of awareness. This intention of customer segmentation necessitates the employment of systematic outlook for the sake of accurately evaluating customers’ particulars and exploration of the worth of consumers for greater customer insight. Keeping abreast of ever-changing times, establishments are altering their corporate drift prototypes through the deployment of systems engineering, change administration and planning out information technology responses that would be helping in gaining brand-new consumers, alongside keeping, and maintaining the existing consumer base and enhancing the customer’s long-lasting merit. Because of the variety of products and facilities that are within an easy reach in the market and besides fierce competitiveness prevailing amongst diverse group of organizations, this form of customer relationship management has now come off as playing out quite an important role for the purpose of recognition and probing of a particular company’s finest clients as well as the acquisition of superlative marketing methodologies in order to obtain and preserve competitive precedence [1]. Unsupervised learning is used for a variety of purposes, including customer segmentation. Customer segmentation would go on to enable organizations in finding copious categories of clients who think and operate distinctively and pay heed to diverse perspectives in their expenditure and procurement habits. Companies can discover numerous segments of clients with the utilization of K-means clustering technique, allowing them to target the possible user base by splitting it by gender, age, interests, and other buying habits. These techniques divulge inside correlative and outside miscellaneous categories. Consumers differ in their characteristics, needs and the primary objective of all types of clustering techniques is to find out consumer sets and classify the base into classes of like profiles for the purpose of carrying out target marketing in a much more well-organized manner. Data visualization would be done through differing visual elements in the form of plots and graphs for the provision of accessible ways to infer the trends, outliers, and patterns in the data [2]. K-means clustering, which is a vector quantization method, would be partitioning the observations into clusters with each observation belonging to the cluster with the nearest mean. We would be finding out the optimal clusters using the following three methods- elbow, silhouette, and Gap Statistic. Finally, the visualization of the optimized clustering results would be performed with the utilization of the principal component analysis [3].

The paper is put in order as follows. Section 2 contains the related work i.e., the literature survey of the works corresponding to this paper’s topic. Section 3 explains the data preparation and visualization part. Section 4 elaborates the proposed method. Section 5 summarizes the results and discussion. Section 6 has the conclusions in it and finally Sect. 7 enumerates all the references.

2 Related Work

Numerous works have been published in the field of customer segmentation. Different approaches yield different results with their unique advantages and disadvantages. For instance, Kansal et al. [3] have used three distinct approaches to compartmentalize the customers and those clustering techniques are- K-Means, bottom-up approach/hierarchical agglomerative clustering (represented by dendrogram) and mean shift algorithm. In this work; random, unlabeled data has been used which would hence explain the employing of internal clustering validation that would pick out the most suitable technique which could accurately cluster the input data into its contrary cluster. Syakur et al. [5] have used the K-Means clustering technique in combination with the elbow method. The objective of picking out the Elbow method is to supplement the K-Means execution for the refining of the extensive quantity of data utilized. This would occur by deciding the best number of clusters from those rendered by the K-Means. Ezenkwu et al. [6] have introduced the application of MATLAB programming language for the purpose of carrying out the K-Means clustering method on the basis of data gathered from a MegaCorp clothes vocation. This research paper would conclude in the business orchestrating particular market schemes that would be appropriate for each one of its consumer sections. Kashwan et al. [7] have used the K-Means clustering algorithm and Statistical Package for the Social Sciences (SPSS) tool with the agenda of forecasting the sales of a specific market in varied yearly recurrent cycles. An analysis of variance was also carried out for the purpose of testing the stability of the clusters. The computing established set-up proved to be intuitive and provided outcomes to the managers for making swift and fast resolutions. Aryuni et al. [8] have fitted K-Means and KMedoids methods based on recency, frequency and monetary (RFM) analysis score in relation to an Internet banking system data. And the output showed that the former method has outperformed the latter one on the grounds of intra cluster i.e., AWC distance. Datta et al. [9] has used the K-Nearest Neighbor (KNN) algorithm. The K-nearest neighbors’ technique is a supervised Machine Learning algorithm that may be used to predict both classification and regression issues. It’s a straightforward notion that uses the concept of viewing surrounding data to classify any set of inputs. If K = 1, only one neighbor is cross-checked, and the input is classified based on the class category of that neighbor. When K = 4, four of the neighbors are examined, and the appropriate class is assigned. If an appropriate value of K can be roughly calculated, this is a proper approach. Rizki et al. [10] has used two methods for Customer Loyalty Segmentation on a Point-of-Sale System; they are Recency-Frequency-Monetary (RFM) model and K-Means algorithm.

Besides K-Means Clustering and related approaches, other techniques for segmentation were also used in several research works. For instance, Song et al. [11] considered feature selection to be an important aspect of customer segmentation. They used one of the advanced techniques to perform customer segmentation, called hydrological cycling optimization (HCO). This method was based on a meta-heuristic approach. The proposed method was able to evolve a set of non-dominated solutions with a smaller number of features which yielded highly accurate results. Wu et al. [12] integrated churn prediction with customer segmentation. Their area of focus lies in a telco industry, where churn management becomes crucial. They implemented multiple ML classification algorithms to perform the churn prediction including K-means clustering. Further, to conduct the factor analysis and identify some key features for turnover customer segmentation, Bayesian Logistic Regression was used. Manjunath et al. [13] proposed a multi-layer hierarchical super peer P2P network architecture to perform the distributed clustering problem involving customer segmentation. Their methodology was in contrast to the centralized clustering approach. The method allowed the flexibility to use different datasets of varying sizes. José J. López et al. [14] aim at making provisions of electric efficacies with a bulk of data on the basis of customer classification to facilitate them to set up distinct sorts of excise tariffs. They have used the ensuing methods to classify the electricity clients- hierarchical clustering, modified to follow the leader and K-Means. Their proposition eradicates the volatility of the preliminary solution and advances towards the global optimum. Maree et al. [15] discovered that the clusters created aren’t discriminating enough for micro-segmentation. As a result, they concentrated on extracting temporal features with continuous values from hidden states of neural networking in order to forecast client spending behavior in terms of transactions. They created micro-segments and course segments using Long Short-Term Memory (LSTM) and feed-forward neural networks, respectively.

The motivation for this research work includes the following:

  1. 1.

    Major emphasis on data visualization of the attributes required for customer segmentation.

  2. 2.

    Re-defining the use of K-Means Clustering algorithm with optimization so that the results are comparable with recent approaches for customer segmentation. If K-Means clustering algorithm is used all by itself, as seen in some of the related works, then there are highly probable chances of it producing high fallacies and mediocre cluster results, that is due to it being a localized maximization method.

  3. 3.

    We have some rudimentary details about the customers like age, Customer ID, gender, yearly earnings and spending score. We ought to understand the customers like who the target customers are, so that the perception can be made as a provision to the marketing team who can eventually sketch their strategy correspondingly to maximize their profits.

3 Data Preparatıon and Vısualızatıon

In this section, the dataset collected for performing customer segmentation is explored. Further, the visualizations of some fundamental attributes related to customers at a typical shopping complex is carried out using insightful and varied forms of graphs.

3.1 Data Exploration

The dataset is collected from a typical mall. There were 200 customers in total, who were analyzed based on their gender, age, annual income (per 1000 $ or k $) and based on their spending behavior, each customer was given a spending score out of 100. The dataset does not contain any redundant values or null values. The completed dataset consists of 200 rows excluding the header row and 5 columns in total. The first 6 records of the dataset are shown in Table 1 for reference.

Table 1 Mall Customer Dataset

3.2 Visualization of Attributes

In this section we use each attribute of our dataset for visualization purposes like Age, Gender, and annual income in 1000$ or k$. In this section, there will be four divisions (subsection) each section consists of an attribute.

3.2.1 Data Visualization Using Customer Gender

A box plot and pie chart is create to to illustrate using mall_customer datset. For the bar graph (as show in Fig. 1), we are using red color for female and green color for male it will be useful for gender comparison. Then, for the pie chart (as show in Fig. 2), we are using red to show female percentage and sky blue for male percentage.

Fig. 1
A bar graph represents the gender comparison. It denotes the count for the male and female as 80 and 110, respectively. Values are approximate.

Bar plot to display gender comparison

Fig. 2
A 3-D pie chart denotes the ratio of the male and female as 44% and 56%, respectively.

Pie chart to display ratio of male and female

Now, from bar plot (as show in Fig. 1), we come to know that the females are more in number than male persons. Then From the above pie-chart (as show in Fig. 2), we can say that the female’s percentage is 56%, and the male percentage in the mall customer dataset is 44%.

3.2.2 Data Visualization Using Customer Age

A histogram is plotted using (as show in Fig. 3) using Age attribute and a box plot to know which age group members are usually coming to the mall and the age of customers coming minimum and maximum to the mall using histogram and a box plot. We are using pink color for histogram and deep pink for box plot. For knowing the hex value of deep pink, we use color picker tool from the internet, then we get “#ff00ee” we use this to get a deep pink color in the box plot.

Fig. 3
A histogram denotes the frequency of the age class ranging from 20 to 70. The frequency has a higher value from 20 to 40 and is maximum at 30. It decreases gradually towards the end.

Histogram plot to display count of age class

From the above two plots (as show in Fig. 3 and Fig. 4), i.e., the box plot and the histogram, we can deduce that the maximum customer ages are between 30 and 35, while the minimum and maximum client ages are 18 and 70.

Fig. 4
A box plot represents the analysis of age. The whisker ranges from a minimum value of 20 to a maximum value of 70. The lower and upper quartile ranges from 25 to 45. The median lies at 35. Values are approximate.

Boxplot for displaying detail analysis of age a

3.2.3 Data Visualization Using Customer Annual Income (k$)

In addition, we will construct visualizations to analyze the annual income of our customers. So, first, we’ll make a histogram plot, and then we’ll move on to a density plot. The red color is used in the histogram (as show in Fig. 5). We are getting hex value in the same way above as we used in age histogram (as show in Fig. 3). Then we get “#ff002f”. Then for density graph (as show in Fig. 6) we are using light yellow color for that we get “#f2ff00”.

Fig. 5
A histogram of frequency versus annual income class denotes an increasing trend in frequency up to the income class of 80 and a decrease thereafter. The frequency reaches the highest value of 30 for the income class of 70 to 80.

Histogram plot to display count for annual income

Fig. 6
An area graph denotes the density of the annual income class ranging from 0 to 150. The density is higher between 50 to 100 and reaches the maximum density of around 0.015, and decreases thereafter.

Density Plot display detail analysis annual income

From the above analysis (as show in Fig. 5 and Fig. 6), we can say that the minimum and maximum customers annual income is 15 and 137. Customers (people) with an average annual income of $70 are the most frequent in the histogram. We can also state that the average salary of clients is $60.56. In the preceding Density Plot, we can see that the annual income of the consumers follows a normal distribution.

3.2.4 Data Visualization Using Customer Spending Score (1–100)

A histogram is plotted using using the spending score attribute and a box plot to know the minimum and maximum spending score of customers. We are using orange color for boxplot (as show in Fig. 7) and blue for histogram plot. For knowing the hex value of orange, we use color picker from net then we get “#ffa200” for orange and “#0099ff” for blue which we are going to use in the histogram (as show in Fig. 8).

Fig. 7
A box plot for descriptive analysis of spending score. The lower and upper whiskers range from 0 to 100. The median lies around 50. The lower and upper quartile ranges from 30 to 70. Values are approximate.

Boxplot for detail analysis of spending score

Fig. 8
A histogram for spending score. It denotes the frequency of spending score class ranging from 0 to 100. The highest frequency of 40 lies between the spending score of 40 to 50. The least frequency of 8 is between 60 to 70.

Histogram plot to show detail analysis of spending score

Now, from the above analysis (Fig. 7 and Fig. 8), we can say that the minimum, maximum and average spending score is 1, 99 and 50.20. And we learned from the histogram plot that clients in the 40–50 age range have the highest spending score of all the classes, which is 40.

4 The Proposed Method

First, after collecting the dataset and data preparation, we would be performing data exploration. Then we would go on to visualize and analyze the data in R to get some necessary insights. Henceforth, we would be making use of the K-Means clustering algorithm in order to determine the optimal clusters representing different customer bases. Finally, the visualization of the clustering outputs would be done by making use of the Principal Components Analysis. The proposed methodology is presented in the form of a flow chart as show in Fig. 9.

Fig. 9
A flow chart starts with collecting the dataset and moves through data preparation, identifying the potential customer base, implementing clustering algorithms, determining optimal clusters, and selling products to the identified customer base.

Flow chart representing the proposed method

4.1 K-means Clustering Algorithm

While we are in the phase of the utilization of the K-Means clustering algorithm, the pre-eminent step is to depute the quantity (number) of clusters that we are preparing to produce in the eventual output. The algorithm would begin by randomly selecting ‘n’ objects from the dataset to serve as early cluster centers. These picked objects are the cluster means, which are also known as centroids in mathematics. The selected objects would then be assigned to the next closest centroid in the next stage, cluster assignment. The Euclidean distance, or the distance of a line segment between two points in Euclidean space, reveals the closest centroid. When this step is completed, the algorithm will compute the new mean value for each cluster in the data. The observations are analyzed to see if they are closer to a distinct cluster after all of this re-computation of the centers’ values. The items will be reassigned after the cluster mean has been renewed. This step would be repeated several times until the cluster assignments stop getting changed i.e., the clusters existing in the present iteration are equivalent to those which we received in the former iteration.

The mathematical expressions used, corresponding to the above elaborated K-Means clustering algorithm are as follows:

$$Euclidean\, Distance, d\left(p, q\right)= \sqrt{{({x}_{2}- {x}_{1})}^{2}+{({y}_{2}-{y}_{1})}^{2}}$$

where p and q are the two points in the Euclidean space.

Also, we know the formula of mean is given by:

$$\overline{x }=\frac{({x}_{i}+{x}_{j})}{2}$$

4.2 Cluster Optimization

One needs to enumerate the number of clusters to be made use of beforehand itself. For this we would be using the optimal number of clusters. Clustering techniques’ major goal is to define clusters in such a way that intracluster variation is kept to a bare minimum.

$$ {\text{Minimum }}\left( {{\text{sum C}}\left( {{\text{W}}_{{\text{n}}} } \right)} \right){\text{ having n}} = {1},{2},{3} \ldots {\text{n}} $$

where, Wn is the nth cluster and C(Wn)-intracluster variation.

For the stated-out purpose we would be making use of the following three prominent methods:

  • Average Silhouette method

  • Gap Static method

  • Elbow method

4.2.1 Average Silhouette Method

Utilizing this method, we would be calculating the attribute of the clustering performance. Good clustering is indicated by a high average silhouette breadth. For different ‘n’ values, this approach computes the average of all the silhouette observations. We would maximize the average silhouette over noteworthy values for k clusters. We utilize the silhouette function and k-mean function for the purpose of calculating the average silhouette width.

4.2.2 Gap Statistic Method

With the usage of this method, we can contrast the aggregate intracluster disparity for distinct values of ‘n’ besides their contemplated values coming under the null reference distribution of the whole of the data. By using the Monte Carlo simulations, we can generate the trial dataset. Then we can compute the range between minimum and maximum for every variable in the generated dataset. Through this process one can generate values evenly starting from the lower bound up till the upper bound. We are using the “clusgap” function for the purpose of producing Gap Statistics alongside the standard error for a given result.

4.2.3 Elbow Method

First and foremost, we’ll compute the clustering technique for a variety of ‘n’ values. The former can be accomplished by varying the number of clusters inside n from one to 10. The aggregate intra-cluster sum of squares is then computed (s). We would be plotting it on the basis of the number of ‘n’ clusters which would be indicating the suitable number of clusters needed in our representation. The optimum number of clusters is designated by a bend’s location.

5 Results and Discussion

As discussed in the above section there are 3 methods, we are using in this optimal clustering so, here in this section we are going to discuss the results of those methods in 3 sub divisions and another subdivision for visualizing the clusters.

5.1 Elbow Method

In this section we are going to show the result for the Elbow method, so that we are taking the number of clusters on the x-axis and total intra-clusters sum of squares on the y-axis and then start the plotting. The resultant graph is shown in Fig. 10.

Fig. 10
A line graph of the total number of intra-clusters of squares versus the number of clusters K denotes a decreasing trend. Some of the coordinated points are (2, 200000), (4, 100000), (6, 60000), and (10, 40000). Values are approximate.

Line graph using Elbow method

Now, based on Fig. 10, we may deduce that k = 6 is the correct number of clusters because it appears at the elbow bend in the graph above.

5.2 Average Silhouette Method

In this section we are going to show the result for the Average Silhouette method. So, first we need to draw silhouette plots (as show in Fig. 11 and Fig. 12) for different numbers of clusters to know the maximum average. So, we start with k = 2 to k = 10 because we used up to 10 clusters in the above elbow graph as well and record the average values.

Fig. 11
A silhouette plot for 2 clusters. The silhouette width ranges from 0 to 1, while the average silhouette width is denoted as 0.29. Cluster 1 denotes an average width of 0.31 while cluster 2 denotes 0.28.

For k = 2 clusters, the average silhouette plot is shown

Fig. 12
A silhouette plot for 10 clusters. The silhouette width ranges from 0 to 1, while the average silhouette width is denoted as 0.38. The average widths from clusters 1 to 10 are 0.50, 0.37, 0.28, 0.30, 0.31, 0.36, 0.56,0.32, 0.38, and 0.28, respectively.

For k = 10 clusters, the average silhouette plot is shown

Here, the averages we got are 0. 29, 0.38, 0.41, 0.44, 0.45, 0.44, 0.43, 0.42, 0.38 respectively from k = 2 to = 10.

Now, by using them we are going to visualize the optimal clusters, so, for that on the x-axis we are taking the number of clusters(k) and on the y-axis we are taking average silhouette width. The resultant graph is illustrated in Fig. 13.

Fig. 13.
A line graph of average silhouette width versus number of clusters K denotes the optimal number of clusters. The line has an increasing trend up to the point (6, 0.4), which dips and fluctuates thereafter. The value is approximated.

Optimal clusters graph using Average Silhouette method

From the above graph (as show in Fig. 13), we observe that the k = 6 is seeming to be the correct value for the number of clusters, because it is having the highest average silhouette width.

5.3 Gap Statistic Method

In this section we are going to display the results for the gap static method. For plotting, we take the number of clusters on the x-axis and the gap static(k) on the y –axis. Then we get the following graph as show in Fig. 14.

Fig. 14
A line graph with error bars denotes the plots for the gap statistic versus the number of clusters. The line has an increasing trend. It highlights the optimal number of clusters at (1, 0.44), with the error bar ranging from 0.42 to 0.46. Values are approximated.

Optimal clusters graph using Gap Static method

From Fig. 14, we can conclude that there are 6 optimal clusters. The trial datasets are genrated using by Monte Carlo simulations.

5.4 Visualization and Analysis of the Clustering Results

From the above results we know that the optimal clusters are 6. So, we are going to plot the 6 clusters in a scatter plot using the “gg-plot” package in RStudio. So, we are taking annual income on x-axis and spending score on y-axis (as show in Fig. 15). To draw segmentations of mall customers using K-Means clustering, we are taking annual income on x-axis and age on y-axis to draw segmentations of mall customers (as show in Fig. 16). Then finally we are taking classes on x-axis and k-means on y-axis to show final K-Means clusters, which is visualized in the form of a scatter plot as show in Fig. 17.

Fig. 15.
A scatterplot of spending score versus annual income denotes different clusters of dots labeled cluster 1 to cluster 6. A few dots of clusters 3 and 4 overlap in the middle. A text at the top reads segments of mall customers.

6-optimal clusters using spending score and annual income

Fig. 16.
A scatterplot of age versus spending score denotes different clusters of dots labeled cluster 1 to cluster 6. The dots of clusters 1 and 2 are closer to each other. A text at the top reads segments of mall customers.

6-optimal clusters using age and spending score

Fig. 17
A scatter plot of classes versus K-means denotes the distribution of the cluster of dots labeled 1 to 6. Clusters 3 and 4 overlap at some points.

Final 6-optimal clusters using k-means and classes

From Fig. 15, we observe that:

  1. 1)

    Cluster 1 - Customers with a high annual income and a high annual expenditure make up this cluster.

  2. 2)

    Cluster 2 - This cluster is characterized by a high annual income and a low annual outlay.

  3. 3)

    Cluster 3 - Customers with a low yearly income and a low annual income spend are represented in this cluster.

  4. 4)

    Cluster 5 - People in this cluster have a low annual income but a high annual spending.

  5. 5)

    Customers in Clusters 6 and 4 have a medium annual salary spend as well as a medium annual salary income.

Further, in Fig. 16, we see that the data is scattered randomly all over the scatter plot. Strong clustering is therefore difficult to achieve as the points are not close together. We can also conclude that there are several people coming to shop at malls having different ages (both young and old) with varying economic status.

From Fig. 17, we can analyze the clusters in terms of Principal Component Analysis (PCA) score. PCA score is of two types: PCA1 and PCA2. PCA1 is the linear combination with the largest possible explained variation, and PCA2 is the best of what’s left.

  1. 1)

    Customers with a medium PCA1 and medium PCA2 score make up Clusters 4 and

  2. 2)

    Clients in Cluster 2 have a high PCA2 but a low yearly income spending.

  3. 3)

    Cluster 3 is made up of customers with a high PCA1 and PCA2 income.

  4. 4)

    Customers in Cluster 5 have a medium PCA1 score but a bad PCA2 score.

  5. 5)

    Cluster 6 - This cluster represents customers with a high PCA2 and a low PCA1.

6 Conclusion and Future Work

In this research work, we were successfully able to perform a customer segmentation on a group of 200 customers. We gave the visualization flavor to our research work by incorporating colorful and insightful data visualization graphs. They helped us gaining insights into the dataset at a glimpse. Further, the principal approach used for the segmentation of the mall customers was the K-Means clustering. Clusters were visualized in the form of scatter plots and thus, we were able to group the customers into six groups based on certain attributes used in the dataset. To improve the clustering process, we also ingeniously implemented a few optimization techniques for achieving better results.

For future work, we can work upon a more complex dataset consisting of thousands of customers. Further, the dataset to be worked upon can also include several other realistic attributes like customer survey data and customer staying time in malls.