Keywords

1 Introduction

Marketing focuses on the establishment, development and maintenance of continuous relationships between client and seller as a source of mutual benefits for the parties [1]. In this sense, for marketing policies to be effective in a context of highly competitive marketing, the literature proposes to consider relational benefits and customer segmentation [2]. Through the definition of consumer segments that value the benefits of the relationship to varying degrees, a company can design marketing strategies according to the characteristics of each type of customer [3]. Based on what has been described, the purpose of this research is to perform the customer’s segmentation according to their level of loyalty on a sample of companies belonging to the SMEs sector in Colombia through the application of Data Mining techniques.

2 Theoretical Review

2.1 RFM Analysis

The RFM (Recency, Frequency, Monetary) analysis is a marketing technique used for the analysis of the customer’s behavior [4], which is achieved by examining what the customer has purchased, using three factors: (R) purchase Recency, (F) Frequency of purchase, and (M) amount of purchase in Monetary terms. According to theories and researchers, customers who spend more money or buy more frequently in their company, are those customers who end up being more sensitive to the information and messages that the company is transmitting.

2.2 Data Mining Methodologies

2.2.1 SEMMA

The SAS Institute, developer of this methodology, defines SEMMA as the process of selection, exploration and modeling of large amounts of data to discover unknown business patterns. The name of this terminology is the acronym corresponding to five basic phases of the process: Sample (Sampling), Explore (Exploration), Modify (Modification), Model (Modeling), Assess (Evaluation) [5].

2.2.2 CRISP-DM

The CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is a method that has proven to guide the Data Mining works. It was created by the group of companies SPSS, NCR and Daimer Chrysler in the year 2000, and is currently the most used reference guide in the development of Data Mining projects [6]. This method structures the process in 6 phases: Understanding the business, Understanding the data, Preparing the data, Modeling, Evaluation, and Implementation. The succession of phases is not necessarily rigid, and each phase is broken down into several general tasks in a second level.

3 Materials and Methods

The data was provided by the Chamber of Commerce of Barranquilla, Colombia, and corresponds to customer records and sales taken from 2015 to 2018 for a group of companies belonging to the SMEs sector in Colombia, Caribbean region [7].

The data collected were categorized as follows:

  • Clients: covers the personal data of clients, provides geographical and demographic descriptors such as ID, RUC, address, age, gender, marital status, telephone, e-mail, workplace, profession, etc.

  • Sales: this category has the daily billing records for sale, which provide the description of each purchase made by customers during the period 2015–2018.

The databases that contain the data of interest for the analysis are the following [8]:

  • Clients: contains personal information of the company’s clients. It has a total of 44,800 customer records.

  • Client Type: contains three records representing end customers, distributors, and franchisees.

  • Institution: defines if the client belongs to a public institution/company, to a private company, or a natural person.

  • Invoice: all the billing information recorded by the company during the study time period. It has a total of 136,278 invoice records.

  • Invoice Details: the products that have been purchased in each invoice. It has a total of 403,159 invoice detail records.

  • Products: contains the records of all the products that the company sells. It has a total of 11,127 product records.

  • Product Groups: the groups or categories to which the products belong. It has a total of 58 product categories.

  • Brands: the brands of the products marketed by the company. It has a total of 396 trademark registrations.

The method used to carry out the process of Data Mining was the CRISP-DM ([9] and [10]) which consists of five phases: Sample (Sampling), Explore (Exploration), Modify (Modification), Model (Modeling), Assess (Valuation), each of them covering a set of activities that must be followed to carry out a mining process with high quality results.

For the segmentation of Master PC customers based on their purchasing behavior, these normalized variables were taken into account: Receipt, Frequency, and Amount. Considering that there is a wide range of clustering algorithms, an analysis was performed on some of them, corresponding to the most used in this type of case. This step allowed selecting the segmentation algorithms that were applied in the present research, which are the following: self-organized maps (SOM) of Kohonen, K-means, and CLARA algorithm (Cluster for Large Applications) which is an extension of the k-medoids algorithm [11].

For the selection of the number of groups, the techniques of internal evaluation applied sum of error squared and the silhouette index. After applying the segmentation algorithms, the results determine which of them provide the best results based on the described cascade evaluation method in [12].

4 Results

Based on the results, it was determined that the most appropriate method for customer segmentation in the study sample on the RFM attributes is the CLARA algorithm which belongs to the group of k-medoids methods. As a result, the following loyalty levels were discovered: Group 1 High, Group 2 Low, Group 3 Medium, and Group 4 Very Low, see Table 1.

Table 1. Loyalty groups profile

4.1 Generating Association Rules (Product-Product) by the Apriori Algorithm

An association rule is a rule of the form \( {\text{X}} \Rightarrow {\text{Y}} \), where X and Y are sets of elements. The meaning of this rule is that the presence of X in a transaction implies the presence of Y in the same transaction. X and Y are respectively called the antecedent and the consequent of the rule [1, 4, 7].

To generate the association rules, the Apriori algorithm, the most commonly used algorithm for the generation of these rules was applied. The Apriori algorithm is a method to discover sets of frequent elements and generates association rules on a set of transaction data [8]. It first identifies the frequent individual elements through the transactions and then extends to the increasingly large element sets until the resulting element sets reach a specified frequency threshold (support) [13]. This algorithm is implemented within the rules package [9] of R.

In order to elaborate the transaction data set, the following data were used: Invoice, Detail_Invoice, Product, Product Group, and Clients with their respective Loyalty groups. In this way, the data set was made up of an identifier of the transaction and the name of the category of product purchased in the transaction. Finally, a set of data was prepared for each customer loyalty group.

In an initial generation of rules, association rules were generated from 57 product categories, for which an amount that exceeded 20,000 rules was obtained. According to the inspection of the rules, it was possible to see that almost all of them were made up of the following categories: CASES and CHASSIS, HARD DRIVES, MEMORIES, PROCESSORS, MOTHERBOARDS, MONITORS, INTE-EXTE MEMORY READERS, DVD WRITERS AND DVD PLAYERS, SUPPRESSOR REGULATORS.

4.1.1 Generation of Association Rules for Product Recommendation to High Loyalty Customers

For the generation of rules for High loyalty customers, an acceptable level of support was selected according to the distribution of products in the set of transactions and a fairly high confidence level. The values of these parameters are described below in the Table 2.

Table 2. Parameters to generate high loyalty customer association rules

After applying the Apriori algorithm, a set of 84 rules was obtained but, to guarantee its quality, only those rules that present a lift value greater than 3 were selected, leaving a total of 70 rules. The association rules obtained served as the basis for making recommendations to any High loyalty customer. Figure 1 shows the first 10 with the highest level of confidence, and Table 3 presents the interpretation of some of them.

Fig. 1.
figure 1

Main association rules for recommendation of product-customers with High loyalty Original screen in Spanish.

Table 3. Interpretation of the main association rules for recommendation of product-customers with high loyalty

4.1.2 Generation of Association Rules for Product Recommendation to Average Loyalty Customers

For the generation of rules for average loyalty customers, an acceptable level of support was selected according to the distribution of the product categories in the transaction set and a fairly high confidence level. The values of these parameters are described in Table 4. The initially generated size of the rules was 952. From this total of rules, those with higher quality were selected, i.e. those with the highest lift values.

Table 4. Parameters to generate association rules for average loyalty customers

The base of lift was set to 3, but the number of rules was still being high, which would hinder its interpretation within the marketing area of the company, so finally, those rules that have a lift value greater than or equal to 5 were selected, leaving a total of 125 rules. The association rules obtained will serve as a basis for making recommendations to any customer of Average loyalty. Figure 2 shows the first 10 rules according to the level of trust for the customers of Average loyalty and, in Table 5, the interpretation of some of them is presented.

Fig. 2.
figure 2

Main rules of association for recommendation of product-customers with Average loyalty. Original screen in Spanish

Table 5. Interpretation of the main rules of association for recommendation of product - customers with Average loyalty.

4.1.3 Generation of Association Rules for Product Recommendation to Low Loyalty Customers

For the generation of rules for Low loyalty customers, the values of 0.01 for support and 0.8 for trust were selected as initial parameters, but under these conditions no association rule was found. The reason is that although there are several transactions for this group of customers, they rarely buy several products together, and those that are bought together do not satisfy high confidence levels.

Therefore, other parameters were established maintaining the support level, but decreasing the confidence level by half, which is also an acceptable value, although not as good as in the previous experiments.

The association rules obtained will serve as the basis for making recommendations to any Low loyalty customer. Table 6 shows the final values of the parameters for generating rules.

Table 6. Parameters to generate Low loyalty customer association rules.

Despite having decreased the value of the acceptance parameters, only 3 rules were generated, one of which was discarded because the lift value was too low, leaving only 2 association rules. The association rules obtained will serve as a basis for making recommendations to any client with very low loyalty. The generated rules are described in Fig. 3 and the interpretation in Table 7.

Fig. 3.
figure 3

Main association rules for product recommendation for Low loyalty customers.

Table 7. Interpretation of the main association rules for product recommendation for Low loyalty customers

4.1.4 Generation of Association Rules for Product Recommendation to Very Low Loyalty Customers

For the generation of rules for Very Low loyalty customers, the parameters were initially set at 0.01 for support and 0.8 for trust but, as for the previous group, no rule was found under these conditions. The value of the confidence parameter was reduced to 40%, which is an acceptable value, see Table 8.

Table 8. Parameters to generate association rules for Very Low loyalty customers

The final result was a total of 3 rules that meet a lift value greater than 2. In Fig. 4, these rules are presented, and their interpretation in Table 9.

Fig. 4.
figure 4

Main association rules generated for product recommendation for Very Low loyalty customers.

Table 9. Interpretation of the main association rules for product recommendation for Very Low loyalty customers.

5 Conclusions

To assess the accuracy of the used algorithms, k-means, k-medoids, and Self Organizing Maps (SOM), classification rules were generated taking, as a decision attribute, the groups created by the algorithms mentioned in this research. Besides, based on the prediction level, the results suggest that the classification of the groups generated by the CLARA of k-medoids algorithm provide a higher accuracy. The groups of customers of the sample of companies in study, by means of data mining, revealed the levels of loyalty as: High, Medium, Low and Very Low. These results will allow the company to develop retention strategies to their customers. The application of the Apriori association algorithm on the set of transactions of each group of customers allowed to create important association rules with quite high confidence levels, especially for the customers that belong to the highest loyalty groups because these customers are those who buy more products in the same transaction.