Keywords

1 Introduction

Data preprocessing is a broad umbrella which covers ample amount of strategies and techniques that are correlated and interrelated in many ways [1]. For getting zest of any dataset, a major part lies in its cleaning and manipulation of collected data. As nothing can be perfect, so the same problem lies with our data. Before performing any preprocessing activities like aggregation, dimensionality reduction, or feature extraction, quality of data is cynosure [2, 3]. Any analytics results are solely dependent upon the quality of its training dataset.

In real-world data collection, before moving any data into some statistical algorithm like classification, clustering, mining, etc. [4], there are number of pre-requisite steps to be followed for getting rightful, accurate, and trustful results. Extracting new features from given set of attributes is very common nowadays. Though selecting the right attributes from hundreds of given feature set is a matter of expert, keeping the required ones and eliminating the irrelevant or redundant attributes not only helps in maintaining the data fitness, reducing the dimensionality of data, as well as helps decision-making algorithms to run faster and more efficiently.

This paper proposed an algorithm for effective knowledge discovery by covering more methods for mitigating data quality issues. Our focus area includes incorporation of new steps in data collection and cleaning, which has direct impact on quality of results in knowledge discovery.

This paper is organized in the sections as in Sect. 2, we discussed about steps to be performed at the time of data collection, so that an advance refinement of collected data could be done at the first level itself along with data analysis and cleaning activities, and we have covered few more checks and treatments to be perform in data preprocessing. The proposed algorithm and detailed implementation strategies are explained in Sect. 3, and Sect. 4 showcases the impact of given algorithm in effective and more truthful knowledge discovery. The concluding remarks and future work is given in Sect. 5 showcasing how one can achieve staggering and more reliable results by stepwise implementing the given algorithm.

2 Related Work

2.1 Data Collection

A classical definition of data collection is gathering of information in a systematic fashion. This statement has evolved a lot with time. At present, data collection tools and techniques are way more than just fetching and loading of data. Complete ETL process—extract, transform, load is expected from a collection tool in modern systems. Keeping this into view, we analyzed few datasets and found some abnormalities other than the implemented ones [5]. As there could be more than one source while fetching data into a particular system, a problem of inconsistency exist in column names, i.e., for a similar attribute, there could exist different name from different sources. Second irregularity we found was distinct formats for a single attribute. Most widely seen example for this case is timestamp. Third anomaly proved primer in erroneous result in knowledge discovery phase was incorrect column type, i.e., the attribute data type was not matching with its values. These anomalies are required to be essentially removed at the time of collection and implicitly before analysis.

2.2 Data Analysis and Cleaning

Data analysis is a process of organization of data in drawing helpful conclusions. This phase acts as base for various data cleaning [4] activities. Identification and removal of inconsistent and imprecise values present in any crude dataset is main aim of any data cleaning method. Noise and outlier detection algorithms like clustering or unsupervised machine learning algorithm work efficiently in searching and removing the abnormal records [6]. Here, abnormal refers to the outliers or oddly present dataset showing serious deviation from other data items present with in the dataset. As unsupervised algorithms has no labels, and therefore, no boundaries exist for framing the data items, thus helpful in finding anomalies. This is classically performed in all the analysis work [7]. Here, we find an improvement in this legacy analysis and cleaning system. As we know, there could exists hundreds, thousands of features in a single dataset [8]; therefore, there may exist a possibility of interconnection and interrelation between them. These relationships can be used in treating missing values as well as NULL values present in the dataset. These relationship values so obtained perform a crucial role in data manipulation. Various machine learning algorithms like apriori algorithm and KNN can be used to predict the missing values using these relationships. Redundancy in records are required to be removed after all data manipulation.

2.3 Data Preprocessing

To get data finally ready for discovering knowledge, it must be passed through data preprocessing phase. This generally includes integration of various attributes and creation of a new attributes by aggregation or segregation of attributes, selecting required and primary features while dropping irrelevant ones [5]. Normalization can be done in the end of preprocessing unit if in case scaling is required. This can be performed in two ways—min-max normalization and z score normalization [9].

3 Proposed Algorithm

Following is the associated algorithm to be executed stepwise for getting maximum data quality and knowledge discovery (Fig. 1):

Fig. 1
figure 1

Proposed data mining algorithm

  1. 1.

    Ingest dataset (Ki) whereas ‘i’ is the number of sources, ‘n’ is the number of columns, and ‘r’ is the number of rows.

  2. 2.

    Check and rectify column name mismatching between similar attributes of different sources and make one unit by combining all the datasets (K).

  3. 3.

    Now, further check number of columns (n) and their formats, i.e., whether all the values are present in a single format or not. If no, correct the same and proceed to step 04

  4. 4.

    Detect the datatype of each column and again check whether they are complementing with respective column values. If no, correct the same and proceed to step 05

  5. 5.

    Start data analysis phase by detecting and removing noise and outliers present in the dataset.

  6. 6.

    Since there is only one dataset now, detect the relationships between the different attributes. This is useful in data manipulation, i.e., for treating missing and NULL values. Rectify and correct such values using this step.

  7. 7.

    Check duplicate values ‘dR’ present in the dataset (K), if yes, then go to step 08, else go to step 09.

  8. 8.

    Remove duplicates and check the unique number of rows ‘UR’

    $$ r = {\text{dR}} + {\text{UR}} $$
  9. 9.

    Proceed further with other data preprocessing activities like feature selection and construction of new attributes from given attributes by following aggregation, segregation, etc.

  10. 10.

    Finally, knowledge discovery procedure can be begin based upon the use cases.

4 Knowledge Discovery

Knowledge discovery is solely dependent upon the quality of data passing into discovery systems [10]. This involves application of various data preprocessing methods aimed at facilitation of data mining algorithms. Many times even required post-processing for refining and improvement of knowledge [11]. For validation of stated algorithm, we have taken a dataset from NYC open data Web site [12]. The data contains information of dog owners living in New York City. All the residents of New York City are required to license their dogs right after their adoption as per the given law. In our dataset, each record represents a unique dog license issued date and expiry date. Each tuple stands as a unique license period for a dog over a year-long time period. This dataset has 15 columns and 51,861 rows saved in a csv format. After analyzing the dataset attributes, other than problems like null value and missing value, there were major data quality issues, refer Table 1.

Table 1 Data quality issues and suggested solution

To understand, the dirty data here is screenshot of our sample grubby dataset, and its inconsistencies are discussed in Table 1 quality issues column (Fig. 2).

Fig. 2
figure 2

Screenshot of grubby dataset opened in excel

The main aim of this work is to showcase the importance of new methods explored and implemented in data collection and cleaning level. The results shows how the new algorithm is impacting the overall knowledge discovery procedure of given dataset. We have used an open-source statistical language, R, and analyzed the results using exploratory data analysis technique [15].

In Fig. 3, due to presence of imprecise boroughs, the distribution checkup between the NYC states was incorrect, but after finding the relationship between zipcode and borough (city name), we were able to correct the missing and incorrect boroughs which eventually impact our distribution of dogs in a particular town, refer Fig. 4.

Fig. 3
figure 3

Number of dogs/town without cleaning

Fig. 4
figure 4

Number of dogs/town after cleaning

Also, it was not possible to plot a comparison report between birth dates and license issue dates as both the columns were present in string format with different date format types. After correcting the datatypes at data collection time, we were able to set a contrast chart between the two columns, and the results can be seen in Fig. 5.

Fig. 5
figure 5

Dogs birth date versus license issued dates

As number of boroughs were present, it was not feasible to plot a gender distribution chart for all the boroughs individually. After data preprocessing, we were able to set a contrast chart between the boroughs in Fig. 6.

Fig. 6
figure 6

Gender distribution with respect to boroughs after cleaning

Using Fig. 6, we can even discover information about number of males or females present in a particular borough, we can calculate the ratio between the males and females in a particular town, etc.

Lastly, we wished to obtain most favorite breed of New York City, and then again it was not possible with raw data due to presence of enormous number of null values. After treating the null values by setting relationship between the column and dropping all the unknown breeds. After cleaning and preprocessing the transformed dataset, we obtain following word cloud based on the number of counts. Higher the count, more centered the position of value.

Therefore, pug is the most favorite dog in whole NYC region of The United States (Fig. 7).

Fig. 7
figure 7

Favorite dog breeds after data cleaning

5 Conclusion and Future Work

Data mining is the process of discovering useful information in a large data repository [11]. This single phase requires number of pre-requisite activities to be followed in a sequence. In our work, we have covered all data mining activity level with inclusion of new activities to improve the knowledge discovery procedure. We carefully analyze the deep insights for data collection and its preprocessing units and suggested an algorithm for effective mining of textual dataset. This algorithm can also be useful in enhancing the over all data quality of any analytics system. The results section demonstrates each step of the proposed algorithm and shows the fruitful impact on dataset quality and knowledge discovery. This has capability of extension, if any new abnormality is found in future. More explorations can be done on cleaning requirements of textual datasets using fusion of machine learning algorithms. It would be nice if a single sequence of this data mining algorithm gives best performance for each type of dataset.