RETRACTED ARTICLE: A swarm-optimized tree-based association rule approach for classifying semi-structured data using soft computing approach

Sasikala, D.; Premalatha, K.

doi:10.1007/s00500-021-06158-6

RETRACTED ARTICLE: A swarm-optimized tree-based association rule approach for classifying semi-structured data using soft computing approach

Methodologies and Application
Published: 31 August 2021

Volume 25, pages 12745–12758, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

RETRACTED ARTICLE: A swarm-optimized tree-based association rule approach for classifying semi-structured data using soft computing approach

Download PDF

D. Sasikala¹ &
K. Premalatha¹

252 Accesses
2 Citations
Explore all metrics

This article was retracted on 27 December 2022

This article has been updated

Abstract

The semantic and XML in document classification are used to develop XML data based on tree-based document classification method. The document classification plays the main role in the information management and its retrieval of data, which is a learning problem. In a development context, document classification has a major role in many applications, especially in classifying, organizing, searching and representing concisely large information volumes. A swarm-optimized tree-based association rule approach is presented for the classification of semi-structured data with the use of soft computing. To improve document classification, a tree pruning technique to prune weak and infrequent rules and a binary particle swarm optimization (BPSO) method to optimize tree construction are proposed. An optimized tree-based association rule was proposed to improve XML documents classification based on BPSO, and tree pruning technique to prune weak/infrequent rules is presented. The method was evaluated by Reuters dataset. The Reuters dataset is applied for this method. Results show that the new method performs well for precision and recall compared with current methods.

Exploration of Document Classification with Linked Data and PageRank

Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation

A high-quality feature selection method based on frequent and correlated items for text classification

Article Open access 04 June 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Document classification is the process undertaken through the retrieval of tasks among various information managements that are done for the fundamental learning problem. The context of development is used for several applications, especially for organizing, classifying, searching and concisely representing large volumes of information. These functions are the important key factors in the document classification. The document classification evolved the building contextual information portals, which is a critical component. For information retrieval, the document classification has been well studied and is an age-old problem. There is a large body of research literature on classification through the functions of the context of the web. The variety of different approaches are used for the web page such as source types from the page: text, links, URL, hypertext and tags of different leverage information (Power et al. 2010).

The contextual information portals are built by the use of document classification, which is critical. The information retrieval is the problem, i.e. well-studied evolution. The leverage differing information source types from a page such as text, URL, links, hypertext and tags used in many approaches are referred in a web context. The web page content classification is to be analysed along with the research literature on web page.

Two factors make document classification challenging: (a) topic ambiguity and (b) feature extraction. Initially, classification determines the document classification algorithm that makes a better accuracy through the extracting right features set and has a critical role. Comparing/classifying documents often yields poor accuracy in text-based classification, which evolves the use of standard textual similarity; secondly, such topics are hard for many topics at an ambiguous making document classification. The ambiguous/broad topics do not reappear because of different meanings and a different topic and related terms through the various contexts. The predictions at document and sentence levels are performed in a two-level document classification model. The document-level performance based on feature structure and inference methods is trained and evaluated according to their performance of the document (Yessenalina et al. 2010).

Advantages of using semantics in document classification include (Barbara 2000): 1. True (latent) dimensions: Assume in latent semantic indexing (LSI) (principal component analysis (PCA)—forms of dimensionality reduction) that the documents and queries are better in a new dimension representation. The word “latent” is that new dimensions are the metaphor underlying that represents the true representations. The space’s original semantic structure and its original dimensions are recovered in LSI analysis. 2. The same underlying concept that should be described with the different terms is called as Synonymy. The different vocabulary may use the conventional retrieval strategies and the document discovering problems on same topic. 3. A common property of language referred as more than one meaning is called as polysemy. Many words in a query reduce search precision significantly, which is called as polysemous. The important use of some data “noise” and the less important uses of some terms are reduced through the LSI representation detection. 4. The traditional vector space model assumes term independence and terms serve as vector space orthogonal basis vectors, which is called as term dependence. When the first-order approximation represents the most reasonable data, that is term independence. The improved performance and its use of term associations should be ensured, and also, the retrieval function works.

XML’s advantages in document classification include:

1.
It is an effective and economical way to publish data on web.
2.
It is easy to handle/control.
3.
Large flat files can be created in this format easily.
4.
It is a useful web-publishing format as it is cross-platform supported.
5.
It supports non-printable characters.
6.
It reduces storage space as format needs less space to store information.
7.
Format is flexible.
8.
It has adaptability.
9.
It has format information representation of some standards. XML helps improve data access efficiency.
10.
No programming is required as it is schema-driven language.
11.
It is easy to archive and retrieve.

For different software development areas, the classification tree method is used, which is a test design. This method was developed by Grimm and Grochtmann in 1993. The classification trees are not to be confused with decision trees, which depends on the classification tree method.

Classification tree has two steps:

1.
The test relevant aspects (so-called classifications) and its corresponding values (classes) are identified.
2.
The different test cases are classified along with the combination of different classes.

Tree-based classification’s simplicity and interpretability are as follows (Buja and Lee 2001): (1) for conventional statistics, the data modelling does not concern, while the data mining is contrast; it involves searching for interesting data parts. (2) The global performance measures the residual sums of squares, misclassification rates, and out-of-sample versions, which are the aim of regarding the superior performance. (3) The interpretability and simplicity are not easily quantifiable in the tree aspects, in which experiments were tried out. To find these intuitive notion quantifications, this is left as an open problem. The tree size is not identical because of its simplicity. The example is discussed for the tree corollary. (4) The unbalanced trees generate the splitting criteria proposed here. The more balanced trees are more interpretable to the following perception and argued to the interpretability are independent.

This work proposes using optimized tree-based association rule (TAR) to improve XML document classification. A tree pruning technique to prune weak/infrequent rules and a binary particle swarm optimization (BPSO) method to optimize tree construction are proposed. The method is used on Reuters dataset. The sections of the paper are as follows. The literature survey for document verification is discussed in Sect. 2. The methods used and its work flow are displayed in Sect. 3, and the discussion of the results is given in Sect. 4; the conclusion of the paper is represented in Sect. 5.

2 Literature survey

De Vries and Geva (2009) presented, discussed and analysed an approach to XML mining track. A new link representation was introduced, extended and analysed. It was combined with text to improve classification performance. For the first time, K-tree algorithm was applied to document clustering. Results show that it suits the task. It produced quality clustering solutions and excellent performance.

Hofmann et al. (2009) analysed the key phrase extraction and use of document structure that are based on features of scientific documents. The scientific documents and its performed experiments are larger than corpora previously used for same task on a new corpus. The features were modelled probabilistically and evaluated. The document structure is based on the evaluation measures for rankings. The generic section structure was featured and analysed through the derivation of XML mark-up execution. The new corpus-based current candidate selection approaches are evaluated and the performance is to be calculated.

Lan and Qiao-Mei (2009) stated the conception/characteristic of web-based data mining was introduced and was expatiated. By now, many websites are built with HTML. The performance analysis of the web mining trough achieves most efficacy and accuracy. XML appearance has brought convenience. The XML transformed web mining research, which is based on the semi-structured data that execute the well-structured information. When the web is constructed, the basic data mining function and the model of web mining system facing the multi-data along with the data mining problem were analysed simultaneously.

Marks et al. (2010) proposed the partitions create an index that optimized XPath’s hierarchical axes. A XML documents partitioning approach. The major effort is to not to analyze the document in advance in this approach differs from a partition sizes determination. For dynamic creating partitions based on document characteristics, e.g. structure/node layout, algorithms are used. To create a partition index, this ensures a fully automated process. For the classification process, a compacting partition index is used for more optimization. In an identical result in query processing, the identical partition generates all particles, which is applicable for all partitions of equivalent structures, which needs a representative partition (branch class). Then, it proved overall optimization gains through experiments.

Bächle et al. (2009) explained the XML, which is the fine-grained concurrency control for realization. The XML document trees are fine-grained transaction through the initiation of tailor-made lock protocols, and the eligible XML document isolation is emphasized. The limits are prefix-based node labelling schemes and lock management of information. Accordingly, different storage models/indexes are incorporated through the integration of XML protocols in a layered architecture. The resultant performance is evaluating the next stage like implementation and classification. The lock protocols runtime behaviour is explained, and the lock manager architecture optimized the needs and presented the various ways to optimize adaptation of data.

Chagheri and Dumoulin (2009) proposed the structured documents based on semantic resource a model for conceptual indexing when the documents are separated in logical elements as a tree. For the WordNet to identify candidate terms in every logical element that is projected through some basic text analysis. According to structure where the terms appear, it also proposed a weight computation formula. The concept of document that enriched the index are obtained from a semantic resource.

Nyberg et al. (2010) proposed the information types are extended through the traditional bag-of-words-type classifier model. The accuracy improved from 70 to 74% in the Finnish National Archive classification. For background information, general Finnish Ontology YSO model is used, the documents and its experiments of data without makes the relation with each other. The two major relational information types are 1. the relations between terms, like ontologies, and 2. the relations between documents, like web links or citations in articles, which are utilized in automatic document classification.

Huang et al. (2009) presented the Wikipedia concepts to measure document in a new approach, which is a semantic relatedness, and measure the relationship between the documents clustering that similarly extends. The bag-of-concepts (BOC) model’s effectiveness and enriched document similarity measure through the results on two datasets are proved. In BOC model, LSI and independent component analysis (ICA), this encodes the featured space based on the clustering approach. The dataset is derived from the featured clustering, and then, it is investigated. The latent semantic structures play a role in comparison, and it is used for semantically enriched documents instead of using the original concepts of clustering that the documents are similarly.

Phan et al. (2011) presented the short and sparse text/web data, the general framework to build the classification and matching/ranking models for the hidden topics from large-scale external data collections. The major problems taken on the processing such data includes data sparseness and synonym/homonym problems focussed on the framework. The lack of shared words and contexts among documents from the former that leads to classification accuracy while the latter the linguistic obstacles is in natural language processing (NLP) and information retrieval (IR). The universal datasets includes all and makes the short documents through the less sparse/more topic-oriented.

Giunchiglia et al. (2009) introduced the Library Science established the analytico-synthetic approach and its methodology is successfully used for the books classification. The well-defined structure is to create and share it among users in the faceted lightweight ontologies easily. The semantics-based applications like semantic search and navigation will ensure more organized inputs.

Hu et al. (2009) presented the enriching document representation with Wikipedia concept and category information to the new text clustering methods that are addressed the both issues. The two approaches are as follows: 1. exact match and relatedness-match and 2. Wikipedia concepts and Wikipedia categories to map the text documents. Based on the similarity metric combining document content information, concept information and category information, the text documents were clustered along with the use of clustering approaches. The new clustering framework on three datasets (20-newsgroup, TDT2 and LA Times) is revealed through the clustering approaches. That are improved significantly by enriching document representation with Wikipedia concepts/categories.

Yessenalina et al. (2010) presented the latent variable structured models which did not rely on the sentence-level annotations along with the document sentiment classification and optimize the document level. The accuracy over the document and sentence levels are directly involved in training the document. The experiments on two standard sentiment analysis datasets can be revealed through the improvement in performance over the earlier results. The proper training and explanations revealed by the experiments can be extracted through an initial guess. The ways to mitigate risk through the feature smoothing of extraction is the subtask suppressing of information.

Gabrilovich and Markovitch (2009) proposed the explicit semantic analysis makes the semantic interpretation method that makes the natural language processing; there is a need for global knowledge Wikipedia along with the computers to ensure the information. The high-dimensional space texts elaborates like as knowledge-based concepts meaning, which correspond to Wikipedia articles. To build a fully automatic way to tap into collective knowledge of tens, hundreds of thousands of people through the use of semantic interpreter. The concept is based on input text alone and consequently supersedes conventional bag-of-words information not carrying the information such as deduced from representation, and the text representation formation.

Sokolova and Lapalme (2009) presented the binary, multi-class, multi-labelled, and hierarchical demonstrates the complete Machine Learning classification tasks spectrum along with its systematic analysis on the 24 performance measurement. The confusion matrix that depends on the every classification task is changed eventually, which is related to split according to the specific data characteristics. The classifier’s evaluation never changing due to the confusion matrix analysis concentration preserves a concentration to measure the invariance of data. The measurement of invariance taxonomy results in the relevant label distribution changes in the classification problems.

Salles et al. (2010) proposed the impact of temporal effects on ADC discussed and proposed two instance weighting approaches leading to the new strategies to make a more accurate classification. The temporal weighting function (TWF) methodology makes the term–class relationships for changes in a given period of time. TWF follows a lognormal distribution in a real dataset, which showed that the parameters are tuned by using the statistical methods. TWF on documents and TWF on scores are the approaches used to incorporate the TWF. The three conventional classifiers are Rocchio, KNN and Naive Bayes. Both were incorporated in TWF.

Power et al. (2010) proposed the context of development-centric topics that achieves high document classification accuracy for the simple feature extraction algorithm. The developments are 1. the feature extraction algorithm exploited and 2. the several regions of specific features due to their local language and cultural underpinnings and authentic pages, which are the two aspects used in the development-centric topics. To describe a topic, the algorithm is used to describe the extract features to combine two separate metrics. For high classification accuracy, every feature subset is insufficient to make the simple joint classifier that enumerates the achievement of featured data.

Vila et al. (2011) proposed the document processing is applied to the behaviour of different similarity measures of Tsallis entropy. The three categories of mutual information are difference between entropy and conditional entropy, Jensen–Tsallis divergence and the ratio with Tsallis joint entropy generalizations which were tested, which is based on Kullback–Leibler distance. The document classification and registration context are measured with executing overall performance.

Michalopoulos and Mavridis (2011) proposed the document classification approach that creates patterns from real dialogues of data utilization. The proper warning signal generated depends upon the classification results for the decision-making method. The best ranked algorithm was applied to the decision-making method by the comparative evaluation approach used to conduct on seven document classification algorithms.

Bekkerman and Gavish (2011) proposed the phrase-based classification (PBC), which is used for the large-scale text collection of over 100 M instances. High precision (95%) with reasonable coverage (80%, improvable) was achieved in PBC method. PBC was feasible on data of virtually any size, once crowd sourcing was used and characterized the data class according to the successful evaluation of PBC—the natural language prevented due to the large annotation task performed. The development and maintenance costs are low due to the evaluation of successful tasks in the deployment of PBC system. The classification framework can be directly applied to other tasks, which is the lead role playing in the proposed system.

3 Materials and methods

An optimized tree-based association rule (TAR) was proposed to improve XML documents classification based on binary particle swarm optimization (BPSO) and tree pruning technique to prune weak/infrequent rules. The method was evaluated by Reuters dataset (Fig. 1).

3.1 Reuters dataset

The Reuters personnel-based new method having 21,578 Reuters used for the news documents labelled manually by the Reuters dataset with four class labels was evaluated. The different category classes like “people”, “places” and “topics” are used in labels 5 dataset. Some instances of dataset used are as follows:

The BankAmerica delay with the some analysts that is up to one billion dollar said they have recommended through the equity offering, which is yet to be approved by the Securities and Exchange Commission.

The news that Brazil suspended interest payments along with other banking issues when the stock fell this week according to BankAmerica on a large portion of its foreign debt.

On January 26, BankAmerica filed the offer. On February 9, a major factor leading to the First Interstate withdrawing the seen as its takeover bid on.

3.2 Pre-processing

The pre-processing stage of the framework in an uninterrupted sequence of a word that denotes in the form of (a..z) denotes the letters, (0..9) denotes the digits and (@ and _) denotes the special characters. Many words are in a phrase, for example, “la machine IBM-360” the phrase counts as four words. The delimit words have the Space and punctuation symbols. The non-significant words are the Stop words; these are starting document before indexing to be removed. The document as such a word have no purpose in information retrieval in a general stop words list. To use the stop words list, which have there are 2 reasons (Savoy 1999).

Based on good indexing terms for best information retrieval, matching query and document should be used. For example, the words like “the” or “be” document retrieval using, which is not a good strategy. The noise and reduce relevant documents along with the non-significant words are considered in the retrieval performance.

The file size from 30 to 50% is to be reduced in the Stop words list. To increase speed and reduce space, do not account for most search engines to be stop words during search.

In many European and Italian languages, by adding suffices at end of root word through the formation of the variant words is quiet normal. For the conflate word formation, the inflectional/derivational suffixes are removed in the Stemming to variants of the same stem/root. The words “thinks”, “thinkers”, and “thinking” represent the index formation. For indexing process, the word “think” is reduced, which is the example for indexing. The classification success factors are used to involve in indexing words with same route to single index term and increase the document retrieval.

3.3 Term frequency

Let us take D means the set of documents and t means the set of terms. After removing stop words and completing stemming process, terms are chosen. The term t_j and document d, in a list of documents D, play the importance in the term frequent-inverse document frequency that denotes as (tf-idf) (Salton 1989).

$$ \begin{aligned} & {tf{\text{-}}idf}(i,d) = tf(i,d)*idf(i) \\ & idf(i) = { \log }\frac{N}{df(i)} \\ \end{aligned} $$

where $ tf() $ means the frequency of term t_i in document d; N is the total number of documents; and $ df(i) $ is the number of documents containing term t_i.

The importance of a term to a document in a collection of documents tf-idf is evaluated.

3.4 Tree pruning

The semantic graphs to TAR mining integrate the semantic tree-based association rule (TAR). The new work has the following steps.

Step 1: Collect Reuters dataset
Step 2: Remove stop words/perform Stemming.
Step 3: Compute the TF-IDF
Step 4: The frequent terms based on semantic appropriateness and minimum support enabled the construction of TAR.
Step 5: Measure performance of method using different queries.

Given a rule set R, to say a rule in R is (maximal) strong if there is no other rule in R that is stronger. Otherwise, rule is weak. So, all strong rules are called optimal class association rule set (Li et al. 2002).

3.4.1 Binary particle swarm optimization (BPSO)

The Kennedy and Eberhart proposed the PSO model, where a particle decides enabled the discrete binary PSO model for binary problems, which takes the decisions for “yes” or “ no”, “true” or “false”, “include” or “not to include”, etc. The real value in binary search space represents such binary values. It is an evolutionary technique that is mainly used for the process of evolutionary programming, evolutionary methods, and genetic programming. PSO is inspired one sociologically since the algorithm is based totally on sociological behaviour related to the approach of hen flocking. It is a population-dependent evolutionary algorithm that is much like the different population primarily based evolutionary algorithms; PSO is used to attain answer a few of the random people. The following theory is that for the whole i and j, the individual likelihood (Kij) of the ith operation being carried out until the services jth are assigned to the ith operation is known. It is an appearance inside the form of the disorganized community of the shifting debris that generally tends to cluster collectively on an identical time as each particle can seem to be moved in an arbitrary course. The particle’s personal best and global best are updated as in a continuous version in the binary PSO (Khanesar et al. 2007). The binary PSO and continuous version make the particles velocities are rather defined regarding probabilities that a bit will change to one that are the difference between both of them. The range [0, 1] to be extended by using this velocity must be restricted in the definition. So the velocity numbers are in the range that maps all real values [0, 1]. The sigmoid function-based normalization function used here is as follows:

$$ v_{ij}^{'} (t) = {\text{sig}}(v_{ij} (t)) = \frac{1}{{1 + e^{{ - v_{ij} (t)}} }}. $$

Also the equation updates the particle’s velocity vector. The particle’s new position is obtained using the following equation:

$$ x_{ij} (t + 1) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if}}\;r_{ij} < {\text{sig}}(v_{ij} (t + 1))} \hfill \\ 0 \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$

where r_ij is a uniform random number in a range [0, 1]. The examples in the test set for after the BPSO process is over are classified as follows. The decision tree till it reaches a leaf node to push down for each node (Carvalho and Freitas 2004). An example is assigned to class predicted by the leaf node when there is the large disjunct of the leaf node, or else the small disjunct of the leaf node—one of the small-disjunct rules discovered by BPSO according to the assignment of the class of example. Thus, from the optimization process, the best fitness function can be attained. By this best optimal value, the classification of semi-structured data is carried out.

4 Experimental results

By use the extensionally querying, the experiments were performed. And the intentional answers also are extracting from the Reuters and synthetic datasets. Four different queries are as follows: Q1 is acquisition, the precision and recall calculated, Q2 is grain, Q3 is crude oil and Q4 is earnings. The results of these four queries with Reuters and synthetic datasets show the following diagrams.

4.1 Reuters dataset

When compared with and without pruning for Q1, Fig. 2 shows the improved average precision by 4.08% of BPSO with pruning method.

When compared with and without pruning for Q2, Fig. 3 shows the improved average precision by 2.46% of BPSO with pruning method.

When compared with and without pruning for Q3, Fig. 4 shows the improved precision by 3.11% of BPSO with pruning method.

When compared with and without pruning for Q4, Fig. 5 illustrates the improved average precision by 3.45% of BPSO with pruning method.

When compared with and without pruning for Q1, Fig. 6 shows the improved average recall by 3.07% of BPSO with pruning method.

When compared with and without pruning for Q2, Fig. 7 shows the improved average recall by 3.75% of BPSO with pruning method.

When compared with and without pruning for Q3, Fig. 8 represents the improved average recall by 3.1% of BPSO with pruning method.

When compared with and without pruning for Q4, Fig. 9 shows the improved average recall by 3.85% of BPSO with pruning method.

4.2 Synthetic dataset

When compared with and without pruning for Q1, Fig. 10 shows the improved average precision by 3.1% of BPSO with pruning method.

When compared with and without pruning for Q2, Fig. 11 shows the improved average precision by 2.62% of BPSO with pruning method

When compared with and without pruning for Q3, Fig. 12 shows the improved average precision by 3.23% of BPSO with pruning method.

When compared with and without pruning for Q4, Fig. 13 represents the improved average precision by 3.71% of BPSO with pruning method (Fig. 14).

When compared without pruning method for Q1, Fig. 13 shows the improved average recall by 8.98% of the proposed BPSO with pruning method.

When compared without pruning method for Q2, Fig. 15 represents the improved average recall by 7.43% of the proposed BPSO with pruning method (Fig. 16).

When compared without pruning method for Q3, Fig. 15 shows the improved average recall by 8.56% of the proposed BPSO with pruning method.

When compared without pruning method for Q4, Fig. 17 shows the improved average recall by 10.34% of the proposed BPSO with pruning method.

5 Conclusion

Document classification’s format is attracted for applying the affluent and readily yielding to force of data by using the tree-based classifications in various applications. Here, the process enumerates by removing unwanted stop words, while stemming replaces a word with its root following the document process, which are involved in the pre-processing stage. For TAR construction, the features are extracted by the measure of term frequency. The new BPSO method investigates integrating particle’s personal best and global best which was updated in continuous version to optimize TAR. The difference between BPSO with continuous version is that particle velocities are defined in terms of probabilities that a bit will change to one. The overall XML query structure tree improved to find the related documents may form. From this method, the Reuters dataset and synthetic dataset are validated. The proposed method outperforms with others to evolve the precision and recall parameter result measurements.

Change history

27 December 2022
This article has been retracted. Please see the Retraction Notice for more detail: https://doi.org/10.1007/s00500-022-07790-6

References

Bächle S, Härder T, Haustein MP (2009) Implementing and optimizing fine-granular lock management for XML document trees. In: Zhou X, Yokota H, Deng K, Liu Q (eds) Database systems for advanced applications. DASFAA 2009. Lecture Notes in Computer Science, vol 5463. Springer. Berlin, Heidelberg, pp 631–645
Google Scholar
Barbara R (2000) Latent semantic indexing: an overview
Bekkerman R, Gavish M (2011) High-precision phrase-based document classification on a modern scale. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 231–239
Buja A, Lee YS (2001) Data mining criteria for tree-based regression and classification. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 27–36
Carvalho DR, Freitas AA (2004) A hybrid decision tree/genetic algorithm method for data mining. Inf Sci 163(1):13–35
Article Google Scholar
Chagheri S, Dumoulin C (2009) Semantic indexing of technical documentation. Laboratoire d'InfoRmatique en Image et Systèmes d'information. https://liris.cnrs.fr/en/thesis/thesis-samanehchagheri
De Vries CM, Geva S (2009) Document clustering with K-tree. International workshop of the initiative for the evaluation of XML retrieval INEX 2008: advances in focused retrieval. pp 420–431
Gabrilovich E, Markovitch S (2009) Wikipedia-based semantic interpretation for natural language processing. J Artif Intell Res 34:443–498
Article MATH Google Scholar
Giunchiglia F, Dutta B, Maltese V (2009) Faceted lightweight ontologies. In: Borgida AT, Chaudhri VK, Giorgini P, Yu ES (eds) Conceptual modeling: foundations and applications. Lecture notes in computer science, vol 5600. Springer. Berlin, Heidelberg, pp 36–51
Chapter Google Scholar
Hofmann K, Tsagkias M, Meij E, De Rijke M (2009) The impact of document structure on keyphrase extraction. In: Proceedings of the 18th ACM conference on information and knowledge management. ACM, pp 1725–1728
Hu X, Zhang X, Lu C, Park EK, Zhou X (2009) Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 389–396
Huang A, Milne D, Frank E, Witten IH (2009) Clustering documents using a Wikipedia-based concept representation. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB (eds) Advances in knowledge discovery and data mining. PAKDD 2009. Lecture notes in computer science, vol 5476. Springer. Berlin, Heidelberg, pp 628–636
Google Scholar
Khanesar MA, Teshnehlab M, Shoorehdeli MA (2007) A novel binary particle swarm optimization. In: Mediterranean conference on control and automation, 2007. MED’07. IEEE, pp 1–6
Lan L, Qiao-Mei R (2009) Research of web mining technology based on XML. In: International conference on networks security, wireless communications and trusted computing, 2009. NSWCTC’09, vol 2. IEEE, pp 653–656
Li J, Shen H, Topor R (2002) Mining the optimal class association rule set. Knowl Based Syst 15(7):399–405
Article Google Scholar
Marks G, Roantree M, Murphy J (2010) Classification of index partitions to boost XML query performance. In: Conceptual modeling—ER 2010. Springer, Berlin, pp 405–418
Michalopoulos D, Mavridis I (2011) Utilizing document classification for grooming attack recognition. In: 2011 IEEE Symposium on computers and communications (ISCC). IEEE, pp 864–869
Nyberg K, Raiko T, Tiinanen T, Hyvönen E (2010) Document classification utilising ontologies and relations between documents. In: Proceedings of the eighth workshop on mining and learning with graphs. ACM, pp 86–93
Phan XH, Nguyen CT, Le DT, Nguyen LM, Horiguchi S, Ha QT (2011) A hidden topic-based framework toward building applications with short web documents. IEEE Trans Knowl Data Eng 23(7):961–976
Article Google Scholar
Power R, Chen J, Kuppusamy TK, Subramanian L (2010) Document classification for focused topics. In: AAAI Spring symposium: artificial intelligence for development
Salles T, Rocha L, Pappa GL, Mourão F, Meira W Jr, Gonçalves M (2010) Temporally-aware algorithms for document classification. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval. ACM, pp 307–314
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Boston
Google Scholar
Savoy J (1999) A stemming procedure and stop word list for general French corpora. J Am Soc Inf Sci 50(10):944–952
Article Google Scholar
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
Article Google Scholar
Vila M, Bardera A, Feixas M, Sbert M (2011) Tsallis mutual information for document classification. Entropy 13(9):1694–1707
Article MATH Google Scholar
Yessenalina A, Yue Y, Cardie C (2010) Multi-level structured models for document-level sentiment classification. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1046–1056

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Bannari Amman Institute of Technology, Sathyamangalam, India
D. Sasikala & K. Premalatha

Authors

D. Sasikala
View author publications
You can also search for this author in PubMed Google Scholar
K. Premalatha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to D. Sasikala.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article has been retracted. Please see the retraction notice for more detail: https://doi.org/10.1007/s00500-022-07790-6

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

About this article

Cite this article

Sasikala, D., Premalatha, K. RETRACTED ARTICLE: A swarm-optimized tree-based association rule approach for classifying semi-structured data using soft computing approach. Soft Comput 25, 12745–12758 (2021). https://doi.org/10.1007/s00500-021-06158-6

Download citation

Accepted: 02 April 2021
Published: 31 August 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s00500-021-06158-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

RETRACTED ARTICLE: A swarm-optimized tree-based association rule approach for classifying semi-structured data using soft computing approach

Abstract

Similar content being viewed by others

Exploration of Document Classification with Linked Data and PageRank

Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation

A high-quality feature selection method based on frequent and correlated items for text classification

1 Introduction

2 Literature survey