Abstract
Many text classifications depend on statistical term measures to implement document representation. Such document representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors. This work proposed a document representation method, WordNet-based lexical semantic VSM, to solve the problem. Using WordNet, this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
JING L P, NG M K, HUANG JOSHUA Z. Knowledge-based vector space model for text clustering [J]. Knowledge and Information Systems, 2010, 25(1): 35–55.
ZHANG Wen, YOSHIDA Taketoshi, TANG Xi-jin. A comparative study of TF*IDF, LSI and multi-words for text classification [J]. Expert Systems with Applications, 2011, 38(3): 2758–2765.
ZHANG Yin, JIN Rong, ZHOU Zhi-hua. Understanding bag-of-words model: a statistical framework [J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1/2/3/4): 43–52.
LI P, SHRIVASTAVA A, KONIG A C. b-Bit minwise hashing in practice [C]// Proceedings of the 5th Asia-Pacific Symposium on Internetware. New York: ACM, 2013: 13–22.
HAMID A O, BEHZADI B, CHRISTOPH S, HENZINGER M. Detecting the origin of text segments efficiently [C]// Proceedings of the 18th International Conference on World Wide Web. New York: ACM, 2009: 61–70.
SANCHEZ D, BATET M. A semantic similarity method based on information content exploiting multiple ontologies [J]. Expert Systems with Applications, 2013, 40(4): 1393–1399.
CHURCH K W, HANKS P. Word association norms, mutual information, and lexicography [J]. Computational linguistics, 1990, 16(1): 22–29.
MILLER G A. WordNet: A lexical database for English [J]. Communications of the ACM, 1995, 38(11): 39–41.
LINTEAN M, RUS V. Measuring Semantic similarity in short texts through greedy pairing and word semantics [C]// Proceedings of the 25th International Florida Artificial Intelligence Research Society Conference. Marco Island, USA: AAAI, 2012: 244–249.
MIT. MIT Java Wordnet interface (JWI) [EB/OL]. [2013-12-20]. http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/morph/WordnetStem mer.html/.
ZHAO Ling-yun, LIU Fang-ai, ZHU Zhen-fang. Frontier and future development of information technology in medicine and education: Identification of evaluation collocation based on maximum entropy model [M]. 1st ed. New York: Springer, 2013: 713–721.
HWANG M, CHOI C, KIM P. Automatic enrichment of semantic relation network and its application to word sense disambiguation [J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(6): 845–858.
KEYLOCK C J. Simpson diversity and the Shannon-Wiener index as special cases of a generalized entropy [J]. Oikos, 2005, 109(1): 203–207.
TAN S. Neighbor-weighted k-nearest neighbor for unbalanced text corpus [J]. Expert Systems with Applications, 2005, 28(4): 667–671.
AGGARWAL C C, ZHAI C X. Mining text data: A survey of text classification algorithms [M]. 1st ed. New York: Springer, 2012: 163–222.
TATA S, PATEL J M. Estimating the selectivity of tf-idf based cosine similarity predicates [J]. ACM Sigmod Record, 2007, 36(2): 7–12.
van RIJSBERGEN C. Information retrieval [M]. London: Butterworths Press, 1979.
YAN Jun, LIU Ning, YAN Shui-cheng, YANG Qiang, FAN Wei-guo, WEI Wei, CHEN Zheng. Trace-oriented feature analysis for large-scale text data dimension reduction [J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(7): 1103–1117.
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: Project(2012AA011205) supported by National High-Tech Research and Development Program (863 Program) of China; Projects(61272150, 61379109, M1321007, 61301136, 61103034) supported by the National Natural Science Foundation of China; Project(20120162110077) supported by Research Fund for the Doctoral Program of Higher Education of China; Project(11JJ1012) supported by Excellent Youth Foundation of Hunan Scientific Committee, China
Rights and permissions
About this article
Cite this article
Long, J., Wang, Ld., Li, Zd. et al. WordNet-based lexical semantic classification for text corpus analysis. J. Cent. South Univ. 22, 1833–1840 (2015). https://doi.org/10.1007/s11771-015-2702-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11771-015-2702-8