Attribute Overlap Minimization and Outlier Elimination as Dimensionality Reduction Techniques for Text Classification Algorithms
Simon Fong1 and
Antonio Cerone2
1. Department of Computer and Information Science, University of Macau, Macau SAR
2. International Institute for Software Technology, United Nations University, Macau SAR
2. International Institute for Software Technology, United Nations University, Macau SAR
Abstract—Text classification is the task of assigning free text documents to some predefined groups. Many algorithms have been proposed; in particular, dimensionality reduction (DR) which is an important data pre-processing step has been extensively studied. DR can effectively reduce the features representation space which in turn helps improve the efficiency of text classification. Two DR methods namely Attribute Overlap Minimization (AOM) and Outlier Elimination (OE) are applied for downsizing the features representation space, on the numbers of attributes and amount of instances respectively, prior to training a decision model for text classification. AOM works by swapping the membership of the overlapped attributes (which are also known as features or keywords) to a group that has a higher occurrence frequency. Dimensionality is lowered when only significant and unique attributes are describing unique groups. OE eliminates instances that describe infrequent attributes. These two DR techniques can function with conventional feature selection together to further enhance their effectiveness. In this paper, two datasets on classifying languages and categorizing online news into six emotion groups are tested with a combination of AOM, OE and a wide range of classification algorithms. Significant improvements in prediction accuracy, tree size and speed are observed.
Index Terms—data stream mining, optimized very fast decision tree, incremental optimization
Cite: Simon Fong and Antonio Cerone, "Attribute Overlap Minimization and Outlier Elimination as Dimensionality Reduction Techniques for Text Classification Algorithms," Journal of Emerging Technologies in Web Intelligence, Vol. 4, No. 3, pp. 259-263, August 2012. doi:10.4304/jetwi.4.3.259-263
Index Terms—data stream mining, optimized very fast decision tree, incremental optimization
Cite: Simon Fong and Antonio Cerone, "Attribute Overlap Minimization and Outlier Elimination as Dimensionality Reduction Techniques for Text Classification Algorithms," Journal of Emerging Technologies in Web Intelligence, Vol. 4, No. 3, pp. 259-263, August 2012. doi:10.4304/jetwi.4.3.259-263
Array