A Novel Text Classification Method Using Comprehensive Feature Weight

Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to obtain accurate classification results for long text classification. To solve this problem, this paper proposes a novel method of short text classification using comprehensive feature weights. This method takes into account the situation of the samples in the positive and negative categories, as well as the category correlation of words, so as to improve the existing feature weight calculation method and obtain a new method of calculating the comprehensive feature weight. The experimental result shows that the proposed method is significantly higher than other feature-weight methods in the micro and macro average value, which shows that this method can greatly improve the accuracy and recall rate of short text classification

effectiveness of short text classification, which cannot explain the effectiveness of the use of feature extraction method alone. Furthermore, we take into account the situation of the samples in the positive and negative categories, as well as the category correlation of words, so as to improve the existing feature weight calculation method and obtain a new method of calculating the comprehensive feature weight.

Feature Weight of Text
Explaining research chronological, including research design, research procedure (in the form of algorithms, Pseudocode or other), how to test and data acquisition. The description of the course of research should be supported references, so the explanation can be accepted scientifically.
In order to convert the text into a form that a computer can understand, a text is represented as a vector to be analyzed and calculated. Vectoring is the basis of text processing. Given a certain weight to the words in the text which mean the importance of a word to the characterization of the text, the greater the weight, the more important word to the text.
In the information retrieval field, the term weight calculation method is divided into two categories: one is the unsupervised tf (term frequency) and (term frequency-inverse document frequency) method; another is the supervised method like ( gain), (ratio) and so on. The term weight calculation method mainly references the feature selection algorithm, as feature selection. These terms are also given different values to measure their contribution to text classification. However, these term weight calculation methods are mainly for the long text, and cannot be directly applied to the short text.
This is due to a prominent feature of the short text that the sample distribution is very uneven, that is, the number of samples in the dataset is much larger than other categories, resulting that small category text is submerged in a large number of other types of documents and it is difficult to identify. But in the massive text data to be processed, sometimes the data that system really cares about is only a small part, for example, in the network public opinion analysis and the hot sensitive topic detection and tracking problem, the valuable data in the practical environment is a small proportion. We refer the category has little samples as the positive category and the category has more samples as the negative category.
However, the existing feature weight calculation methods treat all categories in the same way, in fact, when applying the traditional method of long text feature weight calculation on the short text, the result of text classification is more inclined to negative category while ignoring normal category. The problem is particularly evident in the short text classification performance, leading to no high short text classification precision and recall rate, which cannot meet the need of practical application.

Comprehensive Category Feature Selection Method
In this paper, we regard the current category of short text dataset as a PC (Positive Category). In addition to the current category, another category is regarded as the NC (Negative Category). Related element information is shown in Table 1. Based on the idea of plain text, this paper has 3 important related analyses: 1) For a given term that contains s i , if it occurs frequently in the text, that is when the document frequency is high, it will have a strong expression ability; 2) When the frequency of s i in positive category is higher than that in the negative category, it shows that it has good classification ability which is called inverse document frequency; 3) When the ratio between the frequency of s i not in negative category and the frequency of s i in positive category is high, this shows that the relevancy frequency between s i and the text category is high.
According to the above analysis, this paper proposes a feature selection method for short text. Since this method takes into account the case of the sample in the positive and negative categories, it is named as the integrate category IC (Integrate Category), the calculation formula is as follows: where df is the document frequency, for a given term s i , the value of df is derived from the calculation , and it represents the document frequency of term s i in the positive category; tP is the number of samples which contains s i in positive category. rf is the relevance frequency derived from the calculation . rf is proportional to the frequency of s not in the negative category and inversely proportional to the frequency of s in the positive category. The bigger the rf value, the more the correlation between the term s and the category. icf is inverse document frequency and is derived from the calculation , |C| is the total categories, c f is the number of categories which contains term s i .
In the case of a certain number of positive samples, the sample distribution of dataset can be roughly divided into the following three conditions: 1) the number of negative samples is more than the positive, and it is more evenly distributed in different categories; 2) the number of negative samples is less than positive, and it is more evenly distributed in different categories; 3) the number of negative samples is less than the positive, and the distribution is not balanced, that is, in a small number of categories. In order to express the above relations more visually, 20 categories are given, including 3 simple term examples of 200 short texts, simulating the distribution of positive and negative samples in the short text dataset, see Table 2. As can be seen from table 2, the distribution of samples contain s 2 and samples contain s 3 in the positive and negative categories are very different, among which samples contains s 3 are mostly in the negative class, while the proportion of samples contains s 2 in the negative class is small, but the value of the two are the same; the value of s 2 is significantly increased after taking the relevance frequency of s in the positive category and negative category into consideration. In contrast, the increase in the value of s 3 is smaller, so that the gap between s 2 and s 3 is increasing. The frequency of s 1 in positive category is similar to that of s 2 and s 3 , but its distribution range in negative category is small, it is concentrated in one category, so the value of s 1 is slightly larger than that of s 3 , and the value is between s 2 and s 3 , this is consistent with the previous analysis.
The global feature weight of feature s for the entire corpus is shown in formula (2): (2) The advantage of this method is that it takes into account the distribution of entries in a single sample, and takes into account the category information of the text. The relevance evaluation of the terms in the positive and negative categories makes the feature words more discriminative in the imbalanced datasets. At the same time, selecting the maximum value of the global feature can help us obtain the feature with the highest category discrimination degree.
Algorithm description is as follows: 1) To pre-process the documents in the training corpus: word segmentation, remove the stop words; 2) Calculate IC(s, C j ) of each feature term and category; 3) Calculate IC max (s) of all categories based on the second step results; 4) Sort IC value in descending order, take the first M values as feature words to be retained, M is the dimension of feature space. In this paper, the M value is 820.

Results and Analysis
In order to verify the performance of the short text feature weight algorithm, K-Nearest Neighbor (KNN) classification algorithm is adopted to classify the short text. KNN is a traditional pattern recognition algorithm, the algorithm is simple and intuitive, the classification accuracy is high, and the new training text does not need to be trained, so as to reduce the training time, it is widely used in text automatic classification. K represents the number of nearest neighbor samples in a class. Testing different odd value of K in the range of [3,35], and use the result of the optimal K value for comparison.

Experimental Design
The multi classification problem can be decomposed into two classification problems, so the focus of this paper is to improve the classification performance of positive and negative two categories. At present, there is no general short text dataset, so this paper is based on the API of sina micro-blog to grab the micro-blog data as the text corpus. We select 8478 micro-blog s, each micro-blog has more than 6 words and the average text length is 42. We take about 100 micro-blogs of the Chinese super league in 2015 as a positive category, and 4236 micro-blogs of the traffic jam as a negative category.
Since the sample of dataset is very unevenly distributed, we use a 5-fold crossvalidation method to carry out the classification results comparison. That is split the dataset into 5 folds, train on 4 folds and evaluate on the 5th, then iterate each fold as the validation fold, evaluate the performance and finally average the performance to avoid the contingency of experimental results and make sure the training data and testing data have no intersection. The results thus obtained can be considered to be credible.

Evaluation Method
There are three currently used classification performance evaluation index: precision, recall, and average test value. For short text classification with uneven distribution of samples, precision and recall ration will ignore the influence of small category PC. Micro-average and macro-average are two methods for global evaluation of classification results: micro-average is the arithmetic average of the performance indicator for each instance document; macro-average is first to calculate the classification results of each category then calculate the average value of all categories. Specific definitions are as follows: where MicroP and MicroR represent the precision and recall rate of micro-average respectively; MacroP and MacroR represent the precision and recall rate of macro-average respectively. Before calculating the evaluation index, first we introduce the data elements that are related to the evaluation index, as shown in Table 3. Suppose the total number of text is 1, then: Where .

Experimental Results and Analysis
In practical application, the proportion of positive category text is very small. In order to compare the effect of negative category text data size on different feature weight method, let the initial positive and negative micro-blog text have 100 each, gradually increase the number of negative text, until the total text number reached 4300. Under the KNN classifier, the MicroA values and MacroA values of 6 different feature weight calculation methods are shown in Figure 1 and Figure 2 respectively. As can be seen, with the increase of data size, the performance value of all the feature weight method is on the rise. The reason may be that in a certain number of positive categories, when the data size is small, the text of each category is not sufficient to characterize this category, resulting in lower classification accuracy. This situation improves with the increase of data size. The result of the experiments on Figure 1 and Figure 2 shows that since the MacroA is the average of each category and is greatly influenced by small categories, the MacroA values of different feature weight calculation method is greater than the MicroA values generally.  Figure 3 shows the average of the highest classification result value of 5 times repeated trials on 6 different feature weight calculation methods. From the table, we can see that supervised term weight calculation method is not always better than unsupervised term weight calculation method, such as tf and perform less well, and which is very popular in the field of text classification is not so excellent in this experiment. However, method measures the relationship between terms and categories in the point of view whether the term shows in the positive category and negative category or not, which can maintain good classification performance in different date scale. This is because, in the traditional text classification, the training sets generally have the following characteristics: the category distribution is balanced, each document of the category can better represent this category, a document of the category is more concentrated in the arrangement of feature space. However, in practical application, the real text corpus and the use of the environment are often not satisfied with the above characteristics, therefore, the existing feature weight calculation method can't be effective in the short text.

Conclusion
In the present study of short text classification, the feature weight calculation is still used in the traditional long text method, but the distribution of short text categories is often not so balanced, the traditional method can't get good classification results. Aiming at this problem, this paper makes an improvement on the existing feature weight calculation method, takes the correlation of terms into consideration when calculating the feature weight, and tests the performance of the method in the real corpus environment. Experimental results show that this method can improve the classification effect to a certain extent, and it is found that the classification performance of short text has not been improved greatly. The reason is the collection of the real text from the network data, including a large number of network languages and not standardized expression, resulting in the system can't accurately identify. The next step will be to introduce the method of semantic analysis to reduce the effect of the variety of words on the classification system.