Document classification using term frequency-inverse document frequency and K-means clustering

Al-Obaydy, Wasseem N. Ibrahem; Hashim, Hala A.; Najm, Yassen AbdulKhaleq; Jalal, Ahmed Adeeb

doi:10.11591/ijeecs.v27.i3.pp1517-1524

Published September 1, 2022 | Version v1

Journal article Open

Document classification using term frequency-inverse document frequency and K-means clustering

1. Department of Computer Engineering, College of Engineering, Al-Iraqia University, Baghdad, Iraq
2. Department of Dentistry, Dijlah University College, Baghdad, Iraq
3. Department of English, College of Arts, Al-Iraqia University, Baghdad, Iraq

Increased advancement in a variety of study subjects and information technologies, has increased the number of published research articles. However, researchers are facing difficulties and devote a significant time amount in locating scientific research publications relevant to their domain of expertise. In this article, an approach of document classification is presented to cluster the text documents of research articles into expressive groups that encompass a similar scientific field. The main focus and scopes of target groups were adopted in designing the proposed method, each group include several topics. The word tokens were separately extracted from topics related to a single group. The repeated appearance of word tokens in a document has an impact on the document's weight, which is computed using the term frequency-inverse document frequency (TF-IDF) numerical statistic. To perform the categorization process, the proposed approach employs the paper's title, abstract, and keywords, as well as the categories' topics. We exploited the K-means clustering algorithm for classifying and clustering the documents into primary categories. The K-means algorithm uses category weights to initialize the cluster centers (or centroids). Experimental results have shown that the suggested technique outperforms the k-nearest neighbors algorithm in terms of accuracy in retrieving information.

Files

41 27717.pdf

Files (436.3 kB)

Name	Size	Download all
41 27717.pdf md5:6f38d26c642dadf5649430bc3b2a2701	436.3 kB	Preview Download

	All versions	This version
Views	67	67
Downloads	71	71
Data volume	34.5 MB	34.5 MB

Document classification using term frequency-inverse document frequency and K-means clustering

Authors/Creators

Description

Files

41 27717.pdf

Files (436.3 kB)