K-Means Based Algorithm For Islamic Document Clustering

Authors

  • Majid Hameed Ahmed UNIVERSITI KEBANGSAAN MALAYSIA
  • Sabrina Tiun UNIVERSITI KEBANGSAAN MALAYSIA
  • Mohammed Albared UNIVERSITI KEBANGSAAN MALAYSIA

Keywords:

Islamic document clustering, Information retrieval (IR), K-means algorithm, light stemmer and similarity/distance measures

Abstract

Document clustering is an unsupervised learning task. It is a form of data analysis, aims to group a set of objects into subsets or clusters. In this paper, the target domain of clustered documents is Islamic religious domain. The Islamic document clustering is considered as an important task for gaining more effective results with; the traditional information retrieval (IR) systems, organizing web text and text mining. Fast and high-quality document clustering can tremendously facilitate the user to successfully navigate, particularly on the Internet since the number of available online documents is increasing rapidly, everyday. Thus, religious domain has become an interesting and challenging area for Natural Language Processing (NLP). The aim of this paper is to evaluate the efficiency and accuracy of Arabic Islamic document clustering base on K-means algorithm with three similarity/distance measures; Cosine, Jaccard similarity and Euclidean distance. In order to implement the algorithms, we have to pre-process the data (document). The pre-processing steps are necessary in order to eliminate noise and keep only useful information so that we can boost the performance of documents clustering. Additionally, this research investigates the effect of using stemming and without stemming words on the accuracy of Arabic Islamic text clustering. Based on our experiments, we have found that the stemming process than gives better impact than without stemming process, and the K-means with Cosine similarity measure achieves the highest score of performance.

Downloads

Published

2014-09-15