Investigate the Performance of Document Clustering Approach Based on Association Rules Mining
Journal Title: International Journal of Advanced Computer Science & Applications - Year 2013, Vol 4, Issue 8
Abstract
The challenges of the standard clustering methods and the weaknesses of Apriori algorithm in frequent termset clustering formulate the goal of our research. Based on Association Rules Mining, an efficient approach for Web Document Clustering (ARWDC) has been devised. An efficient Multi-Tire Hashing Frequent Termsets algorithm (MTHFT) has been used to improve the efficiency of mining association rules by targeting improvement in mining of frequent termset. Then, the documents are initially partitioned based on association rules. Since a document usually contains more than one frequent termset, the same document may appear in multiple initial partitions, i.e., initial partitions are overlapping. After making partitions disjoint, the documents are grouped within the partition using descriptive keywords, the resultant clusters are obtained effectively. In this paper, we have presented an extensive analysis of the ARWDC approach for different sizes of Reuters datasets. Furthermore the performance of our approach is evaluated with the help of evaluation measures such as, Precision, Recall and F-measure compared to the existing clustering algorithms like Bisecting K-means and FIHC. The experimental results show that the efficiency, scalability and accuracy of the ARWDC approach has been improved significantly for Reuters datasets.
Authors and Affiliations
Noha Negm, Mohamed Amin, Passent Elkafrawy, Abdel M. Salem
A Genetic Algorithm for Solving Travelling Salesman Problem
In this paper we present a Genetic Algorithm for solving the Travelling Salesman problem (TSP). Genetic Algorithm which is a very good local search algorithm is employed to solve the TSP by generating a preset number of...
Integrated Information System for reserving rooms in Hotels
It is very important to build new and modern flexible dynamic effective compatible reusable information systems including database to help manipulate different processes and deal with many parts around it. One of these i...
THYROID DIAGNOSIS BASED TECHNIQUE ON ROUGH SETS WITH MODIFIED SIMILARITY RELATION
Because of the patient’s inconsistent data, uncertain Thyroid Disease dataset is appeared in the learning process: irrelevant, redundant, missing, and huge features. In this paper, Rough sets theory is used in data discr...
The Criteria for Software Quality in Information System: Rasch Analysis
Most of the organization uses information system to manage the information and provide better decision making in order to deliver high quality services. Due to that the information system must be reliable and fulfill the...
Boosted Constrained K-Means Algorithm for Social Networks Circles Analysis
The volume of information generated by a huge number of social networks users is increasing every day. Social networks analysis has gained intensive attention in the data mining research community to identify circles of...