Vectorization of Text Documents for Identifying Unifiable News Articles

Abstract

Vectorization is imperative for processing textual data in natural language processing applications. Vectorization enables the machines to understand the textual contents by converting them into meaningful numerical representations. The proposed work targets at identifying unifiable news articles for performing multi-document summarization. A framework is introduced for identification of news articles related to top trending topics/hashtags and multi-document summarization of unifiable news articles based on the trending topics, for capturing opinion diversity on those topics. Text clustering is applied to the corpus of news articles related to each trending topic to obtain smaller unifiable groups. The effectiveness of various text vectorization methods, namely the bag of word representations with tf-idf scores, word embeddings, and document embeddings are investigated for clustering news articles using the k-means. The paper presents the comparative analysis of different vectorization methods obtained on documents from DUC 2004 benchmark dataset in terms of purity.

Authors and Affiliations

Anita Kumari Singh, Mogalla Shashi

Keywords

Related Articles

A Review of Towered Big-Data Service Model for Biomedical Text-Mining Databases

The rapid growth of biomedical informatics has drawn increasing popularity and attention. The reason behind this are the advances in genomic, new molecular, biomedical approaches and various applications like protein ide...

A Comparative Evaluation of Dotted Raster-Stereography and Feature-Based Techniques for Automated Face Recognition

Automated face recognition systems are fast becoming a need for security-related applications. Development of a fool-proof and efficient face recognition system is a challenging domain for researchers. This paper present...

The Role of Information Technology on Teaching Process in Education; An Analytical Prospective Study at University of Sulaimani

Nowadays Information Technology (IT) has been engaged in all spheres of life. It plays an important role in developing and processing works in all types of organizations, especially in the teaching process in institution...

Artificial Neural Networks and Support Vector Machine for Voice Disorders Identification

The diagnosis of voice diseases through the invasive medical techniques is an efficient way but it is often uncomfortable for patients, therefore, the automatic speech recognition methods have attracted more and more int...

Impact of Thyristor Controlled Series Capacitor on Voltage Profile of Transmission Lines using PSAT

In power system voltage stability is very important in order to maintain the voltage within the defined limits. The demand of electrical power increases in the last decade due to the lack of expansion in the generation a...

Download PDF file
  • EP ID EP611302
  • DOI 10.14569/IJACSA.2019.0100742
  • Views 132
  • Downloads 0

How To Cite

Anita Kumari Singh, Mogalla Shashi (2019). Vectorization of Text Documents for Identifying Unifiable News Articles. International Journal of Advanced Computer Science & Applications, 10(7), 305-310. https://europub.co.uk/articles/-A-611302