Feature Selection And Vectorization In Legal Case DocumentsUsing Chi-Square Statistical Analysis And Naïve BayesApproaches

Journal Title: IOSR Journals (IOSR Journal of Computer Engineering) - Year 2015, Vol 17, Issue 2

Abstract

 Abstract : Most machine learning techniques employed in the area of text classification require the features ofthe documents to be effectively selected owing to the large chunk of data encountered in the classificationprocess and term weights built from document vectors for proper infusing into the respective classifieralgorithms. Effective selection of the most important features from the raw documents is achieved byimplementing more extensive pre-processing techniques and the features obtained were ranked using the chisquarestatistical approach for the elimination of irrelevant features and proper selection of more relevantfeatures in the entire corpus. The most relevant ranked features obtained are converted to word vectors which isbased on the number of occurrences of words in the documents or categories concerned, using the probabilisticcharacteristics of Naïve Bayes as a vectorizer for machine learning classifiers. This hybrid vector space modelwas experimented on legal text categories and the study revealed better discovered features using the preprocessingand ranking technique, while better term weights from the documents was successfully built formachine learning classifiers used in the text classification process.

Authors and Affiliations

Obasi, Chinedu Kingsley , Ugwu, Chidiebere

Keywords

Related Articles

 Image De-noising By Decision Based Expanded Window Median Filter Using Multiple Scanning

 Abstract: This paper proposes a new filter for noisy imagescorrupted with salt and pepper noise which are caused due to flaws in sensor, transmission. Proposed algorithm (Decision Based Expanded Window Median Filte...

Recent Trends on Content Based Image Retrieval System- An Overview

Abstract: The Content-Based Image Retrieval (CBIR) techniques comprise methodologies intended to retrieve self-content descriptors over the image data set being studied according to the type of the image. The mainpurpose...

Real Time Zetta Bytes -Universal Memory ASIC SOC IP Core Design Implementation using VHDL and Verilog HDL for High Capacity Data Computing Processors like Cloud/Cluster/Super VLIW Parallel Distributing Pipelined Array Computing Processors

Abstract: The main intention is RTL Design Architecture and HDL Design Implementation of Zetta Bytes Memory ASIC SOC IP Core for Advanced Parallel Array Distributed Pipelined Array Computing /Cloud Computing / Super VLIW...

A Testing On Secure Key Policy Attribute-Based Encryption Policy For Data Sharing Among Dynamic Groups In The Cloud

Performance testing is based on the assertion that language and culture study are best brought together when the teacher is effective in the affective as well as cognitive and skills domains, teaching strategies and acti...

 Implemenation of Enhancing Information Retrieval UsingIntegration of Invisible Web Data Source

 Abstract: Current information retrieval process concentrate in downloading web content and analyzing andindexing from surface web, exist of interlinked HTML pages. Information retrieval has limitations if the data...

Download PDF file
  • EP ID EP137568
  • DOI -
  • Views 94
  • Downloads 0

How To Cite

Obasi, Chinedu Kingsley, Ugwu, Chidiebere (2015).  Feature Selection And Vectorization In Legal Case DocumentsUsing Chi-Square Statistical Analysis And Naïve BayesApproaches. IOSR Journals (IOSR Journal of Computer Engineering), 17(2), 42-50. https://europub.co.uk/articles/-A-137568