Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists

Journal Title: International Journal of Experimental Research and Review - Year 2023, Vol 34, Issue 5

Abstract

Text classification also called (text categorization or text tagging) is a crucial and extensively used approach in Natural Language Processing (NLP), to predict unseen content documents into prearranged categories. In this paper, we evaluate the dataset construction and evaluation process as a component of text classification. To begin with, we produced a newly created dataset for Indian Origin Scientists for text classification, which was collected by applying focused crawling and web scraping techniques. We then demonstrate an extensive evaluation of numerous models on this recently constructed dataset. Our evaluations display that the Random forest model outperforms the rest of the supervised models. Our results produce a fine beginning for additional research in Indian Origin Scientists' classification of text. Investigational outcome with K Nearest Neighbor, Logistic Regression, and Support Vector Machine for Indian-origin scientists produced much better performances for Random Forest when combined with SMOTE and K fold cross-validation techniques. We apply the Area under the ROC Curve to compute the effectiveness of the chosen models. Overall, the Random Forest classifier exhibited the best output along with 90% micro-average AUC.

Authors and Affiliations

Shivani Gautam, Rajesh Bhatia, Shaily Jain

Keywords

Related Articles

Effectiveness of Capacity Building Programme on Competency of Electrocardiogram (ECG) Interpretation Among Critical Care Nurses

This study investigates the effects of a capacity-building programme on the competency of ECG interpretation among critical care nurses. Electrocardiogram (ECG) interpretation is a fundamental clinical skill essential fo...

Sex variations in anthropometric variables of Santal children of Birbhum district, West Bengal, India

A cross sectional study was undertaken to assess the anthropometric characteristics among 400 pre-primary and primary school going Santal children aged 4 to 11 years which includes 217 boys and 183 girls of Bolpur Srinik...

Assessment of genetic diversity of bread wheat (Triticum aestivum L.) genotypes through cluster and principal component analysis

Genetic variation of plants decides their potential for enhancement of the efficiency and consequently their utilization in breeding, which eventually may lead to increased food production. Diversity assessment can be pe...

A multidimensional study of wastewater treatment

Water usage generates wastewater, which must be collected and treated properly before being returned into the hydrological cycle for reasons of sustainable development and water supply.The content and volume of waste wat...

Spatial Analysis of Female Literacy in Religious Minorities of Uttar Pradesh, India

This study is related to the district-wise spatial analysis of female literacy in religious minorities of Uttar Pradesh. Analysis is based on district level secondary data obtained from the census of India (census report...

Download PDF file
  • EP ID EP722922
  • DOI https://doi.org/10.52756/ijerr.2023.v34spl.008
  • Views 66
  • Downloads 0

How To Cite

Shivani Gautam, Rajesh Bhatia, Shaily Jain (2023). Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists. International Journal of Experimental Research and Review, 34(5), -. https://europub.co.uk/articles/-A-722922