Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists

Journal Title: International Journal of Experimental Research and Review - Year 2023, Vol 34, Issue 5

Abstract

Text classification also called (text categorization or text tagging) is a crucial and extensively used approach in Natural Language Processing (NLP), to predict unseen content documents into prearranged categories. In this paper, we evaluate the dataset construction and evaluation process as a component of text classification. To begin with, we produced a newly created dataset for Indian Origin Scientists for text classification, which was collected by applying focused crawling and web scraping techniques. We then demonstrate an extensive evaluation of numerous models on this recently constructed dataset. Our evaluations display that the Random forest model outperforms the rest of the supervised models. Our results produce a fine beginning for additional research in Indian Origin Scientists' classification of text. Investigational outcome with K Nearest Neighbor, Logistic Regression, and Support Vector Machine for Indian-origin scientists produced much better performances for Random Forest when combined with SMOTE and K fold cross-validation techniques. We apply the Area under the ROC Curve to compute the effectiveness of the chosen models. Overall, the Random Forest classifier exhibited the best output along with 90% micro-average AUC.

Authors and Affiliations

Shivani Gautam, Rajesh Bhatia, Shaily Jain

Keywords

Related Articles

Study of rhizospheric bacterial population of Azadirachta indica (Neem) of North 24 Parganas district of West Bengal for bioprospective consideration

The rhizospheric microbial population has immense role in agriculture and crop improvement. This article deals with the preliminary information about the rhizospheric bacterial population of Azadirachta indica growing at...

A Comprehensive Chemical Characterization of Leaves of Five Potential Medicinal Plants in Paschim Medinipur District, W. B., India

The physico-chemical and spectroscopic characterization of five selected medicinal plants viz., Acalypha indica, Senna tora, Euphorbia hirta, Physalis angulata and Ziziphus mauritina are the essence and has been carried...

Evaluation of Work Posture and Postural Stresses of Welders: A Report

Work related musculoskeletal disorders (WRMSD) are very common health problem in manufacturing sectors in all over India. Welding is one of the most important activities in manufacturing sector in our country. Higher ris...

Effect of Orthosiphon stamineus Extract on HIF-1Α, Endothelin-1, and VEGFR-2 Gene Expression in NRK-52E Renal Tubular Cells Subjected to Glucotixicity

This study aimed to investigate the impact of Orthosiphon stamineus extract on gene expression in NRK-52E cells under conditions of glucotoxicity. Gene expression analysis using RT-PCR was conducted following exposure of...

Stigma receptivity in Cashew nut (Anacardium occidentale L.)

The cashew is widely and commercially cultivated throughout the nation for its nut. Cashew is a polygamo - monoecious plant with both male and bisexual flowers developing in same inflorescence. Experimental study was con...

Download PDF file
  • EP ID EP722922
  • DOI https://doi.org/10.52756/ijerr.2023.v34spl.008
  • Views 42
  • Downloads 0

How To Cite

Shivani Gautam, Rajesh Bhatia, Shaily Jain (2023). Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists. International Journal of Experimental Research and Review, 34(5), -. https://europub.co.uk/articles/-A-722922