Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists

Journal Title: International Journal of Experimental Research and Review - Year 2023, Vol 34, Issue 5

Abstract

Text classification also called (text categorization or text tagging) is a crucial and extensively used approach in Natural Language Processing (NLP), to predict unseen content documents into prearranged categories. In this paper, we evaluate the dataset construction and evaluation process as a component of text classification. To begin with, we produced a newly created dataset for Indian Origin Scientists for text classification, which was collected by applying focused crawling and web scraping techniques. We then demonstrate an extensive evaluation of numerous models on this recently constructed dataset. Our evaluations display that the Random forest model outperforms the rest of the supervised models. Our results produce a fine beginning for additional research in Indian Origin Scientists' classification of text. Investigational outcome with K Nearest Neighbor, Logistic Regression, and Support Vector Machine for Indian-origin scientists produced much better performances for Random Forest when combined with SMOTE and K fold cross-validation techniques. We apply the Area under the ROC Curve to compute the effectiveness of the chosen models. Overall, the Random Forest classifier exhibited the best output along with 90% micro-average AUC.

Authors and Affiliations

Shivani Gautam, Rajesh Bhatia, Shaily Jain

Keywords

Related Articles

Preclinical evaluation of the diabetic wound healing activity of phytoconstituents extracted from Ficus racemosa Linn. leaves

Human body has several multi-layered organs, but skin is one of biggest and easiest to access. It serves as body's primary line of defense alongside various skin diseased conditions. Despite receiving sufficient and appr...

Occurrences of seven new records of goat fishes (family: Mullidae) from the coastal waters ofWest Bengal, India

Thirty eight fish specimens of family Mullidae were collected during the ornamental faunal survey around the West Bengal coast. All these specimens were identified into seven species which are addition to the faunal reso...

Development and Validation of RP-HPLC Method for Estimation of Ticagrelor in Pharmaceutical Dosage Form and Force degradation study

Ticagrelor is a selective Adenosine diphosphate (ADP)-receptor antagonist which is prescribed in the form of tablets and acts as an oral antiplatelet for the prevention of further thrombotic events in patients with Acute...

Effect of capsular stretch on frozen shoulder

Frozen shoulder is a chronic disabling disease of the shoulder. The management of the frozen shoulder are numerous, but the studies show their own limitation. Biomechanically, it was noted that the cause of the frozen sh...

Performance Analysis of Millimeter-Wave Propagation Characteristics for Various Channel Models in the Indoor Environment

Due to the recent surge in the proliferation of smart wireless devices that feature higher data speeds, there has been a rise in demand for faster indoor data communication services. Moreover, there is a sharp increase i...

Download PDF file
  • EP ID EP722922
  • DOI https://doi.org/10.52756/ijerr.2023.v34spl.008
  • Views 46
  • Downloads 0

How To Cite

Shivani Gautam, Rajesh Bhatia, Shaily Jain (2023). Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists. International Journal of Experimental Research and Review, 34(5), -. https://europub.co.uk/articles/-A-722922