Exploring Character-Based Stylometry Features Using Machine Learning for Intrinsic Plagiarism Detection in Urdu

Abstract

Plagiarism detection in natural language processing (NLP) plays a crucial role in maintaining textual integrity across various domains, particularly for low-resource languages like Urdu. This study addresses the emerging challenge of intrinsic plagiarism detection in Urdu, an area with limited research due to the scarcity of datasets and model resources. To bridge this gap, our research investigates the use of character-based stylometric features in combination with machine learning (ML) and deep learning (DL) models specifically designed for Urdu text analysis. We conducted a series of experiments to evaluate the performance of several classifiers, including Random Forest, AdaBoost, K-Nearest Neighbor (KNN), Decision Tree, Gaussian Naive Bayes, and Long Short-Term Memory (LSTM) networks. Our results show that KNN and LSTM achieved the highest accuracy at 74%, with KNN outperforming the others in terms of F1-score (64.3%), highlighting its balanced performance across accuracy, precision, and recall. AdaBoost followed closely with an accuracy of 73% and a precision of 77.5%, although its F1-score was slightly lower at 63.6%. These findings emphasize the need for specialized approaches in NLP for Urdu, demonstrating that tailored ML and DL techniques can significantly improve intrinsic plagiarism detection in lowresource languages.

Authors and Affiliations

Muhammad Faraz Manzoor, Muhammad Shoaib Farooq, Muntazir Mehdi, Adnan Abid

Keywords

Related Articles

Deep Learning Based Multi Crop Disease Detection System

This research explores the integration of deep learning, computer vision, and edge computing to revolutionize crop disease detection. In response to the pressing need for prompt and accurate disease identification, thi...

Designing an AI-Based Greenhouse Plant Monitoring System to Detect and Classify Plant Diseases from Leaf Images

Plant diseases can significantly hinder food crop production, leading to substantial economic losses and posing a threat to global food security. Machine learning, particularly deep learning, plays a crucial role in ob...

An Efficient and Robust Deep Learning Approach for Vehicle Recognition using Light-weight Deep Network

In the realm of intelligent transportation systems, automatic number plate detection has emerged as a crucial research topic due to its wide range of applications, including traffic violation monitoring, support for au...

IoTin Developing the Smart Farming and Agricultural Technologies

Background: The Internet of Things (IoT) is streamlining processes in food and agriculture, especially in developing countries with agriculture-based economies. These countries stand to gain a lot from the IoT innovati...

Addressing Class Imbalance in Credit Card Fraud Detection: A Hybrid Deep Learning Approach

The rise of credit card fraud is a global concern, demanding reliable detection methods that can overcome challenges with imbalanced datasets and limited exploration of hybrid modeling approaches. This study introduces...

Download PDF file
  • EP ID EP761765
  • DOI -
  • Views 21
  • Downloads 0

How To Cite

Muhammad Faraz Manzoor, Muhammad Shoaib Farooq, Muntazir Mehdi, Adnan Abid (2024). Exploring Character-Based Stylometry Features Using Machine Learning for Intrinsic Plagiarism Detection in Urdu. International Journal of Innovations in Science and Technology, 6(7), -. https://europub.co.uk/articles/-A-761765