Establishing an Optimal Online Phishing Detection Method: Evaluating Topological NLP Transformers on Text Message Data

Apply

Establishing an Optimal Online Phishing Detection Method: Evaluating Topological NLP Transformers on Text Message Data

Journal Title: Journal of Data Science and Intelligent Systems - Year 2024, Vol 2, Issue 1

Abstract

This research establishes an optimal classification model for online SMS spam detection by utilizing topological sentence transformer methodologies. The study is a response to the increasing sophisticated and disruptive activities of malicious actors.We present a viable lightweight integration of pre-trained NLP repository models with sklearn functionality. The study design mirrors the spaCy pipeline component architecture in a downstream sklearn pipeline implementation and introduces a user-extensible spam SMS solution. We leverage large-text data models from HuggingFace (RoBERTa-base) via spaCy and apply linguistic NLP transformer methods to short-sentence NLP datasets. We compare the F1-scores of models and iteratively retest models using a standard sklearn pipeline architecture. Applying spaCy transformer modelling achieves an optimal F1-score of 0.938, a result comparable to existing research output from contemporary BERT/SBERT/‘black box’ predictive models. This research introduces a lightweight, user-interpretable, standardized, predictive SMS spam detection model that utilizes semantically similar paraphrase/sentence transformer methodologies and generates optimal F1-scores for an SMS dataset. Significant F1-scores are also generated for a Twitter evaluation set, indicating potential real-world suitability.

Authors and Affiliations

Helen Milner, Michael Baron

Keywords

Identifying Risk Factors for Heart Failure: A Case Study Employing Data Mining Algorithms

Heart diseases are increasingly present in the lives of human beings and are diseases that affect the heart and blood vessels and can lead the person who develops to death. In this article, we analyzed an open and public...

Federated-Based Deep Reinforcement Learning (Fed-DRL) for Energy Management in a Distributive Wireless Network

Studies on developing future generation wireless systems are expected to support increased infrastructure development and device subscriptions with densely deployed base stations (BSs). Economically, decreasing BS energy...

An Experimental Private Small Hydropower Plant Investments Selection Classification System

Investment selection problems and models are crucial for humans, communities, and states. Private small hydroelectric power/ hydropower plant investments (PSHPPIs) selection problem is a unique one in those problems and...

The Evolving Landscape of Oil and Gas Chemicals: Convergence of Artificial Intelligence and Chemical-Enhanced Oil Recovery in the Energy Transition Toward Sustainable Energy Systems and Net-Zero Emissions

Chemical-enhanced oil recovery (EOR) is a field of study that can gain significantly from artificial intelligence (AI), addressing uncertainties such as mobility control, interfacial tension reduction, wettability altera...

Efficient Scheduling of Data Transfers in Multi-tiered Storage

Multi-tiered persistent storage systems integrate many types of persistent storage devices, such as different types of NVMes, SSDs, and HDDs. This integration provides a multi-level view of persistent storage, where each...

EP ID EP752176
DOI 10.47852/bonviewJDSIS32021131
Views 23
Downloads 0