Establishing an Optimal Online Phishing Detection Method: Evaluating Topological NLP Transformers on Text Message Data

Journal Title: Journal of Data Science and Intelligent Systems - Year 2024, Vol 2, Issue 1

Abstract

This research establishes an optimal classification model for online SMS spam detection by utilizing topological sentence transformer methodologies. The study is a response to the increasing sophisticated and disruptive activities of malicious actors.We present a viable lightweight integration of pre-trained NLP repository models with sklearn functionality. The study design mirrors the spaCy pipeline component architecture in a downstream sklearn pipeline implementation and introduces a user-extensible spam SMS solution. We leverage large-text data models from HuggingFace (RoBERTa-base) via spaCy and apply linguistic NLP transformer methods to short-sentence NLP datasets. We compare the F1-scores of models and iteratively retest models using a standard sklearn pipeline architecture. Applying spaCy transformer modelling achieves an optimal F1-score of 0.938, a result comparable to existing research output from contemporary BERT/SBERT/‘black box’ predictive models. This research introduces a lightweight, user-interpretable, standardized, predictive SMS spam detection model that utilizes semantically similar paraphrase/sentence transformer methodologies and generates optimal F1-scores for an SMS dataset. Significant F1-scores are also generated for a Twitter evaluation set, indicating potential real-world suitability.

Authors and Affiliations

Helen Milner, Michael Baron

Keywords

Related Articles

Correlation Filters in Machine Learning Algorithms to Select Demographic and Individual Features for Autism Spectrum Disorder Diagnosis

Autism spectrum disorder is currently considered one of the main neurodevelopmental disorders with predominant characteristics of difficulty in social communication and cognitive skills, and limited and repetitive patter...

Feature Selection, Clustering, and IoMT on Biomedical Engineering for COVID-19 Pandemic: A Comprehensive Review

In this era, feature clustering is a prominent technique in data mining. Feature clustering has also huge applications in biomedical research for multiple purposes including grouping, feature reduction, and many more. Th...

Bootstrap Methods for Canonical Correlation Analysis of Functional Data

The bootstrap method is a very general resampling procedure for investigating the distributional property of statistics. In this paper, we present two bootstrap methods with the aim of studying the functional canonical c...

Data Science and Applications

This paper investigates the significance of data science as an indispensable instrument for decision-making across multiple domains. The study examines the history, concepts, methods, and applications of data science, as...

Performance Metrics of an Intrusion Detection System Through Window-Based Deep Learning Models

Intrusion and prevention technologies perform reliably in harsh conditions by fortifying many of the world's highest security sites with few defects in high performance. This paper aims to contribute by designing an intr...

Download PDF file
  • EP ID EP752176
  • DOI 10.47852/bonviewJDSIS32021131
  • Views 9
  • Downloads 0

How To Cite

Helen Milner, Michael Baron (2024). Establishing an Optimal Online Phishing Detection Method: Evaluating Topological NLP Transformers on Text Message Data. Journal of Data Science and Intelligent Systems, 2(1), -. https://europub.co.uk/articles/-A-752176