IRS for Computer Character Sequences Filtration: a new software tool and algorithm to support the IRS at tokenization process

Abstract

Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. A token is an instance of token a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. New software tool and algorithm to support the IRS at tokenization process are presented. Our proposed tool will filter out the three computer character Sequences: IP-Addresses, Web URLs, Date, and Email Addresses. Our tool will use the pattern matching algorithms and filtration methods. After this process, the IRS can start a new tokenization process on the new retrieved text which will be free of these sequences.

Authors and Affiliations

Ahmad Badawi, Qasem Al-Haija

Keywords

Related Articles

ECG Abnormality Detection Algorithm

The monitoring and early detection of abnormalities in the cardiac cycle morphology have significant impact on the prevention of heart diseases and their associated complications. Electrocardiogram (ECG) is very effectiv...

A Study on Ranking Key Factors of Virtual Teams Effectiveness in Saudi Arabian Petrochemical Companies

This research ranks effectiveness-related factors of virtual teams. The literature suggests various factors which could motivate or discourage management in using virtual teams versus co-located teams. Forty-eight interv...

TCP- Costco Reno: New Variant by Improving Bandwidth Estimation to adapt over MANETs

The Transmission Control Protocol (TCP) is traditional, dominant and has been de facto standard protocol, used as transport agent at transport layer of TCP/IP protocol suite. Basically it is designed to provide reliabili...

Analysis of Software Deformity Prone Datasets with Use of AttributeSelectedClassifier

Software Deformity Prone datasets models are interesting research direction in the era of software world. In this research study, the interest class of software deformity prone is defective model datasets. There are diff...

Generation of Attributes for Bangla Words for Universal Networking Language(UNL)

The usage of native language through Internet is highly demanding now a day due to rapidly increase of Internet based application in daily needs. It is important to read all information in Bangla from the internet. Unive...

Download PDF file
  • EP ID EP146223
  • DOI 10.14569/IJACSA.2013.040212
  • Views 57
  • Downloads 0

How To Cite

Ahmad Badawi, Qasem Al-Haija (2013). IRS for Computer Character Sequences Filtration: a new software tool and algorithm to support the IRS at tokenization process. International Journal of Advanced Computer Science & Applications, 4(2), 81-855. https://europub.co.uk/articles/-A-146223