IRS for Computer Character Sequences Filtration: a new software tool and algorithm to support the IRS at tokenization process

Abstract

Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. A token is an instance of token a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. New software tool and algorithm to support the IRS at tokenization process are presented. Our proposed tool will filter out the three computer character Sequences: IP-Addresses, Web URLs, Date, and Email Addresses. Our tool will use the pattern matching algorithms and filtration methods. After this process, the IRS can start a new tokenization process on the new retrieved text which will be free of these sequences.

Authors and Affiliations

Ahmad Badawi, Qasem Al-Haija

Keywords

Related Articles

Static Filtered Sky Color Constancy

In Computer Vision, the sky color is used for lighting correction, image color enhancement, horizon alignment, image indexing, and outdoor image classification and in many other applications. In this article, for robust...

Smart Tourism Architectural Model (Kingdom of Saudi Arabia: A Case Study)

The researchers have proposed and implemented a general application architecture model that complies with the demands of the Saudi tourism sector to be used by tourists on their mobile devices. The design architecture ai...

Design of High Precision Temperature Measurement System based on Labview

Using the LabVIEW software platform, a high precision temperature measuring device is designed based on the principle of the thermocouple. The system uses the STM32 MCU as the main control chip, using AD7076 analog digit...

Toward Information Diffusion Model for Viral Marketing in Business

Current obstacles in the study of social media marketing include dealing with massive data and real-time updates have motivated to contribute solutions that can be adopted for viral marketing. Since information diffusion...

Simulation of Packet Telephony in Mobile Adhoc Networks Using Network Simulator

Packet Telephony has been regarded as an alternative to existing circuit switched fixed telephony. To propagate new idea regarding Packet Telephony researchers need to test their ideas in real or simulated environment. M...

Download PDF file
  • EP ID EP146223
  • DOI 10.14569/IJACSA.2013.040212
  • Views 101
  • Downloads 0

How To Cite

Ahmad Badawi, Qasem Al-Haija (2013). IRS for Computer Character Sequences Filtration: a new software tool and algorithm to support the IRS at tokenization process. International Journal of Advanced Computer Science & Applications, 4(2), 81-855. https://europub.co.uk/articles/-A-146223