Focused Web Crawler Development Challenges: Eccrawler

Journal Title: International Journal of Computer Science and Engineering - Year 2017, Vol 6, Issue 1

Abstract

Nowadays, the importance of focused web crawlers is more than any time before. As the web has become massive and spam my, it is now essential to have focused web crawlers that can crawl only the targeted websites and obtain the necessary information. Instead of relying on the available public general web crawlers, today, developing a focused web crawler for the targeted web pages is preferred to increase success of information retrieval. In this paper, the challenges encountered and the proposed solutions to attempt these problems are presented, while developing an original hand-crafted, full scale, robust and effective focused web crawler for E-commerce sites, named as EcCrawler, which is developed in C# programming language by using .NET 4.5 framework and MS-SQL Server 2014 database management system. Most of the crawling challenges have been discussed before in the literature, however in this paper, practical implementation and .NET framework based solutions that includes thread pool initialization, exception handling, task parallelism, HTTP compression, duplicate web page resolution, number of concurrent connections to the same host, database communication, resource sharing between threads, etc. are presented and the proposed solutions are empirically evaluated. The experimental evaluation shows that applying the proposed solutions improve EcCrawler’s crawling speed over 400% and UI responsiveness over 100%. The proposed solutions may be applicable to any software that is developed by using .NET framework.

Authors and Affiliations

FURKAN GÖZÜKARA, Selma Ayşe Özel

Keywords

Related Articles

Information Propagation Model on Multilayer Scale-Free Networks

People usually use multiple social networks simultaneously, and can share the information they learned from one social network to another. In this paper, we study the information spreading on multilayer networks and prop...

ANALYTICAL APPROACH IN RELIABILITY ASSESSMENT IN SOME PARTS OF 33/11KV POWER DISTRIBUTION SYSTEM USING FAULTS OUTAGE DATA OF PHED POWER OPERATOR IN PORT HARCOURT RIVERS STATE NIGERIA

Reliable and steady supply of electrical energy to consumers at distribution voltage level in every network is of fundamental importance to both service providers their customerss. For the customerr equipment to function...

A Survey an Facial Expression Recognition

Computer vision is one among the thrust research area in the field of Image processing. Facial expression recognition is one among the thrust research dimension in computer vision. The process of recognition and identifi...

Designing the Model of Online Immunization Record System

Immunization is an effort made to prevent the occurrence of a disease by inserting a weak antigen into the body of the child, so as to stimulate antibodies against the antigen types in the future. In this study, the immu...

A Revolution in Education through E-Learning

Education has the power to eliminate poverty and other evils that exist in our society. Quality education cannot be achieved through the conventional system of education. Globalization has provided more opportunity in th...

Download PDF file
  • EP ID EP249764
  • DOI -
  • Views 156
  • Downloads 0

How To Cite

FURKAN GÖZÜKARA, Selma Ayşe Özel (2017). Focused Web Crawler Development Challenges: Eccrawler. International Journal of Computer Science and Engineering, 6(1), 1-34. https://europub.co.uk/articles/-A-249764