Improving Data Collection on Article Clustering by Using Distributed Focused Crawler

Journal Title: Data Science: Journal of Computing and Applied Informatics - Year 2017, Vol 1, Issue 1

Abstract

Collecting or harvesting data from the Internet is often done by using web crawler. General web crawler is developed to be more focus on certain topic. The type of this web crawler called focused crawler. To improve the datacollection performance, creating focused crawler is not enough as the focused crawler makes efficient usage of network bandwidth and storage capacity. This research proposes a distributed focused crawler in order to improve the web crawler performance which also efficient in network bandwidth and storage capacity. This distributed focused crawler implements crawling scheduling, site ordering to determine URL queue, and focused crawler by using Naïve Bayes. This research also tests the web crawling performance by conducting multithreaded, then observe the CPU and memory utilization. The conclusion is the web crawling performance will be decrease when too many threads are used. As the consequences, the CPU and memory utilization will be very high, meanwhile performance of the distributed focused crawler will be low.

Authors and Affiliations

Dani Gunawan, Amalia Amalia, Atras Najwan

Keywords

Related Articles

Time Series And Data Envelopment Analysis On The Performance Efficiency Of Dmmmsu-South La Union Campus

This study entitled “Time Series and Data Envelopment Analysis (DEA) on the Performance Efficiency of DMMMSU-South La Union Campus” determined the performance of the Don Mariano Marcos Memorial State University -South La...

Using random search and brute force algorithm in factoring the RSA modulus

Abstract. The security of the RSA cryptosystem is directly proportional to the size of its modulus, n. The modulus n is a multiplication of two very large prime numbers, notated as p and q. Since modulus n is public, a c...

Improving Data Collection on Article Clustering by Using Distributed Focused Crawler

Collecting or harvesting data from the Internet is often done by using web crawler. General web crawler is developed to be more focus on certain topic. The type of this web crawler called focused crawler. To improve the...

The Determining Gender Using Facial Recognition Based On Neural Network With Backpropagation

One area of science that can apply facial recognition applications is artificial intelligence. The algorithms used in facial recognition are quite numerous and varied, but they all have the same three basic stages, face...

Subject Bias in Image Aesthetic Appeal Ratings

Automatic prediction of image aesthetic appeal is an important part of multimedia and computer vision research, as it contributes to providing better content quality to users. Various features and learning methods have b...

Download PDF file
  • EP ID EP435197
  • DOI 10.32734/jocai.v1.i1-82
  • Views 77
  • Downloads 0

How To Cite

Dani Gunawan, Amalia Amalia, Atras Najwan (2017). Improving Data Collection on Article Clustering by Using Distributed Focused Crawler. Data Science: Journal of Computing and Applied Informatics, 1(1), 1-12. https://europub.co.uk/articles/-A-435197