Crawler Using Inverted WAH Bitmap Index and Searching User Defined Document Fields
Journal Title: International Journal of P2P Network Trends and Technology(IJPTT) - Year 2012, Vol 2, Issue 3
Abstract
Crawler is a web crawler aiming to search and retrieve web pages from the World Wide Web, which are related to a specific topic. It based on some specific algorithms to select web pages relevant to some pre-defined set of topic. The main features of Crawler consist of a user interest specification module that mediates between users and search engines to identify target examples and keywords that together specify the topic of their interest, and a URL ordering strategy that combines features of several previous approaches and achieves significant improvement. It also provides a graphic user interface such that users can evaluate and visualize the crawling results that can be used as feedback to reconfigure the crawler. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. The crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue. The user then provides feedback and helps the baseline classifier to be progressively induced using active learning techniques. Once the classifier is in place the crawler can be started on its task of resource discovery.
Authors and Affiliations
Mr. Sanjay Kumar Singh
Improving Accuracy in Decision Making for Detecting Intruders
Normal host based Intrusion detection system provides us some alerts of data integrity breach on the basis of policy violation and unauthorized access. There are some factors responsible if any employee of the ente...
Extracting Multiwords From Large Document Collection Based N-Gram
Multiword terms (MWTs) are relevant strings of words in text collections. Once they are automatically extracted, they may be used by an Information Retrieval system, suggesting its users possible conceptual interes...
A Systematic Security Approach in Software Requirements Engineering
Many software organizations today are confronted with challenge of building secure software systems. Traditional software engineering principles place little emphasis on security. These principles tend to tread s...
Software Code Clone Detection Using AST
The research which exists suggests that a considerable portion (10-15%) of the source code of large-scale computer programs is duplicate code. Detection and removal of such clones promises decreased software maintenance...
Review on Energy Conservation in Wireless Sensor Network
This is a review paper discussing what Wireless Sensor Networks are. Sensor nodes are battery powered devices and the main issue is how to reduce the energy consumption of nodes so that the networks lifetime can...