The World Wide Web (WWW), the largest and most frequently accessed public repository of information ever developed, contains large number of web pages interconnected through hyperlinks. The WWW can be divided into two parts: Surface Web and Deep Web. The Surface Web refers to the static Web pages that can be crawled and indexed by popular search engines, also termed as Publically Indexable Web (PIW). On the other hand, the Deep Web refers to the contents stored in Web databases and published by dynamic Web pages wherein people access web databases through specified query interfaces.

Infact, there are more than 300,000 Deep Web databases and 450,000 query interfaces available in the hidden web and the two figures are still increasing quickly. Besides the scale of Web databases, the contents in Web databases span well across all topics ranging from agriculture to nuclear domain. Some Deep Web portal services provide Deep Web directories that classify Web databases in some taxonomies, contain large amount of high quality information. However, these sites hidden behind search interfaces can not be crawled by traditional crawlers. Infact, crawling hidden Web is a very challenging problem especially because of following two fundamental reasons:

·        Access to these databases is provided only through restricted search interfaces, intended to be filled manually.

·        Besides the access through search interfaces, the shear size of the hidden web is too large i.e. about 400 to 500 times larger than the size of the Surface Web. As a result, it is not prudent to attempt comprehensive coverage of the hidden Web and therefore, there is need to develop a domain-specific crawler for hidden Web.

In this thesis, design and development of a novel framework for an Extensible and Scalable Domain-Specific (Task-Oriented) Hidden Web Crawler (DSHWC) is being reported. It is not only capable of crawling the hidden web but can also efficiently deal with databases hidden behind search interfaces containing single and multiple attributes as well.

hidden web, crawler, search engine

