Sunday, December 6, 2009

Deep and invisible web

It goal of a search engine is to index as much information in the web as possible. But is it possible to index the whole web with a highly powerful search engine given unlimited processing power?. It is not! There are various reasons. The web that can be indexed by search engines is called surface web. The web which is not part of the surface web is called deep web or invisible web. It is estimated that deep web is much more larger than surface web (In fact more than 10 times larger, even though the estimates vary). There are different reasons why the whole web is not indexable by search engines.
As per wikipedia deep Web resources may be classified into one or more of the following categories:
  • Dynamic content: which are returned in response to a submitted query or accessed only through a form, especially if open-domain input elements (such as text fields) are used; such fields are hard to navigate without domain knowledge.
  • Unlinked content: pages which are not linked to by other pages, which may prevent web crawling programs from accessing the content. This content is referred to as pages without backlinks (or inlinks).
  • Private Web: sites that require registration and login (password-protected resources).
  • Contextual Web: pages with content varying for different access contexts (e.g., ranges of client IP addresses or previous navigation sequence).
  • Limited access content: sites that limit access to their pages in a technical way (e.g., using the CAPTHAs , or no-cache Pragma HTTP headers prohibit search engines from browsing them and creating cached copies).
  • Scripted content: pages that are only accessible through links produced by javascript as well as content dynamically downloaded from Web servers via flash or ajax solutions.
  • Non-HTML/text content: textual content encoded in multimedia (image or video) files or specific file formats handled by search engines.
There are various approaches taken by the search engines to index the deep web. For eg. Google’s approach to the Deep Web is to find HTML forms, send input to these forms, and index the resulting HTML pages.Yahoo made a small part of the deep Web searchable by releasing Yahoo! Subscriptions. This search engine searches through a few subscription-only Web sites and the user will be asked to login to access content. Kosmix instead, for any given search query taps into html forms in real-time through API calls, evaluates the results and organizes them into a topic page. Research is going on different approaches to tap this deep web ,but it is sure that large part of the web will be still invisible by search engines.


No comments:

Post a Comment