Deprecated: Assigning the return value of new by reference is deprecated in /home/hostco/public_html/blog/wp-settings.php on line 520

Deprecated: Assigning the return value of new by reference is deprecated in /home/hostco/public_html/blog/wp-settings.php on line 535

Deprecated: Assigning the return value of new by reference is deprecated in /home/hostco/public_html/blog/wp-settings.php on line 542

Deprecated: Assigning the return value of new by reference is deprecated in /home/hostco/public_html/blog/wp-settings.php on line 578

Deprecated: Function set_magic_quotes_runtime() is deprecated in /home/hostco/public_html/blog/wp-settings.php on line 18

Warning: session_start() [function.session-start]: Cannot send session cookie - headers already sent by (output started at /home/hostco/public_html/blog/wp-settings.php:520) in /home/hostco/public_html/blog/wp-content/plugins/wordpress-automatic-upgrade/wordpress-automatic-upgrade.php on line 119

Warning: session_start() [function.session-start]: Cannot send session cache limiter - headers already sent (output started at /home/hostco/public_html/blog/wp-settings.php:520) in /home/hostco/public_html/blog/wp-content/plugins/wordpress-automatic-upgrade/wordpress-automatic-upgrade.php on line 119
Web Crawlers | How web crawlers Works? | Web Hosting India
Home > SEM, SEO > Web Crawlers.

Web Crawlers.

Web Crawler is a program that browses the network in an automated and organized manner. Web crawlers are also called as ants, automatic indexers, bots, worms and Spiders too. The process it engages in is referred as Web crawling. Is intended to crawl over the internet and collect the desired information. Generally crawlers are used by Search engines to collect the information, It collects the links visited and many more important information that the search engines use in there algorithms.

Crawler based search engines performs three steps

1. Crawling :  It recursively follows the hyperlinks present to find the another document.

2. Indexing : It helps to fond the information in faster way. The index is actually a catlog. Evrey change in the web page is recorded here

It consists of two steps

Parsing: It removes the link for further crawling, removes JavaScript, tag,  comments etc.

Hashing: After parsing is done it is encoded into the number

3. Searching: From the millions of the documents only the top relevant pages are tobe displayed. It involves certain steps to follow:

  • Parse the query.
  • Convert words to WordIDs using hash functions
  • Compute rank for every document
  • Sorting of the documents
  • List top documents

how-this-works-diagram-300x275 Web Crawlers.

Though this process seems to be very simple it is not so. Web itself makes crawling difficult.

Large volume of the Web

Extremely fast change in the Web

Dynamic page generation

This characteristic of the Web makes a wide variety of the crawlable URLs

Web crawlers works according to it predefined polices.

Selection policy: looking at the large volume of the web, it is nearly impossible to download the entire web and crawl it, so it downloads the portion of the web and work on it. It has a policy to prioritize the web pages. The importance of the web pages is decided and then it is prioritized.

Re-visit policy: We know that the nature of the web is very dynamic, by the time the crawling of the site is finished many events occur which include new creation, updation or deletion. There are many policies under re-visiting that are implemented that include Uniform policy, Proportional policy and optimal policy.

Politeness policy: It includes how less to overload websites. Web crawler uses many resources.

Parallelization policy: This states that a crawler runs multiple process in parallel. It maximizes the download and minimizes the overhead. In short it coordinates distributed Web crawlers.

Comments are closed.
 
 

Need Help ?

-- Client Area
-- Sitemap
-- Help Center
-- Tutorials

Resources

-- Web Hosting Forum
-- Web Hosting Blog
-- Knowledgebase

Partners

-- Affiliate program

Legal Information

-- Terms of Service
-- Service Level Agreement
-- Privacy Policy

Toll Free : 866 662 0909
1.213.255.7012 &
1.302.294.5628