eSiteSecrets.com - http://esitesecrets.com
In Praise of the Web Crawler
http://esitesecrets.com/articles/702/1/In-Praise-of-the-Web-Crawler/Page1.html
By SEO Sapien
Published on 12/30/2008
 
When you hear the term “web crawler,” you could be forgiven for thinking it is something nightmarish out of the latest Stephen King novel However, for those people whose livelihoods are based on search engine optimization results for a web site, the web crawler is a very important, albeit tiny friend to have on your side

When you hear the term “web crawler,” you could be forgiven for thinking it is something nightmarish out of the latest Stephen King novel. However, for those people whose livelihoods are based on search engine optimization results for a web site, the web crawler is a very important, albeit tiny friend to have on your side. Also known as a spider, ant, worm, web bot or automatic indexer, a crawler is a sophisticated little program that crawls through or scans an Internet page looking for data to create an index for. Information is constantly changing with vast amounts of web pages continually added every day. Web crawlers help keep up with the expansion and allow search engines and other users to make sure their databases keep current.

Web crawlers are mostly associated with search engines and search engine optimization. Search engines like Google, Yahoo, Live Search and Alexa use the crawlers to collect data on public web pages so that when an Internet surfer types a search term on their site, for instance “rare books,” the engine can quickly provide a list of relevant web sites. The importance to a web site master or marketer is that the information “crawled” will also determine how high or low the web site will rank in the pages and its popularity with the search engines.

A search engine assigns the web crawler a list of URLs which it will then visit to systematically index and analyze the content, including the html title, visible text, hyperlinks and keyword or key phrase rich meta tags, and then store it in a central database. The search engine uses this information collected by the web crawler to ascertain what the site is about and its relevance to the search query. That is why it is important to have well designed web pages with information that’s constantly kept up to date. The crawlers have limits to how much they can index on each site and must prioritize. A very large web site with lots of pages probably will not be entirely indexed – and therefore, lose out on valuable searches.

Web crawlers are not only used by search engines. Market researchers can also use crawlers to find out trends in any given market. A linguist might use a web crawler to search the Internet to obtain a list of today’s most commonly used words. But web crawlers can also be used by the bad guys to collect private information available on the Internet.

There can be problems associated with web crawlers that often have to do with being overwhelmed by the vast amounts data they must index. There are policies surrounding the behavior of a web crawler, such as identifying itself to the web site administrator to announce which pages have been selected for download and when it will return to check for changes. Sometimes a web crawler can accidentally fall into a “crawler trap” or overload a web server with requests. This polite identification helps the web site’s owner stop the crawler.