Thanks to the advent of high technology and the many innovations that go along with it, nowadays, you can easily gather important information online by properly using the proper recent online data mining services. This process is called web crawling or “spidering”.

Also known as “web spiders”, web crawlers are programs or scripts that runs automated tasks over the World Wide Web, typically for the purpose of web indexing. Many legitimate sites (in particular search engines such as Google and Microsoft Bing), use spidering as a means of providing up-to-date data. Web crawlers visit sites, take a copy of the pages they visit and then index them to provide fast searches. Crawlers can have multiple advantages such as:

  • Automated maintenance tasks on a web site, such as checking links or validating HTML code.
  • User satisfaction from search directed access to resources and easier browsability (via maintenance and advancements of the Web resulting from analyses)
  • Reduced network traffic in document space resulting from search-directed access.
  • Possibility for new websites to get found easily and freely

Unfortunately, it also exists crawlers that have more sinister intentions, such as harvesting email addresses from web pages for spamming purposes or submitting spam comments to your website forms or blogs. The main disadvantage though, is that web pages are designed for humans, not crawlers. This means that there are a lot of extra information for presentation purposes, such as navigation menus, information messages, headers, footers and so on. All of this makes it a more pleasant experience for the user, and also making it easier to navigate on the page. The crawler on the other hand has no use of this information when retrieving pages. It is actually reducing information quality in the index. For example, a navigation menu will be displayed on every page, thus the crawler will index the navigation content for all pages.

There are ways to get around this, but it requires either altering of the produced HTML or adjustments in the search engine. Also, if the design of the site change, you have to do these adjustments again.

Bottom line is if you really want your website to take full advantage of the World Wide Web and obtain a good ranking, you should definitely let at least Google and other recognized search engines crawl it.

PS: Web crawlers are not to be confound with WebCrawler which is a metasearch engine that blends the top search results from Google Search and Yahoo! Search, and also the first Web search engine to provide full text search.