Invitation to test a Web crawler

Edel Garcia's picture

Forums: 

Search engine robots (search bots) traverse the Web, looking for servers hosting Web documents. Once found, a document URL is visited and URLs present in it are followed. The first, last, or any of the discovered URLs is then visited and the process is repeated. This is what is known as crawling or spidering the Web.

If the first discovered URL is always visited, the process is called first-breath crawling. However, if the last discovered URL is always visited, the process is referred to as deep-breath crawling. Randomly visting a URL could therefore be called random-breath crawling. While this is a nice way of discovering resources, it lacks of something: Human Intuition.

At minerazzi, we thought: "Wouldn't be nice to have a tool that allows users, in particular researchers, to behave as Web crawlers, with the added advantage that the crawling process is now driven by human intuition?"

This tool was developed to answer this question, precisely. Our online spider does exactly that. It is available at the following address:

http://www.minerazzi.com/tools/crawlinker.php

At the present time, the tool crawls only absolute URLs. This means that relative URLs and anchor tags without fully qualified URLs are ignored. At minerazzi, we also use the tool for indexing purposes.

The tool also works as a link checker by examining the status of links as found the source code of documents. Because of this dual functionality, we believe that it is a great tool for teaching Web mining and for actually conducting link analysis and business intelligence (who links to whom).

While we recognize that the tool is far from perfection, we believe that you might like it. Expect new changes and improvements in the near future. In the meantime, enjoy it.

Possible Applications

Users can use our tool for free to discover database URLs. Although not always, URLs with name=value pairs delimited by ampersands (&) suggest a query mechanism used to search in a database, and as such can be taken for an address pointing to a database. For instance, the following address belongs to the Google database wherein the query is 'computers':

http://www.google.com/search?sourceid=navclient&ie=UTF-8&rlz=1T4ACAW_enP...

In this case 'q=computers' is a name=value pair responsible for searching in the Google database.

The tool can also be used to extract links from scholarly journals, forums, blogs, or from school, government, or electronic newspaper sites. For instance, try a search for

http://www.cienciapr.org
http://www.upr.edu
http://www.pridco.com
http://www.vocero.com
etc...

Users can also use the tool to build their own database of links about a particular research topic, to grab reference links from a scientific article, or to test for active, broken, or moved links.

It is simply another Web tool in the toolbox of modern researchers and data miners.

Rating: 

0

Tags: 

Edel Garcia's picture

Just an update to inform the addition of the following features to the Web Crawler: Addition of DNS and MX records reporting capabilities. Addition of source code reporting capabilities. Copy changes. Addition of relative URL resolving capabilities. Addition of hypertext wrapping, IP, and Headers reporting capabilities. With these features, data miners and researchers are able to conduct intelligence in real-time. We plan to add new features in the near for Web Developers to use and are excited to report that universities around the world are requesting be part of the beta test effort of the minerazzi search engine and its several tools. Thank you Cienciapr.org for allowing us to share the news. Keep the hard work, team. Gracias.