If you are writing a robot, please fill in the form to be added to this list.
If anyone knows of any that aren't on this list, please let me know. Note that the descriptions of the robot donot necessarily represent my views, and that a listing here doesn't consitute a recommendation.
Its purpose is to generate a Resource Discovery database.
The HTTP User-agent field is set to JumpStation-Robot,
and the From field is also set.
It's usually run from *.stir.ac.uk.
The Proposed Standard for Robot Exclusion is supported.
It is a set of standalone programs and written in Perl 4, C, and C++.
It Originated as a weekend project in 1993.
This information was last updated on Tue May 16 00:57:42 1995.
Its purpose is to generate a Resource Discovery database, and generate statistics.
The HTTP User-agent field is set to
RBSE Spider v. 1.0,
and the From field is also set.
It's usually run from rbse.jsc.nasa.gov (192.88.42.10).
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in C, and Oracle, WAIS.
Developed and operated as part of the NASA-funded Repository Based Software Engineering Program at the Research Institute for Computing and Information Systems, University of Houston - Clear Lake.
This information was last updated on Thu May 18 04:47:02 1995.
Its purpose is to generate a Resource Discovery database, and generate statistics.
The HTTP User-agent field is set to
WebCrawler/2.0 libwww/3.0,
and the From field is also set.
It's usually run from spidey.webcrawler.com.
The Proposed Standard for Robot Exclusion is not yet supported.
It is a standalone program and written in C.
The WebCrawler originated as an experiment in Internet resource discovery at the University of Washingtin in 1994. Today, it is operated by America Online as a service to the Internet. robots.txt support is coming soon!
This information was last updated on Mon Jun 26 15:58:09 1995.
More information including a search interface is available on the NorthStar Database. Recent runs (26 April 94) will concentrate on textual analysis of the Web versus GopherSpace (from the Veronica data) as well as indexing.
Run from frognot.utdallas.edu, possibly other sites
in utdallas.edu, and from cnidir.org.
Now uses HTTP From fields, and sets User-agent to NorthStar
Run initially in June 1993, its aim is to measure the growth in the web. See details.
User-agent: WWWWanderer v3.0 by Matthew Gray <mkgray@mit.edu>
Its purpose is to discover resources on the fly.
The HTTP User-agent field is set to
'Fish-Search-Robot', but the From field isn't set.
It's usually run from www.win.tue.nl.
The Proposed Standard for Robot Exclusion is not supported because of the incurred overhead.
It is a standalone program and written in C, but a version exists that is integrated into the Tübingen Mosaic 2.4.2 browser (also written in C).
Originated as an addition to Mosaic for X. Available as a standalone program from ftp://ftp.win.tue.nl/pub/infosystems/www/fish-search.tar.gz
This information was last updated on Mon May 8 09:31:19 1995.
Written in Python.
Its aim is to check validity of Web servers. I'm not sure if it has ever been run remotely.
Its purpose is to validate links, and generate statistics.
The HTTP User-agent field is set to
MOMspider/1.00 libwww-perl/0.40,
and the From field is also set.
It's usually run from anywhere.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Perl 4.
Originated as a research project at the University of California, Irvine, in 1993. Presented at the First International WWW Conference in Geneva, 1994.
This information was last updated on Sat May 6 08:11:58 1995.
A mirroring robot. Configured to stay within a directory, sleeps between requests, and the next version will use HEAD to check if the entire document needs to be retrieved.
Identification: Uses User-Agent: HTMLgobble v2.2,
and it sets the From field. Usually run by the
author, from tp70.rz.uni-karlsruhe.de.
Another indexing robot, for which more information is available. Actually has quite flexible search options.
Run from piper.cs.colorado.edu?
Its purpose is to generate a Resource Discovery database, validate links, validate HTML, and generate statistics.
The HTTP User-agent field is set to
W3M2/x.xxx,
and the From field is also set.
It's usually run from anyhost.lri.fr.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Perl 4, Perl 5, and C++.
This information was last updated on Fri May 5 17:48:48 1995.
No longer running.
First spotted in Mid February 1994.
Identification: It runs from phoenix.doc.ic.ac.uk
Further information unavailable.
This is a research program in providing information retrieval and discovery in the WWW, using a finite memory model of the web to guide intelligent, directed searches for specific information needs.
More information is available on its home page.
Identification: User-agent "Lycos/x.x", run from
fuzine.mt.cs.cmu.edu. Lycos also
complies with the latest robot exclusion standard.
Currently under construction, this spider is a CGI script that searches the web for keywords given by the user through a form.
Identification: User-Agent: "ASpider/0.09", with a From field "fredj@nova.pvv.unit.no".
Run since 27 June 1994, for an internal XEROX research project, with some information being made available on SG-Scout's home page
Does a "server-oriented" breadth-first search in a round-robin fashion, with multiple processes.
Identification: User-Agent: "SG-Scout", with a From field set to the operator. Complies with standard Robot Exclusion. Run from beta.xerox.com.
Announced on 12 July 1994, see thei r page.
Combination of an HTML form and a CGI script that verifies links from a given starting point (with some controls to prevent it going off-site or limitless).
Seems to run at full speed...
Identification: version 0.1 sets no User-Agent or From field. From version 0.2 up the User-Agent is set to "EIT-Link-Verifier-Robot/0.2". Can be run by anyone from anywhere.
Its purpose is to generate a Resource Discovery database.
The HTTP User-agent field is set to
NHSEWalker/3.0, and the From field is also set.
It's usually run from *.mcs.anl.gov.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Perl 5
This information was last updated on Fri May 5 15:47:55 1995.
It is a tool called 'WebLinker' which traverses a section of web, doing URN->URL conversion. It will be used as a post-processing tool on documents created by automatic converters such as LaTeX2HTML or WebMaker. More information is on its home page.
At the moment it works at full speed, but is restricted to local sites. External GETs will be added, but these will be running slowly.
WebLinker is meant to be run locally, so if you see it elsewhere let the author know!
Identification: User-agent is set to 'WebLinker/0.0 libwww-perl/0.1'.
Its purpose is to generate a Resource Discovery database.
The HTTP User-agent field is set to
Emacs-w3/v[0-9\.]+,
and the From field is also set.
It's usually run from a variety of machines.
The Proposed Standard for Robot Exclusion is not supported.
It is integrated in a browser and written in Lisp.
This code has not been looked at in a while, but will be spruced up for the Emacs-w3 2.2.0 release sometime this month. It will honor the /robots.txt file at that time.
This information was last updated on Fri May 5 16:09:18 1995.
The purpose (undertaken by HaL Software) of this run was to collect approximately 10k html documents for testing automatic abstract generation. This program will honor the robot exclusion standard and wait 1 minute in between requests to a given server.
Identification: Sets User-agent to 'Arachnophilia', runs from halsoft.com.
This is a French Keyword-searching robot for the Mac, written in HyperCard. The author has decided not to release this robot to the public.
Awaiting identification details.
A URL checking robot, which stays within one step of the local server, see further information.
Awaiting identification details.
A mirroring robot.
Sets User-Agent to "tarspider <version>", and From to "chakl@fu-berlin.de".
This robot, in Perl 4, commenced operation in August 1994 and is being used to generate an index called MathSearch of documents on Web sites connected with mathematics and statistics. It ignores off-site links, so does not stray from a list of servers specified initially.
Identification: The current version sets User-Agent to
Peregrinator-Mathematics/0.7.
It also sets the From field.
The robot follows the exclusion standard, and accesses any given server no more often than once every several minutes.
A description of the robot is available.
Its purpose is to validate links.
The HTTP User-agent field is set to
Checkbot/x.xx,
and the From field is also set.
The Proposed Standard for Robot Exclusion is not supported.
It is a standalone program and written in Perl 5.
Checkbot checks links in a given set of pages on one or more servers. It reports links which returned an error code.
This information was last updated on Tue Mar 12 09:16:24 1996.
The HTTP User-agent field is set to
webwalk,
and the From field is also set.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in C.
Webwalk is easily extensible to perform virtually any maintenance function which involves web traversal, in a way much like the '-exec' option of the find(1) command. Webwalk is usually used behind the HP firewall.
This information was last updated on Wed Nov 15 09:51:59 PST 1995.
A Resource Discovery Robot, part of the Harvest Project.
Runs from bruno.cs.colorado.edu,
sets User-agent and From fields.
Pauses 1 second between requests (by default).
Note that Harvest's motivation is to index community- or topic- specific collections, rather than to locate and index all HTML objects that can be found. Also, Harvest allows users to control the enumeration several ways, including stop lists and depth and count limits. Therefore, Harvest provides a much more controlled way of indexing the Web than is typical of robots.
The HTTP User-agent field is set to
Katipo/1.0,
and
the From field is also set.
The Proposed Standard for Robot Exclusion is not supported.
It is a standalone program and written in C.
A Macintosh robot that periodically (typically, once per day) walks through the global history files provided by some browsers (Mosaic, NetScape), looking for pages that have changed since last visited.
This information was last updated on Sat May 6 10:37:33 1995.
Its purpose is to generate a Resource Discovery database.
The HTTP User-agent field is set to
InfoSeek Robot 1.0,
and the From field is also set.
It's usually run from corp-gw.infoseek.com.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Python.
Collects WWW pages for both InfoSeek's free WWW search and commercial search. Uses a unique proprietary algorithm to identify the most popular and interesting WWW pages. Very fast, but never has more than one request per site outstanding at any given time. Has been refined for more than a year.
This information was last updated on Sun May 28 01:35:48 1995.
Its purpose is to validate links, perform mirroring, and copy document trees.
The HTTP User-agent field is set to
'GetURL.rexx v1.05 by James@Snark.apana.org.au', but the
From field is not set.
It's usually run from whereever it's run from :-).
The Proposed Standard for Robot Exclusion is not supported.
It is a standalone program and written in ARexx (Amiga REXX).
Designed as a tool for retrieving web pages in batch mode without the encumbrance of a browser. Can be used to describe a set of pages to fetch, and to maintain an archive or mirror. Is not run by a central site and accessed by clients - is run by the end user or archive maintainer
This information was last updated on Tue May 9 15:13:12 1995.
Sets User-agent to 'OMW/0.1 libwww/217'
Follows robot exclusion rules, and shouldn't visit any host more than once in 5 minutes.
The TkWWW Robot is described in a paper presented at the WWW94 Conference in Chicago. It is designed to search Web neighborhoods to find pages that may be logically related. The Robot returns a list of links that looks like a hot list. The search can be by key word or all links at a distance of one or two hops may be returned.
For more information see The TkWWW Home Page.
Its purpose is to validate links, and generate statistics.
The HTTP User-agent field is set to
dlw3robot/x.y (in TclX by http://hplyot.obspm.fr/~dl/),
and the From field is also set.
It's usually run from hplyot.obspm.fr.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in TCL.
This information was last updated on Tue May 23 17:51:39 1995.
Its purpose is to generate a Resource Discovery database, and copy document trees. Our primary goal is to develop an advanced method for indexing the WWW documents.
The HTTP User-agent field is set to
TITAN/0.1,
and the From field is also set.
It's usually run from nttnly.isl.ntt.jp.
By using libwww-perl, the Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Perl 4.
This information was last updated on Tue Jun 13 05:21:24 1995.
Its purpose is to generate a Resource Discovery database, and validate HTML.
The HTTP User-agent field is set to
CS-HKUST-IndexServer/1.0,
and the From field is also set.
It's usually run from dbx.cs.ust.hk.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in C.
Part of an on-going research project on Internet Resource Discovery at Department of Computer Science, Hong Kong University of Science and Technology (CS-HKUST).
This information was last updated on Tue Jun 20 02:39:16 1995.
Its purpose is to generate a Resource Discovery database.
Unfortunately neither User-agent
nor From HTTP fields are set.
It's usually run from wizard.spry.com or tiger.spry.com.
Spry is refusing to give any comments about this robot.
This information was last updated on Tue Jul 11 09:29:45 GMT 1995.
Its purpose is to validate, cache and maintain links.
The HTTP User-agent field is set to 'weblayers/0.0'.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program written in Perl 5.
It is designed to maintain the cache generated by the emacs w3 mode (N*tscape replacement) and to support annotated documents (keep them in sync with the original document via diff/patch).
This information was last updated on Fri Jun 23 16:30:42 FRE 1995.
Its purpose is to perform mirroring.
The HTTP User-agent field is set to
'WebCopy/(version)', but the From field isn't set.
The Proposed Standard for Robot Exclusion is not supported.
It is a standalone program and written in Perl 4 or 5.
WebCopy can retrieve files recursively using HTTP protocol. It can be used as a delayed browser or as a mirroring tool. It cannot jump from one site to another. It can be used by anyone from anywhere... sorry!
This information was last updated on Sun Jul 2 15:27:04 1995.
Its purpose is to generate a Resource Discovery database, and generate statistics.
The HTTP User-agent field is set to
Scooter/1.0,
and the From field is also set.
It's usually run from scooter.pa-x.dec.com.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in C.
Generates the data for the Alta Vista Internet search service.
This information was last updated on Thu Jul 6 19:31:12 1995.
A crude robot built on top of Netscape and Userland Frontier, a scripting system for Macs.
Its purpose is to validate HTML, and generate statistics.
The HTTP User-agent field is set to
'WebWatch', but the From field isn't set.
The Proposed Standard for Robot Exclusion is not supported.
It is a standalone program and written in C++.
Check URLs modified since a given date. Shareware.
This information was last updated on Wed Jul 26 13:36:32 1995.
Its purpose is to generate a Resource Discovery database, and to generate statistics.
The HTTP User-agent field is set to
ArchitextSpider,
and the From field is also set.
It's usually run from *.atext.com.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Perl 5 and C.
The ArchitextSpider collects information for Excite, Architext's internet navigation service.
This information was last updated on Tue Oct 3 01:10:26 1995.
Its purpose is to generate a Resource Discovery database.
The HTTP User-agent field is set to
AITCSRobot/1.1,
and the From field is also set.
It's usually run from cs6.cs.ait.ac.th.
The Proposed Standard for Robot Exclusion is not supported.
It is a standalone program and written in Perl 5.
This Robot traverses the net and creates a searchable database of Web pages. It stores the title string of the HTML document and the absolute url. A search engine provides the boolean AND & OR query models with or without filtering the stop list of words. Feature is kept for the Web page owners to add the url to the searchable database.
This information was last updated on Wed Oct 4 06:54:31 1995.
Its purpose is to generate a Resource Discovery database from the Finnish (top-level domain .fi) www servers. The resulting database is used by the search engine at http://www.fi/search.html.
The HTTP User-agent field is set to
"Hämähäkki/0.2" (or to a later version), and the
From field is also set. It is run from *.www.fi. (The
name Hämähäkki is just Finnish for spider.)
The Proposed Standard for Robot Exclusion is supported.
Its purpose is to generate a Resource Discovery database, and generate statistics.
The HTTP User-agent field is set to
'explorersearch', but the From field isn't set.
It's usually run from bitz.co.nz.
The ProposedStandard for Robot Exclusion is not supported.
It is a standalone program and written in C++.
Primarily designed to create a searchable keyword database of HTML pages in a particular domain or at a particular site.
This information was last updated on Wed Nov 1 20:45:10 1995.
This robot now gets HTMLs from only .jp domain. Searching with Japanese is available.
The HTTP User-agent field is set to
Senrigan/xxxxxx
and the From field is also set.
It's usually run from ns.info.waseda.ac.jp.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in C.
Is has been running since Dec 1994.
This information was last updated on Thu Nov 9 10:28:25 PST 1995
Its purpose is to generate a Resource Discovery database, and generate statistics.
The HTTP User-agent field is set to
FunnelWeb-1.0,
and the From field is also set.
It's usually run from earth.planets.com.au.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in C, and C++.
Localised South Pacific Discovery and Search Engine, plus distributed operation under development.
This information was last updated on Mon Nov 27 21:30:11 1995.
Its purpose is to generate a Resource Discovery database, and validate links.
The HTTP User-agent field is set to
JubiiRobot/version#, and the From field is also set. It's
usually run from any host in the cybernet.dk domain.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Visual Basic 4.0.
Used for indexing the .dk top-level domain as well as other Danish sites for a Danish web database, as well as link validation. Will be in constant operation from Spring 1996.
This information was last updated on Sat Jan 6 20:58:44 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP User-agent field is set to
Jobot/0.1alpha libwww-perl/4.0
,
and the From field is also set.
It's usually run from supernova.micrognosis.com.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Perl 4.
Intended to seek out sites of potential "career interest". Hence - Job Robot.
This information was last updated on Tue Jan 9 18:55:55 1996.
Its purpose is to generate a Resource Discovery database, perform mirroring, and generate statistics.
The HTTP User-agent field is set to
Deweb/1.01,
and the From field is also set.
It's usually run from deweb.orbit.de.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Perl 4.
Uses combination of Informix(tm) Database and WN 1.11 serversoftware for indexing/ressource discovery, fulltext search, text excerpts.
This information was last updated on Wed Jan 10 08:23:00 1996.
Its purpose is to generate a Resource Discovery database, and validate links.
The HTTP User-agent field is set to
'roots/0.1', but the From field isn't set.
It's usually run from shiva.di.uminho.pt or from www.di.uminho.pt.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Perl 5.
Parallel robot developed in Minho Univeristy in Portugal to catalog relations among URLs and to support a special navigation aid. First versions since October 1995.
This information was last updated on Wed Jan 10 23:19:08 1996.
Its purpose is to generate a Resource Discovery database, copy document trees, and generate statistics.
The HTTP User-agent field is set to
Robot du CRIM 1.0a,
and the From field is also set.
It's usually run from zorro.crim.ca.
The Proposed Standard for Robot Exclusion is supported.
It is integrated in a browser and written in Perl 5, and Sql plus.
Part of the RISQ's Francoroute project for researching francophone URL's
Uses the Accept-Language tag and reduces demand accordingly
This information was last updated on Wed Jan 10 23:56:22 1996.
Its purpose is to generate a search indexes.
The HTTP User-agent field is set to
Duppies,
and the From field is also set.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program.
Designed to allow webmasters to provide a searchable index of their own site as well as to other sites, perhaps with similar content. Duppies is currently available for the Mac OS with an NT port planned.
This information was last updated on Fri Jan 19 05:08:15 1996.
The HTTP User-agent field is set to
IncyWincy/1.0b1,
and the From field is also set.
It's usually run from osiris.sunderland.ac.uk.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in C++.
Various Research projects at the University of Sunderland
This information was last updated on Fri Jan 19 21:50:32 1996.
Its purpose is to generate a Resource Discovery database, validate links, validate HTML, perform mirroring, and generate statistics.
The HTTP User-agent field is set to IBM_Planetwide,
and the From field is also set. It's usually run from
www.ibm.com www2.ibm.com.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Perl 5.
Restricted to IBM owned or related domains.
This information was last updated on Mon Jan 22 22:09:19 1996. <
Its purpose is to generate a Resource Discovery database.
The HTTP User-agent field is set to
'Nomad-V2.x', but the From field isn't set.
It's usually run from *.cs.colostate.edu.
The Proposed Standard for Robot Exclusion is not supported.
It is a standalone program and written in Perl 4.
Developed in 1995 at Colorado State University.
This information was last updated on Sat Jan 27 21:02:20 1996.
Its purpose is to generate a Resource Discovery database, and generate statistics.
The HTTP User-agent field is set to
UCSD-Crawler,
and the From field is also set.
It's usually run from nuthaus.mib.org scilib.ucsd.edu.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in Perl 4.
Should hit ONLY within UC San Diego - trying to count servers here.
This information was last updated on Sat Jan 27 09:21:40 1996.
Its purpose is to perform mirroring.
The HTTP User-agent field is set to
WebFetcher/0.8,
and the From field is also set.
It's usually run from your own host.
The Proposed Standard for Robot Exclusion is not supported.
It is a standalone program and written in C++.
don't wait! OnTV's WebFetcher mirrors whole sites down to your hard disk on a TV-like schedule. Catch w3 documentation. Catch discovery.com without waiting! A fully operational web robot for NT/95 today, most unix soon, MAC tomorrow.
This information was last updated on Sat Jan 27 10:31:43 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP User-agent field is set to
Libertech-Rover,
and the From field is also set.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in C++.
Originated as part of a suite of Internet Products to organize, search & navigate Intranet sites and to validate links in HTML documents.
This information was last updated on Mon Feb 19 16:06:56 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP User-agent field is set to
'htdig/3.0b3', but the From field isn't set.
It's usually run from teamball.sdsu.edu.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in C++.
Not a internet-wide search system. Used for indexing several WWW servers on a LAN.
This information was last updated on Thu Feb 8 23:56:34 1996.
Its purpose is to generate a Resource Discovery database, and generate statistics.
The HTTP User-agent field is set to
BlackWidow,
and the From field is also set.
It's usually run from 140.190.65.*.
The Proposed Standard for Robot Exclusion is not supported.
It is a standalone program and written in C, and C++.
Started as a research project and now is used to find links for a random link generator. Also is used to research the growth of specific sites.
This information was last updated on Fri Feb 9 00:11:22 1996.
Its purpose is to generate a Resource Discovery database, and generate statisti cs.
The HTTP User-agent field is set to
Pioneer,
and the From field is also set.
It's usually run from *.uncfsu.edu or flyer.ncsc.org.
The Proposed Standard for Robot Exclusion is supported.
It is a stand-alone program and written in C.
Pioneer is part of an undergraduate research project.
This information was last updated on Mon Feb 5 02:49:32 1996.
Its purpose is to generate a Resource Discovery database, validate links, perform mirroring, copy document trees, and generate statistics.
The HTTP User-agent field is set to
NetCarta CyberPilot Pro,
and the From field is also set.
The Proposed Standard for Robot Exclusion is supported.
It is a standalone program and written in C++.
The NetCarta WebMap Engine is a general purpose, commercial spider. Packaged with a full GUI in the CyberPilo Pro product, it acts as a personal spider to work with a browser to facilitiate context-based navigation. The WebMapper product uses the robot to manage a site (site copy, site diff, and extensive link management facilities). All versions can create publishable NetCarta WebMaps, which capture the crawled information. If the robot sees a published map, it will return the published map rather than continuing its crawl.
Since this is a personal spider, it will be launched from multiple domains. This robot tends to focus on a particular site. No instance of the robot should have more than one outstanding request out to any given site at a time. The User-agent field contains a coded ID identifying the instance of the spider; specific users can be blocked via robots.txt using this ID.
This information was last updated on Sun Feb 18 02:02:49 1996.
Its purpose is to generate a Resource Discovery database, validate links, validate HTML, and generate statistics.
The HTTP User-agent field is set to
Hazel's Ferret Web hopper,
and the From field is also set.
The
Proposed
Standard for Robot Exclusion
Date: Fri, 17 May 1996 07:55:08 +0100
From: Rudolf Zuberbauer It is a standalone program
and written in C++, and Visual Basic / Java.
The wild ferret web hopper's are designed as specific agents to retrieve
data from all
available sources on the internet. They work in an onion format hopping
from spot to spot
one level at a time over the internet. The information is gathered into
different relational
databases, known as "Hazel's Horde". The information is publicly
available and will be free
for the browsing at www.greenearth.com. Effective date of the data
posting is to be
announced.
This information was last updated on
Mon Feb 19 00:28:37 1996.
Its purpose is to generate a Resource Discovery database, and generate
statistics.
The HTTP
The
Proposed
Standard for Robot Exclusion
is supported. It is a standalone program
and written in Java.
This information was last updated on
Wed Feb 21 02:57:42 1996.
Its purpose is to perform mirroring, and copy document trees.
The HTTP
The
Proposed
Standard for Robot Exclusion
is not supported. It is a standalone program
and written in C.
Created to make working snapshots of remote sites and map links on web sites.
Currently in Beta, will support robot exclusion (robots.txt).
Currently only one licensed beta-test site.
This information was last updated on
Wed Feb 21 14:45:18 1996.
Its purpose is to generate a Resource Discovery database, and generate
statistics.
The HTTP
The
Proposed
Standard for Robot Exclusion
is not supported.
The Wombat robot is part of a suite of search engine programs
written in IBM Rexx/VisualAge C++ under OS/2.
The robot is the basis of the Web Wombat search engine (Australian/New
Zealand content ONLY).
This information was last updated on Thu Feb 29 00:39:49 1996.
Its purpose is to generate a Resource Discovery database, and generate
statistics.
The HTTP
The
Proposed
Standard for Robot Exclusion
is supported. It is a standalone program
and written in C.
A fast, parallel, scalable, friendly spider that obeys
robots.txt, and collects web pages for the Inktomi
search engine.
This information was last updated on
Sun Mar 3 19:07:17 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP
The
Proposed
Standard for Robot Exclusion
is not supported. It is a standalone program
and written in Perl 5,
C, and Java.
HKU Octopus is an ongoing project
for resource discovery in the Hong Kong
and China WWW domain . It is
a research project conducted by three
undergraduate at the University of Hong Kong
This information was last updated on
Thu Mar 7 14:21:55 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP
The
Proposed
Standard for Robot Exclusion
is supported. It is a standalone program
and written in Perl 5.
Intended to be an index of computer vision pages, containing
all pages within n links (for some small n) of the Co
mputer
Vision
Home Page.
This information was last updated on
Fri Mar 8 16:03:04 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP
The
Proposed
Standard for Robot Exclusion
is supported. It is a standalone program
and written in C++.
This information was last updated on
Tue Mar 12 15:52:25 1996.
Its purpose is to perform mirroring.
The HTTP
The
Proposed
Standard for Robot Exclusion
is not supported. It is a standalone program
and written in Perl 4, and Perl 5.
W3mir uses the If-Modified-Since HTTP header and recurses only the
directory and subdirectories of it's start document. Known to work on
U*ixes and Windows NT.
This information was last updated on
Wed Apr 24 13:23:42 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP
The
Proposed
Standard for Robot Exclusion
is not supported. It is a standalone program
and written in Perl 5.
Finds URLs for K-12 content management.
This information was last updated on
Sat Mar 23 20:12:39 1996.
Its purpose is to validate HTML.
The HTTP
The
Proposed
Standard for Robot Exclusion
is not supported.
It is a standalone program and written in Shockwave/Director.
GetBot's purpose is to index all the sites it can find that
contain Shockwave movies. It is the first bot or spider written
in Shockwave. The bot was originally written at Macromedia
on a hungover Sunday as a proof of concept. - Alex Zavatone 3/29/96
This information was last updated on
Fri Mar 29 20:06:12 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP
The
Proposed
Standard for Robot Exclusion
is supported. It is a standalone program
and written in TCL, and C.
Locates chemical structures in Chemical MIME formats on
WWW and FTP servers and downloads them into database
searchable with structure queries (substructure,
fullstructure, formula, properties etc.)
This information was last updated on
Sat Mar 30 00:55:40 1996.
Its purpose is to generate a Resource Discovery database, and validate links.
The HTTP
The
Proposed
Standard for Robot Exclusion
is supported. It is a standalone program
and written in C++.
Gathers information about travel services, activities, and/or destinations
for use by the Travel-Finder service. Semi-automated; requires operator
intervention to follow links. Results will be publically available starting
in May.
This information was last updated on
Fri Apr 5 03:06:43 1996.
Its purpose is to generate a Resource Discovery database.
Unfortunately neither
The
Proposed
Standard for Robot Exclusion
is not supported. It is a standalone program
and written in C.
Originated as a "fun" project at the HSE at Eindhoven.
This information was last updated on
Tue Apr 16 18:44:55 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP
The
Proposed
Standard for Robot Exclusion
is supported. It is a standalone program
and written in Perl 4.
Undergraduate Project at Trinity College.
Obeys Robot Exclusion Protocol, and usually runs during
hours when traffic is light.
May not be active much longer.
This information was last updated on
Wed Apr 17 18:42:40 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP
The
Proposed
Standard for Robot Exclusion
is supported. It is a standalone program
and written in C.
A complete software designed to collect information in a distributed
workload and supports context queries.
Intended to be a complete updated resource for Israeli sites and
information related to Israel or Israeli Society.
This information was last updated on
Tue Apr 23 19:23:55 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP
The
Proposed
Standard for Robot Exclusion
is not supported. It is a standalone program
and written in Perl 5.
This program search the pgp public key for the specified user.
Originated as a research project at Salerno University in 1995.
This information was last updated on
Sun Apr 14 13:38:50 1996.
Its purpose is to generate a Resource Discovery database.
The HTTP
The
Proposed
Standard for Robot Exclusion
is supported. It is a standalone program
and written in C.
Collects WWW pages for both InfoSeek's free WWW search services.
Uses a unique, incremental, very fast proprietary
algorithm to find WWW pages.
This information was last updated on
Sat Apr 27 01:20:15 1996.
Its purpose is to perform mirroring, and copy document trees.
Unfortunately neither
The
Proposed
Standard for Robot Exclusion
is not supported. It is a standalone program
and written in C++.
It download web pages to hard drive for off-line browsing.
This information was last updated on
Mon Apr 29 08:52:25 1996.
BackRub
BackRub
is maintained by Larry Page <page@leland.stanford.edu>.
User-agent field is set to
BackRub/*.*,
and the From field is also set.
It's usually run from *.stanford.edu.
Templeton
Templeton
is maintained by Neal
Krawetz
<nealk@tamu.edu>.
User-agent field is set to
Templeton,
and the From field is also set.
It's usually run from Domain: cs.tamu.edu.
The Web Wombat
The Web Wombat
is maintained by Internet
Communications User-agent field isn't set, and the
From field isn't set either.
It's usually run from qwerty.intercom.com.au.
Inktomi's Slurp
Inktomi's Slurp
is maintained by Paul
Gauthier <gauthier@cs.berkeley.edu>
.
User-agent field is set to
BSE/Slurp,
and the From field is also set.
It's usually run from *.cs.berkeley.edu.
HKU WWW Octopus
HKU WWW
Octopus
is maintained by Law Kwok
Tung , Lee Tak Yeung , Lo Chun Wing <jax@cs.hku.hk>.
User-agent field is set to
HKU WWW Robot,
and the From field is also set.
It's usually run from phoenix.cs.hku.hk.
vision-search
vision-search
is maintained by Henry A.
Rowley <har@cs.cmu.edu>.
User-agent field is set to
'vision-search/3.0', but the From field isn't set.
It's usually run from dylan.ius.cs.cmu.edu.
Resume Robot
Resume
Robot
is maintained by James
Stakelum <proquest@onramp.net>.
User-agent field is set to
Resume Robot,
and the From field is also set.
w3mir
w3mir
is maintained by Nicolai
Langfeldt and Others <w3mir-core@usit.uio.no>.
User-agent field is set to
w3mir,
and the From field is also set.
SafetyNet Robot
SafetyNet Robot
is maintained by Michael L. Nelson
<m.l.nelson@urlabs.com>.
User-agent field is set to
SafetyNet Robot 0.1,
and the From field is also set.
It's usually run from *.urlabs.com.
GetBot
GetBot
is maintained by Alex Zavatone
<zav@macromedia.com>.
User-agent field is set to
'???', but the From field isn't set.
CACTVS Chemistry Spider
CACTVS
Chemistry Spider
is maintained by W. D. Ihlenfeldt
<wdi@eros.ccc.uni-erlangen.de
>.
User-agent field is set to
'CACTVS Chemistry Spider', but the From field isn't set.
It's usually run from utamaro.organik.uni-erlangen.de.
Travel-Finder Spider
Travel-Finder Spider
is maintained by Ken Wadland
<ken@travel-finder.com>.
User-agent field is set to
travelfinder,
and the From field is also set.
It's usually run from travel-finder.com.
ILSE
ILSE
is maintained by Wiebe Weikamp <wiebe@il.ft.hse.nl>.
User-agent
nor From HTTP fields are set.
It's usually run from charm.il.ft.hse.nl.
Personal Times
Personal Times
is maintained by James McCabe
<jjmccabe@tcd.ie>.
User-agent field is set to
Personal Times,
and the From field is also set.
It's usually run from scott.cs.tcd.ie.
Israeli-search
Israeli-search
is maintained by Etamar Laron
<etamar@xpert.com>.
User-agent field is set to
'IsraeliSearch/1.0', but the From field isn't set.
It's usually run from dylan.ius.cs.cmu.eduA complete software designed to
collect information in a distributed workload and supports context
queries.Intended to be a complete updated resource for Israeli sites and
information related to Israel or Israeli Society..
PKA
pka
is maintained by Massimiliano
Pucciarelli <puma@comm2000.it>.
User-agent field is set to
PGP-KA/1.2,
and the From field is also set.
It's usually run from salerno.starnet.it.
Infoseek Sidewinder
Infoseek Sidewinder
is maintained by Mike Agostino <mna@infoseek.com>.
User-agent field is set to
Infoseek Sidewinder,
and the From field is also set.
WebMirror
WebMirror
is maintained by Siu
Fung Chan <sfchan@mailhost.net>.
User-agent
nor From HTTP fields are set.
Looking for more info on
Services with no information
These services must use robots,
but haven't replied to requests for an entry...
User-agent field: Wobot/1.00
From: mckinley.mckinley.com (206.214.202.2) and galileo.mckinley.com.
(206.214.202.45)
Honors "robots.txt": yes
Contact: cedeno@mckinley.mckinley.com (or possibly:
spider@mckinley.mckinley.com)
Purpose: Resource discovery for Magellan (http://www.mckinley.com/)
User Agents
These look like new robots, but have no contact info...
CaliforniaBrownSpider
EI*Net/0.1 libwww/0.1
Ibot/1.0 libwww-perl/0.40
Merritt/1.0
StatFetcher/1.0
TeacherSoft/1.0 libwww/2.17
WWW Collector
processor/0.0ALPHA libwww-perl/0.20
wobot/1.0 from 206.214.202.45
Hosts
These have no known user-agent, but have requested
/robots.txt repeatedly:
205.252.60.71
194.20.32.131
198.5.209.201
acke.dc.luth.se
dallas.mt.cs.cmu.edu
darkwing.cadvision.com
waldec.com
www2000.ogsm.vanderbilt.edu
unet.ca
murph.cais.net (rapid fire... sigh)
Some other robots are mentioned in a list of
Japanese
Search Engines.
Martijn Koster