Entered! : HTML-Kurs : list of robots

List of Robots

the oroginal of this document can be found here. This is a list of Web Wanderers. See also the World-Wide Web Wanderers, Spiders and Robots page.

If you are writing a robot, please fill in the form to be added to this list.

If anyone knows of any that aren't on this list, please let me know. Note that the descriptions of the robot donot necessarily represent my views, and that a listing here doesn't consitute a recommendation.


Overview


Detailed Information

JumpStation

JumpStation used to be maintained by Jonathon Fletcher, but is no longer in operation. <j.fletcher@stirling.ac.uk>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to JumpStation-Robot, and the From field is also set. It's usually run from *.stir.ac.uk.

The Proposed Standard for Robot Exclusion is supported.

It is a set of standalone programs and written in Perl 4, C, and C++.

It Originated as a weekend project in 1993.

This information was last updated on Tue May 16 00:57:42 1995.


RBSE Spider

RBSE Spider is maintained by David Eichmann <eichmann@rbse.jsc.nasa.gov> .

Its purpose is to generate a Resource Discovery database, and generate statistics.

The HTTP User-agent field is set to RBSE Spider v. 1.0, and the From field is also set. It's usually run from rbse.jsc.nasa.gov (192.88.42.10).

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C, and Oracle, WAIS.

Developed and operated as part of the NASA-funded Repository Based Software Engineering Program at the Research Institute for Computing and Information Systems, University of Houston - Clear Lake.

This information was last updated on Thu May 18 04:47:02 1995.


WebCrawler

WebCrawler is maintained by Brian Pinkerton <bp@webcrawler.com>.

Its purpose is to generate a Resource Discovery database, and generate statistics.

The HTTP User-agent field is set to WebCrawler/2.0 libwww/3.0, and the From field is also set. It's usually run from spidey.webcrawler.com.

The Proposed Standard for Robot Exclusion is not yet supported.

It is a standalone program and written in C.

The WebCrawler originated as an experiment in Internet resource discovery at the University of Washingtin in 1994. Today, it is operated by America Online as a service to the Internet. robots.txt support is coming soon!

This information was last updated on Mon Jun 26 15:58:09 1995.


The NorthStar Robot

Run by Fred Barrie <barrie@unr.edu> and Billy Barron.

More information including a search interface is available on the NorthStar Database. Recent runs (26 April 94) will concentrate on textual analysis of the Web versus GopherSpace (from the Veronica data) as well as indexing.

Run from frognot.utdallas.edu, possibly other sites in utdallas.edu, and from cnidir.org. Now uses HTTP From fields, and sets User-agent to NorthStar


W4 (the World Wide Web Wanderer)

Run by Matthew Gray <mkgray@mit.edu>

Run initially in June 1993, its aim is to measure the growth in the web. See details.

User-agent: WWWWanderer v3.0 by Matthew Gray <mkgray@mit.edu>


Fish search

fish search is maintained by Paul De Bra <debra@win.tue.nl>.

Its purpose is to discover resources on the fly.

The HTTP User-agent field is set to 'Fish-Search-Robot', but the From field isn't set. It's usually run from www.win.tue.nl.

The Proposed Standard for Robot Exclusion is not supported because of the incurred overhead.

It is a standalone program and written in C, but a version exists that is integrated into the Tübingen Mosaic 2.4.2 browser (also written in C).

Originated as an addition to Mosaic for X. Available as a standalone program from ftp://ftp.win.tue.nl/pub/infosystems/www/fish-search.tar.gz

This information was last updated on Mon May 8 09:31:19 1995.


The Python Robot

The Python Robot was written by Guido van Rossum <Guido.van.Rossum@cwi.nl> but is no longer active or available.

Written in Python.


html_analyzer

Run by James E. Pitkow <pitkow@aries.colorado.edu>

Its aim is to check validity of Web servers. I'm not sure if it has ever been run remotely.


MOMspider

MOMspider is maintained by Roy T. Fielding <fielding@ics.uci.edu>.

Its purpose is to validate links, and generate statistics.

The HTTP User-agent field is set to MOMspider/1.00 libwww-perl/0.40, and the From field is also set. It's usually run from anywhere.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 4.

Originated as a research project at the University of California, Irvine, in 1993. Presented at the First International WWW Conference in Geneva, 1994.

This information was last updated on Sat May 6 08:11:58 1995.


HTMLgobble

Maintained by Andreas Ley <ley@rz.uni-karlsruhe.de>

A mirroring robot. Configured to stay within a directory, sleeps between requests, and the next version will use HEAD to check if the entire document needs to be retrieved.

Identification: Uses User-Agent: HTMLgobble v2.2, and it sets the From field. Usually run by the author, from tp70.rz.uni-karlsruhe.de.


WWWW - the WORLD WIDE WEB WORM

Maintained by Oliver McBryan <mcbryan@piper.cs.colorado.edu>.

Another indexing robot, for which more information is available. Actually has quite flexible search options.

Run from piper.cs.colorado.edu?


W3M2

W3M2 is maintained by Christophe Tronche <tronche@lri.fr>.

Its purpose is to generate a Resource Discovery database, validate links, validate HTML, and generate statistics.

The HTTP User-agent field is set to W3M2/x.xxx, and the From field is also set. It's usually run from anyhost.lri.fr.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 4, Perl 5, and C++.

This information was last updated on Fri May 5 17:48:48 1995.


Websnarf

Developed by Charlie Stross <charles@fma.com>

No longer running.


The Webfoot Robot

Run by Lee McLoughlin <L.McLoughlin@doc.ic.ac.uk>

First spotted in Mid February 1994.

Identification: It runs from phoenix.doc.ic.ac.uk

Further information unavailable.


Lycos

Owned by Dr. Michael L. Mauldin <fuzzy@cmu.edu> at Carnegie Mellon University.

This is a research program in providing information retrieval and discovery in the WWW, using a finite memory model of the web to guide intelligent, directed searches for specific information needs.

More information is available on its home page.

Identification: User-agent "Lycos/x.x", run from fuzine.mt.cs.cmu.edu. Lycos also complies with the latest robot exclusion standard.


ASpider (Associative Spider)

Written and run by Fred Johansen <fred@nvg.unit.no>

Currently under construction, this spider is a CGI script that searches the web for keywords given by the user through a form.

Identification: User-Agent: "ASpider/0.09", with a From field "fredj@nova.pvv.unit.no".


SG-Scout

Introduced by Peter Beebee <ptbb@ai.mit.edu, beebee@parc.xerox.com>

Run since 27 June 1994, for an internal XEROX research project, with some information being made available on SG-Scout's home page

Does a "server-oriented" breadth-first search in a round-robin fashion, with multiple processes.

Identification: User-Agent: "SG-Scout", with a From field set to the operator. Complies with standard Robot Exclusion. Run from beta.xerox.com.


EIT Link Verifier Robot

Written by Jim McGuire <mcguire@eit.COM>

Announced on 12 July 1994, see thei r page.

Combination of an HTML form and a CGI script that verifies links from a given starting point (with some controls to prevent it going off-site or limitless).

Seems to run at full speed...

Identification: version 0.1 sets no User-Agent or From field. From version 0.2 up the User-Agent is set to "EIT-Link-Verifier-Robot/0.2". Can be run by anyone from anywhere.


NHSE Web Forager

NHSE Web Forager is maintained by Robert Olson <olson@mcs.anl.gov>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to NHSEWalker/3.0, and the From field is also set. It's usually run from *.mcs.anl.gov.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 5

This information was last updated on Fri May 5 15:47:55 1995.


WebLinker

Written and run by James Casey <jcasey@maths.tcd.ie>

It is a tool called 'WebLinker' which traverses a section of web, doing URN->URL conversion. It will be used as a post-processing tool on documents created by automatic converters such as LaTeX2HTML or WebMaker. More information is on its home page.

At the moment it works at full speed, but is restricted to local sites. External GETs will be added, but these will be running slowly.

WebLinker is meant to be run locally, so if you see it elsewhere let the author know!

Identification: User-agent is set to 'WebLinker/0.0 libwww-perl/0.1'.


Emacs-w3 Search Engine

Emacs-w3 Search Engine is maintained by William M. Perry <wmperry@spry.com>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to Emacs-w3/v[0-9\.]+, and the From field is also set. It's usually run from a variety of machines.

The Proposed Standard for Robot Exclusion is not supported.

It is integrated in a browser and written in Lisp.

This code has not been looked at in a while, but will be spruced up for the Emacs-w3 2.2.0 release sometime this month. It will honor the /robots.txt file at that time.

This information was last updated on Fri May 5 16:09:18 1995.


Arachnophilia

Run by Vince Taluskie <taluskie@utpapa.ph.utexas.edu>

The purpose (undertaken by HaL Software) of this run was to collect approximately 10k html documents for testing automatic abstract generation. This program will honor the robot exclusion standard and wait 1 minute in between requests to a given server.

Identification: Sets User-agent to 'Arachnophilia', runs from halsoft.com.


Mac WWWWorm

Date: Fri, 17 May 1996 07:55:08 +0100 From: Rudolf Zuberbauer Mime-Version: 1.0 To: chirschle@zutt.ch Subject: Robot list X-Url: http://info.webcrawler.com/mak/projects/robots/active.html Written by Sebastien Lemieux <lemieuse@ERE.UMontreal.CA>

This is a French Keyword-searching robot for the Mac, written in HyperCard. The author has decided not to release this robot to the public.

Awaiting identification details.


churl

Maintained by Justin Yunke <yunke@umich.edu>

A URL checking robot, which stays within one step of the local server, see further information.

Awaiting identification details.


tarspider

Run by Olaf Schreck <chakl@fu-berlin.de>

A mirroring robot.

Sets User-Agent to "tarspider <version>", and From to "chakl@fu-berlin.de".


The Peregrinator

Run by Jim Richardson <jimr@maths.su.oz.au>.

This robot, in Perl 4, commenced operation in August 1994 and is being used to generate an index called MathSearch of documents on Web sites connected with mathematics and statistics. It ignores off-site links, so does not stray from a list of servers specified initially.

Identification: The current version sets User-Agent to Peregrinator-Mathematics/0.7. It also sets the From field.

The robot follows the exclusion standard, and accesses any given server no more often than once every several minutes.

A description of the robot is available.



Checkbot

Checkbot is maintained by Hans de Graaff <J.J.deGraaff@twi.tudelft.nl&g t;.

Its purpose is to validate links.

The HTTP User-agent field is set to Checkbot/x.xx, and the From field is also set.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in Perl 5.

Checkbot checks links in a given set of pages on one or more servers. It reports links which returned an error code.

This information was last updated on Tue Mar 12 09:16:24 1996.


webwalk

webwalk was maintained by Rich Testardi at HP, but is no longer active or available. Its purpose is to generate a Resource Discovery database, validate links, validate HTML, perform mirroring, copy document trees, and generate statistics.

The HTTP User-agent field is set to webwalk, and the From field is also set.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C.

Webwalk is easily extensible to perform virtually any maintenance function which involves web traversal, in a way much like the '-exec' option of the find(1) command. Webwalk is usually used behind the HP firewall.

This information was last updated on Wed Nov 15 09:51:59 PST 1995.


Harvest

Run by hardy@bruno.cs.colorado.edu

A Resource Discovery Robot, part of the Harvest Project.

Runs from bruno.cs.colorado.edu, sets User-agent and From fields.

Pauses 1 second between requests (by default).

Note that Harvest's motivation is to index community- or topic- specific collections, rather than to locate and index all HTML objects that can be found. Also, Harvest allows users to control the enumeration several ways, including stop lists and depth and count limits. Therefore, Harvest provides a much more controlled way of indexing the Web than is typical of robots.


Katipo

Katipo is maintained by Michael Newbery <Michael.Newbery@vuw.ac.nz>.

The HTTP User-agent field is set to Katipo/1.0, and the From field is also set.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in C.

A Macintosh robot that periodically (typically, once per day) walks through the global history files provided by some browsers (Mosaic, NetScape), looking for pages that have changed since last visited.

This information was last updated on Sat May 6 10:37:33 1995.


InfoSeek Robot 1.0

InfoSeek Robot 1.0 is maintained by Steve Kirsch <stk@infoseek.com>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to InfoSeek Robot 1.0, and the From field is also set. It's usually run from corp-gw.infoseek.com.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Python.

Collects WWW pages for both InfoSeek's free WWW search and commercial search. Uses a unique proprietary algorithm to identify the most popular and interesting WWW pages. Very fast, but never has more than one request per site outstanding at any given time. Has been refined for more than a year.

This information was last updated on Sun May 28 01:35:48 1995.


GetURL

GetURL is maintained by James Burton <James@Snark.apana.org.au>.

Its purpose is to validate links, perform mirroring, and copy document trees.

The HTTP User-agent field is set to 'GetURL.rexx v1.05 by James@Snark.apana.org.au', but the From field is not set. It's usually run from whereever it's run from :-).

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in ARexx (Amiga REXX).

Designed as a tool for retrieving web pages in batch mode without the encumbrance of a browser. Can be used to describe a set of pages to fetch, and to maintain an archive or mirror. Is not run by a central site and accessed by clients - is run by the end user or archive maintainer

This information was last updated on Tue May 9 15:13:12 1995.


Open Text Corporation Robot

Run by Tim Bray <tbray@opentext.com>

Sets User-agent to 'OMW/0.1 libwww/217'

Follows robot exclusion rules, and shouldn't visit any host more than once in 5 minutes.


The TkWWW Robot

Implemented by Scott Spetka <scott@cs.sunyit.edu>

The TkWWW Robot is described in a paper presented at the WWW94 Conference in Chicago. It is designed to search Web neighborhoods to find pages that may be logically related. The Robot returns a list of links that looks like a hot list. The search can be by key word or all links at a distance of one or two hops may be returned.

For more information see The TkWWW Home Page.


Tcl W3 Robot

A Tcl W3 Robot is maintained by Laurent Demailly <dl@hplyot.obspm.fr>.

Its purpose is to validate links, and generate statistics.

The HTTP User-agent field is set to dlw3robot/x.y (in TclX by http://hplyot.obspm.fr/~dl/), and the From field is also set. It's usually run from hplyot.obspm.fr.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in TCL.

This information was last updated on Tue May 23 17:51:39 1995.


TITAN

TITAN is maintained by Yoshihiko HAYASHI <hayashi@nttnly.isl.ntt.jp>.

Its purpose is to generate a Resource Discovery database, and copy document trees. Our primary goal is to develop an advanced method for indexing the WWW documents.

The HTTP User-agent field is set to TITAN/0.1, and the From field is also set. It's usually run from nttnly.isl.ntt.jp.

By using libwww-perl, the Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 4.

This information was last updated on Tue Jun 13 05:21:24 1995.


CS-HKUST WWW Index Server

CS-HKUST WWW Index Server is maintained by Budi Yuwono <yuwono-b@cs.ust.hk>.

Its purpose is to generate a Resource Discovery database, and validate HTML.

The HTTP User-agent field is set to CS-HKUST-IndexServer/1.0, and the From field is also set. It's usually run from dbx.cs.ust.hk.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C.

Part of an on-going research project on Internet Resource Discovery at Department of Computer Science, Hong Kong University of Science and Technology (CS-HKUST).

This information was last updated on Tue Jun 20 02:39:16 1995.


Spry Wizard Robot

Spry Wizard Robot is maintained by Spry <info@spry.com>.

Its purpose is to generate a Resource Discovery database.

Unfortunately neither User-agent nor From HTTP fields are set. It's usually run from wizard.spry.com or tiger.spry.com.

Spry is refusing to give any comments about this robot.

This information was last updated on Tue Jul 11 09:29:45 GMT 1995.


weblayers

weblayers is maintained by Loic Dachary <loic@afp.com>.

Its purpose is to validate, cache and maintain links.

The HTTP User-agent field is set to 'weblayers/0.0'.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program written in Perl 5.

It is designed to maintain the cache generated by the emacs w3 mode (N*tscape replacement) and to support annotated documents (keep them in sync with the original document via diff/patch).

This information was last updated on Fri Jun 23 16:30:42 FRE 1995.


WebCopy

WebCopy is maintained by Victor Parada <vparada@inf.utfsm.cl>.

Its purpose is to perform mirroring.

The HTTP User-agent field is set to 'WebCopy/(version)', but the From field isn't set.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in Perl 4 or 5.

WebCopy can retrieve files recursively using HTTP protocol. It can be used as a delayed browser or as a mirroring tool. It cannot jump from one site to another. It can be used by anyone from anywhere... sorry!

This information was last updated on Sun Jul 2 15:27:04 1995.


Scooter

Scooter is maintained by Louis Monier <monier@pa.dec.com>.

Its purpose is to generate a Resource Discovery database, and generate statistics.

The HTTP User-agent field is set to Scooter/1.0, and the From field is also set. It's usually run from scooter.pa-x.dec.com.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C.

Generates the data for the Alta Vista Internet search service.

This information was last updated on Thu Jul 6 19:31:12 1995.


Aretha

Aretha is maintained by Dave Weiner davew@well.com

A crude robot built on top of Netscape and Userland Frontier, a scripting system for Macs.


WebWatch

WebWatch is maintained by Joseph Janos <janos@specter.com>.

Its purpose is to validate HTML, and generate statistics.

The HTTP User-agent field is set to 'WebWatch', but the From field isn't set.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in C++.

Check URLs modified since a given date. Shareware.

This information was last updated on Wed Jul 26 13:36:32 1995.


ArchitextSpider

ArchitextSpider is maintained by Architext Software <spider@atext.com>.

Its purpose is to generate a Resource Discovery database, and to generate statistics.

The HTTP User-agent field is set to ArchitextSpider, and the From field is also set. It's usually run from *.atext.com.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 5 and C.

The ArchitextSpider collects information for Excite, Architext's internet navigation service.

This information was last updated on Tue Oct 3 01:10:26 1995.


HI (HTML Index) Search

HI (HTML Index) Search is maintained by Razzakul Haider Chowdhury <a94385@cs.ait.ac.th>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to AITCSRobot/1.1, and the From field is also set. It's usually run from cs6.cs.ait.ac.th.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in Perl 5.

This Robot traverses the net and creates a searchable database of Web pages. It stores the title string of the HTML document and the absolute url. A search engine provides the boolean AND & OR query models with or without filtering the stop list of words. Feature is kept for the Web page owners to add the url to the searchable database.

This information was last updated on Wed Oct 4 06:54:31 1995.


Hämähäkki

Hämähäkki is maintained by Jaakko Hyvätti <Jaakko.Hyvatti@www.fi>.

Its purpose is to generate a Resource Discovery database from the Finnish (top-level domain .fi) www servers. The resulting database is used by the search engine at http://www.fi/search.html.

The HTTP User-agent field is set to "Hämähäkki/0.2" (or to a later version), and the From field is also set. It is run from *.www.fi. (The name Hämähäkki is just Finnish for spider.)

The Proposed Standard for Robot Exclusion is supported.


explorer

Date: Fri, 17 May 1996 07:55:08 +0100 From: Rudolf Zuberbauer Mime-Version: 1.0 To: chirschle@zutt.ch Subject: Robot list X-Url: http://info.webcrawler.com/mak/projects/robots/active.html explorer is maintained by Paul Bourke <pd.bourke@auckland.ac.nz>.

Its purpose is to generate a Resource Discovery database, and generate statistics.

The HTTP User-agent field is set to 'explorersearch', but the From field isn't set. It's usually run from bitz.co.nz.

The ProposedStandard for Robot Exclusion is not supported.

It is a standalone program and written in C++.

Primarily designed to create a searchable keyword database of HTML pages in a particular domain or at a particular site.

This information was last updated on Wed Nov 1 20:45:10 1995.


Senrigan

Senrigan (Japanese page is here.) is maintained by TAMURA Kent < kent@muraoka.info.waseda.ac.jp>.

This robot now gets HTMLs from only .jp domain. Searching with Japanese is available.

The HTTP User-agent field is set to Senrigan/xxxxxx and the From field is also set. It's usually run from ns.info.waseda.ac.jp.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C.

Is has been running since Dec 1994.

This information was last updated on Thu Nov 9 10:28:25 PST 1995


FunnelWeb

FunnelWeb is maintained by David Eagles <eaglesd@pc.com.au>.

Its purpose is to generate a Resource Discovery database, and generate statistics.

The HTTP User-agent field is set to FunnelWeb-1.0, and the From field is also set. It's usually run from earth.planets.com.au.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C, and C++.

Localised South Pacific Discovery and Search Engine, plus distributed operation under development.

This information was last updated on Mon Nov 27 21:30:11 1995.


The Jubii Indexing Robot

The Jubii Indexing Robot is maintained by Jakob Faarvang <jakob@jubii.dk>.

Its purpose is to generate a Resource Discovery database, and validate links.

The HTTP User-agent field is set to JubiiRobot/version#, and the From field is also set. It's usually run from any host in the cybernet.dk domain.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Visual Basic 4.0.

Used for indexing the .dk top-level domain as well as other Danish sites for a Danish web database, as well as link validation. Will be in constant operation from Spring 1996.

This information was last updated on Sat Jan 6 20:58:44 1996.


Jobot

Jobot is maintained by Adam Jack <ajack@corp.micrognosis.com> .

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to Jobot/0.1alpha libwww-perl/4.0 , and the From field is also set. It's usually run from supernova.micrognosis.com.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 4.

Intended to seek out sites of potential "career interest". Hence - Job Robot.

This information was last updated on Tue Jan 9 18:55:55 1996.


DeWeb(c) Katalog/Index

DeWeb(c) Katalog/Index is maintained by Marc Mielke <dewebmaster@orbit.de>.

Its purpose is to generate a Resource Discovery database, perform mirroring, and generate statistics.

The HTTP User-agent field is set to Deweb/1.01, and the From field is also set. It's usually run from deweb.orbit.de.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 4.

Uses combination of Informix(tm) Database and WN 1.11 serversoftware for indexing/ressource discovery, fulltext search, text excerpts.

This information was last updated on Wed Jan 10 08:23:00 1996.


Web Core / Roots

Web Core / Roots is maintained by Carlos Baquero and Jorge Portugal Andrade <wc@di.uminho.pt>.

Its purpose is to generate a Resource Discovery database, and validate links.

The HTTP User-agent field is set to 'roots/0.1', but the From field isn't set. It's usually run from shiva.di.uminho.pt or from www.di.uminho.pt.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 5.

Parallel robot developed in Minho Univeristy in Portugal to catalog relations among URLs and to support a special navigation aid. First versions since October 1995.

This information was last updated on Wed Jan 10 23:19:08 1996.


Robot Francoroute

Robot Francoroute is maintained by Marc-Antoine Parent <maparent@crim.ca>.

Its purpose is to generate a Resource Discovery database, copy document trees, and generate statistics.

The HTTP User-agent field is set to Robot du CRIM 1.0a, and the From field is also set. It's usually run from zorro.crim.ca.

The Proposed Standard for Robot Exclusion is supported.

It is integrated in a browser and written in Perl 5, and Sql plus.

Part of the RISQ's Francoroute project for researching francophone URL's

Uses the Accept-Language tag and reduces demand accordingly

This information was last updated on Wed Jan 10 23:56:22 1996.


Duppies

Duppies is maintained by Larry Burke <lburke@aktiv.com>.

Its purpose is to generate a search indexes.

The HTTP User-agent field is set to Duppies, and the From field is also set.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program.

Designed to allow webmasters to provide a searchable index of their own site as well as to other sites, perhaps with similar content. Duppies is currently available for the Mac OS with an NT port planned.

This information was last updated on Fri Jan 19 05:08:15 1996.


IncyWincy

IncyWincy is maintained by Simon Stobart <simon.stobart@sunderland.a c.uk>.

The HTTP User-agent field is set to IncyWincy/1.0b1, and the From field is also set. It's usually run from osiris.sunderland.ac.uk.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C++.

Various Research projects at the University of Sunderland

This information was last updated on Fri Jan 19 21:50:32 1996.


IBM_Planetwide

IBM_Planetwide is maintained by Ed Costello <epc@www.ibm.com>.

Its purpose is to generate a Resource Discovery database, validate links, validate HTML, perform mirroring, and generate statistics.

The HTTP User-agent field is set to IBM_Planetwide, and the From field is also set. It's usually run from www.ibm.com www2.ibm.com.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 5.

Restricted to IBM owned or related domains.

This information was last updated on Mon Jan 22 22:09:19 1996. <


Nomad

Nomad is maintained by Richard Sonnen <sonnen@cs.colostat.edu>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to 'Nomad-V2.x', but the From field isn't set. It's usually run from *.cs.colostate.edu.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in Perl 4.

Developed in 1995 at Colorado State University.

This information was last updated on Sat Jan 27 21:02:20 1996.


UCSD Crawl

UCSD Crawl is maintained by Adam Tilghman <atilghma@mib.org>.

Its purpose is to generate a Resource Discovery database, and generate statistics.

The HTTP User-agent field is set to UCSD-Crawler, and the From field is also set. It's usually run from nuthaus.mib.org scilib.ucsd.edu.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 4.

Should hit ONLY within UC San Diego - trying to count servers here.

This information was last updated on Sat Jan 27 09:21:40 1996.


webfetcher

webfetcher is maintained by ontv pittsburgh, l.p. <webfetch@ontv.com >.

Its purpose is to perform mirroring.

The HTTP User-agent field is set to WebFetcher/0.8, and the From field is also set. It's usually run from your own host.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in C++.

don't wait! OnTV's WebFetcher mirrors whole sites down to your hard disk on a TV-like schedule. Catch w3 documentation. Catch discovery.com without waiting! A fully operational web robot for NT/95 today, most unix soon, MAC tomorrow.

This information was last updated on Sat Jan 27 10:31:43 1996.


Libertech-Rover

Libertech-Rover is maintained by Anil Peres-da-Silva <adasilva@libertech.com>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to Libertech-Rover, and the From field is also set.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C++.

Originated as part of a suite of Internet Products to organize, search & navigate Intranet sites and to validate links in HTML documents.

This information was last updated on Mon Feb 19 16:06:56 1996.


HTDig

HTDig is maintained by Andrew Scherpbier <andrew@sdsu.edu>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to 'htdig/3.0b3', but the From field isn't set. It's usually run from teamball.sdsu.edu.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C++.

Not a internet-wide search system. Used for indexing several WWW servers on a LAN.

This information was last updated on Thu Feb 8 23:56:34 1996.


BlackWidow

BlackWidow is maintained by Kevin Hoogheem <khooghee@marys.smumn.edu>.

Its purpose is to generate a Resource Discovery database, and generate statistics.

The HTTP User-agent field is set to BlackWidow, and the From field is also set. It's usually run from 140.190.65.*.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in C, and C++.

Started as a research project and now is used to find links for a random link generator. Also is used to research the growth of specific sites.

This information was last updated on Fri Feb 9 00:11:22 1996.


Pioneer

Pioneer is maintained by Micah A. Williams <micah@sequent.uncfsu.edu> .

Its purpose is to generate a Resource Discovery database, and generate statisti cs.

The HTTP User-agent field is set to Pioneer, and the From field is also set. It's usually run from *.uncfsu.edu or flyer.ncsc.org.

The Proposed Standard for Robot Exclusion is supported.

It is a stand-alone program and written in C.

Pioneer is part of an undergraduate research project.

This information was last updated on Mon Feb 5 02:49:32 1996.


NetCarta WebMap Engine

NetCarta WebMap Engine is maintained by NetCarta Corp. <info@netcarta.com>.

Its purpose is to generate a Resource Discovery database, validate links, perform mirroring, copy document trees, and generate statistics.

The HTTP User-agent field is set to NetCarta CyberPilot Pro, and the From field is also set.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C++.

The NetCarta WebMap Engine is a general purpose, commercial spider. Packaged with a full GUI in the CyberPilo Pro product, it acts as a personal spider to work with a browser to facilitiate context-based navigation. The WebMapper product uses the robot to manage a site (site copy, site diff, and extensive link management facilities). All versions can create publishable NetCarta WebMaps, which capture the crawled information. If the robot sees a published map, it will return the published map rather than continuing its crawl.

Since this is a personal spider, it will be launched from multiple domains. This robot tends to focus on a particular site. No instance of the robot should have more than one outstanding request out to any given site at a time. The User-agent field contains a coded ID identifying the instance of the spider; specific users can be blocked via robots.txt using this ID.

This information was last updated on Sun Feb 18 02:02:49 1996.


Wild Ferret Web Hopper #1, #2, #3

Wild Ferret Web Hopper #1, #2, #3 is maintained by Greg Boswell <ghbpets@fishnet.net>.

Its purpose is to generate a Resource Discovery database, validate links, validate HTML, and generate statistics.

The HTTP User-agent field is set to Hazel's Ferret Web hopper, and the From field is also set.

The Proposed Standard for Robot Exclusion Date: Fri, 17 May 1996 07:55:08 +0100 From: Rudolf Zuberbauer Mime-Version: 1.0 To: chirschle@zutt.ch Subject: Robot list X-Url: http://info.webcrawler.com/mak/projects/robots/active.html is not supported.

It is a standalone program and written in C++, and Visual Basic / Java.

The wild ferret web hopper's are designed as specific agents to retrieve data from all available sources on the internet. They work in an onion format hopping from spot to spot one level at a time over the internet. The information is gathered into different relational databases, known as "Hazel's Horde". The information is publicly available and will be free for the browsing at www.greenearth.com. Effective date of the data posting is to be announced.

This information was last updated on Mon Feb 19 00:28:37 1996.


BackRub

BackRub is maintained by Larry Page <page@leland.stanford.edu>.

Its purpose is to generate a Resource Discovery database, and generate statistics.

The HTTP User-agent field is set to BackRub/*.*, and the From field is also set. It's usually run from *.stanford.edu.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Java.

This information was last updated on Wed Feb 21 02:57:42 1996.


Templeton

Templeton is maintained by Neal Krawetz <nealk@tamu.edu>.

Its purpose is to perform mirroring, and copy document trees.

The HTTP User-agent field is set to Templeton, and the From field is also set. It's usually run from Domain: cs.tamu.edu.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in C.

Created to make working snapshots of remote sites and map links on web sites. Currently in Beta, will support robot exclusion (robots.txt). Currently only one licensed beta-test site.

This information was last updated on Wed Feb 21 14:45:18 1996.


The Web Wombat

The Web Wombat is maintained by Internet Communications .

Its purpose is to generate a Resource Discovery database, and generate statistics.

The HTTP User-agent field isn't set, and the From field isn't set either. It's usually run from qwerty.intercom.com.au.

The Proposed Standard for Robot Exclusion is not supported.

The Wombat robot is part of a suite of search engine programs written in IBM Rexx/VisualAge C++ under OS/2.

The robot is the basis of the Web Wombat search engine (Australian/New Zealand content ONLY).

This information was last updated on Thu Feb 29 00:39:49 1996.


Inktomi's Slurp

Inktomi's Slurp is maintained by Paul Gauthier <gauthier@cs.berkeley.edu> .

Its purpose is to generate a Resource Discovery database, and generate statistics.

The HTTP User-agent field is set to BSE/Slurp, and the From field is also set. It's usually run from *.cs.berkeley.edu.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C.

A fast, parallel, scalable, friendly spider that obeys robots.txt, and collects web pages for the Inktomi search engine.

This information was last updated on Sun Mar 3 19:07:17 1996.


HKU WWW Octopus

HKU WWW Octopus is maintained by Law Kwok Tung , Lee Tak Yeung , Lo Chun Wing <jax@cs.hku.hk>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to HKU WWW Robot, and the From field is also set. It's usually run from phoenix.cs.hku.hk.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in Perl 5, C, and Java.

HKU Octopus is an ongoing project for resource discovery in the Hong Kong and China WWW domain . It is a research project conducted by three undergraduate at the University of Hong Kong

This information was last updated on Thu Mar 7 14:21:55 1996.


vision-search

vision-search is maintained by Henry A. Rowley <har@cs.cmu.edu>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to 'vision-search/3.0', but the From field isn't set. It's usually run from dylan.ius.cs.cmu.edu.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 5.

Intended to be an index of computer vision pages, containing all pages within n links (for some small n) of the Co mputer Vision Home Page.

This information was last updated on Fri Mar 8 16:03:04 1996.


Resume Robot

Resume Robot is maintained by James Stakelum <proquest@onramp.net>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to Resume Robot, and the From field is also set.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C++.

This information was last updated on Tue Mar 12 15:52:25 1996.


w3mir

w3mir is maintained by Nicolai Langfeldt and Others <w3mir-core@usit.uio.no>.

Its purpose is to perform mirroring.

The HTTP User-agent field is set to w3mir, and the From field is also set.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in Perl 4, and Perl 5.

W3mir uses the If-Modified-Since HTTP header and recurses only the directory and subdirectories of it's start document. Known to work on U*ixes and Windows NT.

This information was last updated on Wed Apr 24 13:23:42 1996.


SafetyNet Robot

SafetyNet Robot is maintained by Michael L. Nelson <m.l.nelson@urlabs.com>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to SafetyNet Robot 0.1, and the From field is also set. It's usually run from *.urlabs.com.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in Perl 5.

Finds URLs for K-12 content management.

This information was last updated on Sat Mar 23 20:12:39 1996.


GetBot

GetBot is maintained by Alex Zavatone <zav@macromedia.com>.

Its purpose is to validate HTML.

The HTTP User-agent field is set to '???', but the From field isn't set.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in Shockwave/Director.

GetBot's purpose is to index all the sites it can find that contain Shockwave movies. It is the first bot or spider written in Shockwave. The bot was originally written at Macromedia on a hungover Sunday as a proof of concept. - Alex Zavatone 3/29/96

This information was last updated on Fri Mar 29 20:06:12 1996.


CACTVS Chemistry Spider

CACTVS Chemistry Spider is maintained by W. D. Ihlenfeldt <wdi@eros.ccc.uni-erlangen.de >.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to 'CACTVS Chemistry Spider', but the From field isn't set. It's usually run from utamaro.organik.uni-erlangen.de.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in TCL, and C.

Locates chemical structures in Chemical MIME formats on WWW and FTP servers and downloads them into database searchable with structure queries (substructure, fullstructure, formula, properties etc.)

This information was last updated on Sat Mar 30 00:55:40 1996.


Travel-Finder Spider

Travel-Finder Spider is maintained by Ken Wadland <ken@travel-finder.com>.

Its purpose is to generate a Resource Discovery database, and validate links.

The HTTP User-agent field is set to travelfinder, and the From field is also set. It's usually run from travel-finder.com.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C++.

Gathers information about travel services, activities, and/or destinations for use by the Travel-Finder service. Semi-automated; requires operator intervention to follow links. Results will be publically available starting in May.

This information was last updated on Fri Apr 5 03:06:43 1996.


ILSE

ILSE is maintained by Wiebe Weikamp <wiebe@il.ft.hse.nl>.

Its purpose is to generate a Resource Discovery database.

Unfortunately neither User-agent nor From HTTP fields are set. It's usually run from charm.il.ft.hse.nl.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in C.

Originated as a "fun" project at the HSE at Eindhoven.

This information was last updated on Tue Apr 16 18:44:55 1996.


Personal Times

Personal Times is maintained by James McCabe <jjmccabe@tcd.ie>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to Personal Times, and the From field is also set. It's usually run from scott.cs.tcd.ie.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in Perl 4.

Undergraduate Project at Trinity College. Obeys Robot Exclusion Protocol, and usually runs during hours when traffic is light. May not be active much longer.

This information was last updated on Wed Apr 17 18:42:40 1996.


Israeli-search

Israeli-search is maintained by Etamar Laron <etamar@xpert.com>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to 'IsraeliSearch/1.0', but the From field isn't set. It's usually run from dylan.ius.cs.cmu.eduA complete software designed to collect information in a distributed workload and supports context queries.Intended to be a complete updated resource for Israeli sites and information related to Israel or Israeli Society..

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C.

A complete software designed to collect information in a distributed workload and supports context queries. Intended to be a complete updated resource for Israeli sites and information related to Israel or Israeli Society.

This information was last updated on Tue Apr 23 19:23:55 1996.


PKA

pka is maintained by Massimiliano Pucciarelli <puma@comm2000.it>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to PGP-KA/1.2, and the From field is also set. It's usually run from salerno.starnet.it.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in Perl 5.

This program search the pgp public key for the specified user. Originated as a research project at Salerno University in 1995.

This information was last updated on Sun Apr 14 13:38:50 1996.


Infoseek Sidewinder

Infoseek Sidewinder is maintained by Mike Agostino <mna@infoseek.com>.

Its purpose is to generate a Resource Discovery database.

The HTTP User-agent field is set to Infoseek Sidewinder, and the From field is also set.

The Proposed Standard for Robot Exclusion is supported.

It is a standalone program and written in C.

Collects WWW pages for both InfoSeek's free WWW search services. Uses a unique, incremental, very fast proprietary algorithm to find WWW pages.

This information was last updated on Sat Apr 27 01:20:15 1996.


WebMirror

WebMirror is maintained by Siu Fung Chan <sfchan@mailhost.net>.

Its purpose is to perform mirroring, and copy document trees.

Unfortunately neither User-agent nor From HTTP fields are set.

The Proposed Standard for Robot Exclusion is not supported.

It is a standalone program and written in C++.

It download web pages to hard drive for off-line browsing.

This information was last updated on Mon Apr 29 08:52:25 1996.


Looking for more info on

Services with no information

These services must use robots, but haven't replied to requests for an entry...
Magellan
User-agent field: Wobot/1.00
From: mckinley.mckinley.com (206.214.202.2) and galileo.mckinley.com.
(206.214.202.45)
Honors "robots.txt": yes
Contact: cedeno@mckinley.mckinley.com (or possibly:
spider@mckinley.mckinley.com)
Purpose: Resource discovery for Magellan (http://www.mckinley.com/)

User Agents

These look like new robots, but have no contact info...
CaliforniaBrownSpider
EI*Net/0.1  libwww/0.1
Ibot/1.0 libwww-perl/0.40
Merritt/1.0
StatFetcher/1.0
TeacherSoft/1.0  libwww/2.17
WWW Collector
processor/0.0ALPHA libwww-perl/0.20
wobot/1.0 from 206.214.202.45

Hosts

These have no known user-agent, but have requested /robots.txt repeatedly:
205.252.60.71
194.20.32.131
198.5.209.201
acke.dc.luth.se
dallas.mt.cs.cmu.edu
darkwing.cadvision.com
waldec.com
www2000.ogsm.vanderbilt.edu
unet.ca
murph.cais.net (rapid fire... sigh)
Some other robots are mentioned in a list of Japanese Search Engines.
Martijn Koster