Indexing the Web (was Re: What's wrong with <FONT>?)

T. Joseph W. Lazio (lazio@spacenet.tn.cornell.edu)
Fri, 10 May 1996 14:05:02 -0400


From: "T. Joseph W. Lazio" <lazio@spacenet.tn.cornell.edu>
Date: Fri, 10 May 1996 14:05:02 -0400
Message-Id: <199605101805.OAA05855@ism.tn.cornell.edu>
To: preece@predator.urbana.mcd.mot.com
Cc: mudws@mail.olemiss.edu, www-html@w3.org
In-Reply-To: <199605101600.LAA05482@predator.urbana.mcd.mot.com> (preece@predator.urbana.mcd.mot.com)
Subject: Indexing the Web (was Re: What's wrong with <FONT>?)

>>>>> "SEP" == Scott E Preece <preece@predator.urbana.mcd.mot.com> writes:

SEP> First, the danger of FONT lies not in what it does, but in how it
SEP> is used. [...]

 Among the uses of FONT is the following (to pick a random example):

<H1><FONT SIZE="+1">F</FONT>ortran</H1>

It's quite legit, passes the KGV test.

 How is it supposed to be indexed?  This really is a question of
ignorance.  I just poked around Lycos and AltaVista (to pick just two
search engines) and I saw them exclaim that they did index the Web,
but no real description of how they do it.

 I suppose it just means adding an additional conditional to one's
indexer, something like 

 if next character after </FONT> != whitespace
 then 
    ignore <FONT> and </FONT>, index as normal
 else
    append all non whitespace characters following </FONT> to last
       character of stuff between <FONT> and </FONT>
    now index
 endif


 Anybody working on a search engine?  How do you plan to handle stuff
like this?

-- Joseph