Re: Parsing HTML: Easiest way?

Daniel W. Connolly (connolly@beach.w3.org)
Tue, 17 Oct 1995 14:25:49 -0400


Message-Id: <199510171825.OAA17111@beach.w3.org>
To: Bowden Wise <wiseb@cs.rpi.edu>
Cc: www-html@www0.cern.ch
Cc: frystyk@w3.org
Subject: Re: Parsing HTML: Easiest way? 
In-Reply-To: Your message of "Tue, 17 Oct 1995 13:53:07 EDT."
             <199510171753.AA10064@cs.rpi.edu> 
Date: Tue, 17 Oct 1995 14:25:49 -0400
From: "Daniel W. Connolly" <connolly@beach.w3.org>

In message <199510171753.AA10064@cs.rpi.edu>, Bowden Wise writes:
>
>What I would like to do is parse an HTML file into some structure that
>I can use in my app to base my presentation

>I do not have a Web browser to base my browser on, so my question is
>what is the best way to parse HTML for my purposes?  I am using a
>Windows 3.x platform (16-bit).
>
>Some ideas I have thought of doing include:
>
>- using sgmls

This will work, but it may not be convenient.

>- using the W3C Reference Library

The HTML parsing code in the W3C reference library has gotten
kinda crufty. Henrik has been concentrating on protocols
for quite some time, and the SGML/HTML stuff hasn't been
revised much, even though we've found some bugs and changed
our minds about the best way to do some things.

I've been working on some code to update the library. I have
it working, but I haven't done much integration with the
library.

A tech report describing my work is in progress at:

"A Lexical Analyzer for HTML and Basic SGML"
$Id: sgml-lex.html,v 1.8 1995/10/11 21:47:30 connolly Exp $
http://www.w3.org/pub/WWW/MarkUp/SGML/sgml-lex/sgml-lex.html

It includes a lex spec. You probably can't run lex on a 16bit
platform, but you should be able to use the code that lex
spits out when I run it.

Let me know if you want to be an alpha tester. I don't have
a public distribution ready.

Dan