Re: Amaya HTML parser.

On Wed, 30 Jun 2004 22:43:14 -0500
"Thaddeus L. Olczyk" <olczyk@interaccess.com> wrote:

> 
> On Wed, 30 Jun 2004 15:16:50 +0200, Laurent Carcone <laurent@w3.org>
> wrote:
> 
> >
> >Hello Thaddeus,
> >
> >In fact, Amaya uses 2 different parsers, expat for XHTML documents (and for 
> >XML documents in general) 
> Which is of minor interest because this is something I can already do
> quite easily.
> 
> >nd an ad'hoc parser for other HTML documents.
> >This parser is specific to Amaya and has no well-defined API. Nevertheless, 
> >you can have a look on it in the module 'amaya/html2thot.c', and particularly 
> >on the definition of the automaton.
> >
> Ok. So you've basically answered my last question, but
> the first two are still left unanswered. 
> 
> Is the parser relatively bullet proof? I find the combination of
> Tidy+expat simply unusable. Tidy chokes on some rather 
> simple From previous experience if there are problems with
> with simple input a system is going to have lots more problems 
> when the input scales up. I don't want to be dealing with tons of
> special cases that Tidy can't handle. That's the way to disaster.

The problem is that many HTML documents don't respect HTML specifications.
Documents should be parsed with a SGML parser and must be rejected as 
soon as an error occurs. But it's not in the HTML practice.
Tidy as Amaya and Mozilla try to manage as well as possible errors.
With this approach, it's difficult to have a clear and well designed HTML parser.

> Is the parser code in Amaya easily extractable? I once tried

Yes and no.  The main function is StartParser in html2thot.c
It uses an automaton (see InitAutomaton and the automaton above).
With any state change it calls a function like StartOf... and EndOf...
Today these functions call the Thotlib API.
To adapt the parser you must remove the current functions contents and 
replace it by your treatment.

> to do the same thing with the pile of -- they call mozilla, and it was
> a disaster. Now that was mozilla, and using anything from there
> is asking for trouble. The question is what about Amaya?
> 
> 
> Thaddeus L. Olczyk
> -----------------------
> Think twice, code once.


     Irene.
-----
Irène Vatton                     INRIA Rhône-Alpes
INRIA                               ZIRST
e-mail: Irene.Vatton@inria.fr       655 avenue de l'Europe
Tel.: +33 4 76 61 53 61             Montbonnot
Fax:  +33 4 76 61 52 07             38334 Saint Ismier Cedex - France

Received on Thursday, 1 July 2004 05:23:25 UTC