Re: Amaya HTML parser. from Thaddeus L. Olczyk on 2004-07-01 (www-amaya-dev@w3.org from July 2004)

From: Thaddeus L. Olczyk <olczyk@interaccess.com>
Date: Wed, 30 Jun 2004 22:43:14 -0500
To: www-amaya-dev@w3.org
Cc: www-amaya-dev@w3.org
Message-id: <hde6e0tflrv9r473r0qbvaq7oa0cpgugdd@4ax.com>

On Wed, 30 Jun 2004 15:16:50 +0200, Laurent Carcone <laurent@w3.org>
wrote:

>
>Hello Thaddeus,
>
>In fact, Amaya uses 2 different parsers, expat for XHTML documents (and for 
>XML documents in general) 
Which is of minor interest because this is something I can already do
quite easily.

>nd an ad'hoc parser for other HTML documents.
>This parser is specific to Amaya and has no well-defined API. Nevertheless, 
>you can have a look on it in the module 'amaya/html2thot.c', and particularly 
>on the definition of the automaton.
>
Ok. So you've basically answered my last question, but
the first two are still left unanswered. 

Is the parser relatively bullet proof? I find the combination of
Tidy+expat simply unusable. Tidy chokes on some rather 
simple From previous experience if there are problems with
with simple input a system is going to have lots more problems 
when the input scales up. I don't want to be dealing with tons of
special cases that Tidy can't handle. That's the way to disaster.

Is the parser code in Amaya easily extractable? I once tried
to do the same thing with the pile of -- they call mozilla, and it was
a disaster. Now that was mozilla, and using anything from there
is asking for trouble. The question is what about Amaya?

Thaddeus L. Olczyk
-----------------------
Think twice, code once.

Received on Thursday, 1 July 2004 00:32:02 UTC