- From: Irene Vatton <irene.vatton@inrialpes.fr>
- Date: Thu, 1 Jul 2004 11:23:10 +0200
- To: olczyk@interaccess.com
- Cc: www-amaya-dev@w3.org
On Wed, 30 Jun 2004 22:43:14 -0500 "Thaddeus L. Olczyk" <olczyk@interaccess.com> wrote: > > On Wed, 30 Jun 2004 15:16:50 +0200, Laurent Carcone <laurent@w3.org> > wrote: > > > > >Hello Thaddeus, > > > >In fact, Amaya uses 2 different parsers, expat for XHTML documents (and for > >XML documents in general) > Which is of minor interest because this is something I can already do > quite easily. > > >nd an ad'hoc parser for other HTML documents. > >This parser is specific to Amaya and has no well-defined API. Nevertheless, > >you can have a look on it in the module 'amaya/html2thot.c', and particularly > >on the definition of the automaton. > > > Ok. So you've basically answered my last question, but > the first two are still left unanswered. > > Is the parser relatively bullet proof? I find the combination of > Tidy+expat simply unusable. Tidy chokes on some rather > simple From previous experience if there are problems with > with simple input a system is going to have lots more problems > when the input scales up. I don't want to be dealing with tons of > special cases that Tidy can't handle. That's the way to disaster. The problem is that many HTML documents don't respect HTML specifications. Documents should be parsed with a SGML parser and must be rejected as soon as an error occurs. But it's not in the HTML practice. Tidy as Amaya and Mozilla try to manage as well as possible errors. With this approach, it's difficult to have a clear and well designed HTML parser. > Is the parser code in Amaya easily extractable? I once tried Yes and no. The main function is StartParser in html2thot.c It uses an automaton (see InitAutomaton and the automaton above). With any state change it calls a function like StartOf... and EndOf... Today these functions call the Thotlib API. To adapt the parser you must remove the current functions contents and replace it by your treatment. > to do the same thing with the pile of -- they call mozilla, and it was > a disaster. Now that was mozilla, and using anything from there > is asking for trouble. The question is what about Amaya? > > > Thaddeus L. Olczyk > ----------------------- > Think twice, code once. Irene. ----- Irène Vatton INRIA Rhône-Alpes INRIA ZIRST e-mail: Irene.Vatton@inria.fr 655 avenue de l'Europe Tel.: +33 4 76 61 53 61 Montbonnot Fax: +33 4 76 61 52 07 38334 Saint Ismier Cedex - France
Received on Thursday, 1 July 2004 05:23:25 UTC