Re: Amaya HTML parser. from Laurent Carcone on 2004-06-30 (www-amaya-dev@w3.org from June 2004)

From: Laurent Carcone <laurent@w3.org>
Date: Wed, 30 Jun 2004 15:16:50 +0200
To: olczyk@interaccess.com
Cc: www-amaya-dev@w3.org
Message-Id: <20040630131650.AC33E17164@tux.inrialpes.fr>

Hello Thaddeus,

In fact, Amaya uses 2 different parsers, expat for XHTML documents (and for 
XML documents in general) and an ad'hoc parser for other HTML documents.
This parser is specific to Amaya and has no well-defined API. Nevertheless, 
you can have a look on it in the module 'amaya/html2thot.c', and particularly 
on the definition of the automaton.

Hope this will help you,

Laurent Carcone

> 
> Hi.
> I've been going nuts looking for a non-perl HTML parser
> which handles "real world" HTML. On the libwww page,
> it says that their parser is primitive and if you are looking
> for a robust HTML parser, look at Amaya.
> 
> So I've gotten Amaya. I've skinned through the documentation.
> It seems rather vague on where the parser is and what it's API
> is.
> 
> So three questions.
> For a person for whom expat, libxml and libwww used with ( or without)
> HTML Tidy is not good enough, will the parser in Amaya be sufficient?
> 
> Is the Amaya code modularised enough to extract the parser?
> 
> In terms of the code, where would I start with the procedure.
> 
> Thank You
> --
> Thaddeus L. Olczyk
> -----------------------
> Think twice, code once.
> 
>

Received on Wednesday, 30 June 2004 09:17:05 UTC