- From: <uid#15033@dxal18.cern.ch>
- Date: Fri, 24 Mar 1995 15:13:32 +0900
- To: hallam@dxal18.cern.ch (USENET), gtn@ebt.com (Gavin Nicol ), www-talk@w3.org
In article <AA76@cernvm.cern.ch> you write: |>>|> I was wondering if there exists a specification of HTML in yacc |>>|>(or bnr) form. It has probably been done as constructing such a parser is |>>|>way more easier in this way than with a traditional C subroutine. |>> |>>Don't think about it. HTML is not an LR(1) grammar and so trying to use yacc |>>is only going to cause pain. The best way of parsing SGML is with a top down |>>recursive descent parser. Try to use yacc and you will end up in all sorts of |>>troubles, especially with error reporting. |> |>Phill is technically correct (that one cannot parse SGML and hence |>HTML using YACC et al). |> |>If one limits oneself to a subset of SGML, it is quite possible to |>produce a YACC grammer. Dan Connolly has produced such a grammar for |>HTML by hacking DTD2HTML, and the TEI folks have produced an |>*excellent* and very *useful* subset of SGML, and the grammar is |>available at: |> |> ftp://ftp-tei.uic.edu/pub/TEI |> |>While these can accept come documents that are not quite legal SGML, |>99.9% of documents I've seen would be both legal withing the TEI |>grammar, and within SGML. But why bother? Parsing SGML with a top down recursive decent parser based on an FSR is by far the simplest approach to implement and also produces correct code. Why would anyone want to use an inappropriate tool which does the job less well and is more difficult to use? Yacc is OK if you actually have an LR(1) grammar. But its best to steer well clear of it otherwise. In addition error handling was never really though out properly for yacc. I've never seen anyone sucessfully use the error productions without comming a cropper. HTML2.0 is just about parsable with yacc but HTML3 is pretty awfull. Especially the maths extensions since they use some of the character set shifting functions. This part is distinctly non LR(1) and the best, most compact definition of the grammar is produced using a push-down automata. I think the problem lies in comp sci classes being taught that bottom up parsing is `better' and the students not asking why. Goldfarb would not know an LR(1) grammar if one bit him on the nose. If he had SGML might not fall into the "much wailing and gnashing of teeth" catogory which it does. PS: I have discovered that the correct pronunciation of "ASN.1" is "assasin 1". -- Phillip M. Hallam-Baker Not Speaking for anyone else.
Received on Friday, 24 March 1995 10:07:08 UTC