- From: Maciej Stachowiak <mjs@apple.com>
- Date: Wed, 17 Mar 2010 03:16:24 -0700
- To: Graham Klyne <GK-lists@ninebynine.org>
- Cc: Larry Masinter <LMM@acm.org>, 'Dan Connolly' <connolly@w3.org>, "'Michael(tm) Smith'" <mike@w3.org>, noah_mendelsohn@us.ibm.com, 'Paul Cotton' <paul.cotton@microsoft.com>, 'Philippe Le Hegaret' <plh@w3.org>, 'Sam Ruby' <rubys@intertwingly.net>, www-tag@w3.org
On Mar 17, 2010, at 2:37 AM, Graham Klyne wrote: > Maciej Stachowiak wrote: >> On Mar 16, 2010, at 3:25 PM, Larry Masinter wrote: >>>> none of the available schema languages is >>>> expressive enough to represent all of the HTML5 document >>>> conformance >>>> requirements. >>> >>> This seems like an odd requirement. >>> >>> Can you think of any non-trivial computer language for which there >>> a formalism such as a schema language or BNF or whatever completely >>> described ALL of the conformance requirements for instances of >>> that language? In the history of computer languages? >>> >>> I can't. >> Most programming languages are not specified in terms of a schema. >> They do often provide a grammar in BNF form, but this is generally >> seen as an aid to implementors in determining how to parse the >> language, not a tool for conformance checking. To use an example I >> am familiar with, C has many mandatory diagnostics which do not >> comprise part of the grammar, and I do not think it is common to >> check correctness of C programs with a tool that solely checks >> against the grammar. > > I find this a surprising view of the history of programming > languages. Many of the languages I've worked with have formal > grammars that are/were commonly used to define the basis of parsing > programs in that language. Parsing - yes. Checking correctness of the program - not so much. In the case of HTML, though, a schema is useless as an aid to parsing. > Much of the work on programming language theory through the 1970s > was exactly about developing systems to check the syntax of > programming languages against formal specifications, and was largely > successful in achieving those goals. Later work on semantics > conformance checking was harder, but not without some limited success. > > And if parsing a programming language isn't conformance checking, I > don't know what is. Going with an example I am familiar with, a C compiler emits many diagnostics other than parse errors. I don't know of any tool for C that will report *only* the parse errors, based solely on the formal grammar. Likewise for any other language I have coded in. Most compiled languages have a concept of declaring identifiers and will complain if you use one without doing so, for instance. Likewise with type errors, for languages with static typing. In the case of C++, I think it may not even be possible to parse correctly without doing some semantic analysis. > And if a formal grammar isn't a kind of schema, then what is it? It is a kind of schema (or perhaps it's more accurate to say a schema is a kind of grammar?) > > What continues to amaze me is that later work on markup languages > and other network protocol syntaxes seems to have completely ignored > the earlier work on programming language parsing. (XML is a case in > point: it defies parsing according to established programming > language compilation techniques, in part because its lexing is parse- > context-dependent. HTML even more so, I think, though I've never > tried to write a parser for that.) Classically, HTML was assumed to be SGML, and I expect that defies parsing even more than XML does. However, HTML5 drops the pretense of SGMLness, and defines parsing in terms of more traditional lexing and parsing phases. I am pretty certain the HTML5 tokenizer could be expressed as a finite state machine and the HTML5 tree builder as a pushdown automaton, though they are expressed in prose rather than formal grammars. Regards, Maciej
Received on Wednesday, 17 March 2010 10:17:00 UTC