W3C home > Mailing lists > Public > www-tag@w3.org > March 2010

Re: including a schema with "HTML: The Markup Language" Clarifying TAG Re: Courtesy notification

From: Graham Klyne <GK-lists@ninebynine.org>
Date: Wed, 17 Mar 2010 10:53:22 +0000
Message-ID: <4BA0B4A2.8060708@ninebynine.org>
To: Maciej Stachowiak <mjs@apple.com>
CC: Larry Masinter <LMM@acm.org>, 'Dan Connolly' <connolly@w3.org>, "'Michael(tm) Smith'" <mike@w3.org>, noah_mendelsohn@us.ibm.com, 'Paul Cotton' <paul.cotton@microsoft.com>, 'Philippe Le Hegaret' <plh@w3.org>, 'Sam Ruby' <rubys@intertwingly.net>, www-tag@w3.org
OK, now I understand better where you are coming from.

All of which I guess underscores Larry's point: it's hard (if not generally 
impossible) to use a grammar/schema/other-formal-description to check *all* 
aspects of program/input correctness, but that doesn't take away from the value 
of using one to validate those aspects that are amenable to such validation.

In my experience, it is often the process of expressing/reviewing a language in 
some formalism that is of greatest value, for understanding implications of and 
problems in its design.  I believe Dan Connolly reported some similar 
experiences w.r.t. XQuery a few years ago (Amsterdam WWW conference, developer 
day, IIRC).

(I'm not sure about the HTML lexer, but an XML lexer can't (easily) be described 
in terms of a finite state machine because of context sensitivity of the 
tokenization process - something I learned trying to fix up an XML parser 
written in Haskell, which might in turn be regarded in some ways as being pretty 
close to a general-purpose, machine processable formal specification language.)


Maciej Stachowiak wrote:
> On Mar 17, 2010, at 2:37 AM, Graham Klyne wrote:
>> Maciej Stachowiak wrote:
>>> On Mar 16, 2010, at 3:25 PM, Larry Masinter wrote:
>>>>> none of the available schema languages is
>>>>> expressive enough to represent all of the HTML5 document conformance
>>>>> requirements.
>>>> This seems like an odd requirement.
>>>> Can you think of any non-trivial computer language for which there
>>>> a formalism such as a schema language or BNF or whatever completely
>>>> described ALL of the conformance requirements for instances of
>>>> that language? In the history of computer languages?
>>>> I can't.
>>> Most programming languages are not specified in terms of a schema. 
>>> They do often provide a grammar in BNF form, but this is generally 
>>> seen as an aid to implementors in determining how to parse the 
>>> language, not a tool for conformance checking. To use an example I am 
>>> familiar with, C has many mandatory diagnostics which do not comprise 
>>> part of the grammar, and I do not think it is common to check 
>>> correctness of C programs with a tool that solely checks against the 
>>> grammar.
>> I find this a surprising view of the history of programming 
>> languages.  Many of the languages I've worked with have formal 
>> grammars that are/were commonly used to define the basis of parsing 
>> programs in that language.
> Parsing - yes. Checking correctness of the program - not so much. In the 
> case of HTML, though, a schema is useless as an aid to parsing.
>>  Much of the work on programming language theory through the 1970s was 
>> exactly about developing systems to check the syntax of programming 
>> languages against formal specifications, and was largely successful in 
>> achieving those goals.  Later work on semantics conformance checking 
>> was harder, but not without some limited success.
>> And if parsing a programming language isn't conformance checking, I 
>> don't know what is.
> Going with an example I am familiar with, a C compiler emits many 
> diagnostics other than parse errors. I don't know of any tool for C that 
> will report *only* the parse errors, based solely on the formal grammar. 
> Likewise for any other language I have coded in. Most compiled languages 
> have a concept of declaring identifiers and will complain if you use one 
> without doing so, for instance. Likewise with type errors, for languages 
> with static typing.
> In the case of C++, I think it may not even be possible to parse 
> correctly without doing some semantic analysis.
>>  And if a formal grammar isn't a kind of schema, then what is it?
> It is a kind of schema (or perhaps it's more accurate to say a schema is 
> a kind of grammar?)
>> What continues to amaze me is that later work on markup languages and 
>> other network protocol syntaxes seems to have completely ignored the 
>> earlier work on programming language parsing.  (XML is a case in 
>> point: it defies parsing according to established programming language 
>> compilation techniques, in part because its lexing is 
>> parse-context-dependent.  HTML even more so, I think, though I've 
>> never tried to write a parser for that.)
> Classically, HTML was assumed to be SGML, and I expect that defies 
> parsing even more than XML does. However, HTML5 drops the pretense of 
> SGMLness, and defines parsing in terms of more traditional lexing and 
> parsing phases. I am pretty certain the HTML5 tokenizer could be 
> expressed as a finite state machine and the HTML5 tree builder as a 
> pushdown automaton, though they are expressed in prose rather than 
> formal grammars.
> Regards,
> Maciej
Received on Wednesday, 17 March 2010 10:58:25 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:56:33 UTC