Re: including a schema with "HTML: The Markup Language" Clarifying TAG Re: Courtesy notification

On Mar 17, 2010, at 2:37 AM, Graham Klyne wrote:

> Maciej Stachowiak wrote:
>> On Mar 16, 2010, at 3:25 PM, Larry Masinter wrote:
>>>> none of the available schema languages is
>>>> expressive enough to represent all of the HTML5 document  
>>>> conformance
>>>> requirements.
>>>
>>> This seems like an odd requirement.
>>>
>>> Can you think of any non-trivial computer language for which there
>>> a formalism such as a schema language or BNF or whatever completely
>>> described ALL of the conformance requirements for instances of
>>> that language? In the history of computer languages?
>>>
>>> I can't.
>> Most programming languages are not specified in terms of a schema.  
>> They do often provide a grammar in BNF form, but this is generally  
>> seen as an aid to implementors in determining how to parse the  
>> language, not a tool for conformance checking. To use an example I  
>> am familiar with, C has many mandatory diagnostics which do not  
>> comprise part of the grammar, and I do not think it is common to  
>> check correctness of C programs with a tool that solely checks  
>> against the grammar.
>
> I find this a surprising view of the history of programming  
> languages.  Many of the languages I've worked with have formal  
> grammars that are/were commonly used to define the basis of parsing  
> programs in that language.

Parsing - yes. Checking correctness of the program - not so much. In  
the case of HTML, though, a schema is useless as an aid to parsing.

>  Much of the work on programming language theory through the 1970s  
> was exactly about developing systems to check the syntax of  
> programming languages against formal specifications, and was largely  
> successful in achieving those goals.  Later work on semantics  
> conformance checking was harder, but not without some limited success.
>
> And if parsing a programming language isn't conformance checking, I  
> don't know what is.

Going with an example I am familiar with, a C compiler emits many  
diagnostics other than parse errors. I don't know of any tool for C  
that will report *only* the parse errors, based solely on the formal  
grammar. Likewise for any other language I have coded in. Most  
compiled languages have a concept of declaring identifiers and will  
complain if you use one without doing so, for instance. Likewise with  
type errors, for languages with static typing.

In the case of C++, I think it may not even be possible to parse  
correctly without doing some semantic analysis.

>  And if a formal grammar isn't a kind of schema, then what is it?

It is a kind of schema (or perhaps it's more accurate to say a schema  
is a kind of grammar?)

>
> What continues to amaze me is that later work on markup languages  
> and other network protocol syntaxes seems to have completely ignored  
> the earlier work on programming language parsing.  (XML is a case in  
> point: it defies parsing according to established programming  
> language compilation techniques, in part because its lexing is parse- 
> context-dependent.  HTML even more so, I think, though I've never  
> tried to write a parser for that.)

Classically, HTML was assumed to be SGML, and I expect that defies  
parsing even more than XML does. However, HTML5 drops the pretense of  
SGMLness, and defines parsing in terms of more traditional lexing and  
parsing phases. I am pretty certain the HTML5 tokenizer could be  
expressed as a finite state machine and the HTML5 tree builder as a  
pushdown automaton, though they are expressed in prose rather than  
formal grammars.

Regards,
Maciej

Received on Wednesday, 17 March 2010 10:17:00 UTC