Rules for Pointers usage Episode 2: Attack of the Parsers

Hi group,

Here you have a second (long) take on the rules for Pointer usage, this time from a parser perspective. (Warning: huge mail. You may want to jump directly to the conclusions at the bottom)

When talking about analysing formal languages we can typically distinguish three analysis levels:


#1 - Lexical analysis: it looks for character sequences that can form lexemes (aka tokens)

Some examples of lexical problems:

- <p id="42">Paragraph</p>

'42' is not a valid token for an id because it starts with a number

- <p>Me &amp you</p>

'&amp' is not a valid token for an entity due to the missing ';' at the end

- <p my*Attribute="myValue">Another paragraph<p>

'my*attribute' is not a valid "token" for an attribute due to the '*' character


#2 - Syntactic analysis (aka parsing): analyse the sequence to determine if they conform to specific grammar rules

Some examples of syntactic problems:

- </p>Just another paragraph</p>

'</p>' is a valid token but this is not the place for it

- <p>One paragraph <p>inside other</p></p>

All of them are valid tokens but <p> is not allowed to be inside itself


#3 - Semantic analysis: once you follow the lexical and syntactic rules then you can extract the semantics following your own interests. 

Some examples of semantic problems:

- <myP>Personalized paragraph</myP>

'<myP>' is a beautiful element token that also follows the grammar rules, unfortunately its not semantically valid

- <p myAttribute="myValue">The last paragraph so far</p>

Again all valid tokens, but also again 'myAttribute' and 'myValue' are not semantically valid.

There are also several other high level semantic analysis, for example, is the alt of an image appropriate?


Once we have this categorization we can think about the implications when reporting lexical, syntactical or semantic problems, and specifically how it affects to the pointers usage:


- Lexical analysis is typically driven by a finite state automaton and based on regular expressions. In brief we can say that the automaton transits between states until it reaches a final state that could be a valid state or an error.

The analysis at this level is done character by character, so an error can appear at any time. The automaton will stop once it finds an error and it returns the character reference (the first invalid character for the token under construction).

For example:

<p id="42">Paragraph</p>

Returns an error on char '4' position

<p>Me &amp you</p>

Returns an error on the blank char position after the '&amp'

<p my*Attribute="myValue">Another paragraph<p>

Returns an error on char '*' position


In this case the returned pointer from the parser (the problematic char) matches up with the best practice from a human evaluator perspective.


- Syntactical analysis is typically driven by stack machines and based on Backus-Naur Form (for Context-Free Grammars). In brief we can say the machine push and pop elements until it reaches the end.

The analysis at this level is done token by token, so an error can appear any time a new token is analyzed. The machine will stop once it finds an error and it returns the token reference (generally the end of the token).

For example:

</p>Just another paragraph</p>

Returns an error on the first '>', i.e. the end of the first '</p>' token

<p>One paragraph <p>inside other</p></p>

Returns an error on the second, i.e. the end of the second '<p>' token

In this case the returned pointer from the parser (the end of the problematic token) doesn't match with the best practice from a human evaluator perspective (the beginning of the problematic token or the whole token itself)


- Semantic analysis is typically done over some kind of data structure, in our case typically a tree, but some parts of this analysis could be merged with the syntactical analysis.

Then, the analysis would be typically driven by DOM (or the like) parsers that provide access to the different parts of the tree nodes and properties at any time and, additionally, some other ad-hoc transformations for more elaborated checks (e.g. analysing the content of a Text node). In brief we can say we look around the tree in search of what we want.


CONCLUSIONS

Based on the previous categorization we can infer the following:

- Lexical analysis

* Lexical parsers are always going to return problems in the form of pointers to a specific point (character). This coincides with the expected best practice

* This is the only way of identifying lexical problems, as at this level we still don't have any other upper structure level (e.g. tokens)

* Then we can just use pointers that are able to point to specific point, that is:

- CharOffset
- ByteOffset?
- LineChar
- XPointer?

- Syntactical analysis

* Syntactical parsers are usually going to return problems in the form of pointers to the problematic token.

* Current implementations usually returns a pointer to the end point of the token, but the best practice should be to point to the hole token.

* Try to impose the best practice might contradict current implementations, so the best option might be following that de-facto standards.

* Then, again, we can just use pointers that are able to point to specific point (see list above). CSS selectors might also play a role here.

- Semantic analysis

* Semantic parsers are usually going to return tree-like structures, but they could be much more sophisticated and return a wide range of things.

* Current implementations of generic tree-like parsers usually returns whatever part of the tree you are interested on

* XPath and XPointer pointers are going to be really useful here but, give the open nature of these parsers any kind of pointer might be useful here.


Obviously, all the previously refers just to single pointers. If Compound pointers are needed for a specific use case, you could freely combine any of the proposed in each section to create the compound one.

A final open question is that there are not only parsers out there. There may be also human evaluators (a clear use case is the TSD TF, and probably Web Accessibility in general). As we have seen the requirements for human legibility may differ from those for parsers legibility. So the questions are:

- How can we create rules that accommodate both cases? Is this possible?
- May we differentiate between both cases when creating rules?

Looking for your opinion
Regards,
 CI.

____________________

Carlos Iglesias

Fundación CTIC
Parque Científico-Tecnológico de Gijón
33203 - Gijón, Asturias, España
teléfono: +34 984291212
email: carlos.iglesias@fundacionctic.org
URL: http://www.fundacionctic.org 

Received on Thursday, 16 October 2008 16:09:48 UTC