Re: Fuzzy Pointers: Normalisation

I sent this mail earlier to Gabriele, but it was meant to go to the list.

--

Hi Gabriele,

At 12:27 5/05/2005, you wrote:
>Hi Nick,
>
>(...)
>   Normalisation "somehow" changes the original content of an SGML/XML document. I don't want to state the obvious, but normalisation is a one-way process and going back from a normalised document to its original is very hard (unless all the changes are stored).

I agree.


>   This process could therefore affect the localisation of the subject of an assertion. Especially when we assert something like "an element is missing".
>
>   I want to clarify this with an example, and I hope we can discuss about it.
>
>   Let's suppose my aim is to deploy a statistics regarding the usage of the "tbody" element in a collection of HTML documents on the net. I want to use EARL to write a report with assertions of all the documents that have been fetched and checked and the results (and maybe repeat it every quarter of a year).

In fact, every HTML table has a "tbody" element: tbody is one of those elements that have optional start and end tags (like "html", "head" and "body"; an SGML parser is expected to infer there presence). I guess I'm a nitpicker, but I would rephrase your statement as "statistics regarding the use of the 'tbody' start and end tags".


>   If original documents do specify "tbody", I guess the normalisation process produces a structure which would not affect the localisation of the subject of my assertion.
>
>   On the other hand, if we consider this document portion:
>
>[...]
><table>
><tr>
><th>Country</th>
><th>Population</th>
></tr>
><tr>
><td>Italy</td>
><td>57 millions (?)</td>
></tr>
>[...]
>
>   my question is. Would the normalisation process introduce the following change or not?
>
>[...]
><tbody>
><tr>
><td>Italy</td>
><td>57 millions (?)</td>
></tr>
>[...]
></tbody>

The content models for "tbody", "thead" and "tfoot" are identical, namely "(tr)+", so an SGML parser may actually generate a hierarchy that corresponds to this:

[...]
<table>
<tbody>
<tr>
<th>Country</th>
<th>Population</th>
</tr>
<tr>
<td>Italy</td>
<td>57 millions (?)</td>
</tr>
[...]
</tbody>
</table>

More intelligent software might produce this:
[...]
<table>
<thead>
<tr>
<th>Country</th>
<th>Population</th>
</tr>
</thead>
<tbody>
<tr>
<td>Italy</td>
<td>57 millions (?)</td>
</tr>
[...]
</tbody>
</table>

(I am not aware of software that does this.)


>   If it does, I think, there could be problems when trying to locate the missing tbody on a document that's been normalised: indeed, the tbody actually exists, as it has been artificially added.

Not "added" but made explicit.

Regards,

Christophe Strobbe



>   My question is: how would you locate this kind of problem using the normalised document? Are you still able to refer to the problem in the original document using a fuzzy pointer or Xpath expression (which are related to the normalised document)?
>
>   Thank you for your attention.
>
>Ciao,
>-Gabriele
>
>Nick Kew ha scritto:
>>Jim has given us very briefly his take on the normalisation problem.
>>FWIW, there's a piece on the subject by Joe English at
>>http://groups-beta.google.com/group/comp.text.sgml/msg/70ec0496587b03bb
>>taken from an SGML viewpoint.  He doesn't make any reference to HTML
>>as such, but puts forward general rules.  His analysis supports the
>>view that <tbody> elements (along with the usual suspects <html>,
>><head>, <body>) should be inserted into the document tree where there
>>is ambiguity.
>
>-- 
>Gabriele Bartolini: Web Programmer, IWA/HWG Member
>ht://Check, ht://Miner and Wuhkag maintainer
>Current Location: Prato, Toscana, Italia
>me@gabrielebartolini.it | www.gabrielebartolini.it
>> "Lasciate ogne speranza, voi ch'intrate", Dante Alighieri, Divina Commedia, Inferno

-- 
Christophe Strobbe
K.U.Leuven - Departement of Electrical Engineering - Research Group on  Document Architectures
Kasteelpark Arenberg 10 - 3001 Leuven-Heverlee - BELGIUM
tel: +32 16 32 85 51 
http://www.docarch.be/ 

Received on Thursday, 5 May 2005 13:58:44 UTC