Re: How do we point into a web page [was minutes from 30 October 2000]

I like Al's sugggestion regarding how we represent documents we want 
accessibility statements to point to:

>Usually this winds up with some intermediate choice; you don't go all the way
>back to flat text but agree to a scheme that works if a certain repair
>strategy
>works, and wash your hands of the documents that don't clean up by the
>application of this repair method; they are just beyond what you can exchange
>references about under the convention so defined.


We've been talking about HTML and XHTML, but I think we also need to talk 
about representing

- style sheets
- javascript (well, ecmascipt).  There's some innocuous javascipt and some 
that can hurt accessibility.  We need to be able to point to what we're 
talking about.
- scripting in server side pages.  These are scripts that appear in pages 
that the server interprets and that the user never directly sees.  We may 
want to point to those scripts as well,  These scripts can be in various 
languages, e.g. visual basic, perl, php, etc.

Now, one way to point into documents is XPath and XPointer, which, as we 
discused last time, only works for XML.  So what to do?

If the document is already XHTML, we can point with XPointer and no further 
work.  However,
- there can be many different XPointers that point to the same place. E.g. 
you can point to the third image, or the image with src="foo.gif".  This is 
a slight implementation complication if we use two different tools and they 
use different representations, two Xpointer specs that look very different 
can point to the same thing.   But it's straightforward to match them up in 
principle so I'm not going to worry too much about this for now...

If the document is HTML, we can process it to create XHTML.  But there's 
another complication: even if the HTML is correct, there's more than one 
valid XHTML document into which it can be converted.
For example, in the following valid HTML

<P> <BR> <P>

The BR can be be inside or outside the scope of the preceding P

Could be transformed to

<p> </p> <br/> <p> </p>
or
<p> <br/> </p> <p> </p>

So there's no one to one correspondence of HTML to XHTML.  Similarly, a 
given XHTML can correspond to more than one HTML.  This leads to the question:

If we point to a location in an XHTML document that was derived from a 
known HTML document, does it uniquely correspond define the point in the 
HTML document?  Off hand I'd say yes, but given the many-to-many 
correspondence between HTML and XHTML, it's not obvious to me.  I'm 
especially concerned if the XHTML came from the HTML via heuristics.

This concern escalates if the HTML was not valid to begin with.

-----------
So getting back to Al's suggestion: we need some intermediate 
representation that applies to reasonably valid, if not completely valid, 
documents. Here's a couple of proposals.

The first proposal is "Chunk Markup Language" or CML.
Basically, we parse the input into a series of chunks that describes the 
code.  It is NOT an attempt to do a real parse tree.  Just a way to 
describe what's there; and a way that will remain valid even if there's 
some (though not all) errors in the original.

For example,
<A href=quib.html> <img src=foo.gif alt=bar> </A>

would be represented as
<tag name="A"
    <attr name="href" value="quib.html" />
</tag>
<tag name="img">
   <attr name="src"  value="foo.gif" />
   <attr name="alt"  value="bar" />
</tag>
<endtag name="A" />

Note that this applies to documents that messed up nesting, e.g. <B><I> 
</B> </I>  No need to apply Tidy type heuristics.  And there's a simple one 
to one correspondence between chunks in the XML and things in the 
original... even if the original is invalid.

It's straightforward to recover the original, at least without whitespace 
and other details. But those can be added. see below.
---------------
Another way would be to create XHTML, but to mark the closing tag that were 
added in the conversion.  This makes it straightforward to go back to the 
oiginal HTML.  e.g.

<P>  Hello <BR> World <P>

might be transformed into

<p>  Hello <br/>  <added/> </p> World <p>  <added/> </p>

(we can't wrap the closing tag in <added></p></added> because that would be 
invalid XML)

Now to recover the original HTML (at least without the whitespace and some 
other details) we just remove the closing tags marked by <added/> and 
change <tag/> to <tag>

This gets hairier if the original wasn't syntactically correct and you're 
e.g. switching around closing tags...

------
In either case,

If we want to recover the exact appearance of the original code, we can add 
further tags, e.g.

<whitespace value="sssstttssssn" />    <!-- s is space, t is tab, n is 
newline  -->

and further tags for upper lower case.

Awkward to put in markings for use of quotes in variable names though... 
the first method's better for that.

------

Now, which of these notations is more convenient for software that uses 
XPointer?
------
also there's the whole other issue of how we talk about documents that are 
not HTML, e.g. css style sheets, or parts of pges like javascipt.  Xpointer 
can deal with those as character strings, but it would be good to do 
something higher level...

Anyway that's my thoughts so far...

--
Leonard R. Kasday, Ph.D.
Institute on Disabilities/UAP and Dept. of Electrical Engineering at Temple 
University
(215) 204-2247 (voice)                 (800) 750-7428 (TTY)
http://astro.temple.edu/~kasday         mailto:kasday@acm.org

Chair, W3C Web Accessibility Initiative Evaluation and Repair Tools Group
http://www.w3.org/WAI/ER/IG/

The WAVE web page accessibility evaluation assistant: 
http://www.temple.edu/inst_disabilities/piat/wave/

Received on Friday, 3 November 2000 15:20:34 UTC