Re: How do we point into a web page [was minutes from 30 October 2000]

At 09:24 AM 2000-10-31 -0500, Leonard R. Kasday wrote:
>At 02:14 PM 10/30/00 -0500, Al Gilman wrote:
>> >I think you can actually write Xpointer / Xpath expressions to select 
>> bits of
>> >text (I think I saw that it could be used in XSLT). But you should ask an
>> >expert.
>
>I think ability to point into text is critical for additional reasons, 
>besides things like language, misspellings, and missing tags in the text.
>
>- an accessibility assertion could apply to a part of CSS or Javascript 
>inside a web page.  These are typically (and prefereably) inside comments...
>
>so we need to point to text inside comments!  Can we do this with Xpointer?
>
>- If in the future we deal with server-side scripting that follows e.g. asp 
>or php files, and we want to test the asp or php file before it gets 
>processed into HTML by the server, we are again are pointing inside a 
>programming language whose parse tree would not be exposed to XPATH
constructs.
>
>- what if the original page is illegal HTML?  

This is the crux of the matter.  The community of interoperating tools [as
resolved in last telecon] needs to have an agreement or convention as to what
to do, here.

Using 'tidy' to define the canonicalization that you will employ for
references
into imperfect HTML is in a sense a hack; but it gets you a normal form that
you can then manipulate without fear for those pages it copes with; and it
copes with a lot more pages than those that will pass validation before
processing.  But 'tidy' is not the generic issue; it is how bad of HTML are
you
going to try to understand one another in discussing?

>  How can we point to the 
>illegal bits in a tidyfied version?  Seems to me we'd have to cast the 
>whole page, or at least a portion of the page, into CDATA to talk about it.

No; that is not necessary.

>
>These are leaning me in the direction of just considering the page one big 
>text string against which we make XML or RDF statements.   If we do that, 
>that whether we point by interspersing the comments in the string or 
>pointing into the string is largely an implementation detail, it seems to 
>me, since it would be straightforward to convert from one to another.

Yes; the way you exchange mutually understood pointers into something that is
not valid HTML is to drop back to some class of text string that you trust all
the documents you want to talk about conform to in fact, and use an indexing
scheme that works there.

Usually this winds up with some intermediate choice; you don't go all the way
back to flat text but agree to a scheme that works if a certain repair
strategy
works, and wash your hands of the documents that don't clean up by the
application of this repair method; they are just beyond what you can exchange
references about under the convention so defined.

There are i18n issues about character counting in text.  I would have to send
you back to the i18n people to get an explanation of what they are.  Other
pitfalls to character counting deal with the fact that character string
changes
are permissible as a result of transport.  Communicating repair tools, having
recovered the same URI-reference, are not necessarily holding bit-for-bit
replicas in their local storage.  Not even character-for-character.  This is
why signature (q.v.) involves a canonicalization transformation, IIRC.

One point is that the XML Working Group is firmly committed to the idea that
something that is not well-formed XML is pure garbage and they will not define
processing methods that step into that void.  It is not necessarily practical
for a community of HTML repair tools to be quite so strictly orthodox in what
they will eat.  So what I am saying is that you may wish to re-engineer
X-Pointer just a bit to add a bit of repair or robustness to unorthodoxy
somehow.  You have a task requirement that invites a more permissive standard
than what the XML community would agree to define.  But the mission statement
for X-Path and/or X-Pointer (oops, I should have been saying both) is very
close to the same problem as what you need.

Part of the problem is my bias regarding the implementation of sofware in this
area.  I tend to be biased in the direction that this group should be
coming up
with rapid-prototype reference implementations of techniques that we can then
sell off into the format specifications because we have working models.  The
commercial implementers of the formats can come up with the efficient
implementations.

So we don't make 'tidy' the definition of our pointer scheme.  What we say is
that you can point into a repaired image of a document if you document the
repair with a standard diff.  Then 'tidy' can be replaced within the agreed
architecture; the standard is for the diff, not the repair.  This is costly in
compute cycles, but efficient in programming calendar months.  This lets us
build working systems with standard DIFF and standard XML tools and
libraries. 
And capitalize on all the work Raggett has poured into 'tidy' over the years.

>
>Please take the above (possibly inflammatory) statements as just MHO for now.

Temperate and amiable as ever, Lenny.  And well framed, to boot.

Al

>
>Len
>--
>Leonard R. Kasday, Ph.D.
>Institute on Disabilities/UAP and Dept. of Electrical Engineering at Temple 
>University
>(215) 204-2247 (voice)                 (800) 750-7428 (TTY)
><http://astro.temple.edu/~kasday>http://astro.temple.edu/~kasday        
<mailto:kasday@acm.org>mailto:kasday@acm.org
>
>Chair, W3C Web Accessibility Initiative Evaluation and Repair Tools Group
><http://www.w3.org/WAI/ER/IG/>http://www.w3.org/WAI/ER/IG/
>
>The WAVE web page accessibility evaluation assistant: 
><http://www.temple.edu/inst_disabilities/piat/wave/>http://www.temple.edu/
inst_disabilities/piat/wave/
>  

Received on Tuesday, 31 October 2000 10:29:00 UTC