Re: SALT vs VoiceXML from Al Gilman on 2002-03-31 (www-archive@w3.org from March 2002)

From: Al Gilman <asgilman@iamdigex.net>
Date: Sun, 31 Mar 2002 11:01:53 -0500
To: www-archive@w3.org
Cc: <pjenkins@us.ibm.com>
Message-Id: <200203311601.LAA1852032@smtp2.mail.iamworld.net>
[Disclosure and disclaimer.  This is from a discussion on data formats for
defining multimedia interactive experiences on the WAI-PF list.  There may
be inessential references to Member-confidential information in the prior
converstation, but it is believed that this post is free of taint of
confidentiality.  Please address To: <asgilman@iamdigex.net> Cc:
<pjenkins@us.ibm.com> if you wish to clarify that point.  Al

Quotable in its entirety with attribution (use www-archive URL), in public.
This message is claimed by its author to be free from confidential
information.
However, nothing in this message shall be construed to represent the
position of the W3C, the WAI, or the PF working group.  The opinions
expressed are those of the author and all errors and injustices are his
doing.]

There are actually two questions raised by Phill's good post.

a) what are the standards that should be applied in reviewing VoiceXML (or
the Voice Browsing Profile with VoiceXML in place within it).

b) compare and contrast our expectations for VoiceXML and SALT.

My response is roughly 

a)  The answer cannot be answered definitively at the outset "what
scenarios is Voice Browsing technology required to support?"

We need to separatedly evaluate multiple questions:

- What [related] scenarios are beneficial for people with disabilities?
- What service-equivalence-classes are required in similar pre-existing
scenarios?
- What beneficial scenarios could this technology possibly support with
tweaking?
- How readily achievable is the required extension or modification to
accomodate that scenario?

We need to make progress on all these sub-questions concurrently; we can't
expect a definitive answer to any before working on the rest.

b)  We have to look at the nominal usage scenario of each and identify
reasonably nearby usage scenarios that matter for people with disabilities.
 We don't have the unified theory of universal interfaces or content to
enable deriving the true requirements for each of these from a common root.
 VoiceXML presumes an audio only service-delivery context; SALT presumes
that audio plays a minority or supporting role in conjunction with a lot of
other, including large screen visual, display and probably large keyboard
input.

Details follow.

At 03:18 PM 2001-11-02 , Phill Jenkins wrote:
>
>The question I ask the p f working group is: "What is our objective in
>reviewing VoiceXML 2.0?
>- is it to review the spec and insure a user agent could be developed so a
>deaf or hard of hearing person could interact with application?
>- is it to review the spec to insure that a user agent (assistive
>technology) could be developed for a person with a mobility impairment
>(limited hand use) so that he would be able to interact with the
>application?
>- is it to make VoiceXML more multimodal so various input and output
>techniques can be used in a desktop environment?
>- or what?
>
>Before I begin the review, I think the VoiceXML group should agree with (or
>at least know) our objective.
>

AG:  Very good questions.

Charles pointed out the "at least there should be a non-final-form option"
clause that we agreed on with XSL FO.  I don't disagree with this as an "at
least."  But I don't think that our review should in principle be limited
to this.


Let me first try some line by line responses to the above questions.

>
>The question I ask the p f working group is: "What is our objective in
>reviewing VoiceXML 2.0?


>- is it to review the spec and insure a user agent could be developed so a
>deaf or hard of hearing person could interact with application?

AG:: Not just that a user agent could be developed.

The policy that we should point out is prevalent, if not universal, is "if
you offer a service over the Public Switched Telephone Network (PSTN), then
you SHOULD provide an equivalent service that is usable by text telephone
(TTY) [wherever readily achievable?].  In some domains, that is to say in
some political jurisdictions and with regard to some services I believe
this SHOULD is raised to a legal MUST.

Just what the standards should be in this area is a matter of current
debate.  See the IVR Forum for example for an industry view, I believe.

http://www.atis.org/atis/ivr/documents.htm

One of the facts of life is that Voice Browsing technology competes with
technology which just addresses the IVR market.  We cannot assume that
society at large will legally or otherwise require policies for Voice
Browsing applications that are used functionally like IVR systems to comply
with access policies that are radically different from the policies that
IVR systems are required to meet.  We don't want our demands on Voice
Browsing technology to put a ball and chain on them that makes them lose in
the marketplace.  On the other hand, we don't want them to fail to be
competitive at the least in cross-modal compliance, either.

My scenario for TTY compliance is that the telephone gateway detects TTY on
the main number or there is a TTY listed number parallel to the main number
for the Voice Browsing server to PSTN.  Hopefully migrating to
autodetection.  For entry of phone numbers and similar codes, key entry is
ususally more usable than voice recognition, so the Voice Browsing grammar
contains provisions, but not requirements, for a DTMF mode.  In either
case, the Voice Browsing system has call control capabilities where it can
auto-forward the call to a server, which may be hosted by a specialty TTY
option house, which implements the TTY compatible parallel version of the
Voice Browsing service.  

What we should review suggesting as the policy ancestor or prior
requirement is that the Voice Browsing profile that is to say bundle of
specifications at minimum takes all readily available steps to make this
parallel service delivery chain trivial to implement or optionally b)
builds the transform to the parallel into their test suite or even into the
specification.  How much should be built in is reasonably a matter of
discussion between them and us.  Maybe we should not take a policy
position, maybe we should.  But at the least we should clearly articulate
that evidence that we feel they should accept as objective: the layout of
the failure mode or success mode scenarios and why we believe from parallel
examples that this scenario will succeed or fail.

[ASIDE:  The interface to the policy market -- the rules under which we
allocate MUST, SHOULD, MAY choices to clauses -- is a matter of
irresolution across the WAI.  At least I claim there is no clear and agree
metapolicy.]


In any case, it should be part of our job to check that there is nothing
stupid in the specification that keeps it from being readily achievable to
provide a TTY compatible service parallel to any voice-in, voice-out
service communicated over the PSTN.  The media and the dialog management
are damn near identical.  You just have to have text for all audio tapes
that are part of the dialog and understand a few TTY rules of procedure, I
suspect.  The rest is just to follow the emerging standard for tiers of
service.  The service that talks to a TTY or a voice grade circuit
connection is a front tier and there is a backoffice WebService which
represents the business rules and data of the transactions.  The TTY
serving house and the voice serving house both talk with the backoffice
server over the same SOAP borne WebService.


>- is it to review the spec to insure that a user agent (assistive
>technology) could be developed for a person with a mobility impairment
>(limited hand use) so that he would be able to interact with the
>application?

AG::  Here my gut reaction is a 'readily achievable' test.

This is compatibility with AT.  Yes, absolutely voice technology is very
important in building AT for people who have motor impairments and usable
voice.

On the other hand, we need to look at the scenarios for what sort of
technology one would actually use, here.  This sounds more like a candidate
for the SALT technology than the VoiceXML technology.

In fact, Desktop access to the backoffice interface of the Voice Browsing
service is of very real interest for people with visual, not so much motor,
disabilities.

Not that the access should not be available for all if it is available for
one.  There are some exceptions to that but our first assumptions is that
all modes are available to all.

But the person with motor impairment wants to take maximal advantage of
everything they have left.  A well presented visual/voice interface will be
much more efficient that a pure audio dialog, and this is a high priority
performance factor for them.

In the "Escaped Web" piece (Google for that phrase) I argue that
single-source authoring will first be achieved across modalities which are
more similar than across modalities that are more diverse.  Not that our
ultimate goal is not to have all served information registered into a model
which is both scalable and laterally mobile across display, command, and
cognitive modalities.

[Aside:  Can't give you an exhaustive list of cognitive modalities.  But we
can treat as what should be a given that "both show and tell" describes a
diversity of cognitive modality that is well known to be frequently
beneficial in upping the universality of the usability of an interface or
information transaction such as a document.]

>- is it to make VoiceXML more multimodal so various input and output
>techniques can be used in a desktop environment?

There is a valid question on the floor, which is "to what extent should W3C
be minimizing its overall constraint set so that it is Device Independent
by construction?"


Certainly, the VoxML declarative dialog and recovery control features that
got left on the cutting floor are of interest in making the abstract
resource base more multimode and device independent.  But we can't just
make non-negotiable claims, here.  Dave and Raman and Nils have a point
that should be considered, but weighed.  The W3C manner of working of
filling in bricks before shaping the arch doesn't always lead to good
results when the Cathedral is done.

The Cathedral vs. Bazaar is not a closed issue.  The more Cathedral one
thinks, the easier it is to deliver accessible-by-construction
technologies.  Roger Gimson [private communication] pointed out that
"Enterprise thinkers will naturally gravitate to single-source methods; UI
prototypers may not warm up to the idea so fast."

The point is I have been through all this in CAD.  You have to give people
deconstruction tools, because very often what will first be achieved is one
prototype that you know works, is not engineered for robustness or
extensibility, and only scratches the surface of the potential market if
you can get out of the box and view the problem and solution in the right
light.


>- or what?
>
>Before I begin the review, I think the VoiceXML group should agree with (or
>at least know) our objective.
>

AG:: Here, unfortunately, there is no way to get our criteria communicated
to them at 'the right time,' which is when they were writing their
requirements document (which is usually at too low a level anyway).  We are
continually purifying our understanding of policy principles by
deconstructing our gut reactions to concrete scenarios, just as they are
creating a technological abstraction by fitting to a range of scenarios.

At a minimum, we should ourselves try to segregate:

- Laws of mathematics and logic, or otherwise unavoidable constraints.
- Precedents in extant social policy whether the W3C charters or UN or
governmental utterances.
- Summaries of human performance that are well known in the HCI and
usability engineering field.
- Conclusions as to design constraints on technology.

The example here is from the sequence:

Only the user knows for sure what works for them and what doesn't.  -- this
can be established by mathematical analysis of the available information to
the author or software designer vs. the user.

Even the user doesn't usually know for sure until they have tried [some of]
the options
-- this is HCI consolidated knowledge.  It is descriptive of the
demographic facts.

The above two 'principles' are applied as our argument for a conclusion or
_derived design principled_ which is "author proposes, user disposes."
Service delivery chains should exhibit this protocol.  That is a matter of
judgement; the evidence that should be agreed to be objective is the prior
two points.

There are hree policy tests in common use in disability access: "readily
achievable," "undue burden," and "reasonable accomodation."  The first is
more used with casual contacts with the public and the last with ongoing
relationships such as employment with a known individual.  What is
considered "undue burden" is probably different in these two cases.


I have argued in the GL policy debates that "readily achievable" is a
prima_facie case for something to be required, but that "undue burden"
depends on looking at scenarios to determine the nature of the burden and
the price factors that should be applied to that burden in scenario
context.  Here we have to include in the decision factors the relative ease
or difficulty of alternatives from both the user and service offeror sides.

The ultimate in accessible-by-construction is not readily achievable until
we can give service designers a reference model that is a) scalable both in
part/whole scale and in generic/specific generality, and b) demonstrated
effective by multiple bindings to diverse concrete user interfaces.  We
don't have this yet, so under the "reasonableness cuts both ways" clause we
have to be negotiable on incrementalism.  But we should still be looking
for readily achievable further increments that belong in the spec and get
us closer to the "everything is device independent and therefore accessible
by dint of specification compliance" asymptotic goal.

>At first glance, SALT seems more about additional markup for mutimodal
>access (not necessarily accessibility), while VoiceXML seems more about
>controlling the phone interaction and conversation model.

AG::  Yes;  The disability scenarios that are most clearly linked to SALT
are the reading-assist modes used by those with reading-related
disabilities, and the inverse, the provision of visual equivalents for
audible stimuli in the operation of computer systems.

There are Universal Design tests to be applied to SALT for sure; but we are
not necessarily at a level of maturity in putting content and dialog under
one modeling umbrella to explain their requirements as derived from a
common set of reference rules.  We need to look more locally into the
disability scenarios that are related to the most common use of these
technologies and try at least to make sure that the local variations
transform as gracefully as is readily achievable.

So the answer cannot be answered definitively at the outset "what scenarios
is Voice Browsing technology required to support?"

We need to separatedly evaluate multiple questions:

- What [related] scenarios are beneficial for people with disabilities?
- What service-equivalence-classes are required in similar pre-existing
scenarios?
- What beneficial scenarios could this technology possibly support with
tweaking?
- How readily achievable is the required extension or modification to
accomodate that scenario?

We need to make progress on all these sub-questions concurrently; we can't
expect a definitive answer to any before working on the rest.

Al

>
>A quote from the SALT announcement [1]:
>
         
>
         
>
         
>
         
>                               SALT is a lightweight set of XML elements
that       

>                               enhance existing markup languages with a
speech      
>                               interface. SALT will thus extend existing
markup     
>                               languages such as HTML, xHTML and XML.
Multimodal    
>                               access will enable users to interact with
an         
>                               application in a variety of ways: They will
be able  
>                               to input data using speech and/or a
keyboard,        
>                               keypad, mouse or stylus, and produce data
as         
>                               synthesized speech, audio, plain text,
motion video  
>                               and/or graphics. Each of these modes could
be used   
>                               independently or concurrently.
         
>
         
>
         
>
>
>
>A quote from the VoiceXML 2.0 tutorial [2]:
>
>VoiceXML isn't HTML. HTML was designed for visual Web pages and lacks the
>control over the user-application interaction that is needed for a
>speech-based interface. With speech you can only hear one thing at a time
>(kind of like looking at a newspaper with a times 10 magnifying glass).
>VoiceXML has been carefully designed to give authors full control over the
>spoken dialog between the user and the application. The application and
>user take it in turns to speak: the application prompts the user, and the
>user in turn responds.
>
>[1] SALT http://xml.coverpages.org/ni2001-10-24-a.html
>[2] VoiceXML 2.0 Turtorial http://www.w3.org/Voice/Guide/
>
>
>The question I ask the p f working group is: "What is our objective in
>reviewing VoiceXML 2.0?
>- is it to review the spec and insure a user agent could be developed so a
>deaf or hard of hearing person could interact with application?
>- is it to review the spec to insure that a user agent (assistive
>technology) could be developed for a person with a mobility impairment
>(limited hand use) so that he would be able to interact with the
>application?
>- is it to make VoiceXML more multimodal so various input and output
>techniques can be used in a desktop environment?
>- or what?
>
>Before I begin the review, I think the VoiceXML group should agree with (or
>at least know) our objective.
>
>Regards,
>Phill Jenkins
>
Received on Sunday, 31 March 2002 11:01:57 UTC