RE: how are orthographic equivalents for tokens documented? from Al Gilman on 2001-09-08 (www-voice@w3.org from July to September 2001)

From: Al Gilman <asgilman@iamdigex.net>
Date: Fri, 07 Sep 2001 23:21:05 -0400
To: <andrew.hunt@speechworks.com>, <www-voice@w3.org>
Message-Id: <200109080258.WAA9474090@smtp2.mail.iamworld.net>
At 07:30 PM 2001-09-03 , Andrew Hunt wrote:
>Al,
>
>The WG has visited pronunciation and lexicon issues several times 
>in the context of preparing the grammar spec.  The following is my 
>best effort at a summary of the overall group position:
>
>1. Version 1.0 of the Grammar Spec will not permit within-grammar
>   specification of pronunciations

AG::

Well first off, this is very good to hear.

I had it the other way around.  The way I read the grammar specification, it
_only_ included "sounds like" tokens as terminals, without regard for
dictionary spellings.

If your concept of grammar definition is that the token terminals are words
spelled as they would be found in the dictionary, and how to expect it to
sound
is clarified in the optional referenced lexicon, then that is much better.

>2. The most recent draft (August 20) adds a <lexicon> element that
>   permits reference to an external pronunciation lexicon
>3. Review of RUBY found that it was not suited to the needs of the
>   grammar pronunciation problem.  I cannot comment on the details
>   as other WG members did the review
>4. Pronunciation lookup should be outside the scope of the Semantic
>   Interpretation spec.  Put differently, the grammar spec should be
>   complete and stand-alone on the topic of pronunciations

AG::

As it may be clear by now, our concern is with whether the definition of the
input to which the system responds is captured in a form [orthodox spelling]
usable with visual or tactile media (such as Braille).  I believe that
standard
spelling is "semantic enough" for the desired 'repurposability" in this
respect.

>5. The specification, as currently drafted, is sufficient to produce
>   interoperable implementations for a majority of use cases (but
>   certainly not all)
>
>FYI, the Voice Browser WG has begun work on a Pronunciation Lexicon
>specification with a Requirements document being released earlier 
>this year.  I do not anticipate a first working draft of that before
>the end of this year.
>
>> It is not clear that WAI can approve this document going to PR until that
>> mechanism is defined and it is clear it works.
>
>Could you please expand upon this comment so that we can have a 
>sense for the benchmark that the spec will be reviewed against.

Let me offer a bit of disclaimer first.  The WAI is a very shirtsleeve
operation, at least as you are encountering it here.  We have some formal
Recommendations out and some in the pipe, but I would have to say there is
nothing of record that would qualify as "The benchmark against which the
specification will be measured."  So rather let me try to frame the following
as "the performance axis along which the specification will be measured"
rather
than "the threshold of acceptability it must meet."  I hope this makes some
sense.

The rough idea is that the model from which the voice-accepting dialog is
generated should contain sufficient information so that a dialog using
other-than-speech input [screen forms, command line text, ...] could be
derived
[by a program, not a programmer -- but it could be a large and
compute-intensive program] as an alternate view or access mode of the service
for individuals with no vocal capability or a voice that won't recognize.

If the standard way of tooling up a voice dialog in W3C technology does not
create a repurposable resource base under it, then this could be considered a
serious problem.

Voice recognition is regarded as a user interface device, in the sense of
striving for access to enough repurposability or alternate media morphs to
avoid single-point failure modes induced by dependency on a specific device. 
Some people can't use a mouse, some can't use a visual display, some can't use
a voice recognizer.  Can we make the W3C technologies so that they don't shut
any of these people out?  That's what we are trying to do.

For Deaf customers, their speech may or not be recognizable but the voice
prompts are not usable.  Is the logical content there to serve the same
service
over text telephone with an auto-switch at the call control level to an
alternate port serving the text-telephone dialog flavor of the service?

So what we are shooting for is that any device-specific realization of a
service-delivery channel be backed with more robust resources capable of being
the base for delivering equivalent service in other interaction modes.

This is a desideratum, untried in the court of a 'to the mat' struggle at the
Director's desk.  Partly we have some inclination to listen to reasonable
arguments as to what is readily achievable.  And partly we are all discovering
how to do it as we do it.

One is always in the awkward position of hindsight.  One of the things we find
ourselves doing is trying to accelerate the adoption of a device-independent
perspective on all sides, even as the more device-specific technologies are as
we speak just being cooked.  And it's not clear what the device-independence
architecture is until there are voice browsing and XForms separate
proposals on
the table to review.

But as I noted above, I had read the grammar to say the terminal tokens
were by
definition phonetic and of uncertain relationship to anything with a
recognizable relationship to dictionary meaning or Braille transcription.  If
they are standard text and the linkage to phonetic hints to the recognizer is
more distant or dubious, the the issue is not the sort of magnitude that I
thought when I wrote before.

Al

>
>Regards,
>  Andrew Hunt
>
>> -----Original Message-----
>> From: www-voice-request@w3.org
[<mailto:www-voice-request@w3.org%5DOn>mailto:www-voice-request@w3.org]On
>> Behalf Of Al Gilman
>> Sent: Wednesday, August 29, 2001 4:47 PM
>> To: www-voice@w3.org
>> Subject: how are orthographic equivalents for tokens documented?
>> 
>> 
>> 
>> For speech recognition, one may wish to use as a token a phonetic
>> equivalent of a word.  This may be done by phonetic spelling in the ususal
>> writing system of the current language, or by external-production-reference
>> to a speech synthesis grammar production for a more precise phonetic form,
>> if I understand the architecture.
>> 
>> In any case, it is important to unambiguously capture the relationship to
>> the standard spelling of a word, if the token represents a word.  "The
>> standard spelling" is meant to indicate "as you should look it up in a
>> dictionary."
>> 
>> There is an XML precedent in RUBY that could be followed for providing this
>> capability.  
>> 
>> Is this planned to be covered in the 'semantic processing' volume?  This is
>> an important function.   Please clarify.
>> 
>> Al
>> 
>> ["personal opinion" disclaimer]
>> 
>>   
>> 
>> It is not clear that WAI can approve this document going to PR until that
>> mechanism is defined and it is clear it works.
>> 
>
Received on Friday, 7 September 2001 22:58:17 UTC