Comments on Last Call Working Draft of Pronunciation Lexicon Specification (PLS) from BAGSHAW Paul RD-TECH-REN on 2006-02-03 (www-voice@w3.org from January to March 2006)

From: BAGSHAW Paul RD-TECH-REN <paul.bagshaw@francetelecom.com>
Date: Fri, 3 Feb 2006 17:53:33 +0100
To: <www-voice@w3.org>
Message-ID: <941BA0BF46DB8F4983FF7C8AFE800BC203B98F66@ftrdmel3.rd.francetelecom.fr>
Comments made below refer to the 31 January 2006 publication of the PLS last call working draft:

http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/ <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/> 

 

They are presented here for your consideration and I hope that they initiate constructive discussion as we attempt to address the issues exposed.

 

Paul Bagshaw

France Telecom R&D

 

 

 

1. The homograph (heterophone) problem.

 

PLS 1.0 aims to address only the most important aspects of the requirements document (http://www.w3.org/TR/lexicon-reqs/).

 

* Section 4.9.2 of the LCWD stipulates:

 

If more than one <lexeme> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.4#S4.4>  contains the same<grapheme> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.5#S4.5> , all their pronunciations will be collected in document order and a TTS <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#term-TTS#term-TTS>  processor must use the first one in document order that has the prefer attribute set to "true". If none of the pronunciations has prefer set to "true", the TTS <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#term-TTS#term-TTS>  processor must use the first one in document order.

 

The requirement 4.2 classes handling of homophones (heterographs) as MUST HAVE (for ASR), but in contrary, requirement 4.4 for handling homographs (heterophones) is classed only as NICE TO HAVE (for TTS), and has thus not been considered as essential to the LCWD. It’s a shame that handling homographs is not also classed as MUST HAVE. In its current status, PLS just won’t be used for applications exploiting TTS where homographs can occur. Many, if not ALL, applications for many languages depend on homograph disambiguation. An application MUST HAVE a means of indexing unambiguously every pronunciation in the dictionary. It is not possible in the current version of the PLS proposal.

 

It must be possible to associate some additional information (other than the lexeme orthography, <grapheme> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.5#S4.5> ) with each pronunciation.

 

For example, in a simple case, associating a grammatical category to a particular pronunciation in a lexeme is sufficient to distinguish ‘does’ (verb, to do) from ‘does’ (noun, plural of doe). Consider the more complex case of reading an address book full of proper nouns (place and people names) in which the pronunciation of a person’s name depends upon the area from which they come (in the same country speaking the same language – yes, it happens at least in French where final consonants may be pronounced for names originating from the west and south of France, but not elsewhere in the country). The application may have knowledge of the origin of the request for information and instruct the TTS to reply with an according pronunciation. Note that this second example is independent of part-of-speech tags (or grammatical categories) and sentence semantics.

 

The nature of the additional information is open-ended and subject to (too) much discussion (semantics, part-of-speech tags) since there is no standard representation (there’s no universal set of multilingual grammatical categories, for example, and there never will be since there is no universal grammar). The information required can also be application dependant (as illustrated above).

 

Proposition 1: add an interpret-as attribute to the <phoneme> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.5#S4.5>  and <alias> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.5#S4.5>  elements.

 

The problem with having multiple interpretations for a given orthography is equally addressed in the SSML <say-as> element. The proposition here is therefore to add the ‘interpret-as’ attribute with the same values as those in the SSML <say-as> element. <say-as interpret-as=”noun” does> could thus be used to index the lexeme in:

 

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      alphabet="ipa" xml:lang="en-US">
  <lexeme >
    <grapheme>does</grapheme>
    <phoneme interpret-as="verb">dez</phoneme>
    <example>He does not like it.</example>
    <phoneme interpret-as="noun">dowz</phoneme>
    <example>The does hide behind the trees.</example>
  </lexeme>
</lexicon>

 

(sorry if the IPA phonemes are inexact)

 

The value of the ‘interpret-as’ attribute in the PLS element must exactly match that of the SSML <say-as> ‘interpret-as’ attribute when it is to be rendered by a TTS system.

 

The secondary consequences of this proposition are: 1 the editor of the SSML and PLS files controls the content of the interpret-as values, 2 any future standardisation of SSML interpret-as values can be tied in with PLS.

 

There is an analogy to this proposed attribute in the <grapheme> element; the ‘orthography’ attribute associates additional information with the <grapheme> content.

 

2. The homophone (heterograph) problem.

 

* Section 5.4 of the requirements document refers to “pronunciation preference” and has been successfully accommodated for in the PLS by the ‘prefer’ attribute in <phoneme> and <alias> elements. However, ASR currently has no means of indexing a unique orthography from a particular pronunciation. The following requirement is surprisingly not present:

The pronunciation lexicon markup must enable indication of which orthography is the preferred form for use by speech recognition where there are multiple orthographies for a lexicon entry. The pronunciation lexicon markup must define the default selection behaviour for the situations where there are multiple orthographies but no indicated preference.

If PLS is to be used equitably in ASR and TTS environments, then functionality available for grapheme to phoneme mapping should equally be made available for phoneme to grapheme mapping (and visa versa).

 

Proposition 2: add a prefer attribute to the <grapheme> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.5#S4.5>  element.

 

For example, spelling variations could thus be marked with a preference for dictation applications.

 

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      alphabet="ipa" xml:lang="en-US">
  <lexeme>
    <grapheme prefer="true">theater</grapheme>
    <grapheme>theatre</grapheme>
    <phoneme>'&#x03B8;&#x026A;&#x0259;t&#x0259;r</phoneme>
    <!-- IPA string is: "'θɪətər" -->
  </lexeme>
</lexicon>

 

3. Specification ambiguity

 

* Section 4.4 of PLS stipulates:

The <lexeme> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.4#S4.4>  element contains one or more <grapheme> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.5#S4.5>  elements, one or more of either <phoneme> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.5#S4.5>  or <alias> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.7#S4.7>  elements, and zero or more <example> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.8#S4.8>  elements.

 

However, it appears to be possible to have BOTH <phoneme> AND <alias> elements in <lexeme>, as illustrated in example 4 and more clearly described in section 4.9.2

. . . either by <phoneme> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.6#S4.6>  elements or <alias> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.7#S4.7>  elements or a combination of both . . .

 

The either/or of section 4.4 needs correction (Proposition 3: add “or a combination of both”).

 

4. Terminology

 

A final relatively minor comment: in section 4.5.

 

A <grapheme> <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#S4.5#S4.5>  may optionally contain an orthography attribute which identifies the script code used for writing the orthography <http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20060131/#term-Orthography#term-Orthography> .

 

The term ‘orthography’ has doubled use; one as a glossary term and the other as an attribute name. Only the font makes the specification clear. Rewording of the glossary term should be envisaged.

 

 

All comments on the above remarks are more than welcome.
Received on Friday, 3 February 2006 22:31:00 UTC