markup in non-PLS page source affecting PLS lexeme

The Pronunciation Lexicon Specification in section 4.5 bans child elements from any namespace from the content of a grapheme element. I don't know the purpose of the ban, but it creates a problem.
A given word might have multiple pronunciations just within a single page, but a website author may not trust a visitor's TTS to parse the page's syntax correctly. For instance, "read" might be pronounced /reed/ or /red/, but, given how many of us don't rely on grammar checkers such as Microsoft Word's and given that TTS software comes from many vendors and in many versions, we might reasonably distrust anyone's TTS to apply correct (or any) syntax rules (especially dialectal rules) to choose the proper pronunciation of "read" for each context on one page. These might, therefore, have to be distinguished by context. One way to apply PLS requires distinguishing between contexts by writing different grapheme elements. However, the longer the string for the grapheme, the greater the statistical likelihood that the string will necessarily include non-PLS inline markup, such as an HTML span element, with respect to which the HTML or other element is not to get rendered although the element's content is to get rendered. If some TTS systems rely directly on page source as TTS source while others rely on the visual rendering as their source input (perhaps by extracting it from page source through interpretation), we should consider how to write the *.pls file's lexeme element to be compatible with both. When the only ways to distinguish one string from another for grapheme elements require including an inline element, the ban prevents writing a grapheme element for such a string and therefore prevents supplying a pronunciation for it. If the ban is not needed, we should add the capability to cope with inline markup.
I propose replacing "must not" with "may" in this sentence: "The <grapheme> element must not contain 'element' child information items from any namespace, i.e. PLS or foreign namespace." The result would be this: "The <grapheme> element may contain 'element' child information items from any namespace, i.e. PLS or foreign namespace."
I propose explicitly acknowledging the effect of this in the specification, for the sake of making the PLS behavioral change visible. Add: "Any non-PLS element in a grapheme element may or may not alter how PLS interprets content in the grapheme element."
I propose adding the following:
--- The nontag content of a grapheme element includes character child information items, diacritics, accents, nonprinting characters, characters with no width, and white space but does not include a tag of an element regardless of the namespace from which the element comes. Nontag content in the grapheme element includes the nontag content in a child element of the grapheme element regardless of whether the child element is an immediate chlid element or not.
--- If a tag of an element from any namespace is to be rendered as part of nontag content, the tag must be indirectly represented, such as with the HTML character entity defined for each angle bracket that begins or ends such tag.
--- If a child element in a grapheme element, in accordance with a specification specifying such child element, could alter how PLS interprets nontag content in the grapheme element and if how such nontag content would be so altered is known, such alteration must be applied in interpreting such nontag content.
--- If a child element in a grapheme element, in accordance with a specification specifying such child element, could not alter how PLS interprets nontag content in the grapheme element, an alteration due to such child element must not be applied in interpreting such nontag content.
--- If how a child element in a grapheme element could alter how PLS interprets nontag content in the grapheme element is unknown, even by reference to a specification specifying such child element, such an alteration must not be applied in interpreting such nontag content.
--- Interpretation of nontag content shall be applied in the order starting with the child elements at the greatest level of descent and then with the child elements at each higher level of descent and ending with the nontag content that is not in any child element.
-- 
Nick

Received on Monday, 5 November 2018 22:45:53 UTC