RE: builtin grammars from Michael Bodell on 2011-10-20 (public-xg-htmlspeech@w3.org from October 2011)

From: Michael Bodell <mbodell@microsoft.com>
Date: Thu, 20 Oct 2011 08:40:37 +0000
To: Robert Brown <Robert.Brown@microsoft.com>, "Olli@pettay.fi" <Olli@pettay.fi>
CC: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <22CD592CCD76414085591204EB19F4E82394EE46@TK5EX14MBXC263.redmond.corp.microsoft.>
There is precidence for this kind of modification and parametrization of builtin types in the speech standards already with VXML's builtin grammars (http://www.w3.org/TR/voicexml20/#dmlABuiltins).  Note urls like:

builtin:dtmf/boolean?y=7;n=9
builtin:digits?minlength=3;maxlength=5
builtin:grammar/boolean?y=7

Now no doubt the HTML combinations are more extensive, but that's what you'd expect when you have http://www.w3.org/TR/html5/the-input-element.html complete with the giant table listing which content attributes go with which types.

To answer some of Olli's earlier questions:
>And how do they handle patterns.
If pattern is "Hello", is that interpret as word 'Hello', or
separate characters 'H', 'e' 'l', 'l','o' ?

If I were implementing a speech service I'd probably want to support both.  Assuming my text grammar supports both saying words and spelling them.  

>Again, how would speech engines support this? What kind of input is 
expected?
User saying 0.2 would be no-match but 0.00784313726 would be ok?

Well in visual html if I choose a range and have a step I get a slider, and to take a simpler example then the one I stole from HTML, if the step was min=0, max=1, step=0.5 then I'd expect visually I'd have a slider with 3 places I could select. But likely if visually I dragged to 45% along the range it might "snap" to 0.5.  Which maps, in my mind, to recognizing a bunch of numbers that the user could say (like 0.2) but "snapping" to the nearest step in the semantics.  One of the advantages of having both the utterance (0.2) and the interpretation (0.196078432 in the example of 0.0392156863 as that is closest to 0.2).

I agree there is a question of how well each speech service will handle this collection of different attributes.  I imagine some will be handled quite well by everyone (pattern's are relatively easy to turn into SRGS grammars trivially, and certain "high value" regexps may be tuned by some platforms to a non-trivial transform), some will be handled well by some services and not by others, and some might not be handled by any.  I think while not ideal, that is ok, as it will be an evolving process and different services will handle "normal" SRGS grammars and speech with different qualities too.  I don't believe it is fundamentally possible to make any aspect of speech recognition fully identical in all platforms.  Even the very simplest builtin we've discussed (say just a text based dictation grammar) is powerful because it is so complicated.  And those complexities will not be implemented identically in every platform.  Some of them may be able to recognize "'Twas brillig, and the slithy toves, did gyre and gimble in the wabe: all mimsy were the borogoves, and the mome raths outgrabe." and others may not as many of those 'words' might not be in the grammar.

Similarly when we discussed the filterOffensiveWords flag from Glen's proposal the group generally liked the idea, but rightly didn't want to standardize what is and is not an offensive word.

So I think it is OK if some of the information communicated to the recognition service is used slightly differently by different platforms.  For instance, some platforms may take advantage of the placeholder value to weight and tune the exact dictation models used (because, after all, a hint to aid the user with data entry is also a hint that can aid the speech platform with recognition).  Others may just ignore it.  Given html 5 has defined these content attributes it seems to me a shame to ignore them if they could provide a better speech experience for end users.

________________________________________
From: Robert Brown
Sent: Wednesday, October 19, 2011 1:12 PM
To: Olli@pettay.fi; Michael Bodell
Cc: public-xg-htmlspeech@w3.org
Subject: RE: builtin grammars

Michael>> builtin:input?type=text&pattern=%5B0-9%5D%5BA-Z%5D%7B3%7D
Olli> So, is this really something speech engines can support?

I'm wondering that too. I mean, yes it's clearly possible to use a regex as a grammar. But given that SRGS (or proprietary equivalents) and ARPA are the prevalent ways of modeling speech input, I'd be surprised if anybody actually uses regex as a grammar. Still, it should be possible. Whether or not it would be widely implemented is another question.

However, when we see something like "builtin:input?type=text&pattern=%5B0-9%5D%5BA-Z%5D%7B3%7D" I read that as the intersection of two grammars: 1) an ngram model for text input; and 2) a regex to constrain the text to a certain pattern. Either one is okay as a grammar on its own. But what does it mean to combine them? I imagine that the ideal solution is to use the regex to re-weight the ngram model. Making that perform well on demand seems non-trivial. Alternatively, we could perform the ngram recognition first using default weights, then walk the lattice until we found something that fit the regex. That might be easier, but it will be less accurate because we'd be including too much chaff and potentially discarding valid results in the first pass. We'd probably have to use a combination of both techniques. I have no idea how well understood this problem is outside of labs. If this sort of thing is widely implemented, then sure, let's propose a standard way of expressing it in HTML like Michael suggests.

But if not, then I think it's premature to include it. In fact, we might be better off by proposing an extension to the pattern attribute that lets it use an SRGS (XML or ARPA) as an alternative to regex.
Received on Thursday, 20 October 2011 08:41:09 UTC