- From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
- Date: Wed, 2 May 2012 22:29:10 +0100
On Wed, May 2, 2012 at 8:01 PM, Charles Pritchard <chuck at jumis.com> wrote: >>> 1. New features won't fix Google Translate bugs with existing >>> features, and it's more efficient for Google to fix Translate than for >>> the community to design, specify, and implement new features. > > > New features do allow services to coalesce around standards. That's what the > standards are here for. Existing mechanisms for embedding custom data are (being) standardized and can make use of standardized vocabularies. > HTML5 just added a translate attribute. That doesn't describe a drawback with the using the existing mechanisms. > Span does not in and of itself signify any semantic meaning. Doesn't that mean that Google Translate is operating correctly? No? Moving text in or out of an element that "mean[s] something on its own" (as the spec puts it) has potential to break things. But that's also true, if less so, for an element that "doesn't mean anything on its own". There might be code (clientside JS, CSS selectors, XPointer URIs, automation scripts, whatever) that depends on that text being inside or outside that element at that position in the DOM. That's not to say that Google Translate is operating incorrectly. Translation inevitably changes the DOM. Text node contents change of course. Because different languages may express the same ideas in different orders, DOM nodes may need to be reordered. Because different languages have different practices around compounding or implying ideas with different numbers of words, what might be a separate word in a separate element in one language might need to be merged into another word outside the element, or vice versa. It's not obvious that there is a correct behavior here, and I struggle to see how the markup examples you proposed would help. (Perhaps you could elaborate?) Researching and recommending authoring practices that make translation less likely to break code might be a more immediately fruitful line of enquiry, and might help inform the ultimate creation of a vocabulary fit for purpose. But more importantly, assuming such a vocabulary could be created, this is not a reason why it could not be embedded using the existing mechanisms. The HTML specification is not the only source of standardized vocabulary on the web. >>> 2, 3, and 4: Given an appropriate vocabulary, existing mechanisms can >>> encode unambiguous meanings, information about how text should be >>> spoken, and phrase and sentence boundaries. Unicode describes >>> character boundaries. > > Boris brought up that the concept of letter could use some attention: > http://lists.w3.org/Archives/Public/www-style/2011Nov/0055.html It's not clear to me that Boris has raised something not addressed by Unicode, but in any case an appropriate vocabulary could be used for letters too. > Yes, we have existing XML mechanisms for text should be spoken. > > What existing mechanism do we have for disambiguation? Any vocabulary you want to use with microdata, microformats, RDFa, etc. If the vocabulary doesn't exist yet, create it and publish it as a spec. >>> 5. Tab isn't talking about "data-" here, but about all the various >>> mechanisms available to provide custom data for services to consume >>> (e.g. microdata, microformats, RDFa). > > > Tab asked directly why data- does not work He had two questions: 1. If you're only using the data yourself, why not data-? 2. If you want other people to use the data, why not the other mechanisms for custom data embedding? Your 5 points appeared to be in answer to his second question, because you placed them as a list in response to it. But never mind. > Yes, we have a lot of microformats, it's true. And RDFa. > > They don't seem to be taking flight for these issues I suspect that's because these are new mechanisms and markup is a doomed solution to these problems at web scale. Anyhow, given you're one of the few people asking to be able to encode these details in markup, offering lack of usage as a reason for not being able to use these mechanisms is circular. > and language translation seems like a high level issue appropriate for HTML. That's not a reason why you could not use the existing mechanisms. Aside: just because a problem is important, does not mean that introducing more markup features is an approach that will scale to solve the problem across the web. More work on NLP would probably be a better investment in this case. > Again, look at the translate and lang attributes; those are baked into HTML. That's not a reason why you could not use the existing mechanisms. > I am approaching the "lang-" proposal as language agnostic, much as "aria-" > is language agnostic. > > This seems to be where we are currently: > <img lang="es" translate="no" alt="No" /> > > With alt having ARIA counterparts. > > I'm suggesting a "lang-" with counterparts to translate, language code, and > a vastly enhanced vocabulary, much as ARIA vastly enhanced the UI > vocabulary. I think it could help in the long run. That's just you choosing to use something _other_ than the existing mechanisms; it's not a reason why you could not use them. I'm baffled why you think defining an RDF vocabulary then requiring host languages to closely couple their specs to your spec with a set of arbitrary and confusing syntactical and behavioural requirements is preferable to just defining a vocabulary and letting host languages embed it however they like. I would certainly caution against further integrations with HTML along the ARIA model, having seen the pain it's caused. I'd suggest instead that the small number of authors interested in this markup get together and use and develop vocabularies that can be embedded in HTML or XML using microdata or RDFa. You will probably make lots of mistakes and learn a lot along the way. If at the end of the day, you've got robust vocabularies that solve problems for more authors and sees non-microscopic levels of adoption, then they could be pulled into the mainstream language just as class="nav" got pulled in as <nav> and class="datetime" got pulled in as <time>. Proposing that we conjure such a vocabulary out of the air to solve a wide set of mostly unanalysed problems in the absence of documented workarounds and then reify that vocabulary in a load of specific features seems to me to put the cart way before the horse. -- Benjamin Hawkes-Lewis
Received on Wednesday, 2 May 2012 14:29:10 UTC