W3C home > Mailing lists > Public > whatwg@whatwg.org > January 2013

Re: [whatwg] Sentence structure

From: Vipul S. Chawathe <Engineer@VipulSChawathe.ind.in>
Date: Sat, 12 Jan 2013 02:23:18 +0530
To: "'Ian Hickson'" <ian@hixie.ch>
Message-ID: <001b01cdf03d$b3ccb0e0$1b6612a0$@VipulSChawathe.ind.in>
Cc: whatwg@lists.whatwg.org
>From: Ian Hickson [mailto:ian@hixie.ch] 
>
>On Thu, 10 Jan 2013, Thomas A. Fine wrote:
>> 
>> Use Cases:
>>   4. Clarifying sentence boundaries would be an aid in machine
>>      translation software.

>Do you have any evidence supporting this? I've spoken with engineers who
work on machine translation software and while they've certainly had
requests (whence the "translate" attribute), they've never asked for a way
to mark up sentences.


I'm doing some related work that requires machine translation on the lines
of export/import HTML snippets. Human language content boundaries are
directly determined by author's grammatical punctuation skills at the
sentence level. HTML is everything to-do tied-up with GUI web-browsers, so
machine translation, screen readers, & so forth are supported through other
"living" standards GRDDL XSLT RDFa that also work with HTML as one of
multiple possible host, however their relationship with XML serialization as
dependency for proper functioning might cause browser engine makers to
promote sticking to microdata, unless someday we get Google SilverFlash.java
Safari plug-in so that one size will fit all. As HTML is host language in
wide-spread use (my apologies for lacking statistics that I compensate by
deriving statements from common sense), perhaps this is starting point for
raising concerns that may be redirected into other specs too. It's the only
opening for those rare use cases as the story of Emperor's New Clothes.
Getting back to business, for larger content fragments there's the p
element. An immediate citation is search results cut-off abrupt fragments in
content preview. For improvising on such fragment indices they've come up
with schema.org vocab which I just had to remind here. They've got provision
to specialize from their general pre-defined types, so Thing>WebPageElement
can be used to get Thing>WebPageElement>Paragraph>Sentence This can be
expressed using html5 microdata itemtype attribute as:
<span itemscope="itemscope"
itemtype="http://www.schema.org/thing/webpage/webpageelement/paragraph/sente
nce">One whole sentence!</span>
HTML5 without XML serialization will allow to skip ="itemscope" too! saves
12 characters, savings comparable to those recommended by minifying. :-)
Received on Friday, 11 January 2013 20:52:19 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 30 January 2013 18:48:12 GMT