W3C home > Mailing lists > Public > public-html-comments@w3.org > April 2012

Re: HTML5 Paragraphs, Sentences and Phrases

From: Arthur Clifford <art@artspad.net>
Date: Thu, 12 Apr 2012 23:11:29 -0700
Message-Id: <AE4308C6-F3E3-4DB6-87EF-9B3210182434@artspad.net>
To: public-html-comments@w3.org
At what point does one reach a point of absurdity?

Should there be a <punct char="." name="period"  role="full-stop" /> and a <punct char="," name="comma" role="pause" />

Should there be a word tag? <word role="verb"  tense="past">ran</word> 

I am sure these things would be great to have, but ultimately if somebody wants to make content available with that level of detail they should work on a conversion program that generates tagged content in XML. It would probably be something like the NLML Natural Language Markup Language.

If HTML is for markup of presentation content in browsers or similar user agents, then div and span are adequate for the job. You could namespace your divs and spans to accomplish what you want in terms of <span id="word_verb_past" >ran</span> and have a reading technology know how to process the ids for spans to determine how to present, read, or interact with the user. 

If HTML is supposed to be semantic then the argument in favor of sentence, sentence_fragment, phrase, word are not unreasonable because they do after all explain what you are seeing, at least for english speakers. Then again, I know its semantics, but a div with a specially formatted id, name, or perhaps a role attribute (if you really needed to add something) would semantically suggest what you are looking at. As would span. They suggest you are either looking at a block of content (div) or a fragment of a block or sub-set of a block (span); the only thing missing is the role of the div or span. While styles do imply role, style semantically suggests visualization.

I don't think the problem here is one of reading as much as writing. Nobody in their right might wants to sit down and markup their sentences unless they are working on something to teach someone about sentence structure. In which case they are better of learning XML XSLT and HTML and how to really use them and to work with with dedicated/controlled content. Frankly the majority of the content creators are not interested in teaching anybody how to read, but rather wants to sell a product, or blog about something, tweet their brainfart of the moment, or even share research as was originally the purpose of the web. However, if Word or other programs can tell me my grammar is wrong then it should be able to export my document in an xml format that marks up my content with grammatical markup. XSLT could transform that for use in a browser or translate it for use in other technologies. This request needs to start at the places where we produce content. Honestly, most of us still don't use even Word correctly (do you bold or italic individual words or do you apply a style?).

Based on how I've seen folks respond here, the HTML standard is based on what people are doing. So, rather than asking for something which may help something possibly do something, I think the key is to ask the right sector in the industry to actually build something that produces a dedicated markup language that HTML 6 can incorporate later. While I don't always agree with decisions made by folks here, I can understand their perspective that this is a fringe use case and not compelling enough to warrant new tags, especially when you can do that yourself with XHTML.


Art C.



On Apr 12, 2012, at 1:48 PM, Thomas A. Fine wrote:

> 
> This is in response to Benjamin Hawkes-Lewis' response to
> Adam Sobieski's proposal for sentence and phrase tags.
> 
> Speaking to the "necessity" of these tags, while I'm not sure really
> any tag, or HTML or the web or even a good slice of pizza can be
> described as necessary, these tags can definitely be useful, and
> most likely they can be important.  Sentence and phrase markings can
> be very useful to:
>  People relying on audio conversion to access the web.
>  People relying on automated translation.
>  People who are just learning to read.
>  People who are reading an article not in their native language.
>  People who are interested in inter-sentence spacing or inter-phrase spacing.
>  People with commercial interests, looking to maximize their reach.
> 
> Of course, simply adding tags won't really help any of these people.
> The real point is that such tags can facilitate tools that help
> these people.
> 
> The problem with using span tags is that they won't facilitate tool
> development.  In the absence of a real standard, no one is going
> to develop software to process sentences by searching for spans
> that might be labeled "sentence" or "sent" or "stc" or who knows
> what else.  Only in the presence of a standard tag, can developers use
> these tags to improve translation, or emphasize phrasing and sentence
> structure for improved readability.
> 
> Mr. Hawkes-Lewis wrote:
>> The web corpus is not going to get marked up with phrases and
>> sentences in the absence of NLP advances that would make such markup
>> mostly redundant.
> 
> Natural Language Processing is riddled with problems, and there is
> nothing to suggest that this will change in the near future.  On
> the other hand, someone who is authoring content is in the perfect
> situation to accurately identify sentences or phrases.  NLP can be
> an aid to that user, and can provide hints to help them select
> sentence structure.  But as I said above, no such software would
> ever be developed to use NLP to aid users in marking sentence
> structure unless there were already dedicated sentence and phrase
> tags.  So in essence, you are correct, but only because you're
> argument is a self-fulfilling prophesy.
> 
> You also suggest simply using a CSS pseudo-tag, and relying on the
> unicode sentence breaking conventions.  However, looking at these
> conventions, they are just another attempt at some sort of automated
> processing, and they acknowledge that this will not work for all cases.
> This is just one more argument in favor of giving content providers
> the ability to accurately mark up sentence structure.
> 
> I'll further note that any form of automated NLP is wholly inadequate
> when it comes to users interested simply in formatting control issues.
> Giving them a mechanism that does not provide control over where and
> when content will be formatted (other than some outside algorithm they
> don't control) is not providing any real control over formatting.
> 
> If you are saying that you don't think most people will bother, that is
> probably true.  But that doesn't mean that there aren't people with
> a legitimate and important interest.
> 
> So back to the original question, are these tags necessary?  I would
> now say yes, these tags are necessary to the development of software
> tools to aid users in marking sentence structure, and they are
> necessary to the development of tools that allow content providers
> to improve readability of their web pages for several classes of
> web users.
> 
>     tom
> 
> 
Received on Friday, 13 April 2012 06:12:01 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 13 April 2012 06:12:01 GMT