- From: Thomas A. Fine <fine@head.cfa.harvard.edu>
- Date: Thu, 10 Jan 2013 14:32:34 -0500 (EST)
- To: whatwg@whatwg.org
[Apologies to those who read public-html or www-style at w3.org where I've raised these issues (although this is more comprehensive). The process for modifying HTML is way more complicated than it use to be, and I'm still trying to figure out all the parts and the best approach.] HTML needs support for identifying sentence structure. Use Cases: 1. Formatting sentence spacing to approximate the look of almost all books in English from 1650-1950. 2. Formatting sentence spacing because it is very likely an aid to scanning text, and there are some indications that it is helpful for new readers, readers learning a new language, and readers with visual scanning issues and other learning disabilities. 3. Formatting sentence spacing because I like it that way. 4. Clarifying sentence boundaries would be an aid in machine translation software. 5. Clarifying sentence boundaries would be an aid to screen readers to help provide correct inflection. When it comes to the formatting use cases, there are a huge number of people who currently use two spaces between sentences already, even in web content where it currently is wasted. Some significant portion of these people are likely to be interested in sentences formatting if such a feature was available in a practical form. As for machine-parsed uses, it's not actually my field, so I'm not sure how helpful it would be, only that it would be helpful to some degree. In my limited experience with text-to-speech, sentence inflection errors are usually a noticeable problem. Existing practices, with some obvious pluses(+) and minuses(-): * The most popular recommendation on the web is to use . + Many people are familiar with it. - Not so fun to type. - Only the non-collapsing aspect is needed, the non-breaking aspect interrupts line breaks and creates uneven justification (left and right). - No fine-grained or dynamic control that CSS could provide. - Not really so useful for machine translation of screen readers as it doesn't eliminate ambiguity. * Use other space entities. + Doesn't have the justification problem of . + Allows some degree of fine control with different space sizes. - There exists no space entity which is the same size as a space and which breaks but doesn't collapse. - Many content creators are not aware of these entities. - Not really so useful for machine translation of screen readers as it doesn't eliminate ambiguity. - Still not as fine-grained as a CSS solution and no dynamic control. * Use spans to wrap sentences (not commonly used). - Very tedious. + Allows fine-grained and dynamic control through CSS. - Clean CSS for formatting is not obvious (e.g. some recommendations say to use the box model, which disrupts line breaking and creates uneven margins). * Set white-space to pre-wrap (not commonly used). + Very simple for content creators. - Doesn't provide unambiguous sentences to machine parsers. - Pre-wrap honors new lines which may be undesirable to some authors [why isn't there a white-space option that preservers spaces but not newlines?]. - No fine-grained or dynamic control. Possible improvements, with some obvious pluses(+) and minuses(-): * Detect sentences from text with an off-the-shelf algorithm. + Works on all existing content. - Available algorithms are some combination of unreliable and expensive. - Content creator doesn't have any control over what the algorithm will decide is or is not a sentence. Some sort of tag or entity could be used only for exceptions but again, the content creator wouldn't know where the exceptions might occur without a specified algorithm. * CSS setting that tells the parser that two spaces after terminal punctuation can be trusted as a reliable method of detecting sentences without ambiguity. + Would work immediately for some existing content. + By far the simplest solution for content creators. + Gives content creator full control. * Explicit sentence tag that surrounds each sentence (and some associated CSS to format it). + The most "traditional" solution. + The only solution here that fully marks the entire sentence, not just the end or gap so there is no extra processing to find the beginning of a sentence. (Consider this a minus on all the other approaches, even though I didn't list it.) - Very tedious to do by hand + A dedicated tag could spur implementation of HTML editors that mark the text for you. * Dedicated tag to mark the gap between sentences. - Somehow this just seems weird to me, a tag that's only purpose is to contain a space. + An easy substitution to make in an editor or post-processor. * New entity that marks the ends of sentences, or the gap between two adjacent sentences (and associated CSS to manipulate it). + An easy substitution to make in an editor or post-processor. * New Unicode character that provides an unambiguous full stop + The language could really need it. - There are also ambiguous cases for the question mark and the exclamation point, notably when used in a quote within a sentence, but also other odd balls too. - Don't hold your breath getting anything into Unicode. Thanks for your careful consideration in this matter, tom
Received on Thursday, 10 January 2013 19:33:01 UTC