- From: Karin Verspoor <Karin.Verspoor@nicta.com.au>
- Date: Thu, 21 Feb 2013 12:03:07 +0000
- To: David Cuenca <dacuetu@gmail.com>
- CC: Robert Sanderson <azaroth42@gmail.com>, "<public-openannotation@w3.org>" <public-openannotation@w3.org>
Some other issues off the top of my head: * It's hard to determine paragraphs, sentences and words. -- Paragraphs could be <p>, or <div>, but they might not be. Perhaps just <br/><br/> is used to separate the paragraphs. And that's just HTML, let alone other textual resources. -- Sentences: Mr. J. Smith of the U.S.A. took $1.45 from his pocket ... and spent it. 1 sentence or 10? -- Words: Word splitting is extremely hard in eastern languages. In the first case it is a matter of having accurate definitions of what a paragraph means - I admit there are some loose ends there. The issue with sentences could be averted defining a sentence as a group of at least 2 words and handling numerals properly. Nevertheless in an extreme situation like that it would be much more sensible to use word counting instead. What do you mean by word splitting in eastern languages? The concept of using the unit "word"? I just want to reiterate Robert's concerns here. Paragraphs can be tricky -- many text formats don't treat a carriage return as a paragraph break, but some do; some formats use whitespace (e.g. two carriage returns) to indicate a paragraph break, etc. But that's nothing compared to sentences (for which use of punctuation can vary; punctuation which 'usually' ends a sentence doesn't always) and, especially "words". In many domains there are words that contain punctuation -- even in general English we have contractions and hyphens, i.e. "I'm coming" -- is "I'm" one word or two (many natural language processing systems will treat that as two, equivalent to I am, and possessives like "David's book" would be parsed as "David <possessive> book")? is "state-of-the-art" one word or four? "co-occurring" one or two? -- and using simple heuristics about whitespace and punctuation will not always work correctly. In the chemistry domain, for instance, hyphens are used to indicate bonds between molecules; in biology the apostrophe is used to indicate which end of the DNA is being referenced (3'ACTAGCTA … 5'ACTAGCTA etc.), gene names can contain numbers ('BRCA1'), among myriad other cases where punctuation and numerals can be ambiguous about whether or not they are part of the "word". I would strongly recommend avoiding using selectors/citations that depend on syntactic or semantic units that are open to various interpretations. Stick with what is the least sensitive to interpretation, the characters in the file. * We stuck with character counting, but even then it's tricky with normalization routines. & -- 1 character or 5? You have the same issue with length as well. My aim was only to address the topic of parsed text, would that be an issue in that case? I think that as long as you have (a) a standard version of the text that you are pointing to, and (b) an indication of the character encoding you assume (e.g. UTF-8), then using character counts is the most straightforward way to have consistent references. If you're dealing with the raw HTML then "&" is 5 characters. Karin Verspoor ________________________________ The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments.
Received on Thursday, 21 February 2013 12:05:18 UTC