- From: Thomas Breuel <tmbdev@gmail.com>
- Date: Wed, 28 Mar 2007 14:38:34 -0700
- To: "Danny Ayers" <danny.ayers@gmail.com>
- Cc: SW-forum <semantic-web@w3.org>
- Message-ID: <7e51d15d0703281438i42eb0fd0m1905007879f0d61b@mail.gmail.com>
On 3/28/07, Danny Ayers <danny.ayers@gmail.com> wrote: > > On 28/03/07, Thomas Breuel <tmbdev@gmail.com> wrote: > > I should mention that the motivation for our work is not the semantic > web; > > that is, while it may be a nice side effect of hOCR and hBIB that OCR > and > > bibliographic information can be extracted from on-line web pages and > > processed further, our primary use of HTML is as a markup language, not > a > > hypertext language. > > I'm a little confused by the need for anything like microformats, as > HTML is already a markup language, microformats augment that by > allowing a way of embedding explicit data. We are embedding explicit data (and a lot of it, often a lot more than the text itself), just not the kind of data that the semantic web community usually considers. Furthermore, almost all the data we embed is automatically generated. Hmm, my feeling would be that if the additional information was worth > including in the documents, then it would very likely be useful to > semantic web applications. Things like identification of separate > authors isn't necessarily a problem, RDF and related technologies are > designed so they can still be useful with partial information. But I > take your point about statistical analysis - that's something I've > heard little about around the semweb community (except maybe on the > fringes of search). Well, to illustrate the difference, in our applications, in addition to being well-formed, perfectly structured text, the "author" markup might alternatively be an image (a snippet of the original document), a probabilistic graph structure representing possible interpretations of the source document, or even just a set of image coordinates referring to some page image that isn't part of the document. As a consequence, while the kind of markup we produce will often be useful for semantic web applications, we can't guarantee it. There are explicit indications in both hOCR and hBIB as to what kinds of analysis has and hasn't been carried out on the input, so you should always be able to tell. hOCR also has ways of indicating confidence in different interpretations (within limits set by the grammar of HTML/XML markup). But, not to be misunderstood, we have defined hOCR and hBIB not as alternatives to existing microformats, but because we needed to represent this information and didn't have a good way of doing that with existing formats at all. I think both formats will also be useful for traditional semantic web applications, it's just that that hasn't been driving their development. Tom
Received on Thursday, 29 March 2007 00:59:10 UTC