Re: [uf-new] announcing the hOCR and hBIB microformats from Thomas Breuel on 2007-03-28 (semantic-web@w3.org from March 2007)

From: Thomas Breuel <tmbdev@gmail.com>
Date: Wed, 28 Mar 2007 14:38:34 -0700
To: "Danny Ayers" <danny.ayers@gmail.com>
Cc: SW-forum <semantic-web@w3.org>
Message-ID: <7e51d15d0703281438i42eb0fd0m1905007879f0d61b@mail.gmail.com>

On 3/28/07, Danny Ayers <danny.ayers@gmail.com> wrote:
>
> On 28/03/07, Thomas Breuel <tmbdev@gmail.com> wrote:
> > I should mention that the motivation for our work is not the semantic
> web;
> > that is, while it may be a nice side effect of hOCR and hBIB that OCR
> and
> > bibliographic information can be extracted from on-line web pages and
> > processed further, our primary use of HTML is as a markup language, not
> a
> > hypertext language.
>
> I'm a little confused by the need for anything like microformats, as
> HTML is already a markup language, microformats augment that by
> allowing a way of embedding explicit data.

We are embedding explicit data (and a lot of it, often a lot more than the
text itself), just not the kind of data that the semantic web community
usually considers.  Furthermore, almost all the data we embed is
automatically generated.

Hmm, my feeling would be that if the additional information was worth
> including in the documents, then it would very likely be useful to
> semantic web applications. Things like identification of separate
> authors isn't necessarily a problem, RDF and related technologies are
> designed so they can still be useful with partial information. But I
> take your point about statistical analysis - that's something I've
> heard little about around the semweb community (except maybe on the
> fringes of search).

Well, to illustrate the difference, in our applications, in addition to
being well-formed, perfectly structured text, the "author" markup might
alternatively be an image (a snippet of the original document), a
probabilistic graph structure representing possible interpretations of the
source document, or even just a set of image coordinates referring to some
page image that isn't part of the document.

As a consequence, while the kind of markup we produce will often be useful
for semantic web applications, we can't guarantee it. There are explicit
indications in both hOCR and hBIB as to what kinds of analysis has and
hasn't been carried out on the input, so you should always be able to tell.
hOCR also has ways of indicating confidence in different interpretations
(within limits set by the grammar of HTML/XML markup).

But, not to be misunderstood, we have defined hOCR and hBIB not as
alternatives to existing microformats, but because we needed to represent
this information and didn't have a good way of doing that with existing
formats at all.  I think both formats will also be useful for traditional
semantic web applications, it's just that that hasn't been driving their
development.

Tom

Received on Thursday, 29 March 2007 00:59:10 UTC