Re: [uf-new] announcing the hOCR and hBIB microformats

On 28/03/07, Thomas Breuel <tmbdev@gmail.com> wrote:
> I should mention that the motivation for our work is not the semantic web;
> that is, while it may be a nice side effect of hOCR and hBIB that OCR and
> bibliographic information can be extracted from on-line web pages and
> processed further, our primary use of HTML is as a markup language, not a
> hypertext language.

I'm a little confused by the need for anything like microformats, as
HTML is already a markup language, microformats augment that by
allowing a way of embedding explicit data. But I suspect considering
the ability to extract data as a useful side-effect is the right way
around in situations like this, rather than being an end in itself.

> In particular, hBIB is intended primarily to encode bibliographic entries
> for the purposes of publishing, rendering, and statistical content analysis,
> not for the purpose of non-statistical automated processing usually
> envisioned by the semantic web community.  For example, hBIB does not force
> identification of separate authors (although it can represent it when
> available), simply because the tools generating hBIB may not be able to do
> so for every document.

Hmm, my feeling would be that if the additional information was worth
including in the documents, then it would very likely be useful to
semantic web applications. Things like identification of separate
authors isn't necessarily a problem, RDF and related technologies are
designed so they can still be useful with partial information. But I
take your point about statistical analysis - that's something I've
heard little about around the semweb community (except maybe on the
fringes of search).

Cheers,
Danny.

-- 

http://dannyayers.com

Received on Wednesday, 28 March 2007 19:50:34 UTC