- From: Danny Ayers <danny.ayers@gmail.com>
- Date: Wed, 28 Mar 2007 21:50:30 +0200
- To: "Thomas Breuel" <tmbdev@gmail.com>
- Cc: SW-forum <semantic-web@w3.org>
On 28/03/07, Thomas Breuel <tmbdev@gmail.com> wrote: > I should mention that the motivation for our work is not the semantic web; > that is, while it may be a nice side effect of hOCR and hBIB that OCR and > bibliographic information can be extracted from on-line web pages and > processed further, our primary use of HTML is as a markup language, not a > hypertext language. I'm a little confused by the need for anything like microformats, as HTML is already a markup language, microformats augment that by allowing a way of embedding explicit data. But I suspect considering the ability to extract data as a useful side-effect is the right way around in situations like this, rather than being an end in itself. > In particular, hBIB is intended primarily to encode bibliographic entries > for the purposes of publishing, rendering, and statistical content analysis, > not for the purpose of non-statistical automated processing usually > envisioned by the semantic web community. For example, hBIB does not force > identification of separate authors (although it can represent it when > available), simply because the tools generating hBIB may not be able to do > so for every document. Hmm, my feeling would be that if the additional information was worth including in the documents, then it would very likely be useful to semantic web applications. Things like identification of separate authors isn't necessarily a problem, RDF and related technologies are designed so they can still be useful with partial information. But I take your point about statistical analysis - that's something I've heard little about around the semweb community (except maybe on the fringes of search). Cheers, Danny. -- http://dannyayers.com
Received on Wednesday, 28 March 2007 19:50:34 UTC