W3C home > Mailing lists > Public > semantic-web@w3.org > March 2007

Fwd: [uf-new] announcing the hOCR and hBIB microformats

From: Danny Ayers <danny.ayers@gmail.com>
Date: Wed, 28 Mar 2007 11:49:24 +0200
Message-ID: <1f2ed5cd0703280249u34065d5ai463efeeff9bf6c68@mail.gmail.com>
To: SW-forum <semantic-web@w3.org>
Cc: tmbdev@gmail.com

fyi, it may be useful to map the formats described below to RDF domain
models, ideally so that the GRDDL [1] mechanisms can provide automatic
extraction of the data. I believe bibliographic metadata is already
pretty thoroughly covered by existing vocabularies (there's one
specifically for BibTeX, right?), but haven't heard of anything for
OCR.

Thomas, if you're interested, all this would need your end is the
allocation of a profile URI for each of the formats - which is in line
with accepted practice for microformats [2].

(Note that strictly speaking these aren't microformats, as they
haven't been through the process described at microformats.org, the
accepted phrase there in such a case is 'semantic HTML')

Cheers,
Danny.

[1] http://www.w3.org/TR/grddl/
[2] http://microformats.org/wiki/profile-uris

---------- Forwarded message ----------
From: Thomas Breuel <tmbdev@gmail.com>
Date: 28-Mar-2007 09:25
Subject: [uf-new] announcing the hOCR and hBIB microformats
To: microformats-new@microformats.org


We're currently developing a new open source OCR system, with a focus
on digital library applications (www.ocropus.org).  As part of this,
we needed formats for representing both OCR output and bibliographic
metadata, and we have defined two new microformats for this purpose:
hOCR and hBIB.

hOCR is a format for representing OCR output, including layout
information, character confidences, bounding boxes, and style
information. It embeds this information invisibly in standard HTML. By
building on standard HTML, it automatically inherits well-defined
support for most scripts, languages, and common layout options.
Furthermore, unlike previous OCR formats, the recognized text and
OCR-related information co-exist in the same file and survives editing
and manipulation. hOCR markup is independent of the presentation.

The hBIB format is a microformat that makes it easy to indicate both
where a document has been published, as well as to indicate references
stored within the document (e.g., for reference lists).  It is a
straightforward embedding of BibTeX into HTML and should also be
useful for making available reference lists and embedding citation
information into the output of tools like latex2html.

 We're starting to make available tools and samples for both formats at:

http://code.google.com/p/hocr-tools

 http://code.google.com/p/hbib-tools

Cheers,
Thomas.




_______________________________________________
microformats-new mailing list
microformats-new@microformats.org
http://microformats.org/mailman/listinfo/microformats-new



-- 

http://dannyayers.com
Received on Wednesday, 28 March 2007 09:49:28 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 1 March 2016 07:41:55 UTC