- From: Danny Ayers <danny.ayers@gmail.com>
- Date: Wed, 28 Mar 2007 11:49:24 +0200
- To: SW-forum <semantic-web@w3.org>
- Cc: tmbdev@gmail.com
fyi, it may be useful to map the formats described below to RDF domain models, ideally so that the GRDDL [1] mechanisms can provide automatic extraction of the data. I believe bibliographic metadata is already pretty thoroughly covered by existing vocabularies (there's one specifically for BibTeX, right?), but haven't heard of anything for OCR. Thomas, if you're interested, all this would need your end is the allocation of a profile URI for each of the formats - which is in line with accepted practice for microformats [2]. (Note that strictly speaking these aren't microformats, as they haven't been through the process described at microformats.org, the accepted phrase there in such a case is 'semantic HTML') Cheers, Danny. [1] http://www.w3.org/TR/grddl/ [2] http://microformats.org/wiki/profile-uris ---------- Forwarded message ---------- From: Thomas Breuel <tmbdev@gmail.com> Date: 28-Mar-2007 09:25 Subject: [uf-new] announcing the hOCR and hBIB microformats To: microformats-new@microformats.org We're currently developing a new open source OCR system, with a focus on digital library applications (www.ocropus.org). As part of this, we needed formats for representing both OCR output and bibliographic metadata, and we have defined two new microformats for this purpose: hOCR and hBIB. hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation. The hBIB format is a microformat that makes it easy to indicate both where a document has been published, as well as to indicate references stored within the document (e.g., for reference lists). It is a straightforward embedding of BibTeX into HTML and should also be useful for making available reference lists and embedding citation information into the output of tools like latex2html. We're starting to make available tools and samples for both formats at: http://code.google.com/p/hocr-tools http://code.google.com/p/hbib-tools Cheers, Thomas. _______________________________________________ microformats-new mailing list microformats-new@microformats.org http://microformats.org/mailman/listinfo/microformats-new -- http://dannyayers.com
Received on Wednesday, 28 March 2007 09:49:28 UTC