W3C home > Mailing lists > Public > public-egov-ig@w3.org > March 2010

Re: Publishing in PDF

From: Chris Beer <chris-beer@grapevine.net.au>
Date: Wed, 17 Mar 2010 20:40:36 +1100
Message-ID: <4BA0A394.3060108@grapevine.net.au>
To: Gannon Dick <gannon_dick@yahoo.com>, W3C e-Gov IG <public-egov-ig@w3.org>
CC: Rachel Flagg <rachel.flagg@gsa.gov>, Sandro Hawke <sandro@w3.org>
Hi Gannon

Sorry for my tardy reply - making up for it here. :)

I've sent this to the list as well as it may well be of interest to 
others. I'm also reincluding your earlier links on redaction:

http://www.fas.org/sgp/othergov/dod/nsa-redact.pdf

and

http://www.rustprivacy.org/RedactionOfPII.zip

I'm looking forward to your post - while redaction as a method of sanitisation/deidentification of document drafts / classified material is of obvious import to anyone working in the gov sphere, I suspect it could also have obvious implications and uses within creating metadata rich government publications that leave behind unwanted or messy markup/vendor specific information (such as MSO tags)? Ironically we were chewing on the converting PDF to HTML on the fly issue today at work - so I for one would be very keen to explore any such options that you intend to present in that particular regard.

However I'd also like to highlight this in the broader scope of PII redaction in PSI for the Web Tech project - this obviously is a necessary part of Best Practice in terms of the Project scope. Would be keen to see you expand on that aspect for us.

Thankyou for your contributions - keep them coming!

Cheers

Chris


On 17/03/2010 9:15 AM, Gannon Dick wrote:
> Chris,
>
> Regarding your project for publishing PDF's:
>
> I'll be posting, hopefully before the end of the month, a method to redact metadata from documents authored in OpenOffice.  This is for the case where you would like a PDF of the source, but a 1) (throwaway source) HTML (web page) and 2) (redacted) HTML (web page).  One "problem" is that the default HTML structure is some metadata is "hidden" (html/head/meta) and some is "visible" (html/body), yet, all should be "exposed".  When you are redacting, you don't want to mess with this browser specific feature. One has to fiddle with the HTML flow model, but I have an XSD Schema based on XHTML 1.0 Strict which will validate the extra attributes marking the various elements, cite, abbr, acronym, etc. (in the Personally Identifiable Information Namespace<http://purl.org/pii/terms/>).  It's a nice way to produce rich text exemplars for datasets you are thinking of putting on-line.
>
> Let me know if you want some samples
>
>   --Gannon
>
>
>
>
>    
Received on Wednesday, 17 March 2010 09:40:50 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 17 March 2010 09:40:51 GMT