W3C home > Mailing lists > Public > public-webappsec@w3.org > January 2016

RE: PDF alternative using HTML (proposal)

From: Crispin Cowan <crispin@microsoft.com>
Date: Sun, 17 Jan 2016 10:18:57 +0000
To: Craig Francis <craig@craigfrancis.co.uk>
CC: Adrian Hope-Bailie <adrian@hopebailie.com>, "public-webappsec@w3.org" <public-webappsec@w3.org>
Message-ID: <BN3PR0301MB12203E20711C59997629970CBDCF0@BN3PR0301MB1220.namprd03.prod.outlook.com>
And ... what? You cited 2 solutions to your problem space (DOCX, and EPUP3 which I am unfamiliar with) and complain that both are more complex than you want them to be. Maybe that is the inherent complexity of the space? If not, then come back with a proposal that is demonstrably simpler than the existing solutions. But be clear about the problems you are solving and the problems you are excluding. In particular, DOCX and PDF are *precise* about document layout, and HTML is not. Are you seeking e-mailable marked up information? Or e-mailable formatted pages that I can print and get exactly the same layout that someone else does?

-----Original Message-----
From: Craig Francis [mailto:craig@craigfrancis.co.uk] 
Sent: Sunday, January 17, 2016 2:07 AM
To: Crispin Cowan <crispin@microsoft.com>
Cc: Adrian Hope-Bailie <adrian@hopebailie.com>; public-webappsec@w3.org
Subject: Re: PDF alternative using HTML (proposal)

Thanks Crispin, but if you have looked at the docx standard, it really is very difficult to work with.

I was hoping to take the HTML/CSS that we all know and love, and package it into a single file using a technology that we also already know and love, and get the browsers to display it in a way we are all familiar with, in a nice secure way (where the security part of this is the bit that would need most discussion).

That said, Ivan at the Digital Publishing IG believes the EPUP3 standard is the answer, which I need to look at again, but I feel that's falling into the same trap of just being an overly complicated solution for what most developers want (good for ebooks though).

Craig



> On 17 Jan 2016, at 06:33, Crispin Cowan <crispin@microsoft.com> wrote:
> 
> Just FYI, Microsoft .docx is a standard called Open XML 
> https://en.wikipedia.org/wiki/Office_Open_XML
> 
> So if you want to take the approach that Office did, then done! 
> 
> -----Original Message-----
> From: Craig Francis [mailto:craig@craigfrancis.co.uk]
> Sent: Thursday, January 14, 2016 2:40 AM
> To: Wendy Seltzer <wseltzer@w3.org>
> Cc: Adrian Hope-Bailie <adrian@hopebailie.com>; 
> public-webappsec@w3.org
> Subject: Re: PDF alternative using HTML (proposal)
> 
> Thanks Wendy,
> 
> I must confess I didn't look at the other Groups, but have just posted (after trying to get used to the volume of emails in that group).
> 
> The reason I started the post here was because the current alternatives (HTML with inline resources, or MHTML) already exist, and fail completely at security, so I'm hoping this solution will focus on that.
> 
> Craig
> 
> 
> https://lists.w3.org/Archives/Public/public-digipub-ig/2016Jan/0089.ht
> ml
> 
> 
> 
> 
> 
>> On 12 Jan 2016, at 14:14, Wendy Seltzer <wseltzer@w3.org> wrote:
>> 
>> Hi Craig and Adrian,
>> 
>> You may want to bring this discussion to the Digital Publishing IG, 
>> https://www.w3.org/dpub/IG/wiki/Main_Page
>> 
>> While the security considerations of packaged documents could be 
>> in-scope for WebAppSec, the PDF alternative use cases are probably 
>> best developed elsewhere.
>> 
>> --Wendy
>> 
>>> On 01/12/2016 07:06 AM, Craig Francis wrote:
>>> From a web developers point of view, my replies are below...
>>> 
>>> 
>>> 
>>>> On 12 Jan 2016, at 11:33, Adrian Hope-Bailie <adrian@hopebailie.com> wrote:
>>>> 
>>>> +1 - seems like something worth standardizing if browsers will standardize the security model that is applied to this browsing context.
>>>> 
>>>> Assumptions: 
>>>> ALL embedded resources would be packaged in the archive The script 
>>>> execution capabilities of this app would be severely limited (no network requests for example).
>>> 
>>> 
>>> Yes to both, I think security/privacy is very important here.
>>> 
>>> If we start having documents that start reporting on when they are being opened (e.g. via JS or remote image), then people will probably avoid these documents (it needs to be better than PDF in this regard).
>>> 
>>> 
>>>> Observations:
>>>> 
>>>> "ability to change layout depending on screen size" means embedding resources for all supported screen sizes in the archive - how big could this archive get? Would be useful to try a few examples and see.
>>> 
>>> 
>>> If you are providing images (or dare I say videos), then this may increase the file size a bit, but it's an extra feature that can be used (and probably only in rare cases, like a badly imported image into a PDF).
>>> 
>>> Generally the strength of HTML/CSS is that it's text, so if anything the file size will probably be very good for the typical document.
>>> 
>>> 
>>>> I can see the tooling for this becoming quite powerful and ultimatley allowing you to produce documents and slide decks that are far superior to those from existing proprietary formats.
>>> 
>>> 
>>> I think building of these documents would be excellent.
>>> 
>>> Developers could create a folder with index.html and style.css files, maybe some images, test locally, then zip up the folder and change the extension (the manual approach, but it works).
>>> 
>>> Users could also visit a website and do a "save page as" and not have to worry about missing images/resources (either because they only saved the HTML, or because the resources are typically put into a separate folder).
>>> 
>>> And systems that create documents, well they often use HTML to PDF generators already, and they are all pretty bad from my experience.
>>> 
>>> 
>>>> I would imagine that if I opened the file /tmp/html-document.hta it 
>>>> would open in my browser and the address bar would show file:///temp/html-document.hta Can I browse to other HTML files in the archive? And if so what is their URL?
>>>> E.g. Would the file example/otherfile.html inside the archive be at the URL file:///temp/html-document.hta/example/otherfile.html ?
>>> 
>>> 
>>> Personally I wouldn't be using multiple HTML files (I'm currently creating reports that are exported as PDF's, which don't have this ability)... but I don't see why that feature couldn't be included.
>>> 
>>> I like the idea of just appending onto the base path.
>>> 
>>> The HTML files themselves can then just do a <a href="../../example/otherfile.html"> to help during development/testing, or just use <a href="/example/otherfile.html">.
>>> 
>>> 
>>>> I stole the .hta extension from Microsoft's HTML Applications (https://en.wikipedia.org/wiki/HTML_Application <https://en.wikipedia.org/wiki/HTML_Application>).
>>>> Similar idea with the opposite security principles and very little 
>>>> success as far as I know
>>> 
>>> I found that someone else was proposing a "hdoc" extension:
>>> 
>>> http://hdoc.crzt.fr/www/co/hdoc.html
>>> <http://hdoc.crzt.fr/www/co/hdoc.html>
>>> 
>>> Although I think their proposal went a bit far including several meta files which I don't think are needed (just have the requirement of one index.html file).
>>> 
>>> Personally I don't think it matters which extension we choose :-)
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> On 12 January 2016 at 12:54, Craig Francis <craig@craigfrancis.co.uk <mailto:craig@craigfrancis.co.uk>> wrote:
>>>> Hi,
>>>> 
>>>> Recently I've been thinking of some of the problems with PDF's, which are useful for creating a document that can be archived, emailed, printed, etc.
>>>> 
>>>> HTML has solutions for many of PDF's problems though, for example structured text (accessibility), ability to change layout depending on screen size (no need for small screen devices to zoom into a fixed A4 layout), can change font size, better indexing support (searching for documents), etc.
>>>> 
>>>> Unfortunately you can't just email a HTML document to someone, as this causes a range of security problems, and including resources can be difficult (you can inline them, or use MHTML, but these are tricky to create).
>>>> 
>>>> So I was wondering if we could take the approach that Microsoft Word did with the docx format, Java with JAR, PHP with PHAR, etc...
>>>> 
>>>> Have a new file format, associated with the browser, which is just a ZIP/GZIP file that contains an index.html file, and everything else needed for the document.
>>>> 
>>>> Then from a security point of view, it can be locked down to its own little box, so no access to other files on the file system, probably no access to cookies/localstorage, no ability to connect to another host (maybe).
>>>> 
>>>> And from the users point of view, the document could be protected with a password (a feature that ZIP/GZIP provides already, and the browser can prompt for when opening).
>>>> 
>>>> So would this help with the security aspects of emailing HTML files to people (e.g. reports), and be better than PDFs?
>>>> 
>>>> Craig
>>>> 
>>>> 
>>>> https://code.google.com/p/chromium/issues/detail?id=575677
>>>> <https://code.google.com/p/chromium/issues/detail?id=575677>
>>>> 
>>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1237990
>>>> <https://bugzilla.mozilla.org/show_bug.cgi?id=1237990>
>> 
>> 
>> --
>> Wendy Seltzer -- wseltzer@w3.org +1.617.715.4883 (office) Policy 
>> Counsel and Domain Lead, World Wide Web Consortium (W3C)
>> http://wendy.seltzer.org/        +1.617.863.0613 (mobile)
> 
> 
Received on Sunday, 17 January 2016 10:19:31 UTC

This archive was generated by hypermail 2.3.1 : Monday, 23 October 2017 14:54:17 UTC