Re: PDF alternative using HTML (proposal)

Hi Crispin,

Fiar points... I hope this reply answers them without being too long.

---

I agree that DOCX/PDF really focuses on precise document layout, which can be useful in some cases (printing), but not always desirable (e.g. trying to read an A4 report on a small/mobile screen, resulting in horizontal and vertical scrolling).

HTML/CSS can do both precise(ish) layout, but really now focuses on layouts that change depending on the viewport / user preferences (i.e. media queries, em font sizes, font colours/contrasts, etc).

So I want the flexibility that HTML/CSS can already provide, with the ability to send that document to someone, so they can view and store it (this is a requirement from most of my clients, so I don't think this is uncommon).

It will need to have the ability to be printed (like webpages can be), but I believe in most cases these documents are viewed on a computing devices now (of varying screen size).

All of this is already covered by HTML/CSS... the problem is that you cannot email someone a HTML document, they cause far too many security problems (see second email to Anders, any why I started posting in WebAppSec).

---

"...both are more complex than you want them to be"

There is a large number of developers that already create websites in HTML/CSS.

There are many tools to help with the development of HTML/CSS pages.

Many developers build their websites by starting off with an index.html and styles.css file in a folder, along with some images. This allows very quick/easy development.

All of this is quite simple, the only extra step for this proposed solution is to zip the folder, and change the file extension :-)


DOCX on the other hand is a completely different document format that is heavily influenced by the history of MS Word. If you open one of these files, you will find several XML files, which follow a spec that is thousands of pages, and requires a lot of meta data that is not really needed.

PDF is also a difficult standard, with an odd a document tree, encoded resources that become difficult to keep track of, sizes in inches (rarely used outside of the US), co-ordinates from the bottom left of the page (so font size needs to be factored into positioning), and is typically created with automated systems that do ridiculous things like position every character individually (requiring accessibility tools to use heuristics to guess at the identification of words, lines, paragraphs, and columns).

EPUB3 is a solution that Ivan has only just mentioned to me, and I need to do some more digging, but it seems to be only loosely based on HTML, and currently requires the recipient to open it in an ebook reader (not ideal if you are sending a report to someone).

I should also add, there is quite a big demand from web developers to create PDF and DOCX files from HTML, because they already know and understand HTML/CSS.

---

The problems I am solving...

Many people want to send reports that are read-only (well, as much as these documents can be).

They do need to be more presentable than an email.

Email has a tendency to be edited, forwarded, cut, commented on within the text, etc... whereas an attached document is an atomic unit.

These documents can also be downloaded from websites for future reference (i.e. archiving purposes).

And must work when the user is offline, or the website is no longer available.

They need to be semantically marked up, both for the OS / Email client to index (i.e. searching), and for assistive technology to work with.

They do need to be printed, but rarely to pixel/pt perfect quality (more so for someone to take to a meeting to discuss).

They need to be easy to create by developers / systems (see above).

And most importantly, very secure, so that IT departments can feel very comfortable allowing them onto their network (I would argue that this document format would be preferred over things like PDF).

---

The problems I am excluding...

Absolutely perfect rending for printing (CSS is getting there, but it's not ready yet).

Setting a specific document size (e.g. A4).

Documents that can be edited without understanding the HTML/CSS... DOCX and OpenDocument kind of address that already.

---

I hope I've covered anything.

And just as an aside, I believe you work on the Edge browser, which I'd just like to say thank you for doing such a good job on (both with the security improvements, but the general approach the team is taking).

Craig







> On 17 Jan 2016, at 10:18, Crispin Cowan <crispin@microsoft.com> wrote:
> 
> And ... what? You cited 2 solutions to your problem space (DOCX, and EPUP3 which I am unfamiliar with) and complain that both are more complex than you want them to be. Maybe that is the inherent complexity of the space? If not, then come back with a proposal that is demonstrably simpler than the existing solutions. But be clear about the problems you are solving and the problems you are excluding. In particular, DOCX and PDF are *precise* about document layout, and HTML is not. Are you seeking e-mailable marked up information? Or e-mailable formatted pages that I can print and get exactly the same layout that someone else does?
> 
> -----Original Message-----
> From: Craig Francis [mailto:craig@craigfrancis.co.uk] 
> Sent: Sunday, January 17, 2016 2:07 AM
> To: Crispin Cowan <crispin@microsoft.com>
> Cc: Adrian Hope-Bailie <adrian@hopebailie.com>; public-webappsec@w3.org
> Subject: Re: PDF alternative using HTML (proposal)
> 
> Thanks Crispin, but if you have looked at the docx standard, it really is very difficult to work with.
> 
> I was hoping to take the HTML/CSS that we all know and love, and package it into a single file using a technology that we also already know and love, and get the browsers to display it in a way we are all familiar with, in a nice secure way (where the security part of this is the bit that would need most discussion).
> 
> That said, Ivan at the Digital Publishing IG believes the EPUP3 standard is the answer, which I need to look at again, but I feel that's falling into the same trap of just being an overly complicated solution for what most developers want (good for ebooks though).
> 
> Craig
> 
> 
> 
>> On 17 Jan 2016, at 06:33, Crispin Cowan <crispin@microsoft.com> wrote:
>> 
>> Just FYI, Microsoft .docx is a standard called Open XML 
>> https://en.wikipedia.org/wiki/Office_Open_XML
>> 
>> So if you want to take the approach that Office did, then done! 
>> 
>> -----Original Message-----
>> From: Craig Francis [mailto:craig@craigfrancis.co.uk]
>> Sent: Thursday, January 14, 2016 2:40 AM
>> To: Wendy Seltzer <wseltzer@w3.org>
>> Cc: Adrian Hope-Bailie <adrian@hopebailie.com>; 
>> public-webappsec@w3.org
>> Subject: Re: PDF alternative using HTML (proposal)
>> 
>> Thanks Wendy,
>> 
>> I must confess I didn't look at the other Groups, but have just posted (after trying to get used to the volume of emails in that group).
>> 
>> The reason I started the post here was because the current alternatives (HTML with inline resources, or MHTML) already exist, and fail completely at security, so I'm hoping this solution will focus on that.
>> 
>> Craig
>> 
>> 
>> https://lists.w3.org/Archives/Public/public-digipub-ig/2016Jan/0089.ht
>> ml
>> 
>> 
>> 
>> 
>> 
>>> On 12 Jan 2016, at 14:14, Wendy Seltzer <wseltzer@w3.org> wrote:
>>> 
>>> Hi Craig and Adrian,
>>> 
>>> You may want to bring this discussion to the Digital Publishing IG, 
>>> https://www.w3.org/dpub/IG/wiki/Main_Page
>>> 
>>> While the security considerations of packaged documents could be 
>>> in-scope for WebAppSec, the PDF alternative use cases are probably 
>>> best developed elsewhere.
>>> 
>>> --Wendy
>>> 
>>>> On 01/12/2016 07:06 AM, Craig Francis wrote:
>>>> From a web developers point of view, my replies are below...
>>>> 
>>>> 
>>>> 
>>>>> On 12 Jan 2016, at 11:33, Adrian Hope-Bailie <adrian@hopebailie.com> wrote:
>>>>> 
>>>>> +1 - seems like something worth standardizing if browsers will standardize the security model that is applied to this browsing context.
>>>>> 
>>>>> Assumptions: 
>>>>> ALL embedded resources would be packaged in the archive The script 
>>>>> execution capabilities of this app would be severely limited (no network requests for example).
>>>> 
>>>> 
>>>> Yes to both, I think security/privacy is very important here.
>>>> 
>>>> If we start having documents that start reporting on when they are being opened (e.g. via JS or remote image), then people will probably avoid these documents (it needs to be better than PDF in this regard).
>>>> 
>>>> 
>>>>> Observations:
>>>>> 
>>>>> "ability to change layout depending on screen size" means embedding resources for all supported screen sizes in the archive - how big could this archive get? Would be useful to try a few examples and see.
>>>> 
>>>> 
>>>> If you are providing images (or dare I say videos), then this may increase the file size a bit, but it's an extra feature that can be used (and probably only in rare cases, like a badly imported image into a PDF).
>>>> 
>>>> Generally the strength of HTML/CSS is that it's text, so if anything the file size will probably be very good for the typical document.
>>>> 
>>>> 
>>>>> I can see the tooling for this becoming quite powerful and ultimatley allowing you to produce documents and slide decks that are far superior to those from existing proprietary formats.
>>>> 
>>>> 
>>>> I think building of these documents would be excellent.
>>>> 
>>>> Developers could create a folder with index.html and style.css files, maybe some images, test locally, then zip up the folder and change the extension (the manual approach, but it works).
>>>> 
>>>> Users could also visit a website and do a "save page as" and not have to worry about missing images/resources (either because they only saved the HTML, or because the resources are typically put into a separate folder).
>>>> 
>>>> And systems that create documents, well they often use HTML to PDF generators already, and they are all pretty bad from my experience.
>>>> 
>>>> 
>>>>> I would imagine that if I opened the file /tmp/html-document.hta it 
>>>>> would open in my browser and the address bar would show file:///temp/html-document.hta Can I browse to other HTML files in the archive? And if so what is their URL?
>>>>> E.g. Would the file example/otherfile.html inside the archive be at the URL file:///temp/html-document.hta/example/otherfile.html ?
>>>> 
>>>> 
>>>> Personally I wouldn't be using multiple HTML files (I'm currently creating reports that are exported as PDF's, which don't have this ability)... but I don't see why that feature couldn't be included.
>>>> 
>>>> I like the idea of just appending onto the base path.
>>>> 
>>>> The HTML files themselves can then just do a <a href="../../example/otherfile.html"> to help during development/testing, or just use <a href="/example/otherfile.html">.
>>>> 
>>>> 
>>>>> I stole the .hta extension from Microsoft's HTML Applications (https://en.wikipedia.org/wiki/HTML_Application <https://en.wikipedia.org/wiki/HTML_Application>).
>>>>> Similar idea with the opposite security principles and very little 
>>>>> success as far as I know
>>>> 
>>>> I found that someone else was proposing a "hdoc" extension:
>>>> 
>>>> http://hdoc.crzt.fr/www/co/hdoc.html
>>>> <http://hdoc.crzt.fr/www/co/hdoc.html>
>>>> 
>>>> Although I think their proposal went a bit far including several meta files which I don't think are needed (just have the requirement of one index.html file).
>>>> 
>>>> Personally I don't think it matters which extension we choose :-)
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> On 12 January 2016 at 12:54, Craig Francis <craig@craigfrancis.co.uk <mailto:craig@craigfrancis.co.uk>> wrote:
>>>>> Hi,
>>>>> 
>>>>> Recently I've been thinking of some of the problems with PDF's, which are useful for creating a document that can be archived, emailed, printed, etc.
>>>>> 
>>>>> HTML has solutions for many of PDF's problems though, for example structured text (accessibility), ability to change layout depending on screen size (no need for small screen devices to zoom into a fixed A4 layout), can change font size, better indexing support (searching for documents), etc.
>>>>> 
>>>>> Unfortunately you can't just email a HTML document to someone, as this causes a range of security problems, and including resources can be difficult (you can inline them, or use MHTML, but these are tricky to create).
>>>>> 
>>>>> So I was wondering if we could take the approach that Microsoft Word did with the docx format, Java with JAR, PHP with PHAR, etc...
>>>>> 
>>>>> Have a new file format, associated with the browser, which is just a ZIP/GZIP file that contains an index.html file, and everything else needed for the document.
>>>>> 
>>>>> Then from a security point of view, it can be locked down to its own little box, so no access to other files on the file system, probably no access to cookies/localstorage, no ability to connect to another host (maybe).
>>>>> 
>>>>> And from the users point of view, the document could be protected with a password (a feature that ZIP/GZIP provides already, and the browser can prompt for when opening).
>>>>> 
>>>>> So would this help with the security aspects of emailing HTML files to people (e.g. reports), and be better than PDFs?
>>>>> 
>>>>> Craig
>>>>> 
>>>>> 
>>>>> https://code.google.com/p/chromium/issues/detail?id=575677
>>>>> <https://code.google.com/p/chromium/issues/detail?id=575677>
>>>>> 
>>>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1237990
>>>>> <https://bugzilla.mozilla.org/show_bug.cgi?id=1237990>
>>> 
>>> 
>>> --
>>> Wendy Seltzer -- wseltzer@w3.org +1.617.715.4883 (office) Policy 
>>> Counsel and Domain Lead, World Wide Web Consortium (W3C)
>>> http://wendy.seltzer.org/        +1.617.863.0613 (mobile)
>> 
>> 

Received on Monday, 18 January 2016 11:40:37 UTC