Re: [w3c/clipboard-apis] Make async clipboard APIs (read/write) to sanitize interoperably with setData/getData for text/html (#150)

We want to standardize the clipboard sanitization procedure as it has caused [issues ](https://bugs.chromium.org/p/chromium/issues/detail?id=121163#c31)for many popular apps while using async clipboard HTML read/write. The [spec says](https://w3c.github.io/clipboard-apis/#dom-clipboard-write) that during write, we should only write the sanitized payload of the item, but doesn't say anything about what the process of sanitization looks like and what kind of data is expected in the clipboard item. e.g. when writing HTML payload using the async clibpoard write, what should be the format of the HTML markup string? Should it be a document fragment or complete HTML document?
The [read operation](https://w3c.github.io/clipboard-apis/#dom-clipboard-read) doesn't specify anything about sanitization and I found that Chromium is performing strict sanitization for both HTML read & write operation which breaks sites like Excel online as mentioned in this [bug](https://bugs.chromium.org/p/chromium/issues/detail?id=121163#c31). I agree that we should at least strip out tags such as script & elements that have `javascript:` protocols associated with them as they are harmful, but we should mention explicitly what tags are stripped out in this process so it sets an expectation for the web developers & native apps who are reading these formats.

Also FWIW currently at least in Chromium & Firefox, we write *unsanitized* HTML content to the clipboard when web developers use `setData` method, and AFAIK there are no known security threats as we still use the strict sanitization when user pastes the data into an editable region of the website. Web developers have proper sanitization process when they query the HTML markup using `getData`[4].
Currently websites like Office online use setData & getData methods to read/write html content that don't do any sanitization to support high fidelity copy/paste content. Native apps(including Office) are already exposed to these types of HTML content from the clipboard. In the browser, we can perhaps do some level of sanitization during read (as it might affect the DOM when sites insert the fragment) if that is a concern, but we shouldn't be aggressively sanitizing content during write (to be in parity with DataTransfer APIs).

To better describe the sanitization & clipboard write process for HTML payload we propose the following:
1. Currently we process the HTML string as a document fragment that basically ignores the content inside the `<head>` element. This leads to loss of styles[1], meta tag etc. while parsing the document fragment. Instead we are proposing to parse the string provided to the async write API as an HTML document. Insert the start & end fragment comment tags within the `body` element, and then create a well formed HTML document.
2. Use the sanitizer APIs default configuration[2] to sanitize the HTML document. This makes the sanitization process consistent with what is being proposed in the sanitizer API[3].
3. Create the HTML header info (corresponding to each platform) and then write the HTML format to the clipboard.

[1] Here is a GIF that shows the style loss when user copies and paste from Excel online to win32 Excel app: https://drive.google.com/file/d/1Nsyp1rUKc_NF4l0n-O05snAKabHAKeiG/view
[2] https://wicg.github.io/sanitizer-api/#default-configuration
[3] https://wicg.github.io/sanitizer-api/#sanitizer-algorithms
[4] https://github.com/ckeditor/ckeditor5/blob/9ad13ec94d77e10e4ddf678e86577acb107db3fb/packages/ckeditor5-clipboard/docs/framework/guides/deep-dive/clipboard.md

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/w3c/clipboard-apis/issues/150#issuecomment-909405090

Received on Tuesday, 31 August 2021 16:45:20 UTC