[whatwg] HTML-to-plaintext conversion (innerText and Selection.toString())

On Thu, Feb 3, 2011 at 3:15 PM, Boris Zbarsky <bzbarsky at mit.edu> wrote:
> OK... ?See, that's the sort of behavior change for a DOM API that I don't
> think we should have. ?Why do we want a DOM API which looks like a way to
> serialize the DOM but actually works totally differently in disconnected
> subtrees and a displayed document?

Well, so maybe if we were writing it now we'd call it something else,
but it already exists.  Some standard method of allowing authors to
programmatically serialize HTML to plaintext (based on how it appears,
not just the DOM) would be reasonable to have, right?

>> and Gecko returns the empty string when
>> you stringify a Selection that's not displayed. ?This seems
>> unreasonable from an author perspective
>
> Well, what exactly would "reasonable" be?

Well.  If the algorithm is sophisticated enough, you're right, it's
not easy to say what would be reasonable.  Like if you accounted for
positioning and z-indexes and so on, you could conceive of examples
where the only way to figure out what's visible would be to actually
paint it.  If the algorithm is fairly simple, along the lines of the
one I've written so far, then it's straightforward to just apply it to
the nodes like always.  The latter is what IE seems to do.  Although
IE's algorithm is so simple that it doesn't even handle display: none.

> Yes, I understand that's what Webkit does. ?I just think it's a terrible
> idea.

Because innerText is a property on HTMLElement rather than a method on
some other interface, or for some other reason?

> Floating and absolutely positioned (in the CSS spec sense) elements.

I'm just not paying attention to whether elements float or are
positioned, which seems to be what everyone does right now for both
innerText and Selection.toString().  At least the differences seem
pretty trivial, like a matter of what leading/trailing whitespace is
emitted.

I mean, basically there's no way you're going to get anything close to
the complexity of what CSS can render reflected in plaintext.  At
least given that we're talking about a stream of text and not ASCII
art.  Yes, you could develop ever more refined heuristics if you
really wanted to, but I think there's a use for a standardized
HTML-to-plaintext algorithm that's accessible from JavaScript and that
handles 90% of the cases right without being too complicated.  I don't
think it's more useful to have a non-standardized algorithm, which is
the status quo.

> Whyever not? ?I think browsers should be allowed to try to handle it in
> their selection implementations if they want to try!

If you have markup like a right float or right-aligned absolute
positioning, you can't handle it in a text stream because you don't
know the width of the output, so you have no way to figure out where
it should go.  (Assuming LTR, obviously.)  You could only do that kind
of thing if you were emitting ASCII art of known fixed width, like a
text browser.  Yes, of course, you could theoretically handle some
special cases of floats and absolute positioning nicely.

> Note that the "UI" you're looking at there is basically an accident. ?;)

It still works.  :)

> Why wouldn't you, if you can select it at all?

It would be nice to partially select generated content, but there's no
way to do it from a DOM-based API, is what I was trying to say.  (Of
course, this is irrelevant to innerText.)

> I should note that Gecko doesn't support innerText, and we haven't had a
> single bug report about it not working or request to implement it in the
> last 4 years. ?So I question how widely used it is... ?Maybe it's useful,
> but I'd need to understand the use cases first. ?What are they?

It's pretty widely used, but the sites that use it mostly either do
some kind of crude browser detection (like !document.all . . .
whoops), or just ignore Firefox (mostly East Asian sites).  That said,
the overwhelming majority of uses seem like textContent would work
about as well, and the sites that use feature detection mostly seem to
substitute textContent.  Most cases seem to just set it, in fact.  It
could be the feature will mostly die now that IE9 supports
textContent.

(E.g., jQuery uses innerText in at least one place, but only if
textContent isn't present: elem.textContent || elem.innerText ||
getText([ elem ]) || "")

I did find one interesting use case in the 150ish pages I looked at
containing innerText.  Two different pages used innerText to convert
an HTML comment to plaintext, in one case for an "add quote" button
for a Wordpress plugin and in the other case to create a tooltip of a
previous comment when you hover over the "in response to" marker.  (I
posted this at <https://bugzilla.mozilla.org/show_bug.cgi?id=264412#c14>.)
 In principle, textContent could produce bad results in these cases,
if there were nontrivial markup present -- although in those two
specific cases it didn't.

> At least assuming anyone actually cares about the details of the values
> Selection.toString() produces. ?And if no one does, then we shouldn't be
> standardizing them, imo.

I know of at least one case where an author complained about inconsistency here:

http://www.mediawiki.org/wiki/User:Catrope/W3C_Range_feature_requests#Newline_handling_in_stringification_of_getSelection.28.29

He can't remember his exact use-case, unfortunately.  But for
author-visible JS APIs, consistency is almost always more valuable
than correctness.  Better to have all browsers do an okay job of
stringifying selections and do it the same, than to have some browsers
do a really good job (which I'm not convinced any of them will do
anyway) but all of them do it differently.

Received on Thursday, 3 February 2011 13:18:40 UTC