[whatwg] HTML-to-plaintext conversion (innerText and Selection.toString())

On 2/3/11 4:18 PM, Aryeh Gregor wrote:
> Well, so maybe if we were writing it now we'd call it something else,
> but it already exists.  Some standard method of allowing authors to
> programmatically serialize HTML to plaintext (based on how it appears,
> not just the DOM) would be reasonable to have, right?

Perhaps, yes.  Assuming it does an ok job of it.  And assuming the 
behavior when the content doesn't "appear" is sane.  (What's sane in 
that case?  Probably depends on whom you ask.)

> Well.  If the algorithm is sophisticated enough, you're right, it's
> not easy to say what would be reasonable.  Like if you accounted for
> positioning and z-indexes and so on, you could conceive of examples
> where the only way to figure out what's visible would be to actually
> paint it.  If the algorithm is fairly simple, along the lines of the
> one I've written so far, then it's straightforward to just apply it to
> the nodes like always.  The latter is what IE seems to do.  Although
> IE's algorithm is so simple that it doesn't even handle display: none.

Right, so for your purposes it's broken anyway.

>> Yes, I understand that's what Webkit does.  I just think it's a terrible
>> idea.
>
> Because innerText is a property on HTMLElement rather than a method on
> some other interface

Mostly, yes.  As we're trying to define it here it's just not a concept 
that makes sense for all HTML elements.  Could we make it throw in the 
cases when it doesn't make sense?

(Also note that detecting whether it makes sense requires flushing out 
at least style changes, if not layout, depending on algorithm.)

> I'm just not paying attention to whether elements float or are
> positioned, which seems to be what everyone does right now for both
> innerText and Selection.toString().

And all I'm saying is that there are at least three pieces of data here:

1)  innerText return value
2)  Selection.toString() return value
3)  What the browser actually copies

My point is that browsers must be free to modify #3 as desired. 
Dictating it in a web spec, is not acceptable, imo.

That presupposes that if we freeze #1 and #2, then they will in the 
future no longer match #3, even if they happen to now.  Which will force 
browsers to cart around two separate serializers; maybe that's ok. 
It'll also lead to calls for a spec for the new, better, behaviors and a 
way to get at their return value, I suspect.

> At least the differences seem
> pretty trivial, like a matter of what leading/trailing whitespace is
> emitted.

For now.  This stuff is being actively worked on, at least for Gecko, 
last I checked.

> I mean, basically there's no way you're going to get anything close to
> the complexity of what CSS can render reflected in plaintext.

Sure.  But right now we're more somewhere between "do something really 
dumb" and "do something mostly dumb" in terms of browser handling of 
this stuff.  You can get way better without being anywhere close to a 
faithful reproduction of the web page in plaintext.

> but I think there's a use for a standardized
> HTML-to-plaintext algorithm that's accessible from JavaScript and that
> handles 90% of the cases right without being too complicated.

Agreed, I think; but should that be Selection.toString() or some other 
API?  That is are we hijacking Selection.toString() because it's 
convenient, or because it's the right way to expose such an algorithm?

> I don't think it's more useful to have a non-standardized algorithm, which is
> the status quo.

I think it's useful (more than that; required) to not standardize what 
browsers actually do in their _user_ visible behavior for this 
situation.  I also think that script-visible behavior that purports to 
produce "the same" results as user-visible behavior is a bad idea, since 
script-visible behavior does need to be standardized.

What I'm not sure about is how best to proceed given those opinions.

> If you have markup like a right float or right-aligned absolute
> positioning, you can't handle it in a text stream because you don't
> know the width of the output, so you have no way to figure out where
> it should go.

Perhaps the right answer is to leave it out entirely?  Quite often 
that's what I want to happen with floats when I copy a paragraph that 
includes them.

>> Note that the "UI" you're looking at there is basically an accident.  ;)
>
> It still works.  :)

My point is that it working is an accident; it could break tomorrow and 
we would consider that ok.

> (E.g., jQuery uses innerText in at least one place, but only if
> textContent isn't present: elem.textContent || elem.innerText ||
> getText([ elem ]) || "")

Sure; I'm aware of uses like that, but they're irrelevant to 
non-quirks-IE browsers.

> I did find one interesting use case in the 150ish pages I looked at
> containing innerText.  Two different pages used innerText to convert
> an HTML comment to plaintext, in one case for an "add quote" button
> for a Wordpress plugin and in the other case to create a tooltip of a
> previous comment when you hover over the "in response to" marker.

That does seem like a good use case, yes.

> But for author-visible JS APIs, consistency is almost always more valuable
> than correctness.

Depending on how you define "consistency" and "correctness".  We could 
have Selection.toString() consistently throw; would be simple to 
implement in UAs.  I think it would be less valuable to authors than 
what we have now, warts and all.

> Better to have all browsers do an okay job of
> stringifying selections and do it the same

Depending on your definition of "okay", yes.  I mean... we have an 
"okay" way that's interoperable now (I hope): Range.toString.  Except 
you don't think it does an okay job, clearly.  I agree on that; I don't 
necessarily agree that current browser Selection.toString does an "okay" 
job.

-Boris

Received on Thursday, 3 February 2011 13:41:01 UTC