Re: spec verbosity and algorithms from Maciej Stachowiak on 2010-01-23 (public-html@w3.org from January 2010)

From: Maciej Stachowiak <mjs@apple.com>
Date: Sat, 23 Jan 2010 13:17:35 -0800
To: "Tab Atkins Jr." <jackalmage@gmail.com>
Cc: Julian Reschke <julian.reschke@gmx.de>, Karl Dubost <karl@la-grange.net>, "public-html@w3.org" <public-html@w3.org>
Message-id: <FABE3510-B1CC-4293-B503-A5C1117E4B23@apple.com>
On Jan 23, 2010, at 6:58 AM, Tab Atkins Jr. wrote:

> On Sat, Jan 23, 2010 at 1:47 AM, Julian Reschke <julian.reschke@gmx.de> wrote:
>> Offlist I was asked why verbosity is a problem. The answer is really simple.
>> 
>> As a reader, I want to know when two origins are equal. If this is defined
>> in an 8-step algorithm, I'll have to work my way through this algorithm step
>> by step and understand it.
>> 
>> In particular, given the sheer amount of instructions, I *will* be confused
>> because it really doesn't contain any surprises.
>> 
>> This is a disservice to the reader.
> 
> It's a service to implementors, who have a trivial algorithm to follow
> that is very difficult to get wrong and not really subject to
> misinterpretation.

It's only at all a service when you want to implement the algorithm literally. In cases where you want to do an alternate implementation, this style of definition can be hard to follow. It can also be hard to review an overly verbose algorithm for correctness, or to tell if other definitions relying on it are correct.

In this case, it's a little mysterious what "additional data" is about. As far as I can tell this is meant to be an extension point to further restrict origin sameness. document.domain is also handled in an oddly subtle way. You can't tell just from reading the same origin algorithm and the definitions of origin above that it is handled at all. (It works by setting the port field to a special "manual override" value. If I saw someone code it that way for real I'd think it was kind of a hack and probably reject the patch because security code should not be tricky.)

I think one of the reasons it's a little hard to follow is that the numbered list algorithm format makes it awkward to do nested ifs or else clauses.

If it were up to me, I'd write it like this:

Two origins A and B are said to be the same origin if the following algorithm returns true:
    If A and B are both opaque identifiers and their value is equal, then:
        Return true.
    else:
        If either A or B or both are opaque identifiers, then:
            Return false.
    If both A and B have the domain override flag set, then:
        If A and B have identical scheme components, host components, and additional data (if any) then:
            Return true.
    else if neither A nor B have the domain override flag set, then:
        If A and B have identical scheme components, host components, port components,  and additional data (if any) then:
            Return true.
    else:
         Return false.

That's 14 lines instead of 9, but I think the control flow and case analysis nature of this algorithm is more clear. Collapsing single-line ifs without an else you could have 11 lines, but I'm not sure this is better:

Two origins A and B are said to be the same origin if the following algorithm returns true:
    If A and B are both opaque identifiers and their value is equal, then:
        Return true.
    else:
        If either A or B or both are opaque identifiers, then: return false.
    If both A and B have the domain override flag set, then:
        If A and B have identical scheme components, host components, and additional data (if any) then: return true.
    else if neither A nor B have the domain override flag set, then:
        If A and B have identical scheme components, host components, port components,  and additional data (if any) then: return true.
    else:
         Return false.

You could try to write this all as one big compound boolean expression, but that would be hard to understand and hard to check for correctness.

Tab's two-sentence version is very concise:

> 
> On the other hand, I see the value of *pairing* the verbose
> explanation with a concise one that explains it in short, simple
> language.  This algorithm could be paired with "If either A or B or
> both are opaque identifiers, return true if they're equal and false
> otherwise.  If neither are opaque identifiers, return true if their
> host, scheme, port, and any additional data they may have if it
> exists, are all equal; otherwise return false.

But it has the same problem as the original of hiding the ball on document.domain, and the binding of conditions for unique origins is ambiguous (which if does the "otherwise" pair with?). Going back to being tricky about how document.domain works

Two origins A and B are said to be the same origin if the following algorithm returns true:
    If A and B are both opaque identifiers and their value is equal, then:
        Return true.
    else:
        If either A or B or both are opaque identifiers, then: return false.
    If A and B have identical scheme components, host components, port components,  and additional data (if any) then:
        Return true.
    Return false.

That's almost as concise as Tab's short version. But I think relying on a special value of port to implement both the rule that no origin with document.domain set explicitly compares equal to one where it is not set explicitly, and the rule that origins with document.domain set compare equal regardless of port, is too subtle for security code, and likewise too subtle for security spec language.

Regards,
Maciej
Received on Saturday, 23 January 2010 21:18:10 UTC