Re: [whatwg] URL: spec review - basic_parser from Sam Ruby on 2014-10-12 (public-whatwg-archive@w3.org from October 2014)

From: Sam Ruby <rubys@intertwingly.net>
Date: Sun, 12 Oct 2014 09:01:27 -0400
To: Anne van Kesteren <annevk@annevk.nl>
Cc: whatWG <whatwg@whatwg.org>
Message-ID: <543A7BA7.8060209@intertwingly.net>
On 10/12/2014 04:18 AM, Anne van Kesteren wrote:
> On Sat, Oct 11, 2014 at 7:24 PM, Sam Ruby <rubys@intertwingly.net> wrote:
>> On 10/10/2014 08:19 PM, Sam Ruby wrote:
>>> 2) https://url.spec.whatwg.org/#concept-basic-url-parser
>>>      I'm interpreting "terminate this algorithm" and "return failure" to
>>>      mean the same thing, and I'm interpreting "parse error" as "set
>>>      parse error flag and continue".
>>
>> I'm inclined to submit a pull request standardizing on "terminate this
>> algorithm" and "set parse error".
>
> I'm not sure what you mean here. Returning failure is important.
> However, in override mode returning failure is not important so the
> algorithm is simply terminated.

Can you explain in JavaScript terms what the difference is between 
return failure and terminate?

In any case, this difference wasn't clear to me, and mixed in with not 
defining what should be done with parse errors, and returning failure 
without setting parse errors (as mentioned below); all combined to make 
it more difficult (for me at least) to determine what was desired.

> And parse error error would be more
> like flag a parse error, or append a parse error to a list of parse
> errors. It depends a bit on whether the parser decides to halt on the
> first one or not.

I don't see anything in the prose that indicates that halting on the 
first parse error is an option.

>>>      b) Step 1.3.3 seems problematic.  I interpret this prose to mean "if
>>>         any character in buffer is a "%" and the first two characters
>>>         after the pointer position in input aren't hex characters".
>>>         Specifically, it appears to be comparing a possibly
>>>         non-contiguous set of characters.
>
> Ah yes. It needs to check the two code points after code point in
> buffer. That seems like a bug.

I'll look into submitting a pull request after I complete this pass.

>>> 4) https://url.spec.whatwg.org/#file-host-state
>>>      Step 1.3.2 returns failure without setting parse_error.  Is this
>>>      correct?
>>>
>>> 5) https://url.spec.whatwg.org/#host-state
>>>      Step 1.2.2 also returns failure without setting parse_error.
>
> This is indeed inconsistent. I must at some point have thought that
> returning failure without reporting a parse error was fine (as failure
> was indicated) or the other way around. Reporting a parse error before
> returning is probably best.
>
>> I'm inclined to submit a pull request changing these to to set parse error
>> before failing.
>
> Thanks.

Will do.

>>> 6) https://url.spec.whatwg.org/#relative-path-state
>>>      If input contains a path but no query or fragment, the last part of
>>>      the path will be accumulated into buffer, but that buffer will never
>>>      be added to the path
>>
>> Looks like I got confused by the prose in the spec.  I've submitted a pull
>> request that makes this point clearer:
>>
>>    https://github.com/whatwg/url/pull/4
>>
>> There were a number of places where things weren't clear to me; after I
>> complete my technical review and testing verification, I'll go back and
>> identify more.  Meanwhile, here is an example:
>>
>>    https://github.com/whatwg/url/pull/5
>
> Feel free to only modify url.src.html in PRs.

Ack.

> Did you have a look at the open bugs by the way? There's a chance the
> parsing algorithm will get rewritten at some point to be a bit more
> functional and less state driven.

Not yet.  I'm still seeing a large set of differences between what I am 
producing and what is in urltestdata.txt and need to track down whether 
the problems are in my implementation, the spec, or in the test results.

Once those three are in sync; I'll try to look at the bigger picture.

- Sam Ruby
Received on Sunday, 12 October 2014 13:01:55 UTC