Re: [rtcweb] Data on travel times

From: Eric Rescorla <ekr@rtfm.com> · Date: Mon, 9 Apr 2012 09:02:41 -0700

On Mon, Apr 9, 2012 at 8:35 AM, Marshall Eubanks
<marshall.eubanks@gmail.com> wrote:
> I really like this analysis. Some questions.
>
> 2012/4/9 Eric Rescorla <ekr@rtfm.com>:
>> Hi folks,
>>
>> Since it seems like we're going to be having a large number of
>> interims, I thought it might be instructive to try to analyze a bunch
>> of different locations to figure out the best strategy. My first cut
>> analysis is below.
>>
>> Note that I'm not trying to make any claims about what the best set of
>> venues is. It's obviously easy to figure out any statistic we want
>> about each proposed venue, but how you map that data to "best" is up
>> to you. In particular, there's some tradeoff between minimal total
>> travel time and a "fair" distribution of travel times (not that I
>> claim to know what that means).
>>
>>
>> METHODOLOGY
>> The data below is derived by treating both people and venues as
>> airport locations and using travel time as our primary instrument.
>>
>> 1. For each responder for the current Doodle poll, assign a home
>>   airport based on their draft publication history.  We're missing a
>>   few people but basically it should be pretty complete. Since
>>   these people responded before the venue is known, it's at
>>   least somewhat unbiased.
>>
>> 2. Compute the shortest advertised flight between each home airport
>>   and the locations for each venue by looking at the shortest
>>   advertised Kayak flights around one of the proposed interim
>>   dates (6/10 - 6/13), ignoring price, but excluding "Hacker fares".
>>   [Thanks to Martin Thomson or helping me gather these.]
>>
>
> 1.) Why are some fields doubled ? I.e.,
>
> ARN SFO 14 13
>
> Are these counted twice ? That would, of course, give more weight to
> those records.

Laziness. When I started recording flight times, I used the total time
and then later realized that what I wanted was to break them out by
out and back, but I was too lazy to go back and fix the earlier ones.

> 2.) At any rate, I couldn't quite match your numbers. For SFO, for
> example, I got
>
> # SFO
>
>  Records            29  |
>  Mean            12.52  |
>  RMS             15.34  |
>  Std Dev          8.55  |
>  Minimum          1.00  |
>  Maximum         34.00  |
>
> This assumes that each doubled entry counts as 2 separate entries. If
> the second entries are ignored, I get

I'm not sure what procedure you are following here, but if it's taking the
SD of the data in durations.txt, that's not what I did. That's just
the input data.

The summary data that I am showing is produced by weighting by
participant from each home airport. The script to generate that is
pairings.py and the results are found in doodle-out.txt. Of course,
it could still all be wrong.

FWIW, I'm using R's sd() which uses n-1.

-Ekr

> # SFO
>
>  Records            21  |
>  Mean            14.05  |
>  RMS             17.05  |
>  Std Dev          9.14  |
>  Minimum          1.00  |
>  Maximum         34.00  |
>
> If two entries are averaged together (when present)
>
> # SFO
>  Records            21  |
>  Mean            13.93  |
>  RMS             16.97  |
>  Std Dev          9.18  |
>  Minimum          1.00  |
>  Maximum         34.00  |
>
> None of these 3 options match your
>
> Venue         Mean         Median           SD
> ----------------------------------------------
> SFO           13.5             11         12.2
>
> In particular, your SD value seems high.
>
> (Note, I use the SD = root mean square /(n-1) not / n convention, but
> that won't explain the difference. )
>
> Regards
> Marshall
>
>
>> This lets us compute statistics for any venue and/or combination
>> of venues, based on the candidate attendee list.
>>
>> The three proposed venues:
>>
>> - San Francisco (SFO)
>> - Boston (BOS)
>> - Stockholm (ARN)
>>
>> Three hubs not too distant from the proposed venues:
>>
>> - London (LHR)
>> - Frankfurt (FRA)
>> - New York (NYC) [0]
>>
>> Also, Calgary (YYC), since the other two chair locations (BOS and SFO)
>> were already proposed as venues, and I didn't want Cullen to feel
>> left out.
>>
>>
>> RESULTS
>> Here are the results for each of the above venues, measured in total
>> hours of travel (i.e., round trip).
>>
>> Venue         Mean         Median           SD
>> ----------------------------------------------
>> SFO           13.5             11         12.2
>> BOS           12.3             11          7.5
>> ARN           17.0             21         10.7
>> FRA           14.8             17          7.3
>> LHR           13.3             14          7.5
>> NYC           11.5             11          5.8
>> YYC           14.9             13         10.2
>> SFO/BOS/ARN   14.3             13          3.6
>> SFO/NYC/LHR   12.7             11.3        3.7
>>
>> XXX/YYY/ZZZ a three-way rotation of XXX, YYY, and ZZZ. Obviously, mean
>> and median are intended to be some sort of aggregate measure of travel
>> time. I don't have any way to measure "fairness", but SD is intended
>> as some metric of the variation in travel time between attendees.
>>
>> The raw data and software are attached. The files are:
>>
>>  home-airports     -- the list of people's home airports
>>  durations.txt     -- the list of airport-airport durations
>>  doodle.txt        -- the attendees list
>>  pairings.py       -- the software to compute travel times
>>  doodle-out.txt -- the computed travel times for each attendee
>>
>> Obviously, there could be an error in the raw data or the software.
>> Please feel free to send corrections, especially if you find
>> something material.
>>
>>
>> OBSERVATIONS
>> Obviously, it's hard to know what the optimal solution is without
>> some model for optimality, but we can still make some observations
>> based on this data:
>>
>> 1. If we're just concerned with minimizing total travel time, then we
>> would always in New York, since it has both the shortest mean travel
>> time and the shortest median travel time, but as I said above, this
>> arguably isn't fair to people who live either in Europe or California,
>> since they always have to travel.
>>
>> 2. Combining West Coast, East Coast, and European venues has
>> comparable (or at least not too much worse) mean/median values than
>> NYC with much lower SDs. So, arguably that kind of mix is more fair.
>>
>> 3. There's a pretty substantial difference between hub and non-hub
>> venues. In particular, LHR has a median travel time 7 hours less than
>> ARN, and the SFO/NYC/LHR combination has a median/mean travel time
>> about 2 hours less than SFO/BOS/ARN (primarily accounted for by the
>> LHR/ARN difference). [Full disclosure, I've favored Star Alliance hubs
>> here, but you'd probably get similar results if, for instance, you
>> used AMS instead of LHR.]
>>
>>
>> Obviously, your mileage may vary based on your location and feelings
>> about what's fair, but based on this data, it looks to me like a
>> three-way rotation between West Coast, East Coast, and European hubs
>> offers a good compromise between minimum cost and a flat distribution
>> of travel times.
>>
>> Personally, whatever we decide to do I'd ask that the WG settle now on
>> a pattern going forward so that we can predictably budget our travel
>> time and dollars.
>>
>>
>> [0] Treating all three NYC airports as a single location.
>>
>> _______________________________________________
>> rtcweb mailing list
>> rtcweb@ietf.org
>> https://www.ietf.org/mailman/listinfo/rtcweb
>>