SILO Melbourne walk anomoly

Looking into my copies of health indicator and traffic flow files created as part of HealthExposureModelMEL, I was struck by the outlying size.

This was for the Melbourne test example --- a small 231 person (100 household) subset of the actual Greater Melbourne (~4 million) person synthetic population that I have been struggling to process due to long processing times (eg see #159 ).  So, the overall file sizes here are small, but the relative difference is the point --- walk takes longer to process, and its interesting that the walk files here have more data.  

I wonder --- Are more people being routed further than they should by walk mode specifically for some reason, and is the cause of the longer processing times?

Let's look into it, first with the test data, noting that these files have not to date been able to be generated for Melbourne due to the outsize sample size meaning 'walk' is taking more than a week to process for a single day:

## File sizes and number of records by mode for health indicators and traffic flows

| type | mode | n_files | avg_size_kb | median_size_kb | avg_nrows | median_nrows | total_rows |
|:---|:---|---:|:---|:---|:---|:---|:---|
| healthIndicators | autoDriver | 7 | 10.6 | 11.2 | 60.3 | 63.0 | 422 |
| healthIndicators | autoPassenger | 7 | 3.9 | 4.3 | 21.1 | 24.0 | 148 |
| healthIndicators | bicycle | 7 | 0.5 | 0.4 | 1.4 | 1.0 | 10 |
| healthIndicators | pt | 7 | 1.5 | 1.5 | 10.1 | 11.0 | 71 |
| healthIndicators | walk | 7 | 28.8 | 30.5 | 189.1 | 197.0 | 1324 |
| traffic_flows | bike | 7 | 6.8 | 3.4 | 498.6 | 249.0 | 3490 |
| traffic_flows | car | 7 | 522.3 | 561.9 | 36962.1 | 39749.0 | 258735 |
| traffic_flows | walk | 7 | 15651.0 | 15991.4 | 1116650.1 | 1141197.0 | 7816551 |

Without even looking at the content of these files, it is clear something is going wrong.  It suggests that 
- there are more walking trips than other modes combined, and
- those 231 persons are making walking trips over many more roads than other modes that would be expected to take you further

That is, these summary statistics of file sizes and number of records for health indicators and traffic flows suggest that more people seem to be walking than expected, and they are walking too far.  This might explain the symptom of this mis-routing which we appear to be experiencing --- impossibly long walk processing times for health indicators.  Perhaps this is a side-effect of mis-routing, for some reason.

Let's look further.

## Number of links and activity duration for trips by mode (from health indicators)

### Number of links (n)

mode | min | p25 | p50 | p75 | max
-- | -- | -- | -- | -- | --
autoDriver | 10 | 248.75 | 414 | 571.75 | 1335
autoPassenger | 60 | 366.5 | 582.5 | 898 | 1672
pt | 0 | 0 | 0 | 0 | 0
bicycle | 46 | 232 | 261.5 | 516 | 801
walk | 0 | 576.25 | 6340 | 10215.25 | 22630

### Activity duration (mins)

mode | min | p25 | p50 | p75 | max
-- | -- | -- | -- | -- | --
autoDriver | -1 | 5 | 43 | 284.75 | 726
autoPassenger | -1 | 7.5 | 50 | 246 | 670
pt | -1 | 34.5 | 118 | 409 | 583
bicycle | -1 | -1 | -1 | -1 | 545
walk | -1 | -1 | 35 | 142 | 785

As inferred from the file sizes of healthIndicators and traffic flows, looking at the health indicator data we see that travel times are longer than for other modes, and the number of links traverse an order of magnitude or two higher than for other modes including driving.

That's not plausible so we need to look into what is driving that.

While the Brunswick test example (20,000 people, but in an area of 56 SA1s, instead of 10,000 SA1s), as a small delimited study region processes much quicker, however the sizes of these files display the same pattern.

So, the Brunswick test example that runs SiloMEL and RunHealthExposureOffline each in ~9mins can be used instead to explore what might be giving rise to this instead of the Melbourne 100 household test data that runs the latter in 7.5 hours.  If we can fix it for Brunswick, we can then run for the 100 household test data, that should then process much quicker giving confidence of proper processing times for Greater Melbourne.

## matsimPlans

I'm not sure how best to approach analysing these XML documents, so have taken a crude approach ---- I opened up matsimPlan_Friday.xml that exists in the same folder for Brunswick and did count of occurances of different modes:

  | Melbourne - 100hh (n) | Brunswick (n)
-- | -- | --
bike | 0 | 3,009
car | 154 | 19,126
walk | 564 | 44,904

So, walk is an outlier there too

## SiloMEL MATSim outputs

Going further upstream in the process, considering the outputs of SiloMEL.java that these are based on the file sizes for 2018.output_events.xml.gz on a Thursday for Brunswick is 6.6MB for carTruck, compared with 14.4MB for bikePed.  

For the 100hh data across all Melbourne, its 53.1MB for carTruck, and 1.1MB for bikePed.

Its a bit confusing but --- it probably makes sense because the all Melbourne dataset contains all interstate freight, so quite a few truck trips; Brunswick restricted this to just Brunswick (very few).  Brunswick has 20,000 people doing bikePed trips and the 100hh dataset 231 people.  So maybe this looks like it makes sense. 

I think more attention needs to be paid to how the RunExposureOfflineHealth events files are generated...

I'll look into this now, but so I don't accidentally lose the above, I'll add this comment now.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SILO Melbourne walk anomoly #175

File sizes and number of records by mode for health indicators and traffic flows

Number of links and activity duration for trips by mode (from health indicators)

Number of links (n)

Activity duration (mins)

matsimPlans

SiloMEL MATSim outputs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

type	mode	n_files	avg_size_kb	median_size_kb	avg_nrows	median_nrows	total_rows
healthIndicators	autoDriver	7	10.6	11.2	60.3	63.0	422
healthIndicators	autoPassenger	7	3.9	4.3	21.1	24.0	148
healthIndicators	bicycle	7	0.5	0.4	1.4	1.0	10
healthIndicators	pt	7	1.5	1.5	10.1	11.0	71
healthIndicators	walk	7	28.8	30.5	189.1	197.0	1324
traffic_flows	bike	7	6.8	3.4	498.6	249.0	3490
traffic_flows	car	7	522.3	561.9	36962.1	39749.0	258735
traffic_flows	walk	7	15651.0	15991.4	1116650.1	1141197.0	7816551

mode	min	p25	p50	p75	max
autoDriver	10	248.75	414	571.75	1335
autoPassenger	60	366.5	582.5	898	1672
pt	0	0	0	0	0
bicycle	46	232	261.5	516	801
walk	0	576.25	6340	10215.25	22630

mode	min	p25	p50	p75	max
autoDriver	-1	5	43	284.75	726
autoPassenger	-1	7.5	50	246	670
pt	-1	34.5	118	409	583
bicycle	-1	-1	-1	-1	545
walk	-1	-1	35	142	785

SILO Melbourne walk anomoly #175

Description

File sizes and number of records by mode for health indicators and traffic flows

Number of links and activity duration for trips by mode (from health indicators)

Number of links (n)

Activity duration (mins)

matsimPlans

SiloMEL MATSim outputs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions