Looking into my copies of health indicator and traffic flow files created as part of HealthExposureModelMEL, I was struck by the outlying size.
This was for the Melbourne test example --- a small 231 person (100 household) subset of the actual Greater Melbourne (~4 million) person synthetic population that I have been struggling to process due to long processing times (eg see #159 ). So, the overall file sizes here are small, but the relative difference is the point --- walk takes longer to process, and its interesting that the walk files here have more data.
I wonder --- Are more people being routed further than they should by walk mode specifically for some reason, and is the cause of the longer processing times?
Let's look into it, first with the test data, noting that these files have not to date been able to be generated for Melbourne due to the outsize sample size meaning 'walk' is taking more than a week to process for a single day:
File sizes and number of records by mode for health indicators and traffic flows
| type |
mode |
n_files |
avg_size_kb |
median_size_kb |
avg_nrows |
median_nrows |
total_rows |
| healthIndicators |
autoDriver |
7 |
10.6 |
11.2 |
60.3 |
63.0 |
422 |
| healthIndicators |
autoPassenger |
7 |
3.9 |
4.3 |
21.1 |
24.0 |
148 |
| healthIndicators |
bicycle |
7 |
0.5 |
0.4 |
1.4 |
1.0 |
10 |
| healthIndicators |
pt |
7 |
1.5 |
1.5 |
10.1 |
11.0 |
71 |
| healthIndicators |
walk |
7 |
28.8 |
30.5 |
189.1 |
197.0 |
1324 |
| traffic_flows |
bike |
7 |
6.8 |
3.4 |
498.6 |
249.0 |
3490 |
| traffic_flows |
car |
7 |
522.3 |
561.9 |
36962.1 |
39749.0 |
258735 |
| traffic_flows |
walk |
7 |
15651.0 |
15991.4 |
1116650.1 |
1141197.0 |
7816551 |
Without even looking at the content of these files, it is clear something is going wrong. It suggests that
- there are more walking trips than other modes combined, and
- those 231 persons are making walking trips over many more roads than other modes that would be expected to take you further
That is, these summary statistics of file sizes and number of records for health indicators and traffic flows suggest that more people seem to be walking than expected, and they are walking too far. This might explain the symptom of this mis-routing which we appear to be experiencing --- impossibly long walk processing times for health indicators. Perhaps this is a side-effect of mis-routing, for some reason.
Let's look further.
Number of links and activity duration for trips by mode (from health indicators)
Number of links (n)
| mode |
min |
p25 |
p50 |
p75 |
max |
| autoDriver |
10 |
248.75 |
414 |
571.75 |
1335 |
| autoPassenger |
60 |
366.5 |
582.5 |
898 |
1672 |
| pt |
0 |
0 |
0 |
0 |
0 |
| bicycle |
46 |
232 |
261.5 |
516 |
801 |
| walk |
0 |
576.25 |
6340 |
10215.25 |
22630 |
Activity duration (mins)
| mode |
min |
p25 |
p50 |
p75 |
max |
| autoDriver |
-1 |
5 |
43 |
284.75 |
726 |
| autoPassenger |
-1 |
7.5 |
50 |
246 |
670 |
| pt |
-1 |
34.5 |
118 |
409 |
583 |
| bicycle |
-1 |
-1 |
-1 |
-1 |
545 |
| walk |
-1 |
-1 |
35 |
142 |
785 |
As inferred from the file sizes of healthIndicators and traffic flows, looking at the health indicator data we see that travel times are longer than for other modes, and the number of links traverse an order of magnitude or two higher than for other modes including driving.
That's not plausible so we need to look into what is driving that.
While the Brunswick test example (20,000 people, but in an area of 56 SA1s, instead of 10,000 SA1s), as a small delimited study region processes much quicker, however the sizes of these files display the same pattern.
So, the Brunswick test example that runs SiloMEL and RunHealthExposureOffline each in ~9mins can be used instead to explore what might be giving rise to this instead of the Melbourne 100 household test data that runs the latter in 7.5 hours. If we can fix it for Brunswick, we can then run for the 100 household test data, that should then process much quicker giving confidence of proper processing times for Greater Melbourne.
matsimPlans
I'm not sure how best to approach analysing these XML documents, so have taken a crude approach ---- I opened up matsimPlan_Friday.xml that exists in the same folder for Brunswick and did count of occurances of different modes:
| |
Melbourne - 100hh (n) |
Brunswick (n) |
| bike |
0 |
3,009 |
| car |
154 |
19,126 |
| walk |
564 |
44,904 |
So, walk is an outlier there too
SiloMEL MATSim outputs
Going further upstream in the process, considering the outputs of SiloMEL.java that these are based on the file sizes for 2018.output_events.xml.gz on a Thursday for Brunswick is 6.6MB for carTruck, compared with 14.4MB for bikePed.
For the 100hh data across all Melbourne, its 53.1MB for carTruck, and 1.1MB for bikePed.
Its a bit confusing but --- it probably makes sense because the all Melbourne dataset contains all interstate freight, so quite a few truck trips; Brunswick restricted this to just Brunswick (very few). Brunswick has 20,000 people doing bikePed trips and the 100hh dataset 231 people. So maybe this looks like it makes sense.
I think more attention needs to be paid to how the RunExposureOfflineHealth events files are generated...
I'll look into this now, but so I don't accidentally lose the above, I'll add this comment now.
Looking into my copies of health indicator and traffic flow files created as part of HealthExposureModelMEL, I was struck by the outlying size.
This was for the Melbourne test example --- a small 231 person (100 household) subset of the actual Greater Melbourne (~4 million) person synthetic population that I have been struggling to process due to long processing times (eg see #159 ). So, the overall file sizes here are small, but the relative difference is the point --- walk takes longer to process, and its interesting that the walk files here have more data.
I wonder --- Are more people being routed further than they should by walk mode specifically for some reason, and is the cause of the longer processing times?
Let's look into it, first with the test data, noting that these files have not to date been able to be generated for Melbourne due to the outsize sample size meaning 'walk' is taking more than a week to process for a single day:
File sizes and number of records by mode for health indicators and traffic flows
Without even looking at the content of these files, it is clear something is going wrong. It suggests that
That is, these summary statistics of file sizes and number of records for health indicators and traffic flows suggest that more people seem to be walking than expected, and they are walking too far. This might explain the symptom of this mis-routing which we appear to be experiencing --- impossibly long walk processing times for health indicators. Perhaps this is a side-effect of mis-routing, for some reason.
Let's look further.
Number of links and activity duration for trips by mode (from health indicators)
Number of links (n)
Activity duration (mins)
As inferred from the file sizes of healthIndicators and traffic flows, looking at the health indicator data we see that travel times are longer than for other modes, and the number of links traverse an order of magnitude or two higher than for other modes including driving.
That's not plausible so we need to look into what is driving that.
While the Brunswick test example (20,000 people, but in an area of 56 SA1s, instead of 10,000 SA1s), as a small delimited study region processes much quicker, however the sizes of these files display the same pattern.
So, the Brunswick test example that runs SiloMEL and RunHealthExposureOffline each in ~9mins can be used instead to explore what might be giving rise to this instead of the Melbourne 100 household test data that runs the latter in 7.5 hours. If we can fix it for Brunswick, we can then run for the 100 household test data, that should then process much quicker giving confidence of proper processing times for Greater Melbourne.
matsimPlans
I'm not sure how best to approach analysing these XML documents, so have taken a crude approach ---- I opened up matsimPlan_Friday.xml that exists in the same folder for Brunswick and did count of occurances of different modes:
So, walk is an outlier there too
SiloMEL MATSim outputs
Going further upstream in the process, considering the outputs of SiloMEL.java that these are based on the file sizes for 2018.output_events.xml.gz on a Thursday for Brunswick is 6.6MB for carTruck, compared with 14.4MB for bikePed.
For the 100hh data across all Melbourne, its 53.1MB for carTruck, and 1.1MB for bikePed.
Its a bit confusing but --- it probably makes sense because the all Melbourne dataset contains all interstate freight, so quite a few truck trips; Brunswick restricted this to just Brunswick (very few). Brunswick has 20,000 people doing bikePed trips and the 100hh dataset 231 people. So maybe this looks like it makes sense.
I think more attention needs to be paid to how the RunExposureOfflineHealth events files are generated...
I'll look into this now, but so I don't accidentally lose the above, I'll add this comment now.