-
Notifications
You must be signed in to change notification settings - Fork 12
speedup for gsi_obs_space #247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Making this a draft as I had some thoughts on how to change this. Fundamentally, the set of variables I have there doesn't make sense. It came about as I was combining a sensor which had the algorithm said had the same lat/lon's per channel to a sensor which supposedly had independent lat/lons per channel. I had some ideas on how to do this properly (check qc, check lat/lon repeats and reshape and thin as appropriate), or just be safe and reshape everything as everything is flat in the gsi and it makes sense in my mind to reshape everything to preserve alignment of all fields. |
|
Is this related to what @rtodling mentioned today in terms of calculating stats and sorting between different channels? Or is this even the exact same problem? |
|
No, this is completely different, I think. The problem here is more the slowness that gsi_obs_space encounters as it will reshape or thin every variable, even if the user doesn't request it. Also, as a secondary problem the method used to determine whether to reshape or thin seems like it could easily fail to do the right thing. |
…tch gsi_obs_space_reshape_all the user can turn on.
|
What I ended up doing here is add a switch that a user can set The flag is set to false by default (or if unspecified) so behavior remains the same for all tests. |
Description
Reading gsi diag files can be pretty slow, especially for hyperspectral sounders (IASI being the worst at the moment). There are a number of things that contribute to this, but the big one is reshaping or thinning every dataset in the diag. There are other contributing factors like reading each variable for a count of nchan, making the decision to either thin, or reshape the file based on whether all values are equal in the first nchan values.
This seems like a bad idea. It's really not hard to imagine a case where a sensor goes crazy/has fill values across all channels for an observation. If it did it on the first observation, under the current scheme, it would choose to thin rather than reshape the observation. Instead I broke the variables out into lists that are to be thinned, and reshaped ,respectively. There is a fallback using the old scheme if the variable does not fall within the two lists.
For a channel summary comparison between 3 runs I have right now, under the current scheme it takes ~45 minutes, using the changes I've made here, this drops it to ~22 minutes on a dedicated node.
Timing before:
Timing after:
Dependencies
None
Impact
Speeds up and makes gsi_obs_space more consistent/safe among sensors.