Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ANALYSIS:
~The actual writing part (data[data_row_offset + col + threadIdx.x]) takes 2340 to 2352
~In fact, theoretically, this part should be the bottleneck. Should this part be the bottleneck however, based on how many operations of nested for-loops--if done linearly, the time should be 3,744,000 (2340 * 5 * 5 * 64)--col, row, channel.
~However, it currently actually takes around 4278832-4302892, suggesting something is happening.
~In fact, further analysis of each for-loop suggests such a linear pattern. The second for-loop for col takes 11276 (about 2340 * 5). The third for-loop for row takes about 55700 (about 2340 * 5 *5 = 58500).
~OPTIMIZATION SOLUTION:
~I suspect that the biggest problem with the current implementation is due to super-alignment issues when caching. That is, non-sequential access causes random slow-downs on large powers-of-two.
~To address this problem:
~Such optimization now makes inner two for-loops take around 48441, meaning all of the code should execute in ~3,100,000 at the very least. Nevertheless, is taking 3578698 now, suggesting that there is much more cache optimization to be done
+++++++++++++++++++++++++++++++++++
~Now Major thing is to optimize memory access. It's now taking 3476050 to 3507880
~The biggest problem is that the outer for-loop jumps in increments of is_1 * is_2 * is_3 --> which is WAY too large. I'm suspecting in cases, the next iteration of the for-loops are all misses, resulting in huge latency
++++++++++++
~Down to taking 3426024 to 3463012
~I think that at this point, i've pretty much exploited the temporality locality benefits, and other hacky efficiency things like not declaring variables as often, storing values into variables, etc.
~The inherent problem with the current implementation is what I mentioned above (caching misses)--and that seems inavoidable--unless we reorder our data structure.
QUESTION: Why are the outer for-loop jumps in increments of is_1 * is_2 * is_3? What data is in between these huge jumps that you are not accessing/skipping? Is there a way to squash the data down so they are closer together?
~^Solving that problem will definatively improve the speed--I found that decreasing the increment size significantly improved the speed by like 2x.
At this point, I've optimized it the best I can with out further restructuring of the data itself. Through my refactoring, it is possible that I might have made a mistake in the math--I've checked though, but there might still be errors.