increasing approx kernel faster Speed by 16liw · Pull Request #2 · marsupialtail/SVS

16liw · 2018-08-27T00:33:33Z

ANALYSIS:

~The actual writing part (data[data_row_offset + col + threadIdx.x]) takes 2340 to 2352
~In fact, theoretically, this part should be the bottleneck. Should this part be the bottleneck however, based on how many operations of nested for-loops--if done linearly, the time should be 3,744,000 (2340 * 5 * 5 * 64)--col, row, channel.
~However, it currently actually takes around 4278832-4302892, suggesting something is happening.

~In fact, further analysis of each for-loop suggests such a linear pattern. The second for-loop for col takes 11276 (about 2340 * 5). The third for-loop for row takes about 55700 (about 2340 * 5 *5 = 58500).
~OPTIMIZATION SOLUTION:
~I suspect that the biggest problem with the current implementation is due to super-alignment issues when caching. That is, non-sequential access causes random slow-downs on large powers-of-two.
~To address this problem:

Expanding inner two for loops: I found declaring to variables takes more time than updating variables--hence explains why expanding for-loops decreases time
Optimizing caching by doing all reads the first array first and then writing to the second array

~Such optimization now makes inner two for-loops take around 48441, meaning all of the code should execute in ~3,100,000 at the very least. Nevertheless, is taking 3578698 now, suggesting that there is much more cache optimization to be done

+++++++++++++++++++++++++++++++++++
~Now Major thing is to optimize memory access. It's now taking 3476050 to 3507880
~The biggest problem is that the outer for-loop jumps in increments of is_1 * is_2 * is_3 --> which is WAY too large. I'm suspecting in cases, the next iteration of the for-loops are all misses, resulting in huge latency
++++++++++++
~Down to taking 3426024 to 3463012
~I think that at this point, i've pretty much exploited the temporality locality benefits, and other hacky efficiency things like not declaring variables as often, storing values into variables, etc.
~The inherent problem with the current implementation is what I mentioned above (caching misses)--and that seems inavoidable--unless we reorder our data structure.
QUESTION: Why are the outer for-loop jumps in increments of is_1 * is_2 * is_3? What data is in between these huge jumps that you are not accessing/skipping? Is there a way to squash the data down so they are closer together?
~^Solving that problem will definatively improve the speed--I found that decreasing the increment size significantly improved the speed by like 2x.
At this point, I've optimized it the best I can with out further restructuring of the data itself. Through my refactoring, it is possible that I might have made a mistake in the math--I've checked though, but there might still be errors.

16liw added 4 commits August 26, 2018 17:16

increase speed

9f3902c

update

9e7a150

optimize

3b5283a

update

f488666

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increasing approx kernel faster Speed#2

increasing approx kernel faster Speed#2
16liw wants to merge 4 commits into
masterfrom
wendy

16liw commented Aug 27, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

16liw commented Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

16liw commented Aug 27, 2018 •

edited

Loading