Skip to content

increasing approx kernel faster Speed#2

Open
16liw wants to merge 4 commits into
masterfrom
wendy
Open

increasing approx kernel faster Speed#2
16liw wants to merge 4 commits into
masterfrom
wendy

Conversation

@16liw

@16liw 16liw commented Aug 27, 2018

Copy link
Copy Markdown
Collaborator

ANALYSIS:

~The actual writing part (data[data_row_offset + col + threadIdx.x]) takes 2340 to 2352
~In fact, theoretically, this part should be the bottleneck. Should this part be the bottleneck however, based on how many operations of nested for-loops--if done linearly, the time should be 3,744,000 (2340 * 5 * 5 * 64)--col, row, channel.
~However, it currently actually takes around 4278832-4302892, suggesting something is happening.

~In fact, further analysis of each for-loop suggests such a linear pattern. The second for-loop for col takes 11276 (about 2340 * 5). The third for-loop for row takes about 55700 (about 2340 * 5 *5 = 58500).
~OPTIMIZATION SOLUTION:
~I suspect that the biggest problem with the current implementation is due to super-alignment issues when caching. That is, non-sequential access causes random slow-downs on large powers-of-two.
~To address this problem:

  1. Expanding inner two for loops: I found declaring to variables takes more time than updating variables--hence explains why expanding for-loops decreases time
  2. Optimizing caching by doing all reads the first array first and then writing to the second array

~Such optimization now makes inner two for-loops take around 48441, meaning all of the code should execute in ~3,100,000 at the very least. Nevertheless, is taking 3578698 now, suggesting that there is much more cache optimization to be done

+++++++++++++++++++++++++++++++++++
~Now Major thing is to optimize memory access. It's now taking 3476050 to 3507880
~The biggest problem is that the outer for-loop jumps in increments of is_1 * is_2 * is_3 --> which is WAY too large. I'm suspecting in cases, the next iteration of the for-loops are all misses, resulting in huge latency
++++++++++++
~Down to taking 3426024 to 3463012
~I think that at this point, i've pretty much exploited the temporality locality benefits, and other hacky efficiency things like not declaring variables as often, storing values into variables, etc.
~The inherent problem with the current implementation is what I mentioned above (caching misses)--and that seems inavoidable--unless we reorder our data structure.
QUESTION: Why are the outer for-loop jumps in increments of is_1 * is_2 * is_3? What data is in between these huge jumps that you are not accessing/skipping? Is there a way to squash the data down so they are closer together?
~^Solving that problem will definatively improve the speed--I found that decreasing the increment size significantly improved the speed by like 2x.
At this point, I've optimized it the best I can with out further restructuring of the data itself. Through my refactoring, it is possible that I might have made a mistake in the math--I've checked though, but there might still be errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant