- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13
Enable (more than) 1 TiB RAM. #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| Test result (on b9): success, and these messages (Note: 1 TiB = 0x10000000000 B):  | 
| What is the impact, if any, on the resulting memory channel configuration? Are we changing how we're interleaving as a result of changing these tokens? In particular, the main concern and I think the thing we need to understand before we move forward is: 
 We can get at some of this data through the zen_umc driver work that I did and the big theory statement has some background there. | 
| Comparing the version before enabling 1 TiB to after enabling 1 TiB, we get no change in interleaving parameters on gimlet b9:  | 
| So in that output, I don't see any change in  I expect that we'll need to go investigate changing the interleaving to do so. | 
| The change that's necessary to make this effective is not directly changing the interleaving settings.  The critical guidance from AMD here is this tidbit from pub 55483 rev 1.70, 4.1.2.9  This goes on to mention the example of interleaving set to 'die' where each socket (sic) has 1TB of DRAM, in which the remapping won't work.  Let's examine then the tokens we might expect to be of interest.  The first is  Let's start with the only setting of this token that can possibly be of use:  The astute reader will notice that this is exactly the same interleaving configuration we get by default.  We shouldn't really be altogether surprised, because the documentation for this parameter tells us it's valid on  If this obvious setting doesn't do anything, what might?  Let's consider instead a pair of tokens  While we can probably get the ABL to choose slightly different bits for interleaving this way (and that might indeed be useful if we want to explore the overall configuration space while observing performance changes), this isn't going to get us what we really need which is for the rule containing the IOMMU hole to begin at some address other than 0. The key observation here is that from Naples to Rome, the essential control knob for this changed from these tokens to the "NUMA" configuration, Nodes per Socket or  Armed with this essential conclusion, we can set  and with  With the full 1 TiB population, each of these rules would be twice as large but the basic organisation will remain the same. In other words, setting this parameter to  That leaves us with two fundamental questions, one much more important than the other: 
 Obviously, if the answer to the first part is no, then the second question is largely moot on this 1S platform. In any event, we now know what is possible in terms of interleaving configuration so at least we can measure what is measurable. It would seem that the obvious path forward here, with lowest impact, is to configure NPS2 and leave the mode selection at the default (i.e., COD-4). In the limit it may be advantageous to go to NPS4 and take advantage of our knowledge of this to improve locality for VMs that fit in a quadrant, though the effect (if any) on globally-bound PCIe devices doing DMA access may not be conducive to great results. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my notes on interleaving; in order to make this work on Milan, one must set DfDramNumaPerSocket to Two, or else identify some convincing alternative.
Since this change includes the fixes for additional bugs, please include them in the synopsis.
Then there's the big part: how will you test performance of the alternate interleaving configuration in a way that will be representative of how the machine is being used in the field? This would mean that a pattern of linear accesses to a physical address range would be interleaved over 4 channels rather than 8. How common is that access pattern? Does the kmem allocator end up mostly papering over this anyway? If not, are guest performance or other specific tasks measurably slowed?
See also #51
Note: This PR is on top of issue-64 and issue-73 as well--and that is how it was tested.