-
Notifications
You must be signed in to change notification settings - Fork 15
Hardware Inventory EVE API Messages #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3095edc to
9760dbd
Compare
|
Can you squash the commits? F.e. I don't think we need |
Sure i will update it. |
proto/info/info.proto
Outdated
|
|
||
| string kernel_flavor = 17; // e.g. "pc", "rt", "virt" | ||
| string kernel_version = 18; // e.g. "6.6.45-eve" | ||
| string eve_platform = 19; // board/platform name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rucoder will this platform string be sufficient to identify that it is an evaulation image and which partition of the evaulation image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after the discussion with @rucoder and @christoph-zededa we will add a field for the partition and the kernel cmdline!
proto/info/info.proto
Outdated
| uint32 cpu_count = 13; // logical CPU count | ||
|
|
||
| bool watchdog_present = 14; | ||
| bool tpm_present = 15; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the golang library make the hardware and firmware versions of the TPM device available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eriknordmark Yes, the google/go-tpm/tpm2 library that we use on EVE, can read both hardware and firmware version info from the TPM. You can use the GetCapability command with TPMCapProperty and the relevant TPM_PT_* constants, such as TPM_PT_FIRMWARE_VERSION_1/2 for firmware and TPM_PT_MANUFACTURER / TPM_PT_VENDOR_STRING_* for vendor and other hardware properties. I will update the PR accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will add a string for TPM version, if it can be recognized by the API.
b8c7a79 to
6e75bdd
Compare
@christoph-zededa i squashed two of your commits, please check again and tell me if you want me to do anything else. |
6e75bdd to
07a9f9d
Compare
proto/info/info.proto
Outdated
| string vendor = 2; // e.g. "GenuineIntel" | ||
| uint32 id = 3; // logical CPU id | ||
| uint64 freq = 4; // nominal frequency | ||
| repeated CPUCache caches = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caches can be shared between processor cores. IMO it makes therefore more sense to have a list of CPUCache and have it a member repeated cpu_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right that caches can be shared between cores, but this proto "CPU struct" represents logical CPUs, not core groups. For now we keep Cache per CPU, and if we ever need to model shared cache accross different CPUs we can extend NUMA with a shared-cache structure later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we ever need to model shared cache accross different CPUs we can extend NUMA with a shared-cache structure later
I think it is likely that we need to add it, as it is advantageous when using CPU pinning.
If we extend NUMA with a shared-cache structure, then we have the cache in the CPU message and in the NUMA message and it does not look nice when information is scattered or duplicated over two messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this structure allow you to specify that core 1 and core 2 share an L2 cache and core 3 and core 4 share another L2 cache, and all four share an L3 cache? To specify that don't you need to be able to name the two shared L2 caches so that you can tell it apart from core 1 and core 3 sharing an L2 cache (and 2 + 4 sharing the other).
This cache hiearchy seems premature and can be added later.
And NUMA is more that cache sharing - NUMA also needs to specify the placement of the memory controllers and which memory hangs off of each memory controller.
When we get to add that we should look at what the various generic + intel/amd/arm-specific tools report in terms of memory, controllers, caches, and cores.
No need to spend time on getting this correct now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion would be to do it similar to the ghw library.
This is part of the output of ghwc topology -f json | jq .topology.nodes[0].caches on my laptop:
{
"level": 2,
"type": "unified",
"size_bytes": 1310720,
"logical_processors": [
6,
7
]
},
{
"level": 2,
"type": "unified",
"size_bytes": 2097152,
"logical_processors": [
8,
9,
10,
11
]
},
{
"level": 2,
"type": "unified",
"size_bytes": 2097152,
"logical_processors": [
12,
13,
14,
15
]
},
{
"level": 3,
"type": "unified",
"size_bytes": 18874368,
"logical_processors": [
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15
]
}
The NUMA node has a list of caches and each cache has a list of cpu cores.
Yes, no need to make it correct now, but we should not put obstacles on the way there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but for NUMA we need something more than just caches to be able to handle the AMD interconnect between cores/chiplets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eriknordmark @christoph-zededa Do we agree to remove caches for the time beeing and rethink about it in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is okay to do that later as we know have an idea how it can be extended.
The controller does not use this information yet, so we're not in an immediate hurry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the conversation we had,we decided to leave it as is for the time beeing and rethink about that in the future!
One commit adds |
add inventory message for EVE to upload to controller Signed-off-by: Christoph Ostarek <[email protected]>
- Extend ZInfoHardware with node summary, NUMA and BIOS attributes - Add node-level summary and device inventories to ZInfoHardware - Refine hardware inventory proto with CPU/NUMA, BIOS and misc attrs - Update hardware inventory API with summary fields and misc attributes Signed-off-by: Konstantinos Perakis <[email protected]>
07a9f9d to
fa862a5
Compare
|
Another thing: Can you change the PR title? |
In general (and to make it easier to merge with any other PRs) the best is to put only and all the proto file changes in one commit and the derived files in a separate commit. (If there is a conflict you can then discard the second commit and regenerate it.) |
proto/info/hardware.proto
Outdated
| repeated SmartAttr smart_attr = 8; // smartctl output | ||
| } | ||
|
|
||
| message USBBusDevnum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would seem more consistent with the PCI case to name this message USBAddress.
proto/info/hardware.proto
Outdated
| optional PCIAddress pci_parent = 1; | ||
| optional USBBusDevnum usb_parent = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we ever have a different type of parent we have to modify this.
If you add a layer of indirection (e.g., called BusParent) which contains
optional PCIAddress pci_parent = 1;
optional USBBusDevnum usb_parent = 2;
we can later add different types of parents without changing the children.
proto/info/hardware.proto
Outdated
| // https://elixir.bootlin.com/linux/v6.16.9/source/include/linux/usb.h#L984 | ||
| uint32 vendor_id = 3; | ||
| uint32 product_id = 4; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
USB devices also have a class which would be useful to report.
Also, lsusb -t shows the driver which would be helpful to report (to see if drivers are missing).
proto/info/hardware.proto
Outdated
| optional PCIAddress pci_parent = 1; | ||
| optional USBBusDevnum usb_parent = 2; | ||
|
|
||
| string ioports = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the plural "ports" imply that this is a list? If so, how are the different items separated (comma? space?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eriknordmark Good point, the name is confusing. This field is a single I/O port range string as reported by the kernel (for example "03f8-03ff" for a serial port), not a list of multiple values.
I can rename it to ioport_range (or similar) and update the comment to make the format explicit. Would that work better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a rename and comment would help.
proto/info/hardware.proto
Outdated
| optional USBBusDevnum usb_parent = 2; | ||
|
|
||
| string ifname = 3; | ||
| bool wireless = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we currently separate WiFi and Cellular? (And then there is BT as well which spec.sh doesn't look for AFAIK.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eriknordmark i agree, this is what is described in the document.
I suggest the following change:
enum NetworkDeviceType {
NETWORK_DEVICE_TYPE_UNKNOWN = 0;
NETWORK_DEVICE_TYPE_ETHERNET = 1;
NETWORK_DEVICE_TYPE_WIFI = 2;
NETWORK_DEVICE_TYPE_CELLULAR = 3;
NETWORK_DEVICE_TYPE_BT = 4;
}
message NetworkDevice {
BusParent bus_parent = 1;
string ifname = 3;
NetworkDeviceType type = 4;
string mac_address = 5; // e.g. "00:11:22:33:44:AA"
uint32 speed_mbps = 6; // link speed if known
}
To add enum with some known device types. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have some existing definitions for the config side of the API but these are intermixed with other I/O adapters:
proto/evecommon/devmodelcommon.proto: PhyIoNetEth = 1;
proto/evecommon/devmodelcommon.proto: PhyIoNetWLAN = 5;
proto/evecommon/devmodelcommon.proto: PhyIoNetWWAN = 6;
proto/evecommon/devmodelcommon.proto: PhyIoNetEthPF = 11;
proto/evecommon/devmodelcommon.proto: PhyIoNetEthVF = 12;
Note that we haven't defined anything for BT yet.
How about renaming NETWORK_DEVICE_TYPE_CELLULAR to NETWORK_DEVICE_TYPE_WWAN and leaving out BT for now?
proto/info/hardware.proto
Outdated
| repeated CPUCache caches = 5; | ||
| } | ||
|
|
||
| message NUMA { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Isn't it just a list of the CPUs? I see no memory-specific info in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eriknordmark You’re right, with the current definition NUMA is effectively just a list of CPUs. I followed the plan in the hardware inventory document: https://docs.google.com/document/d/1AqeDQNHneVpCvtnBDU3gIzwKI6BA_S_Bdq3MKuPW2vQ/edit?tab=t.0,
where the NUMA message only defines the CPUs, and the memory is exposed at the node level via ZInfoHardware.total_memory_bytes. As I understand it, the CPUs in a given NUMA node share the same memory.
Do you think we should also include memory information inside the NUMA message or keep it as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is confusing to call the list of CPUs "NUMA" so I would argue to just rename it (to CPUS or CPU_list). But check with @christoph-zededa and others in the team.
Later we can add NUMA information to capture the CPU to memory relationships.
proto/info/hardware.proto
Outdated
| uint32 cpu_count = 12; // logical CPU count | ||
|
|
||
| bool watchdog_present = 13; | ||
| bool tpm_present = 14; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec.sh reports only a boolean. But it probably makes sense making this a message/struct so that we can also report manufacturer and firmware versions. @shjala can provide details of what is useful, but we can fill in more details later if you define a optional message for now.
proto/info/hardware.proto
Outdated
| } | ||
|
|
||
| // Information about the system that is sent once when the system boots | ||
| message ZInfoHardware { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of this message doesn't make it clear what we consider being the hardware inventory. And a consumer probably wants to be able to compare the inventory without being confused by e.g., the eve_release string. That string is very useful, but it is not part of the hardware inventory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eriknordmark according to the plan, we should include the hardware inventory in the existing ZInfoHardware message that we already send to the controller when the system boots.
If we rename it, we’ll likely break compatibility with previous versions. Do you agree that we should instead add a HardwareInventory message under ZInfoHardware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, some HardwareInventory inside it would make it clear what is inventory and what is not. Currently I think there is only the smart and the eve release which should be outside. I don't know if @rucoder will want some additional field for the evaluation partition names to track them. But if he doesn't respond we can worry about that later.
eriknordmark
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please update the https://github.com/lf-edge/eve-api/blob/main/APIv2.md#hardware-health-info as a result of the move of the smart counters, and add a bullet item for hardware info in https://github.com/lf-edge/eve-api/blob/main/APIv2.md#info (it appears to not have been added before).
@eriknordmark sure, lets finalize all the changes and i will update the docs. |
e2ccc2c to
89cb091
Compare
|
@christoph-zededa @europaul @eriknordmark @rucoder can you check the PR again please? I did the changes we discussed yesterday! |
| string kernel_flavor = 16; // e.g. "pc", "rt", "virt" | ||
| string kernel_version = 17; // e.g. "6.6.45-eve" | ||
| string eve_platform = 18; // board/platform name | ||
| string partition = 19; // e.g. "IMGA", "IMGB" | ||
| string kernel_cmdline = 20; // current kernel command line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is useful information, but it doesn't change the hardware itself. Thus a change to one of these five shouldn't be an indication that the hardware inventory changed. Thus I think it makes sense to place them one level up in the ZInfoHardware message and not in the inventory message.
eriknordmark
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment about moving up some fields to the same place as string eve_release
Also. please put all the proto filechanges in one commit and all the other changes in a separate commit.
Once those are done go ahead and merge. (I'll be offline for a few hours).
Move hardware inventory API in a new hardware.proto file. Signed-off-by: Konstantinos Perakis <[email protected]>
89cb091 to
4f5cb37
Compare
This PR extends ZInfoHardware and related messages to support the hardware inventory work for 4 steps to Nirvana. It adds structured PCI/USB/serial/network/CAN, BIOS, CPU/NUMA fields and basic node-level summary (memory, storage, CPUs, kernel, platform, TPM/watchdog/LED presence).