From cef6a0df5b9c24e3f70f5ba4699073babbf9e16b Mon Sep 17 00:00:00 2001 From: Adrian Reber Date: Thu, 3 Jul 2025 13:26:07 +0000 Subject: [PATCH] Introduce WG Checkpoint Restore MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Viktória Spišaková Co-authored-by: Antonio Ojea Co-authored-by: Sergey Kanzhelev Signed-off-by: Adrian Reber --- OWNERS_ALIASES | 5 ++ liaisons.md | 1 + sig-apps/README.md | 1 + sig-auth/README.md | 1 + sig-list.md | 1 + sig-node/README.md | 1 + sig-scheduling/README.md | 1 + sigs.yaml | 37 ++++++++++++++ wg-checkpoint-restore/README.md | 37 ++++++++++++++ wg-checkpoint-restore/charter.md | 87 ++++++++++++++++++++++++++++++++ 10 files changed, 172 insertions(+) create mode 100644 wg-checkpoint-restore/README.md create mode 100644 wg-checkpoint-restore/charter.md diff --git a/OWNERS_ALIASES b/OWNERS_ALIASES index 12ad082f53a..4a3dc88eeb7 100644 --- a/OWNERS_ALIASES +++ b/OWNERS_ALIASES @@ -146,6 +146,11 @@ aliases: - kannon92 - mwielgus - tenzen-y + wg-checkpoint-restore-leads: + - adrianreber + - haircommander + - rst0git + - viktoriaas wg-data-protection-leads: - xing-yang - yuxiangqian diff --git a/liaisons.md b/liaisons.md index 5640e615b20..cd56458fd61 100644 --- a/liaisons.md +++ b/liaisons.md @@ -58,6 +58,7 @@ members will assume one of the departing members groups. | [WG AI Gateway](wg-ai-gateway/README.md) | Stephen Augustus (**[@justaugustus](https://github.com/justaugustus)**) | | [WG AI Integration](wg-ai-integration/README.md) | Paco Xu 徐俊杰 (**[@pacoxu](https://github.com/pacoxu)**) | | [WG Batch](wg-batch/README.md) | Antonio Ojea (**[@aojea](https://github.com/aojea)**) | +| [WG Checkpoint Restore](wg-checkpoint-restore/README.md) | Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**) | | [WG Data Protection](wg-data-protection/README.md) | Patrick Ohly (**[@pohly](https://github.com/pohly)**) | | [WG Device Management](wg-device-management/README.md) | Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**) | | [WG etcd Operator](wg-etcd-operator/README.md) | Maciej Szulik (**[@soltysh](https://github.com/soltysh)**) | diff --git a/sig-apps/README.md b/sig-apps/README.md index ce44ac07645..02d0754e209 100644 --- a/sig-apps/README.md +++ b/sig-apps/README.md @@ -59,6 +59,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-apps: * [WG AI Integration](/wg-ai-integration) * [WG Batch](/wg-batch) +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Data Protection](/wg-data-protection) * [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sig-auth/README.md b/sig-auth/README.md index 39110757e10..2615e7b2088 100644 --- a/sig-auth/README.md +++ b/sig-auth/README.md @@ -66,6 +66,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-auth: * [WG AI Integration](/wg-ai-integration) +* [WG Checkpoint Restore](/wg-checkpoint-restore) ## Subprojects diff --git a/sig-list.md b/sig-list.md index d64a2dd6c4e..0d4b091c646 100644 --- a/sig-list.md +++ b/sig-list.md @@ -65,6 +65,7 @@ When the need arises, a [new SIG can be created](sig-wg-lifecycle.md) |[AI Gateway](wg-ai-gateway/README.md)|[ai-gateway](https://github.com/kubernetes/kubernetes/labels/wg%2Fai-gateway)|* Multicluster
* Network
|* [Keith Mattix](https://github.com/keithmattix), Microsoft
* [Flynn](https://github.com/kflynn), Buoyant
* [Kellen Swain](https://github.com/kfswain), Google
* [Nir Rozenbaum](https://github.com/nirrozenbaum), IBM
* [Shane Utt](https://github.com/shaneutt), Red Hat
* [Xunzhuo](https://github.com/xunzhuo), Tencent
|* [Slack](https://kubernetes.slack.com/messages/wg-ai-gateway)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-ai-gateway)|* WG AI Gateway Bi-Weekly Meeting (Earlier Option): [Mondays at 12PM UTC (bi-weekly)]()
* WG AI Gateway Bi-Weekly Meeting (Later Option): [Thursdays at 6PM UTC (bi-weekly)]()
|[AI Integration](wg-ai-integration/README.md)|[ai-integration](https://github.com/kubernetes/kubernetes/labels/wg%2Fai-integration)|* API Machinery
* Apps
* Architecture
* Auth
* CLI
|* [Arda Guclu](https://github.com/ardaguclu), Red Hat
* [Arush Sharma](https://github.com/rushmash91), Amazon
* [Zvonko Kaiser](https://github.com/zvonkok), NVIDIA
|* [Slack](https://kubernetes.slack.com/messages/wg-ai-integration)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-ai-integration)|* WG AI Integration Weekly Meeting ([calendar](https://calendar.google.com/calendar/embed?src=71ef14cc0995618018b12614c63ca482d667e2922ff5b94d9fb0cfd32d4efada%40group.calendar.google.com)): [Wednesdays at 10:00 PT (Pacific Time) (biweekly)](https://zoom.us/j/95637970280?pwd=3Ys5MQF5hKoeWDazUsMdgt5FiRxbSs.1)
|[Batch](wg-batch/README.md)|[batch](https://github.com/kubernetes/kubernetes/labels/wg%2Fbatch)|* Apps
* Autoscaling
* Node
* Scheduling
|* [Kevin Hannon](https://github.com/kannon92), Red Hat
* [Marcin Wielgus](https://github.com/mwielgus), Google
* [Yuki Iwai](https://github.com/tenzen-y), CyberAgent, Inc.
|* [Slack](https://kubernetes.slack.com/messages/wg-batch)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-batch)|* Regular Meeting ([calendar](https://calendar.google.com/calendar/embed?src=8ulop9k0jfpuo0t7kp8d9ubtj4%40group.calendar.google.com)): [Thursdays (starting February 15th 2024)s at 3PM CET (Central European Time) (monthly)](https://zoom.us/j/98329676612?pwd=c0N2bVV1aTh2VzltckdXSitaZXBKQT09)
+|[Checkpoint Restore](wg-checkpoint-restore/README.md)|[checkpoint-restore](https://github.com/kubernetes/kubernetes/labels/wg%2Fcheckpoint-restore)|* Apps
* Auth
* Node
* Scheduling
|* [Adrian Reber](https://github.com/adrianreber), Red Hat
* [Peter Hunt](https://github.com/haircommander), Red Hat
* [Radostin Stoyanov](https://github.com/rst0git), University of Oxford
* [Viktória Spišaková](https://github.com/viktoriaas), Masaryk University
|* [Slack](https://kubernetes.slack.com/messages/wg-checkpoint-restore)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore)| |[Data Protection](wg-data-protection/README.md)|[data-protection](https://github.com/kubernetes/kubernetes/labels/wg%2Fdata-protection)|* Apps
* Storage
|* [Xing Yang](https://github.com/xing-yang), VMware
* [Xiangqian Yu](https://github.com/yuxiangqian), Google
|* [Slack](https://kubernetes.slack.com/messages/wg-data-protection)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-data-protection)|* Regular WG Meeting: [Wednesdays at 9:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/j/6933410772)
|[Device Management](wg-device-management/README.md)|[device-management](https://github.com/kubernetes/kubernetes/labels/wg%2Fdevice-management)|* Architecture
* Autoscaling
* Network
* Node
* Scheduling
|* [John Belamaric](https://github.com/johnbelamaric), Google
* [Kevin Klues](https://github.com/klueska), NVIDIA
* [Patrick Ohly](https://github.com/pohly), Intel
|* [Slack](https://kubernetes.slack.com/messages/wg-device-management)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-device-management)|* Regular WG Meeting (Asia/Europe): [Wednesdays at 9:00 CET (Central European Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)
* Regular WG Meeting (Europe/America): [Tuesdays at 8:30 PT (Pacific Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)
|[etcd Operator](wg-etcd-operator/README.md)|[etcd-operator](https://github.com/kubernetes/kubernetes/labels/wg%2Fetcd-operator)|* Cluster Lifecycle
* etcd
|* [Benjamin Wang](https://github.com/ahrtr), VMware
* [Ciprian Hacman](https://github.com/hakman), Microsoft
* [Josh Berkus](https://github.com/jberkus), Red Hat
* [James Blair](https://github.com/jmhbnz), Red Hat
* [Justin Santa Barbara](https://github.com/justinsb), Google
|* [Slack](https://kubernetes.slack.com/messages/wg-etcd-operator)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-etcd-operator)|* Regular WG Meeting: [Tuesdays at 11:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/my/cncfetcdproject)
diff --git a/sig-node/README.md b/sig-node/README.md index 1ef3742dbb3..1826db21a11 100644 --- a/sig-node/README.md +++ b/sig-node/README.md @@ -54,6 +54,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-node: * [WG Batch](/wg-batch) +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Device Management](/wg-device-management) * [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sig-scheduling/README.md b/sig-scheduling/README.md index 1d6b8c590f0..f0365b30bc6 100644 --- a/sig-scheduling/README.md +++ b/sig-scheduling/README.md @@ -63,6 +63,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-scheduling: * [WG Batch](/wg-batch) +* [WG Checkpoint Restore](/wg-checkpoint-restore) * [WG Device Management](/wg-device-management) * [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sigs.yaml b/sigs.yaml index 22e273db496..9d44368e56e 100644 --- a/sigs.yaml +++ b/sigs.yaml @@ -3648,6 +3648,43 @@ workinggroups: liaison: github: aojea name: Antonio Ojea + - dir: wg-checkpoint-restore + name: Checkpoint Restore + mission_statement: > + This working group aims to provide a central location for the community to discuss the integration of Checkpoint/Restore functionality into Kubernetes. + + charter_link: charter.md + stakeholder_sigs: + - Apps + - Auth + - Node + - Scheduling + label: checkpoint-restore + leadership: + chairs: + - github: adrianreber + name: Adrian Reber + company: Red Hat + email: areber@redhat.com + - github: haircommander + name: Peter Hunt + company: Red Hat + email: pehunt@redhat.com + - github: rst0git + name: Radostin Stoyanov + company: University of Oxford + email: radostin.stoyanov@eng.ox.ac.uk + - github: viktoriaas + name: Viktória Spišaková + company: Masaryk University + email: spisakova@ics.muni.cz + meetings: [] + contact: + slack: wg-checkpoint-restore + mailing_list: https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore + liaison: + github: BenTheElder + name: Benjamin Elder - dir: wg-data-protection name: Data Protection mission_statement: > diff --git a/wg-checkpoint-restore/README.md b/wg-checkpoint-restore/README.md new file mode 100644 index 00000000000..2a6134b2ac0 --- /dev/null +++ b/wg-checkpoint-restore/README.md @@ -0,0 +1,37 @@ + +# Checkpoint Restore Working Group + +This working group aims to provide a central location for the community to discuss the integration of Checkpoint/Restore functionality into Kubernetes. + +The [charter](charter.md) defines the scope and governance of the Checkpoint Restore Working Group. + +## Stakeholder SIGs +* [SIG Apps](/sig-apps) +* [SIG Auth](/sig-auth) +* [SIG Node](/sig-node) +* [SIG Scheduling](/sig-scheduling) + + + +## Organizers + +* Adrian Reber (**[@adrianreber](https://github.com/adrianreber)**), Red Hat +* Peter Hunt (**[@haircommander](https://github.com/haircommander)**), Red Hat +* Radostin Stoyanov (**[@rst0git](https://github.com/rst0git)**), University of Oxford +* Viktória Spišaková (**[@viktoriaas](https://github.com/viktoriaas)**), Masaryk University + +## Contact +- Slack: [#wg-checkpoint-restore](https://kubernetes.slack.com/messages/wg-checkpoint-restore) +- [Mailing list](https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore) +- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore) +- Steering Committee Liaison: Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**) + + + diff --git a/wg-checkpoint-restore/charter.md b/wg-checkpoint-restore/charter.md new file mode 100644 index 00000000000..04647d6332a --- /dev/null +++ b/wg-checkpoint-restore/charter.md @@ -0,0 +1,87 @@ + +# WG Checkpoint Restore Charter + +This charter adheres to the conventions described in the [Kubernetes Charter README] and uses +the Roles and Organization Management outlined in [sig-governance]. + +## Scope + +The Checkpoint/Restore Working Group aims to solve the problem of transparently +checkpointing and restoring workloads in Kubernetes, a [functionality discussed +for over five years][kep2008]. The group will deliver the design and +implementation of Checkpoint/Restore functionality in Kubernetes, serving as a +central hub for community information and discussion. This initiative addresses +a wide range of problems, including fault tolerance, improved resource +utilization, and accelerated application startup times. + +### In scope + +- Identify core Kubernetes checkpoint/restore use cases (e.g., live migration, + fault tolerance, debugging, snapshotting) and gather stakeholder requirements. +- Investigate and propose Kubernetes APIs for checkpoint/restore operations. +- Work with SIGs for the best integration of checkpoint/restore functionality + and APIs. +- Provide guidance for developers on checkpoint-friendly app design and + recommendations for operators on feature management. +- Work closely with relevant upstream projects (CRI-O, containerd, CRIU, gVisor) + for alignment and integration. +- Revisit the existing implementations to find and remedy possible inefficiencies. + One example is the existing checkpoint archive format which has already been + identified as being a major source of slowdown. + +### Out of scope + +- Not focused on general OS-level checkpointing outside Kubernetes + pods/containers. +- Will not dictate internal application checkpointing logic; focuses on + Kubernetes platform orchestration of *container/pod state. + +## Stakeholders + +Stakeholders in this working group span multiple SIGs that own parts of the +code in core kubernetes components and addons. + +- SIG Node +- SIG Scheduling +- SIG Auth +- SIG Apps + +## Deliverables + +The list of deliverables include the following high level features: + +- In the early stage, we mainly want to offer a well-defined location for the + community to find information, ask questions, and discuss the next steps of + enabling checkpoint and restore in Kubernetes. + +Later: + +- Ability to checkpoint and restore a container using kubectl +- Ability to checkpoint and restore a pod using kubectl +- Integration of container/pod checkpointing in scheduling decisions + +## Roles and Organization Management + +This WG adheres to the Roles and Organization Management outlined in [wg-governance] +and opts-in to updates and modifications to [wg-governance]. + +[wg-governance]: /committee-steering/governance/wg-governance.md + +Additionally, the WG commits to: + +- maintain a solid communication line between the Kubernetes groups and the + wider CNCF community + +## Timelines and Disbanding + +As a first mandate, the WG will propose a draft roadmap and identify key tasks in the first quarter of operation. + +After that, the WG will facilitate collaboration among community members to explore possible APIs and draft proposals for their integration into Kubernetes, which will then be presented to the relevant SIGs. + +Achieving the aforementioned deliverables, also mentioned in the `In Scope` +section, will allow us to decide when to disband this WG. There is no +expectations that the Working Group will be converted into a SIG long term. + +[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md +[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md +[kep2008]: https://github.com/kubernetes/enhancements/issues/2008