-
Notifications
You must be signed in to change notification settings - Fork 967
Fix Kyuubi OOM bug when multiple batch jobs are submitted at once in large amount #7227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix Kyuubi OOM bug when multiple batch jobs are submitted at once in large amount #7227
Conversation
…cordingly once engine submit timeout is reached - prevent subsequent kyuubi OOM
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7227 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 696 696
Lines 43530 43543 +13
Branches 5883 5884 +1
======================================
- Misses 43530 43543 +13 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses issue #7226 by preventing Kyuubi OOM errors when multiple batch jobs time out waiting for Spark driver engines. When a batch job reaches the engine submit timeout, the metadata store is now properly updated with TIMEOUT state and NOT_FOUND engine state, preventing the restarted Kyuubi server from repeatedly polling these timed-out jobs.
Key Changes:
- Updated timeout handling to persist batch job state when engine submission times out
- Added metadata store update with proper error state and message on timeout
- Added integration test to verify timeout behavior updates metadata correctly
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| KubernetesApplicationOperation.scala | Added metadata store update logic when driver pod is not found after submit timeout |
| SparkOnKubernetesTestsSuite.scala | Added integration test verifying timeout state is properly persisted to metadata store |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kyuubi-server/src/main/scala/org/apache/kyuubi/engine/KubernetesApplicationOperation.scala
Outdated
Show resolved
Hide resolved
| assert(!failKillResponse._1) | ||
| } | ||
| test( | ||
| "If spark batch reach timeout, it should have associated Kyuubi Application Operation be " + |
Copilot
AI
Oct 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammatical error in test description. Should be 'reaches timeout' instead of 'reach timeout', and 'should have the associated' instead of 'should have associated'.
| "If spark batch reach timeout, it should have associated Kyuubi Application Operation be " + | |
| "If spark batch reaches timeout, it should have the associated Kyuubi Application Operation be " + |
Co-authored-by: Copilot <[email protected]>
|
Hi @JoonPark1 For this issue, does it has chance to update metadata in BatchJobSubmission? kyuubi/kyuubi-server/src/main/scala/org/apache/kyuubi/operation/BatchJobSubmission.scala Lines 169 to 189 in e8bbf52
|
|
Hey @turboFei. I believe the spark driver engine state and spark app state will be updated for metadata store... |
|
Hi @JoonPark1 Could you provide more details? |
|
@turboFei Sure! Once the kyuubi batch job times out because the elapsed time exceeds the configured submitTimeout property value (no spark driver is instantiated and in running state to handle the submitted batch job), the metadata about the spark application and the spark driver engine state is updated accordingly via "org.apache.kyuubi.server.metadata.MetadataManager" class' updateMetadata method which takes in the new up-to-date Metadata construct object instance (which is instance of class "org.apache.kyuubi.server.metadata.api.Metadata"). Then, internally within the manager class, the method calls upon the "org.apache.kyuubi.server.metadata.MetadataStore" class' updateMetadata method, which keeps the data regarding the state of each submitted kyuubi batch jobs utilizing spark compute engine as in-sync with the state of kyuubi's metadata store in relationalDB. As you can see, the whole flow does not need to invoke the BatchJobSubmission:: updateBatchMetadata to update the kyuubi's metadata store instance. |
…cordingly once engine submit timeout is reached - prevent subsequent kyuubi OOM
Why are the changes needed?
This PR change is to address bug #7226. It updates the behavior of updating metadata store accordingly for batch jobs that have timed out due to waiting for available spark driver engine. This will fix the subsequent restarted kyuubi server from repeatedly polling for the spark application status of each and every batch job, which can cause consecutive OOM errors under k8 cluster deployment mode for kyuubi.
How was this patch tested?
This patch was tested through integration test that was added to test suite class called "SparkOnKubernetesTestsSuite.scala".
Was this patch authored or co-authored using generative AI tooling?
No!