[SPARK-53870][PYTHON][SS][4.0] Fix partial read bug for large proto messages in TransformWithStateInPySparkStateServer

jiateoh · HeartSaVioR · commit ea04238eed22 · 2025-10-14T11:04:00.000+09:00
### What changes were proposed in this pull request? This is a branch-4.0 PR for #52539. Description is copied and updated below (4.0 has a slightly different test setup and only provides pandas tests). Fix the TransformWithState StateServer's `parseProtoMessage` method to fully read the desired message using the correct [readFully DataInputStream API](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#readFully(byte%5B%5D)) rather than `read` (InputStream/FilterInputStream) which only reads all available data and may not return the full message. [`readFully` (DataInputStream)](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/DataInput.html#readFully(byte%5B%5D)) will continue fetching until it fills up the provided buffer. In addition to the linked API above, this StackOverflow post also illustrates the difference between the two APIs: https://stackoverflow.com/a/25900095 ### Why are the changes needed? For large state values used in the TransformWithState API, `inputStream.read` is not guaranteed to read `messageLen`'s bytes of data as per the InputStream API. For large values, `read` will return prematurely and the messageBytes will only be partially filled, yielding an incorrect and likely unparseable proto message. This is not a common scenario, as testing also indicated that the actual proto messages had to be somewhat large to consistently trigger this error. The test case I added uses 512KB strings in the state value updates. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? (Note: compared to original PR, this 4.0 branch organizes tests differently and only supports the pandas tests) Added a new test case using 512KB strings: - Value state update - List state update with 3 (different) values (note: list state provides a multi-value update API, so this message is even larger than the other two) - Map state update with single key/value ``` build/sbt -Phive -Phive-thriftserver -DskipTests package python/run-tests --testnames 'pyspark.sql.tests.pandas.test_pandas_transform_with_state TransformWithStateInPandasTests' ``` The configured data size (512KB) triggers an incomplete read, while also completing in a reasonable time (within 30s on my laptop). I had separately tested a larger input size of 4MB which took 30min which I considered too expensive to include in the test. Below is sample/testing results from using `read` only (i.e., no fix) and adding a check on message length vs read bytes ([test code is included in this commit](b68cfd7) but reverted later for the PR). The check is no longer required after the `readFully` fix as that is handled within the provided API. ``` TransformWithStateInPandasTests pyspark.errors.exceptions.base.PySparkRuntimeError: Error updating map state value: TESTING: Failed to read message bytes: expected 524369 bytes, but only read 261312 bytes ``` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-sonnet-4-5-20250929) Closes #52596 from jiateoh/tws_readFully_fix-4.0. Authored-by: Jason Teoh <jiateoh@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
diff --git a/python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py b/python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py
@@ -1601,6 +1601,49 @@ def check_exception(error):
                     check_exception=check_exception,
                 )
 
+    def test_transform_with_state_in_pandas_large_values(self):
+        """Test large state values (512KB) to validate readFully fix for SPARK-53870"""
+
+        def check_results(batch_df, batch_id):
+            batch_df.collect()
+            target_size_bytes = 512 * 1024
+            large_string = "a" * target_size_bytes
+            expected_list_elements = ",".join(
+                [large_string, large_string + "b", large_string + "c"]
+            )
+            expected_map_result = f"large_string_key:{large_string}"
+
+            assert set(batch_df.sort("id").collect()) == {
+                Row(
+                    id="0",
+                    valueStateResult=large_string,
+                    listStateResult=expected_list_elements,
+                    mapStateResult=expected_map_result,
+                ),
+                Row(
+                    id="1",
+                    valueStateResult=large_string,
+                    listStateResult=expected_list_elements,
+                    mapStateResult=expected_map_result,
+                ),
+            }
+
+        output_schema = StructType(
+            [
+                StructField("id", StringType(), True),
+                StructField("valueStateResult", StringType(), True),
+                StructField("listStateResult", StringType(), True),
+                StructField("mapStateResult", StringType(), True),
+            ]
+        )
+
+        self._test_transform_with_state_in_pandas_basic(
+            PandasLargeValueStatefulProcessor(),
+            check_results,
+            single_batch=True,
+            output_schema=output_schema,
+        )
+
 
 class SimpleStatefulProcessorWithInitialState(StatefulProcessor):
     # this dict is the same as input initial state dataframe
@@ -2374,6 +2417,46 @@ def close(self) -> None:
         pass
 
 
+class PandasLargeValueStatefulProcessor(StatefulProcessor):
+    """Test processor for large state values (512KB) to validate readFully fix"""
+
+    def init(self, handle: StatefulProcessorHandle):
+        value_state_schema = StructType([StructField("value", StringType(), True)])
+        self.value_state = handle.getValueState("valueState", value_state_schema)
+
+        list_state_schema = StructType([StructField("value", StringType(), True)])
+        self.list_state = handle.getListState("listState", list_state_schema)
+
+        self.map_state = handle.getMapState("mapState", "key string", "value string")
+
+    def handleInputRows(self, key, rows, timerValues) -> Iterator[pd.DataFrame]:
+        target_size_bytes = 512 * 1024
+        large_string = "a" * target_size_bytes
+
+        self.value_state.update((large_string,))
+        value_retrieved = self.value_state.get()[0]
+
+        self.list_state.put([(large_string,), (large_string + "b",), (large_string + "c",)])
+        list_retrieved = list(self.list_state.get())
+        list_elements = ",".join([elem[0] for elem in list_retrieved])
+
+        map_key = ("large_string_key",)
+        self.map_state.updateValue(map_key, (large_string,))
+        map_retrieved = f"{map_key[0]}:{self.map_state.getValue(map_key)[0]}"
+
+        yield pd.DataFrame(
+            {
+                "id": key,
+                "valueStateResult": [value_retrieved],
+                "listStateResult": [list_elements],
+                "mapStateResult": [map_retrieved],
+            }
+        )
+
+    def close(self) -> None:
+        pass
+
+
 class TransformWithStateInPandasTests(TransformWithStateInPandasTestsMixin, ReusedSQLTestCase):
     pass
 
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/streaming/TransformWithStateInPandasStateServer.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/streaming/TransformWithStateInPandasStateServer.scala
@@ -190,7 +190,7 @@ class TransformWithStateInPandasStateServer(
   private def parseProtoMessage(): StateRequest = {
     val messageLen = inputStream.readInt()
     val messageBytes = new Array[Byte](messageLen)
-    inputStream.read(messageBytes)
+    inputStream.readFully(messageBytes)
     StateRequest.parseFrom(ByteString.copyFrom(messageBytes))
   }
 

Original file line number	Diff line number	Diff line change
`@@ -190,7 +190,7 @@ class TransformWithStateInPandasStateServer(`
`190`	`190`	`private def parseProtoMessage(): StateRequest = {`
`191`	`191`	`val messageLen = inputStream.readInt()`
`192`	`192`	`val messageBytes = new Array[Byte](messageLen)`
`193`		`- inputStream.read(messageBytes)`
	`193`	`+ inputStream.readFully(messageBytes)`
`194`	`194`	`StateRequest.parseFrom(ByteString.copyFrom(messageBytes))`
`195`	`195`	`}`
`196`	`196`