Skip to content

Releases: pathwaycom/pathway

v0.26.3

03 Oct 09:26
Compare
Choose a tag to compare

Added

  • New parser pathway.xpacks.llm.parsers.PaddleOCRParser supporting parsing of PDF, PPTX and images.

v0.26.2

01 Oct 13:48
Compare
Choose a tag to compare

Added

  • pw.io.gdrive.read now supports the "only_metadata" format. When this format is used, the table will contain only metadata updates for the tracked directory, without reading object contents.
  • Detailed metrics can now be exported to SQLite. Enable this feature using the environment variable PATHWAY_DETAILED_METRICS_DIR or via pw.set_monitoring_config().
  • pw.io.kinesis.read and pw.io.kinesis.write methods for reading from and writing to AWS Kinesis.

Fixed

  • A bug leading to potentially unbounded memory consumption that could occur in Table.forget and Table.sort operators during multi-worker runs has been fixed.
  • Improved memory efficiency during cold starts by compacting intermediary structures and reducing retained memory after backfilling.

Changed

  • The frequency of background operator snapshot compression in data persistence is limited to the greater of the user-defined snapshot_interval or 30 minutes when S3 or Azure is used as the backend, in order to avoid frequent calls to potentially expensive operations.
  • The Google Drive input connector performance has been improved, especially when handling directories with many nested subdirectories.
  • The MCP server tool method now allows to pass the optional data title, output_schema, annotations and meta to inform the LLM client.
  • Relaxed boto3 dependency to <2.0.0.

v0.26.1

28 Aug 07:58
Compare
Choose a tag to compare

Added

  • pw.Table.forget to remove old (in terms of event time) entries from the pipeline.
  • pw.Table.buffer, a stateful buffering operator that delays entries until time_column <= max(time_column) - threshold condition is met.
  • pw.Table.ignore_late to filter out old (in terms of event time) entries.
  • Rows batching for async UDFs. It can be enabled with max_batch_size parameter.

Changed

  • pw.io.subscribe and pw.io.python.write now work with async callbacks.
  • The diff column in tables automatically created by pw.io.postgres.write and pw.io.postgres.write_snapshot in replace and create_if_not_exists initialization modes now uses the smallint type.
  • optimize_transaction_log option has been removed from pw.io.deltalake.TableOptimizer.

Fixed

  • pw.io.postgres.write and pw.io.postgres.write_snapshot now respect the type optionality defined in the Pathway table schema when creating a new PostgreSQL table. This applies to the replace and create_if_not_exists initialization modes.

v0.26.0

14 Aug 08:20
Compare
Choose a tag to compare

Added

  • path_filter parameter in pw.io.s3.read and pw.io.minio.read functions. It enables post-filtering of object paths using a wildcard pattern (*, ?), allowing exclusion of paths that pass the main path filter but do not match path_filter.
  • Input connectors now support backpressure control via max_backlog_size, allowing to limit the number of read events in processing per connector. This is useful when the data source emits a large initial burst followed by smaller, incremental updates.
  • pw.reducers.count_distinct and pw.reducers.count_distinct_approximate to count the number of distinct elements in a table. The pw.reducers.count_distinct_approximate allows you to save memory by decreasing the accuracy. It is possible to control this tradeoff by using the precision parameter.
  • pw.Table.join (and its variants) now has two additional parameters - left_exactly_once and right_exactly_once. If the elements from a side of a join should be joined exactly once, *_exactly_once parameter of the side can be set to True. Then after getting a match an entry will be removed from the join state and the memory consumption will be reduced.

Changed

  • Delta table compression logging has been improved: logs now include table names, and verbose messages have been streamlined while preserving details of important processing steps.
  • Improved initialization speed of pw.io.s3.read and pw.io.minio.read.
  • pw.io.s3.read and pw.io.minio.read now limit the number and the total size of objects to be predownloaded.
  • BREAKING optimized the implementation of pw.reducers.min, pw.reducers.max, pw.reducers.argmin, pw.reducers.argmax, pw.reducers.any reducers for append-only tables. It is a breaking change for programs using operator persistence. The persisted state will have to be recomputed.
  • BREAKING optimized the implementation of pw.reducers.sum reducer on float and np.ndarray columns. It is a breaking change for programs using operator persistence. The persisted state will have to be recomputed.
  • BREAKING the implementation of data persistence has been optimized for the case of many small objects in filesystem and S3 connectors. It is a breaking change for programs using data persistence. The persisted state will have to be recomputed.
  • BREAKING the data snapshot logic in persistence has been optimized for the case of big input snapshots. It is a breaking change for programs using data persistence. The persisted state will have to be recomputed.
  • Improved precision of pw.reducers.sum on float columns by introducing Neumeier summation.

v0.25.1

24 Jul 12:09
Compare
Choose a tag to compare

Added

  • pw.xpacks.llm.mcp_server.PathwayMcp that allows serving pw.xpacks.llm.document_store.DocumentStore and pw.xpacks.llm.question_answering endpoints as MCP (Model Context Protocol) tools.
  • pw.io.dynamodb.write method for writing to Dynamo DB.

v0.25.0

17 Jul 17:44
Compare
Choose a tag to compare

Added

  • pw.io.questdb.write method for writing to Quest DB.
  • pw.io.fs.read now supports the "only_metadata" format. When this format is used, the table will contain only metadata updates for the tracked directory, without reading file contents.

Changed

  • BREAKING The Elasticsearch and BigQuery connectors have been moved to the Scale license tier. You can obtain the Scale tier license for free at https://pathway.com/get-license.
  • BREAKING pw.io.fs.read no longer accepts format="raw". Use format="binary" to read binary objects, format="plaintext_by_file" to read plaintext objects per file, or format="plaintext" to read plaintext objects split into lines.
  • BREAKING The pw.io.s3_csv.read connector has been removed. Please use pw.io.s3.read with format="csv" instead.

Fixed

  • pw.io.s3.read and pw.io.s3.write now also check the AWS_PROFILE environment variable for AWS credentials if none are explicitly provided.

v0.24.1

17 Jul 17:44
Compare
Choose a tag to compare

Added

  • Confluent Schema Registry support in Kafka and Redpanda input and output connectors.

Changed

  • pw.io.airbyte.read will now retry the pip install command if it fails during the installation of a connector. It only applies when using the PyPI version of the connector, not the Docker one.

v0.24.0

17 Jul 17:44
Compare
Choose a tag to compare

Added

  • pw.io.mqtt.read and pw.io.mqtt.write methods for reading from and writing to MQTT.

Changed

  • pw.xpacks.llm.embedders.SentenceTransformerEmbedder and pw.xpacks.llm.llms.HFPipelineChat are now computed in batches. The maximum size of a single batch can be set in the constructor with the argument max_batch_size.
  • BREAKING Arguments api_key and base_url for pw.xpacks.llm.llms.OpenAIChat can no longer be set in the __call__ method, and instead, if needed, should be set in the constructor.
  • BREAKING Argument api_key for pw.xpacks.llm.llms.OpenAIEmbedder can no longer be set in the __call__ method, and instead, if needed, should be set in the constructor.
  • pw.io.postgres.write now accepts arbitrary types for the values of the postgres_settings dict. If a value is not a string, Python's str() method will be used.

Removed

  • pw.io.kafka.read_from_upstash has been removed, as the managed Kafka service in Upstash has been deprecated.

v0.23.0

12 Jun 08:22
Compare
Choose a tag to compare

Changed

  • BREAKING: To use pw.sql you now have to install pathway[sql].

Fixed

  • pw.io.deltalake.read now correctly reads data from partitioned tables in all cases.
  • Added retries for all cloud-based persistence backend operations to improve reliability.

v0.22.0

05 Jun 10:48
Compare
Choose a tag to compare

Added

  • Data persistence can now be configured to use Azure Blob Storage as a backend. An Azure backend instance can be created using pw.persistence.Backend.azure and included in the persistence config.
  • Added batching to UDFs. It is now possible to make UDFs operate on batches of data instead of single rows. To do so max_batch_size argument has to be set.

Changed

  • BREAKING: when creating pw.DateTimeUtc it is now obligatory to pass the time zone information.
  • BREAKING: when creating pw.DateTimeNaive passing time zone information is not allowed.
  • BREAKING: expressions are now evaluated in batches. Generally, it speeds up the computations but might increase the memory usage if the intermediate state in the expressions is large.

Fixed

  • Synchronization groups now correctly handle cases where the source file-like object is updated during the reading process.