Pandas on AWS
Please, pin the version you are using on your environment.
AWS Data Wrangler is completing 1 year, and the team is working to collect feedbacks and features requests to put in our 1.0.0 version. By now we have 3 major changes listed:
- API redesign
- Nested data types support
- Deprecation of PySpark support
- PySpark support takes considerable part of the development time and it has not been reflected in user adoption. Only 2 of our 66 issues on GitHub are related to Spark.
- In addition, the integration between PySpark and PyArrow/Pandas remains in experimental stage and we have been experiencing tough times to keep it stable.
| FROM | TO | Features |
|---|---|---|
| PySpark DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenesAppend/Overwrite/Upsert modes |
| PySpark DataFrame | Glue Catalog | Register Parquet or CSV DataFrame on Glue Catalog |
| Nested PySpark DataFrame |
Flat PySpark DataFrames |
Flatten structs and break up arrays in child tables |
| FROM | TO | Features |
|---|---|---|
| Pandas DataFrame | Amazon S3 | Parquet, CSV, Partitions, Parallelism, Overwrite/Append/Partitions-Upsert modes, KMS Encryption, Glue Metadata (Athena, Spectrum, Spark, Hive, Presto) |
| Amazon S3 | Pandas DataFrame | Parquet (Pushdown filters), CSV, Fixed-width formatted, Partitions, Parallelism, KMS Encryption, Multiple files |
| Amazon Athena | Pandas DataFrame | Workgroups, S3 output path, Encryption, and two different engines: - ctas_approach=False -> Batching and restrict memory environments - ctas_approach=True -> Blazing fast, parallelism and enhanced data types |
| Pandas DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenes Append/Overwrite/Upsert modes |
| Amazon Redshift | Pandas DataFrame | Blazing fast using parallel parquet on S3 behind the scenes |
| Pandas DataFrame | Amazon Aurora | Supported engines: MySQL, PostgreSQL Blazing fast using parallel CSV on S3 behind the scenes Append/Overwrite modes |
| Amazon Aurora | Pandas DataFrame | Supported engines: MySQL Blazing fast using parallel CSV on S3 behind the scenes |
| CloudWatch Logs Insights | Pandas DataFrame | Query results |
| Glue Catalog | Pandas DataFrame | List and get Tables details. Good fit with Jupyter Notebooks. |
| Feature | Details |
|---|---|
| List S3 objects | e.g. wr.s3.list_objects("s3://...") |
| Delete S3 objects | Parallel |
| Delete listed S3 objects | Parallel |
| Delete NOT listed S3 objects | Parallel |
| Copy listed S3 objects | Parallel |
| Get the size of S3 objects | Parallel |
| Get CloudWatch Logs Insights query results | |
| Load partitions on Athena/Glue table | Through "MSCK REPAIR TABLE" |
| Create EMR cluster | "For humans" |
| Terminate EMR cluster | "For humans" |
| Get EMR cluster state | "For humans" |
| Submit EMR step(s) | "For humans" |
| Get EMR step state | "For humans" |
| Query Athena to receive python primitives | Returns Iterable[Dict[str, Any] |
| Load and Unzip SageMaker jobs outputs | |
| Dump Amazon Redshift as Parquet files on S3 | |
| Dump Amazon Aurora as CSV files on S3 | Only for MySQL engine |
