add default max batch size and batchSize parameter for SF updates#55
add default max batch size and batchSize parameter for SF updates#55unintellisense wants to merge 45 commits into
Conversation
Add batch size update
Nulls not empty
|
Running into the same issue, this would be very helpful @springml :) |
| "com.springml" % "salesforce-wave-api" % "1.0.10", | ||
| "com.github.loanpal-engineering" % "salesforce-wave-api" % "eb71436", |
There was a problem hiding this comment.
@unintellisense Wondering, is this related to this change? Why do you need to change to a fork of the api client?
|
Looking into it a bit further, it might be better to improve |
| val partitionCnt = (1 + csvRDD.count() / batchSize).toInt | ||
| val partitionedRDD = csvRDD.repartition(partitionCnt) |
There was a problem hiding this comment.
This doesn't seem to be the right place to repartition as it's just leading to a 2nd round of shuffling the data around :/ Partitioning to control the size of ingest batches is already done in Utils.repartition, so the limit of records per batch should be considered there:
https://github.com/springml/spark-salesforce/pull/59/files#diff-b359f3e710dff2341dbedadb012b9ff4R62-R73
There was a problem hiding this comment.
A PR for the alternative approach is here #59
spark 3.0/scala 2.12 compatibility
swap out java-sizeof version and remove scala version from the artifact-id
update gitignore a
use salesforce magic null string value
* add support for max column width to csv parser * pom changes
shade dependency
support pkChunks with filtering (handling empty batches)
bulk api 2.0
change force api to 53
fix max columns
Update to use pooled connection for polling events
I recently started using the project to perform bulk updates to Salesforce, and found it was not creating multiple batches when the row count exceeded the maximum 10,000 rows allowed.
This PR sets a default batch size of 5000 (since there is also a max size per batch I preferred to not start at 10k) and allows a batchSize parameter to be set if you prefer to change the batch size (often, the SF API will process multiple batches faster than one large batch, though your API limit is measured by number of batches).