Skip to content

add default max batch size and batchSize parameter for SF updates#55

Open
unintellisense wants to merge 45 commits into
egen:masterfrom
loanpal-engineering:master
Open

add default max batch size and batchSize parameter for SF updates#55
unintellisense wants to merge 45 commits into
egen:masterfrom
loanpal-engineering:master

Conversation

@unintellisense

Copy link
Copy Markdown

I recently started using the project to perform bulk updates to Salesforce, and found it was not creating multiple batches when the row count exceeded the maximum 10,000 rows allowed.

This PR sets a default batch size of 5000 (since there is also a max size per batch I preferred to not start at 10k) and allows a batchSize parameter to be set if you prefer to change the batch size (often, the SF API will process multiple batches faster than one large batch, though your API limit is measured by number of batches).

@mosche

mosche commented Sep 10, 2020

Copy link
Copy Markdown

Running into the same issue, this would be very helpful @springml :)

Comment thread build.sbt Outdated
Comment on lines +14 to +15
"com.springml" % "salesforce-wave-api" % "1.0.10",
"com.github.loanpal-engineering" % "salesforce-wave-api" % "eb71436",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unintellisense Wondering, is this related to this change? Why do you need to change to a fork of the api client?

@mosche

mosche commented Sep 10, 2020

Copy link
Copy Markdown

Looking into it a bit further, it might be better to improve Utils.repartition to also consider the number of rows when calculating the target number of partitions rather than just basing the number of partitions on the estimated size.

Comment on lines +35 to +36
val partitionCnt = (1 + csvRDD.count() / batchSize).toInt
val partitionedRDD = csvRDD.repartition(partitionCnt)

@mosche mosche Sep 15, 2020

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be the right place to repartition as it's just leading to a 2nd round of shuffling the data around :/ Partitioning to control the size of ingest batches is already done in Utils.repartition, so the limit of records per batch should be considered there:
https://github.com/springml/spark-salesforce/pull/59/files#diff-b359f3e710dff2341dbedadb012b9ff4R62-R73

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PR for the alternative approach is here #59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants