Skip to content

Comments

First draft of Reporting Source of Truth™#496

Draft
wrridgeway wants to merge 103 commits intomasterfrom
387-reporting-sot
Draft

First draft of Reporting Source of Truth™#496
wrridgeway wants to merge 103 commits intomasterfrom
387-reporting-sot

Conversation

@wrridgeway
Copy link
Member

@wrridgeway wrridgeway commented Jun 6, 2024

Notes

  • Ignoring av_quintile as a grouping for now since it doesn't apply to some tables and interacts oddly with class groupings.
  • PySpark is brutal when it comes to column types. Some of these columns are probably not the best types (primarily doubles that should be ints), but it was getting pretty awful rebuilding everything because pysark refused to let a np.int64 get cast as a bigint (this is at least partly due to issues with NA/nan/None). Nullable booleans are also currently an issue with reassessment_year.
  • Not sure what the best way to do delta columns for tables with stages is - right now we compare BOR 2020 to BOR 2019, etc.

Sales

  • Prior AVs seems a little weird since this isn't sale-level?
  • I'm not sure how to classify sales as "Valid" or "Invalid" based on our current "Outlier" schema in vw_pin_sale.

Ratios

  • Need to hammer out exactly what our SOPs should be and what min samples should be without them.

Priorities moving forward

  1. Sort out runner memory issues (or, figure out a way around them, possibly by looping through data by year, though this will take a long time)
  2. Improve column types
  3. Code cleanup. The code isn't awful, but there are some specific portions that could absolutely be consolidated into loops or other more efficient methods. I gave up trying to implement these improvements for the sake of delivering an MVP, but it's pretty low hanging fruit.
  4. Performance improvements

@wrridgeway wrridgeway linked an issue Jun 6, 2024 that may be closed by this pull request
@wrridgeway wrridgeway marked this pull request as draft June 28, 2025 18:45
@wrridgeway
Copy link
Member Author

wrridgeway commented Jul 1, 2025

I tried everything I could in the athena notebook editor, but at least one of these tables is pretty difficult to get to build given the way it is scoped and the memory constraints we have to work with. I haven't pushed all of that work since it didn't ultimately reach the finish line, but I'll at least lay out some of the things I learned while trying:

  • There seems to be a constant battle for memory between the size of the input data (~170 million rows for the table above), the operations spark needs to perform, what can actually be passed through the spark driver back into python, what is being stored in memory for output, and the amount of time everything takes
  • You can output spark or pandas dataframes to dbt (a spark dataframe will have to be pulled into python, but that is handled automatically)
  • Iterating over several different combinations of columns for aggregation seems to really diminish the performance returns gained by using spark in the first place
  • You can feed a ton of data into spark, but the more you do the more likely it is the driver will error out trying to return it to python. This cannot be avoided by sending dbt a spark dataframe
  • I found the limits on what you can feed to spark are very influenced by two things, at least when using applyInPandas:
    • only pass in the columns that are absolutely necessary for calculations and grouping. extraneous columns will drain memory.
    • really uneven grouping will absolutely lead to memory failures. spark can't handle 2 million rows of even a few columns in one grouping if there are dozens of other groupings that also need to be kept in memory (eg grouping by class doesn't cause errors, grouping by res vs other does)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reporting SoT

3 participants