First draft of Reporting Source of Truth™ by wrridgeway · Pull Request #496 · ccao-data/data-architecture

wrridgeway · 2024-06-06T18:01:33Z

Notes

Ignoring av_quintile as a grouping for now since it doesn't apply to some tables and interacts oddly with class groupings.
PySpark is brutal when it comes to column types. Some of these columns are probably not the best types (primarily doubles that should be ints), but it was getting pretty awful rebuilding everything because pysark refused to let a np.int64 get cast as a bigint (this is at least partly due to issues with NA/nan/None). Nullable booleans are also currently an issue with reassessment_year.
Not sure what the best way to do delta columns for tables with stages is - right now we compare BOR 2020 to BOR 2019, etc.

Sales

Prior AVs seems a little weird since this isn't sale-level?
I'm not sure how to classify sales as "Valid" or "Invalid" based on our current "Outlier" schema in vw_pin_sale.

Ratios

Need to hammer out exactly what our SOPs should be and what min samples should be without them.

Priorities moving forward

Sort out runner memory issues (or, figure out a way around them, possibly by looping through data by year, though this will take a long time)
Improve column types
Code cleanup. The code isn't awful, but there are some specific portions that could absolutely be consolidated into loops or other more efficient methods. I gave up trying to implement these improvements for the sake of delivering an MVP, but it's pretty low hanging fruit.
Performance improvements

wrridgeway · 2025-07-01T18:11:24Z

I tried everything I could in the athena notebook editor, but at least one of these tables is pretty difficult to get to build given the way it is scoped and the memory constraints we have to work with. I haven't pushed all of that work since it didn't ultimately reach the finish line, but I'll at least lay out some of the things I learned while trying:

There seems to be a constant battle for memory between the size of the input data (~170 million rows for the table above), the operations spark needs to perform, what can actually be passed through the spark driver back into python, what is being stored in memory for output, and the amount of time everything takes
You can output spark or pandas dataframes to dbt (a spark dataframe will have to be pulled into python, but that is handled automatically)
Iterating over several different combinations of columns for aggregation seems to really diminish the performance returns gained by using spark in the first place
You can feed a ton of data into spark, but the more you do the more likely it is the driver will error out trying to return it to python. This cannot be avoided by sending dbt a spark dataframe
I found the limits on what you can feed to spark are very influenced by two things, at least when using applyInPandas:
- only pass in the columns that are absolutely necessary for calculations and grouping. extraneous columns will drain memory.
- really uneven grouping will absolutely lead to memory failures. spark can't handle 2 million rows of even a few columns in one grouping if there are dozens of other groupings that also need to be kept in memory (eg grouping by class doesn't cause errors, grouping by res vs other does)

First draft of sales script

d780d76

wrridgeway linked an issue Jun 6, 2024 that may be closed by this pull request

Reporting SoT #387

Open

wrridgeway added 28 commits June 6, 2024 19:28

File renaming

00909fd

Cleaner for loop

2ac5982

First draft taxes and exemptions table

2107d2a

Wrap assessment_roll

c56aaaf

Correct size, count calculations

6c81308

Wrap sales table

1bf9b9c

Correct stage grouping, counting

0a9e1f3

Fix assessment roll stage grouping

030a7c5

Clean output before writing

1c2adae

Begin dbt building

672bd1e

Merge branch 'master' into 387-reporting-sot

0c42e23

Attempt to build assessment_roll table

3f60a77

Testing build on smaller input

fdff457

Trying to build on limited sample

6abd074

Try to build sales table

fd342b6

Try to build taxes and exemptions table

cccf8e1

Try to build taxes and exemptions table

3656964

Try to build taxes table

8b0f95f

Try to build ratio stats table

9383bdc

Add assesspy to ratio_stats table

08d3bd6

ratio_stats builds in dbt, excluding assesspy funcs

d2cac22

sot_ratio_stats table building in dbt

f559753

Add res_other group

1f8ad1f

Add reassessment year indicator for assessment roll

063591c

Retry assessment_year indicator

a9ffc64

Assessment_roll should run with reassessment year indicator

62dd68e

Add schema to assessment_roll table

c185e81

Correct output from sales and taxes tables

d08bc3d

wrridgeway and others added 27 commits June 17, 2025 14:50

Temp changes

f056835

Everything but delta cols

744ddd6

Remove vestigial objects

ae79bc3

Simplify schema creation

080d5b7

Aggregate spark dfs

de27eb8

Merge branch 'master' into 387-reporting-sot

98f1bed

Remove temp limit on ass roll table

214ec7a

Try table build with spark

26f00e3

Remove old table, rerun build to gen error log

2509e40

Debugging input pyspark dataframe

23c6fb8

Pass geography to aggregate

763915e

Reduce input size to test runner memory limits

29c15a1

Really reduce input size

b99dd24

Further reduce input size

76a7df5

Try a really small input

8fa0e4e

Change int type for pyarrow

8f1ee19

Try coercing expected string columns

b8bdf39

Remove string coersion for output table

f33a2e4

Try to increase max driver result for spark session

2264e1d

Change spark driver config access

dfb7d1d

One more driver attempt

2323aea

Try new engine config

8cfc713

Test smaller amount of collection

db21c18

Remove config without permission

e6681fe

Test using entire input

34fa863

Revert for now

9b5d48f

Remove limit again for testing

e4aeb0e

wrridgeway marked this pull request as draft June 28, 2025 18:45

Attempt to collect more often

e146066

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

First draft of Reporting Source of Truth™#496

First draft of Reporting Source of Truth™#496
wrridgeway wants to merge 103 commits intomasterfrom
387-reporting-sot

wrridgeway commented Jun 6, 2024 •

edited

Loading

Uh oh!

wrridgeway commented Jul 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

wrridgeway commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Sales

Ratios

Priorities moving forward

Uh oh!

wrridgeway commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wrridgeway commented Jun 6, 2024 •

edited

Loading

wrridgeway commented Jul 1, 2025 •

edited

Loading