Skip to content

Conversation

@john-p-ryan
Copy link
Contributor

@john-p-ryan john-p-ryan commented Apr 7, 2025

We want to estimate the distribution of income $z$ by combining a Kernel Density Estimate and a Pareto distribution estimated on the upper tail. The issue is that we need the density estimate to be twice differentiably continuous. The final density estimate is as follows:

$$ f(z) = \begin{cases} A f_{KDE}(z), & z < t_1 \\ (1-s(z)) A f_{KDE}(z) + s(z) B f_{Pareto}(z), & t_1 \leq z \leq t_2 \\ B f_{Pareto}(z), & z > t_2 \end{cases} $$

Where $A$ and $B$ are scaling factors and $s$ is a smoothing function that satisfies $s(t_1) = 0$, $s(t_2) = 1$, $s'(t_1) = s'(t_2) = s''(t_1) = s''(t_2) = 0$, and $s'(z) \geq 0$ for $z \in [t_1, t_2]$. For example, $s(y) = 6x^5 - 15x^4 + 10x^3$ for $y = \frac{z - t_1}{t_2-t_1}$. The estimation algorithm is as follows:

  1. Estimate the KDE on all of the data, $f_{KDE}$.
  2. Choose $t_1$ and $t_2$ to match certain percentiles of the distribution. $t_1$ is where the distribution begins to resemble a Pareto distribution, and $t_2$ is chosen so that the transition can be smooth enough but does not interfere with the Pareto tail. For example, we use $t_1$ is the 90th percentile and $t_2$ is the 95th percentile.
  3. Estimate the Pareto distribution on all data with $z \geq t_1$.
  4. Choose $A$ and $B$ to ensure continuity and that $f$ integrates to 1. That is,

$$ \frac{A}{B} = \frac{f_{Pareto}(t_1)}{f_{KDE}(t_1)}$$

$$ \int f(z) dz = 1 $$

One way to do step 4 computationally is to just first use $A=1$ and estimate $B = \frac{f_{KDE}(t_1)}{f_{Pareto}(t_1)}$, then divide the whole thing by the integral of $f$ over the whole interval.

This is what we get for the distribution:

image

Here is $f'$:

image

You can slightly see the transition in $f'$, but it looks pretty smooth. However, the small dip in $f'$ causes a big dip in the resulting weights:

image

I tried playing with the cutoffs as well as the KDE bw but this seems to be a persistent issue. I am trying to figure out what's causing this and if there's another way to smooth the transition, perhaps more forcefully.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 0% with 43 lines in your changes missing coverage. Please review.

Project coverage is 38.53%. Comparing base (10985c1) to head (499cce2).
Report is 57 commits behind head on main.

Files with missing lines Patch % Lines
iot/inverse_optimal_tax.py 0.00% 43 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (10985c1) and HEAD (499cce2). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (10985c1) HEAD (499cce2)
unittests 3 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #38       +/-   ##
===========================================
- Coverage   72.81%   38.53%   -34.28%     
===========================================
  Files           3        3               
  Lines         103      205      +102     
===========================================
+ Hits           75       79        +4     
- Misses         28      126       +98     
Flag Coverage Δ
unittests 38.53% <0.00%> (-34.28%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jdebacker
Copy link
Member

@john-p-ryan What was the kde_bw value that you used for the income distribution plotted above?

@john-p-ryan
Copy link
Contributor Author

@john-p-ryan What was the kde_bw value that you used for the income distribution plotted above?

I just kept it at 'none', which defaults to scipy's 'scott', which is a common rule of thumb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants