| 
 | 1 | +# Semi-Join Selectivity Forward Plan  | 
 | 2 | + | 
 | 3 | +## Problem Statement  | 
 | 4 | + | 
 | 5 | +The current planner implementation conflates two fundamentally different concepts when estimating costs for semi-joins (EXISTS clauses with LIMIT):  | 
 | 6 | + | 
 | 7 | +1. **Filter Selectivity**: The fraction of child rows that pass filter predicates  | 
 | 8 | +2. **Semi-Join Selectivity**: The fraction of parent rows that have at least one matching child  | 
 | 9 | + | 
 | 10 | +This conflation leads to incorrect cost estimates for queries with LIMIT clauses on the outer table.  | 
 | 11 | + | 
 | 12 | +## Background  | 
 | 13 | + | 
 | 14 | +### Example Scenario  | 
 | 15 | + | 
 | 16 | +Consider the query:  | 
 | 17 | + | 
 | 18 | +```sql  | 
 | 19 | +SELECT * FROM users  | 
 | 20 | +WHERE EXISTS (  | 
 | 21 | +  SELECT 1 FROM posts  | 
 | 22 | +  WHERE posts.userId = users.id  | 
 | 23 | +  AND posts.published = true  | 
 | 24 | +)  | 
 | 25 | +LIMIT 10  | 
 | 26 | +```  | 
 | 27 | + | 
 | 28 | +The planner needs to estimate: **How many users must we scan to find 10 users that have at least one published post?**  | 
 | 29 | + | 
 | 30 | +### Current Implementation (Incorrect)  | 
 | 31 | + | 
 | 32 | +```typescript  | 
 | 33 | +// In planner-connection.ts:129-139  | 
 | 34 | +const costWithFilters = model(table, sort, filters, undefined);  | 
 | 35 | +const costWithoutFilters = model(table, sort, undefined, undefined);  | 
 | 36 | +this.selectivity = costWithFilters / costWithoutFilters;  | 
 | 37 | + | 
 | 38 | +// In planner-join.ts:210-214  | 
 | 39 | +scanEst = Math.min(scanEst, parentCost.limit / childCost.selectivity);  | 
 | 40 | +```  | 
 | 41 | + | 
 | 42 | +**What it calculates**: If 50% of posts are published → selectivity = 0.5  | 
 | 43 | +**What it estimates**: Need to scan 10 / 0.5 = 20 users  | 
 | 44 | + | 
 | 45 | +**Problem**: This assumes 50% of users have published posts, but the actual probability depends on how posts are distributed across users (the "fan-out").  | 
 | 46 | + | 
 | 47 | +### The Two Selectivities Compared  | 
 | 48 | + | 
 | 49 | +| Scenario                                                                                      | Filter Selectivity                 | Semi-Join Selectivity                             | Reality                                 |  | 
 | 50 | +| --------------------------------------------------------------------------------------------- | ---------------------------------- | ------------------------------------------------- | --------------------------------------- |  | 
 | 51 | +| Even distribution<br/>1000 users, 10 posts each<br/>50% published                             | 0.5<br/>(50% of posts pass filter) | 0.999<br/>(P(user has ≥1 published) = 1 - 0.5^10) | Almost all users have ≥1 published post |  | 
 | 52 | +| Skewed distribution<br/>900 users with 0 posts<br/>100 users with 100 posts<br/>50% published | 0.5<br/>(still 50% of posts pass)  | 0.1<br/>(only 100/1000 users have any posts)      | Only 10% of users have posts at all     |  | 
 | 53 | + | 
 | 54 | +**Key insight**: Filter selectivity tells us about child rows; semi-join selectivity tells us about parent rows. They're only equal when fan-out = 1 (foreign key from parent to child).  | 
 | 55 | + | 
 | 56 | +## The Fan-Out Formula  | 
 | 57 | + | 
 | 58 | +The relationship between filter and semi-join selectivity is:  | 
 | 59 | + | 
 | 60 | +```  | 
 | 61 | +semiJoinSelectivity = 1 - (1 - filterSelectivity)^fanOut  | 
 | 62 | +```  | 
 | 63 | + | 
 | 64 | +Where **fan-out** = average number of child rows per parent row.  | 
 | 65 | + | 
 | 66 | +### Proof Intuition  | 
 | 67 | + | 
 | 68 | +For a parent with N children, each with independent probability P of matching:  | 
 | 69 | + | 
 | 70 | +- Probability NO children match: (1 - P)^N  | 
 | 71 | +- Probability ≥1 child matches: 1 - (1 - P)^N  | 
 | 72 | + | 
 | 73 | +### Edge Cases  | 
 | 74 | + | 
 | 75 | +**When fan-out = 1** (foreign key from parent → child):  | 
 | 76 | + | 
 | 77 | +```  | 
 | 78 | +semiJoinSelectivity = 1 - (1 - filterSelectivity)^1  | 
 | 79 | +                    = filterSelectivity  | 
 | 80 | +```  | 
 | 81 | + | 
 | 82 | +Current implementation is correct! ✅  | 
 | 83 | + | 
 | 84 | +**When fan-out = 10** (typical one-to-many):  | 
 | 85 | + | 
 | 86 | +- filterSelectivity = 0.5 → semiJoinSelectivity = 0.999  | 
 | 87 | +- filterSelectivity = 0.1 → semiJoinSelectivity = 0.651  | 
 | 88 | +- filterSelectivity = 0.01 → semiJoinSelectivity = 0.096  | 
 | 89 | + | 
 | 90 | +**When fan-out = 100** (high cardinality):  | 
 | 91 | + | 
 | 92 | +- filterSelectivity = 0.5 → semiJoinSelectivity ≈ 1.0  | 
 | 93 | +- filterSelectivity = 0.1 → semiJoinSelectivity ≈ 1.0  | 
 | 94 | +- filterSelectivity = 0.01 → semiJoinSelectivity = 0.634  | 
 | 95 | + | 
 | 96 | +## Solution: Query sqlite_stat1 for Fan-Out  | 
 | 97 | + | 
 | 98 | +### What sqlite_stat1 Provides  | 
 | 99 | + | 
 | 100 | +After running `ANALYZE`, SQLite populates `sqlite_stat1` with index statistics:  | 
 | 101 | + | 
 | 102 | +```sql  | 
 | 103 | +SELECT tbl, idx, stat FROM sqlite_stat1;  | 
 | 104 | + | 
 | 105 | +-- Example result:  | 
 | 106 | +-- tbl='posts', idx='idx_posts_userId', stat='10000 100'  | 
 | 107 | +--                                             ^     ^  | 
 | 108 | +--                                             |     |  | 
 | 109 | +--                                    total rows   avg rows per distinct userId  | 
 | 110 | +```  | 
 | 111 | + | 
 | 112 | +The `stat` column format: `"totalRows avgRowsPerDistinct [avgRowsPerDistinct2 ...]"`  | 
 | 113 | + | 
 | 114 | +For an index on `posts.userId`:  | 
 | 115 | + | 
 | 116 | +- First number: Total rows in the index (10,000 posts)  | 
 | 117 | +- Second number: Average rows per distinct value of userId (100 posts per user)  | 
 | 118 | +- **The second number IS the fan-out!**  | 
 | 119 | + | 
 | 120 | +### Implementation Strategy  | 
 | 121 | + | 
 | 122 | +```typescript  | 
 | 123 | +function getFanOutFromStats(  | 
 | 124 | +  db: Database,  | 
 | 125 | +  tableName: string,  | 
 | 126 | +  columnName: string,  | 
 | 127 | +): number | undefined {  | 
 | 128 | +  // 1. Find indexes containing the column  | 
 | 129 | +  const indexes = db  | 
 | 130 | +    .prepare(  | 
 | 131 | +      `  | 
 | 132 | +    SELECT name FROM sqlite_master  | 
 | 133 | +    WHERE type='index'  | 
 | 134 | +    AND tbl_name=?  | 
 | 135 | +    AND sql LIKE '%' || ? || '%'  | 
 | 136 | +  `,  | 
 | 137 | +    )  | 
 | 138 | +    .all(tableName, columnName);  | 
 | 139 | + | 
 | 140 | +  if (indexes.length === 0) {  | 
 | 141 | +    return undefined; // No index, can't get stats  | 
 | 142 | +  }  | 
 | 143 | + | 
 | 144 | +  // 2. Query sqlite_stat1 for the first matching index  | 
 | 145 | +  for (const {name: indexName} of indexes) {  | 
 | 146 | +    const result = db  | 
 | 147 | +      .prepare(  | 
 | 148 | +        `  | 
 | 149 | +      SELECT stat FROM sqlite_stat1  | 
 | 150 | +      WHERE tbl=? AND idx=?  | 
 | 151 | +    `,  | 
 | 152 | +      )  | 
 | 153 | +      .get(tableName, indexName);  | 
 | 154 | + | 
 | 155 | +    if (result && result.stat) {  | 
 | 156 | +      const parts = result.stat.split(' ');  | 
 | 157 | +      if (parts.length >= 2) {  | 
 | 158 | +        return parseInt(parts[1], 10); // Average rows per distinct value  | 
 | 159 | +      }  | 
 | 160 | +    }  | 
 | 161 | +  }  | 
 | 162 | + | 
 | 163 | +  return undefined; // Stats not available (ANALYZE not run)  | 
 | 164 | +}  | 
 | 165 | +```  | 
 | 166 | + | 
 | 167 | +### Fallback Strategy  | 
 | 168 | + | 
 | 169 | +```typescript  | 
 | 170 | +const DEFAULT_FANOUT = 3; // Conservative middle ground  | 
 | 171 | + | 
 | 172 | +const fanOut = getFanOutFromStats(db, table, joinColumn) ?? DEFAULT_FANOUT;  | 
 | 173 | +const semiJoinSelectivity = 1 - Math.pow(1 - filterSelectivity, fanOut);  | 
 | 174 | +```  | 
 | 175 | + | 
 | 176 | +**Why 3?**  | 
 | 177 | + | 
 | 178 | +- SQLite's default is 10 (might be too optimistic)  | 
 | 179 | +- fan-out = 1 is common for FK relationships  | 
 | 180 | +- fan-out = 3 is a conservative middle ground  | 
 | 181 | +- Overestimating selectivity → slightly less efficient but still correct  | 
 | 182 | +- Underestimating selectivity → risk of bad plans  | 
 | 183 | + | 
 | 184 | +## Design Decisions  | 
 | 185 | + | 
 | 186 | +### Default Fan-Out Value  | 
 | 187 | + | 
 | 188 | +**Options**:  | 
 | 189 | + | 
 | 190 | +- 1: Conservative, assumes FK relationships (current behavior)  | 
 | 191 | +- 3: Moderate, safe middle ground  | 
 | 192 | +- 10: SQLite's default, optimistic  | 
 | 193 | + | 
 | 194 | +**Recommendation**: Use 3  | 
 | 195 | + | 
 | 196 | +- Safer than 10 for sparse relationships  | 
 | 197 | +- More accurate than 1 for typical one-to-many  | 
 | 198 | +- Easy to tune based on production data  | 
 | 199 | + | 
 | 200 | +## Edge Cases and Considerations  | 
 | 201 | + | 
 | 202 | +### 1. No Index on Join Column  | 
 | 203 | + | 
 | 204 | +If the child table has no index on the join column:  | 
 | 205 | + | 
 | 206 | +- `getFanOutFromStats()` returns `undefined`  | 
 | 207 | +- Fall back to `DEFAULT_FANOUT`  | 
 | 208 | +- Cost estimates may be less accurate but still safe  | 
 | 209 | + | 
 | 210 | +### 2. ANALYZE Not Run  | 
 | 211 | + | 
 | 212 | +If `sqlite_stat1` is empty:  | 
 | 213 | + | 
 | 214 | +- `getFanOutFromStats()` returns `undefined`  | 
 | 215 | +- Fall back to `DEFAULT_FANOUT`  | 
 | 216 | +- Document recommendation to run ANALYZE periodically  | 
 | 217 | + | 
 | 218 | +### 3. Composite Indexes  | 
 | 219 | + | 
 | 220 | +If join column is part of a multi-column index:  | 
 | 221 | + | 
 | 222 | +- `sqlite_stat1` provides stats at each position  | 
 | 223 | +- Use the stat corresponding to the join column's position  | 
 | 224 | +- May require more sophisticated parsing  | 
 | 225 | + | 
 | 226 | +### 4. Multiple Indexes on Same Column  | 
 | 227 | + | 
 | 228 | +If multiple indexes contain the column:  | 
 | 229 | + | 
 | 230 | +- Use first index found (arbitrary but consistent)  | 
 | 231 | +- Could enhance to prefer indexes where column is leftmost  | 
 | 232 | + | 
 | 233 | +### 5. Zero Selectivity  | 
 | 234 | + | 
 | 235 | +Current code has a bug where `semiJoinSelectivity = 0` (EXISTS can never succeed):  | 
 | 236 | + | 
 | 237 | +```typescript  | 
 | 238 | +if (childCost.semiJoinSelectivity === 0) {  | 
 | 239 | +  // EXISTS can never succeed - result set is empty  | 
 | 240 | +  return {  | 
 | 241 | +    baseCardinality: 0,  | 
 | 242 | +    runningCost: parentCost.runningCost, // Cost to discover emptiness  | 
 | 243 | +    filterSelectivity: 0,  | 
 | 244 | +    semiJoinSelectivity: 0,  | 
 | 245 | +    limit: parentCost.limit,  | 
 | 246 | +  };  | 
 | 247 | +}  | 
 | 248 | +```  | 
 | 249 | + | 
 | 250 | +This should be fixed in the same PR.  | 
 | 251 | + | 
 | 252 | +### 6. NOT EXISTS  | 
 | 253 | + | 
 | 254 | +For `NOT EXISTS`, we don't short-circuit on first match:  | 
 | 255 | + | 
 | 256 | +- Must scan ALL children to verify none match  | 
 | 257 | +- Don't apply limit or use semi-join selectivity  | 
 | 258 | +- Current code correctly doesn't set limit for NOT EXISTS  | 
 | 259 | + | 
 | 260 | +## Future Enhancements  | 
 | 261 | + | 
 | 262 | +### Enhancement 1: Use sqlite_stat4 for Distribution  | 
 | 263 | + | 
 | 264 | +**Problem**: `sqlite_stat1` provides average fan-out, but distributions can be highly skewed.  | 
 | 265 | + | 
 | 266 | +**Solution**: Sample actual distribution from `sqlite_stat4`:  | 
 | 267 | + | 
 | 268 | +```typescript  | 
 | 269 | +function getSemiJoinSelectivityWithStat4(  | 
 | 270 | +  db: Database,  | 
 | 271 | +  childTable: string,  | 
 | 272 | +  childColumn: string,  | 
 | 273 | +  filterSelectivity: number,  | 
 | 274 | +): number {  | 
 | 275 | +  // Query sqlite_stat4 for distribution samples  | 
 | 276 | +  const samples = db  | 
 | 277 | +    .prepare(  | 
 | 278 | +      `  | 
 | 279 | +    SELECT CAST(neq AS INTEGER) as fanOut  | 
 | 280 | +    FROM sqlite_stat4  | 
 | 281 | +    WHERE tbl=? AND idx LIKE '%' || ? || '%'  | 
 | 282 | +  `,  | 
 | 283 | +    )  | 
 | 284 | +    .all(childTable, childColumn);  | 
 | 285 | + | 
 | 286 | +  if (samples.length === 0) {  | 
 | 287 | +    return fallbackToStat1();  | 
 | 288 | +  }  | 
 | 289 | + | 
 | 290 | +  // Calculate selectivity for each sample's fan-out  | 
 | 291 | +  const selectivities = samples.map(  | 
 | 292 | +    s => 1 - Math.pow(1 - filterSelectivity, s.fanOut),  | 
 | 293 | +  );  | 
 | 294 | + | 
 | 295 | +  // Average across the distribution  | 
 | 296 | +  return selectivities.reduce((a, b) => a + b, 0) / selectivities.length;  | 
 | 297 | +}  | 
 | 298 | +```  | 
 | 299 | + | 
 | 300 | +**Challenge**: `stat4` only samples rows that exist in the index. Parents with 0 children don't appear. Need to adjust for missing parents.  | 
 | 301 | + | 
 | 302 | +## References  | 
 | 303 | + | 
 | 304 | +### PostgreSQL Documentation  | 
 | 305 | + | 
 | 306 | +- [Semi-Join Planning](https://postgrespro.com/blog/pgsql/5969618)  | 
 | 307 | +- [compute_semi_anti_join_factors](https://doxygen.postgresql.org/costsize_8c.html)  | 
 | 308 | +  - Calculates `outer_match_frac`: fraction of outer rows that have matches  | 
 | 309 | +  - Uses NDV (number of distinct values) from statistics  | 
 | 310 | + | 
 | 311 | +### SQLite Documentation  | 
 | 312 | + | 
 | 313 | +- [sqlite_stat1 Format](https://sqlite.org/fileformat2.html#stat1tab)  | 
 | 314 | +- [Query Planning Overview](https://sqlite.org/optoverview.html)  | 
 | 315 | +- [ANALYZE Command](https://sqlite.org/lang_analyze.html)  | 
 | 316 | + | 
 | 317 | +### Academic References  | 
 | 318 | + | 
 | 319 | +- "Access Path Selection in a Relational Database Management System" (Selinger et al., 1979)  | 
 | 320 | +  - Original paper on cost-based optimization  | 
 | 321 | +  - Discusses selectivity estimation for joins  | 
 | 322 | + | 
 | 323 | +## Open Questions  | 
 | 324 | + | 
 | 325 | +1. **Should we fix zero selectivity edge case in same PR?**  | 
 | 326 | +   - Pro: Related to selectivity handling  | 
 | 327 | +   - Con: Separate concern, could be separate PR  | 
 | 328 | + | 
 | 329 | +2. **What's the right default fan-out?**  | 
 | 330 | +   - 1: Current behavior (conservative for FK, wrong for one-to-many)  | 
 | 331 | +   - 3: Moderate (recommended)  | 
 | 332 | +   - 10: SQLite's default (optimistic)  | 
 | 333 | + | 
 | 334 | +3. **Should we add warnings/logging when falling back to default?**  | 
 | 335 | +   - Could help users identify missing ANALYZE or indexes  | 
 | 336 | +   - Might be noisy for intentionally simple cost models  | 
0 commit comments