Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 135 additions & 26 deletions experimental/databricks-aibi-dashboards/SKILL.md

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Advanced visualization types for AI/BI dashboards. For core widgets (text, count
- `widgetType`: "area"
- Same structure as line chart - useful for showing cumulative values or emphasizing volume

> Time-series area charts benefit from a `vertical-line` annotation marking a meaningful event (launch, incident, holiday) — turns a generic trend into a readable story. See [Annotations in 1-widget-specifications.md](1-widget-specifications.md#annotations-event-markers).

```json
"spec": {
"version": 3,
Expand Down Expand Up @@ -61,6 +63,8 @@ Combines bar and line visualizations on the same chart - useful for showing rela
- `y.primary`: bar chart fields
- `y.secondary`: line chart fields

> Mark meaningful events with a `vertical-line` annotation when the x-axis is temporal. See [Annotations in 1-widget-specifications.md](1-widget-specifications.md#annotations-event-markers).

```json
{
"widget": {
Expand Down Expand Up @@ -159,19 +163,326 @@ Displays geographic regions colored by aggregate values. Requires a field with g

---

## Other Visualization Types
## Forecast Line (with `AI_FORECAST`)

Overlays a model prediction on top of historical data — historical line continues into a future band with upper/lower confidence bounds.

- `version`: **1**
- `widgetType`: "forecast-line"
- The dataset SQL produces the original series **plus** three forecast columns: a point forecast, upper band, lower band. Spark's built-in `AI_FORECAST` table function generates them.

### Dataset SQL pattern

> **Always exclude the current (in-progress) bucket from the historical series.** If you aggregate weekly and today is Tuesday, the current week's bucket is only 2 days of data — the line drops off a cliff right before the forecast starts. Filter with `WHERE bucket_start < DATE_TRUNC('<grain>', current_date())` using the **same grain as the aggregation**.

```sql
WITH actuals AS (
SELECT DATE_TRUNC('WEEK', opened_at) AS opened_at, COUNT(*) AS count
FROM support_cases
-- Drop the partial-elapsed bucket. Grain MUST match the DATE_TRUNC above —
-- weekly aggregation → exclude current week; monthly → exclude current month.
WHERE DATE_TRUNC('WEEK', opened_at) < DATE_TRUNC('WEEK', current_date())
GROUP BY 1
),
dates AS (
SELECT MAX(opened_at) AS max_d, MIN(opened_at) AS min_d FROM actuals
),
forecast AS (
SELECT opened_at, count_forecast, count_upper, count_lower, CAST(NULL AS BIGINT) AS count
FROM AI_FORECAST(
TABLE(actuals),
horizon => (SELECT max_d + MAKE_DT_INTERVAL(
CAST(FLOOR(DATEDIFF(max_d, min_d) * 0.5) AS INT), 0, 0, 0) FROM dates),
time_col => 'opened_at',
value_col => 'count'
)
),
bridge AS (
-- One-row "seam" that carries the last actual value into the forecast columns
-- so the historical line and the forecast band visually connect instead of breaking
-- with a gap at the boundary.
SELECT a.opened_at,
a.count AS count_forecast,
a.count AS count_upper,
a.count AS count_lower,
a.count
FROM actuals a
JOIN dates d ON a.opened_at = d.max_d
)
SELECT opened_at, CAST(NULL AS BIGINT) AS count_forecast, CAST(NULL AS BIGINT) AS count_upper, CAST(NULL AS BIGINT) AS count_lower, count FROM actuals
UNION ALL SELECT opened_at, count_forecast, count_upper, count_lower, count FROM bridge
UNION ALL SELECT opened_at, count_forecast, count_upper, count_lower, count FROM forecast
```

Three CTEs:
- **`actuals`** — historical series (`count` populated, forecast columns NULL).
- **`forecast`** — `AI_FORECAST` output (forecast columns populated, `count` NULL).
- **`bridge`** — a **single row at the last actual timestamp** with the actual value duplicated into all three forecast columns. Without it, the historical line and the forecast band have a visible gap at the boundary; with it, they connect smoothly.

> **The final `SELECT`s must list columns explicitly, in the same order, in every branch.** `SELECT * FROM actuals` (2 cols) `UNION ALL SELECT * FROM forecast` (5 cols) errors out with `NUM_COLUMNS_MISMATCH`. Project the same 5-column shape from every CTE — fill NULLs where a branch doesn't have a value (and `CAST(NULL AS <type>)` so the types align).

The `horizon` expression projects forward 50% of the historical range. Tune the multiplier (0.5 → 1.0 for "predict as far as we've seen") to taste.

**If you switch the aggregation grain, update both `DATE_TRUNC` calls.** They must match — a daily x-axis with a weekly cutoff filter would still show the cliff. Common pairings:

| Aggregation `DATE_TRUNC` | Cutoff filter |
|---|---|
| `'DAY'` | `WHERE event_ts < DATE_TRUNC('DAY', current_timestamp())` — drops today |
| `'WEEK'` | `WHERE DATE_TRUNC('WEEK', event_ts) < DATE_TRUNC('WEEK', current_date())` — drops current week |
| `'MONTH'` | `WHERE DATE_TRUNC('MONTH', event_ts) < DATE_TRUNC('MONTH', current_date())` — drops current month |
| `'QUARTER'` / `'YEAR'` | same shape with that grain |

### Widget spec

```json
"spec": {
"version": 1,
"widgetType": "forecast-line",
"encodings": {
"x": {"fieldName": "opened_at", "scale": {"type": "temporal"}},
"y": {
"scale": {"type": "quantitative"},
"original": {"fieldName": "count", "displayName": "Cases"},
"prediction": {"fieldName": "count_forecast", "displayName": "Forecast"},
"predictionUpper": {"fieldName": "count_upper"},
"predictionLower": {"fieldName": "count_lower"}
}
},
"annotations": [ /* vertical-line for known events — same shape as in 1-widget */ ],
"frame": {"showTitle": true, "title": "Case Volume Forecast"}
}
```

> Annotations (`vertical-line`) work on forecast-line — useful for marking known seasonal events (holidays, releases) inside both the historical window and the prediction band. Shape documented in [1-widget-specifications.md](1-widget-specifications.md#annotations-event-markers).

---

## Pivot

A cross-tab — dimensions on rows AND columns, measures in cells. Supports per-cell conditional styling (heat-map-style).

- `version`: **3**
- `widgetType`: "pivot"
- For multi-dimensional aggregations like "category × priority", supports drill-down totals, and is the right widget for cohort retention (see end of section).

```json
"spec": {
"version": 3,
"widgetType": "pivot",
"encodings": {
"rows": [
{"fieldName": "industry"},
{"fieldName": "customer_segment", "total": {"show": true}}
],
"columns": [
{"fieldName": "category"},
{"fieldName": "Priority Level", "total": {"show": true}}
],
"cell": {
"type": "multi-cell",
"fields": [
{
"fieldName": "count(*)",
"cellType": "text",
"style": {
"type": "basic",
"rules": [
{"condition": {"operand": {"type": "data-value", "value": "30"}, "operator": ">="},
"backgroundColor": {"hex": "#FF7E5C"}},
{"condition": {"operand": {"type": "data-value", "value": "20"}, "operator": ">="},
"backgroundColor": {"themeColorType": "visualizationColors", "position": 0}},
{"condition": {"operand": {"type": "data-value", "value": "15"}, "operator": ">="},
"backgroundColor": {"themeColorType": "visualizationColors", "position": 6}}
]
}
}
]
}
},
"frame": {"showTitle": true, "title": "Cases by Category × Priority"}
}
```

For a **continuous color gradient** instead of explicit thresholds, set `cell.fields[].cellType: "color-scale"` and drop the `style.rules`. The gradient auto-fits to min/max in the cell values.

**Sort by cell values** (e.g., put the highest-volume column first) — useful for cohort tables:
```json
"columns": [{
"fieldName": "category",
"scale": {"type": "categorical", "sort": {"by": "cell", "field": {"index": 0}}}
}]
```
`"by": "cell-reversed"` flips the order. `field.index` picks which value field to sort by when there are multiple.

> **Cohort retention charts** are built as a `pivot`. Rows = cohort date, columns = period offset (`0`, `1 year`, `2 years`…), cell = retention ratio with `cellType: "color-scale"`. There is no separate `cohort` widget type.

---

## Histogram

Frequency distribution. The bin width is set in the **widget's field expression**, not in the dataset SQL.

- `version`: **3**
- `widgetType`: "histogram"

```json
"queries": [{
"name": "main_query",
"query": {
"datasetName": "ds_cases",
"fields": [
{"name": "bin(time_to_resolution_hours, binWidth=2)",
"expression": "BIN_FLOOR(`time_to_resolution_hours`, 2)"},
{"name": "count(*)", "expression": "COUNT(`*`)"}
],
"disaggregated": false
}
}],
"spec": {
"version": 3,
"widgetType": "histogram",
"encodings": {
"x": {"fieldName": "bin(time_to_resolution_hours, binWidth=2)",
"scale": {"type": "quantitative"}},
"y": {"fieldName": "count(*)", "scale": {"type": "quantitative"}}
},
"frame": {"showTitle": true, "title": "Resolution Time Distribution"}
}
```

The field `name` (and the widget's `fieldName`) is the readable `bin(col, binWidth=N)` label; the underlying `expression` uses `BIN_FLOOR(\`col\`, N)` — a Lakeview field-expression, not raw SQL.

---

## Sankey

The following visualization types are available in Databricks AI/BI dashboards but are less commonly used. Refer to [Databricks documentation](https://docs.databricks.com/visualizations/visualization-types) for details:
Flow between two or more stages. Each stage is a categorical field; the value is a quantitative aggregate.

| Widget Type | Description |
- `version`: **1**
- `widgetType`: "sankey"

```json
"spec": {
"version": 1,
"widgetType": "sankey",
"encodings": {
"value": {"fieldName": "count(*)"},
"stages": [
{"fieldName": "channel"},
{"fieldName": "reopened_flag", "displayName": "Reopened"}
]
},
"frame": {"showTitle": true, "title": "Channel → Reopen flow"}
}
```

Add more `stages` entries for multi-step flows (e.g., funnel-with-attribution: `source → channel → outcome`).

---

## Heatmap

Color-intensity grid: x-axis categorical, y-axis categorical, color = numeric aggregate. Useful for "X by Y" matrices.

- `version`: **3**
- `widgetType`: "heatmap"

```json
"spec": {
"version": 3,
"widgetType": "heatmap",
"encodings": {
"x": {"fieldName": "priority", "scale": {"type": "categorical"}, "axis": {"hideTitle": true}},
"y": {"fieldName": "ship_mode", "scale": {"type": "categorical"}, "axis": {"hideTitle": true}},
"color": {"fieldName": "sum(order_count)",
"scale": {"type": "quantitative",
"colorRamp": {"mode": "scheme", "scheme": "viridis"}}}
},
"frame": {"showTitle": true, "title": "Order count by priority × ship mode"}
}
```

Heatmap limit: 64K rows / 10MB. For larger data, pre-aggregate to a smaller grid.

`axis.hideTitle: true` (shown above) drops the redundant "priority" / "ship_mode" axis labels — the row/column headers already tell you what they are. Same trick works on any x/y axis encoding (line, bar, heatmap, pivot) when the column name is obvious from context.

---

## Funnel

Stage-by-stage conversion: how many users / records make it from step 1 to step N.

- `version`: **1**
- `widgetType`: "funnel"

```json
"spec": {
"version": 1,
"widgetType": "funnel",
"encodings": {
"x": {"fieldName": "stage"},
"y": {"fieldName": "count", "scale": {"type": "quantitative"}},
"color": {"fieldName": "count"}
},
"frame": {"showTitle": true, "title": "Signup funnel"}
}
```

Dataset SQL typically returns one row per stage, with an ordering column. Use `ORDER BY stage_order` in the SQL to guarantee top-to-bottom visualization.

---

## Box

Distribution summary (median, quartiles, whiskers, outliers). Compare distributions across categories.

- `version`: **1**
- `widgetType`: "box"

```json
"spec": {
"version": 1,
"widgetType": "box",
"encodings": {
"x": {"fieldName": "return_flag", "displayName": "Return flag"},
"y": {"fieldName": "l_extendedprice", "displayName": "Extended price",
"scale": {"type": "quantitative"}}
},
"frame": {"showTitle": true, "title": "Price distribution by return flag"}
}
```

---

## Waterfall

Cumulative effect of positive/negative deltas — useful for P&L bridges, MoM revenue walks, factor decomposition.

- `version`: **1**
- `widgetType`: "waterfall"

```json
"spec": {
"version": 1,
"widgetType": "waterfall",
"encodings": {
"x": {"fieldName": "monthly(date_col)",
"expression": "DATE_TRUNC(\"MONTH\", `date_col`)"},
"y": {"fieldName": "sum(amount)", "scale": {"type": "quantitative"}}
},
"frame": {"showTitle": true, "title": "Monthly P&L"}
}
```

Dataset typically returns one row per period with signed values (positive contributions, negative deductions).

---

## Other (less common)

| Widget Type | When to use |
|-------------|-------------|
| heatmap | Color intensity grid for numerical data |
| histogram | Frequency distribution with configurable bins |
| funnel | Stage-based metric analysis |
| sankey | Flow visualization between value sets |
| box | Distribution summary with quartiles |
| marker-map | Latitude/longitude point markers |
| pivot | Drag-and-drop aggregation table |
| word-cloud | Word frequency visualization |
| sunburst | Hierarchical data in concentric circles |
| cohort | Group outcome analysis over time |
| `word-cloud` | Word/category frequency from a text field. |
| `sunburst` | Hierarchical data in nested rings (org chart, taxonomy). |

These follow the same `version`/`widgetType`/`encodings` pattern — see the [official docs](https://docs.databricks.com/dashboards/manage/visualizations/types) for spec details.
Loading
Loading