Skip to content

Reduce log volume#6568

Open
PSeitz-dd wants to merge 7 commits into
quickwit-oss:mainfrom
PSeitz:reduce_logs
Open

Reduce log volume#6568
PSeitz-dd wants to merge 7 commits into
quickwit-oss:mainfrom
PSeitz:reduce_logs

Conversation

@PSeitz-dd

@PSeitz-dd PSeitz-dd commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

rate_limited_tracing macro (rate_limited_info etc.): the suppressed count is now folded into the emitted line as a num_suppressed=N field, instead of a separate preceding line.
In combination with #6549, this lets us aggregate on the field and recover the true pre-suppression log count.

3 Types of Log Changes:

  1. INFODEBUG (routine per-operation chatter): node pool-add ×4 (the ~907M dominant source), send-to-index-serializer, spawning pipeline, merge schedule/download, leaf split-finished/offsets, offload-to-lambda, resetting pipeline, adding-shards-assignment.

  2. INFO → rate-limited INFO (1/min) (worth a heartbeat): new-split, actor-exit (success), assigning shards, env-var defaults, truncated-shard.

  3. ERROR → rate-limited ERROR (1/min): the recurring lambda invocation failure (kept visible, not flooding).

publish-new-splits now carries num_splits, num_docs, and total on-disk split_size.

PSeitz added 3 commits June 30, 2026 18:55
The rate-limited tracing macros emitted a separate "suppressed N similar
log messages" line before the next allowed line. Attach the count to that
line instead, as a `suppressed_in_last_min` field. This makes the
suppression rate visible inline on the message it belongs to and removes a
distinct log pattern, slightly lowering volume.
…EM-759)

A fleet-wide extract showed ~1.07B log lines, dominated by a single
per-gossip-tick INFO (~907M, ~85%). Reclassify high-frequency operational
logs to cut default-level volume by ~20x while preserving actionable signal:

- INFO -> DEBUG for routine per-operation chatter with no liveness value:
  node pool-add (x4), send-to-index-serializer, spawning pipeline, merge
  schedule/download, leaf split-finished/offsets, offload-to-lambda,
  resetting pipeline, adding shards assignment.
- INFO -> rate-limited INFO (1/min) for logs worth a heartbeat: new-split,
  actor-exit (success), assigning shards, env-var defaults, truncated-shard.
- ERROR -> rate-limited ERROR (1/min) for the recurring lambda invocation
  failure, keeping it visible without flooding.

Error/failure branches are untouched (e.g. actor-exit failure stays at
ERROR). Stage/publish, merge-completion, and cluster lifecycle stay at INFO.
The publish-new-splits log carried no fields. Add num_splits, num_docs,
and total on-disk split size to give operators visibility into publish
throughput and split sizing without adding a new log line.

num_splits is >1 only for partitioned sources (a single commit produces
one split per partition); merges publish a single output split, with the
merged inputs recorded in replaced_split_ids.
@PSeitz-dd PSeitz-dd requested a review from a team as a code owner June 30, 2026 17:27
PSeitz added 3 commits June 30, 2026 20:24
This is a low-volume (~80K), operationally meaningful event: the searcher
spilling search work to Lambda. It carries capacity/cost signal and pairs
with the lambda invocation error we deliberately keep visible. Demoting it
to DEBUG bought no real volume reduction, so revert it to INFO.
Keep visibility into ingester, searcher, and generic-service pool membership
at INFO but cap each to 1/min. These are far lower volume than the indexer
pool-add (~907M), which stays at DEBUG.
…uppressed

Rate-limit the indexer pool-add log (~907M, the dominant pattern) at INFO
1/min instead of demoting to DEBUG. At 1/min this collapses to ~1.4K/day --
the same volume win as DEBUG -- while keeping pool membership visible at INFO,
consistent with the other three pool-add logs.

Rename the rate-limit suppressed-count field from `suppressed_in_last_min`
to `num_suppressed`: the count is messages suppressed since the call site
last emitted, and since the window only resets on the next call it can span
more than a minute, so the old name was misleading.
@PSeitz PSeitz changed the title Reduce default log volume by reclassifying verbose INFO logs (CLOUDPREM-759) Reduce log volume Jun 30, 2026
Comment thread quickwit/quickwit-common/src/lib.rs Outdated
flag_value
} else {
info!(default_value=%default_value, "using environment variable `{key}` default value");
crate::rate_limited_info!(limit_per_min = 1, default_value=%default_value, "using environment variable `{key}` default value");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have ~2M logs of that in the last week on the hooray cluster. I'll try to find the call sites and cache them instead of rate-limiting

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added caching to the callsites (with some macro to avoid misuse in the future)

| ActorExitStatus::DownstreamClosed
| ActorExitStatus::Killed => {
info!(actor_id, phase = ?exit_phase, exit_status = ?after_process_exit_status, "actor-exit");
quickwit_common::rate_limited_info!(limit_per_min = 1, actor_id, phase = ?exit_phase, exit_status = ?after_process_exit_status, "actor-exit");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to rate limit this one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it, but we had ~4M log lines of that in the last week

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the rate limiting

@fulmicoton

Copy link
Copy Markdown
Collaborator

I am all for reducing logging but this is a terrible PR!
Why do you want to rate limit the thing that displays the parsed cli options?

The "using environment variable ... default value" INFO lines flooded
because callers like PostgresqlMetastore::new read the same variables on
every (re)construction, and the metastore is rebuilt frequently. Rate-
limiting the shared log statement (previous approach) funneled every key
through one call-site counter, so within a burst only the first key was
ever logged and the rest were deterministically dropped.

Instead, add per-call-site caching macros in quickwit-common:
- get_from_env_cached!(ty, key, default, sensitive)
- get_bool_from_env_cached!(key, default)

Each expands to a block-local `static LazyLock<T>`, so the read + log
fire exactly once per call site per process, with no cross-key
contention (distinct call sites get distinct statics). Revert lib.rs env
logging back to plain info!.

Convert the non-lazy hot readers (metastore QW_POSTGRES_*, node_config
OTLP/JAEGER, CORS debug) and fold the existing hand-rolled LazyLock env
caches (ingest v2 enable/disable, batch bytes, per-index metrics, doc
validation, load estimation, field list limit, default load per shard)
onto the same macro for consistency.

Also drop the rate-limit on the actor-exit success log (back to info!).
@PSeitz-dd

Copy link
Copy Markdown
Contributor Author

I am all for reducing logging but this is a terrible PR! Why do you want to rate limit the thing that displays the parsed cli options?

It's not parsed CLI options, it's reading environment variables, which we logged ~2M times in the last week on the hooray cluster.
I changed it to cache the env vars instead of rate limiting (since this covered only default values)

@PSeitz-dd PSeitz-dd requested a review from fulmicoton July 1, 2026 13:27
@fulmicoton

Copy link
Copy Markdown
Collaborator

I changed it to cache the env vars instead of rate limiting (since this covered only default values)

Yes sorry I meant env var.
And yes that's what I meant. If you see we log a lot of these, then the right move is to check why, and address it the way all of the codebase does.

PIPELINE_FULL_CAPACITY.cpu_millis() / 4,
false
);
NonZeroU32::new(default_load_per_shard).unwrap()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I don't see the benefit of a macro here. Are we trying to same one LOC?

Less nitpick: if introducing a macro is something we want to do, it should be a separate PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main benefit is to avoid this issue in the future by adding a central way that includes caching.
An different approach would be to have a central env hashmap in a oncelock, with all env variables in the repo. Which one do you think is better?

I can put it in a different PR

}
}
info!(
quickwit_common::rate_limited_info!(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rate limited cannot be the right move here!

Generally speaking, rate limiting is rarely a good move for info statement.
If we don't need it (because it is useless or redundant) let's remove it.

Those info statement are good to investigate, when something goes wrong, what chain of events led to the problem.
If you rate limit, we cannot draw any conclusions about the absence of the log statement. We cannot no that indexing did not reach a given step for a given split for instance.

) -> Change<NodeId, IngesterPoolEntry> {
let chitchat_id = node.chitchat_id();
info!(
quickwit_common::rate_limited_info!(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No rate limiting on that no. If we have a lot of churn we need to feel its pain and not ignore it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by far the biggest block with 700M logs in the last week (90% of all logs), so it may cause some cost for the customer. We did some experiments, so I'm not sure if this may happen on the customer side.

We could also increase the limit_per_min, to have some safeguard against it and still get a better signal. What do you think?

BTW. In combination with #6549 we can easily recover the total sum of these logs.

ClusterChange::Add(node) if node.is_searcher() => {
let chitchat_id = node.chitchat_id();
info!(
quickwit_common::rate_limited_info!(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

}
}

/// Reads and parses an environment variable exactly once, caching the result for the lifetime of

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, we should have a env.rs file in quickwit-common to gather all of these helpers.

Also not by implementing your macro only for a two calls, you creating a fracture in the way we deal with env variable.

Some stuff use a macro, some not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants