Skip to content

Eager search space materialization #796

@AdrianSosic

Description

@AdrianSosic

Problem

SubspaceDiscrete eagerly materializes the full candidate space at construction time. The classmethod constructors (from_product, from_simplex, from_dataframe) immediately build both exp_rep (experimental representation) and comp_rep (computational/encoded representation) as pandas DataFrames, storing the complete set of candidates in RAM.

This means:

  • Large product spaces blow up. from_product of even modest mixture or formulation spaces produces millions of rows that immediately sit in RAM, even when only a tiny subset is ever scored by the recommender.
  • Infinite spaces cannot be represented at all. Countably infinite spaces (e.g., a SubstanceParameter defined by an alphabet and length range, or hierarchical parameter spaces) cannot be constructed because they would require infinite memory.
  • No deferred computation. Even when a recommender only needs 50 candidates, the entire space of 10^6+ rows must be materialized first.
  • Parameters must predefine all possible values. Because comp_rep is built eagerly at construction time for the entire space, every encoded parameter must declare all its possible values upfront — the computational representation of an unknown value simply cannot be produced later. This forces the use of active_values as a workaround, which is problematic in campaigns where the full set of values a user might provide is not known in advance. For parameters with a fixed encoder (e.g., Mordred fingerprints for substances), there is no inherent reason the encoding can't happen on-the-fly for any valid input — but the eager architecture prevents this.

Why it matters

  • Memory exhaustion on realistic combinatorial/formulation spaces (10^6 - 10^9 candidates).
  • Entire classes of parameter types are blocked — any parameter with a countably infinite domain cannot participate in a product space.
  • Wasted computation — encoding millions of rows when only dozens are used downstream.
  • Artificial requirement to enumerate values upfront — parameters with fixed encoders (Mordred, ECFP, etc.) could encode any valid value on demand, but the eager comp_rep construction forces all values to be known at space definition time. This creates friction in campaign workflows where new substances/values are discovered over time.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions