Implement Request and Response Policy Based Routing in Cluster Mode #3422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ofekshenawa wants to merge 25 commits into redis:load-balance-search-commands-to-shards from ofekshenawa:load-balance-search-commands-to-shards

Collaborator

ofekshenawa commented Jun 30, 2025 •

edited

Loading

This PR introduces support for Redis COMMAND-based request_policy and response_policy routing for Redis commands when used in OSS Cluster client.

Key Additions:

Command Policy Loader: Parses and caches COMMAND metadata with routing/aggregation tips on first use.
Routing Engine Enhancements:
Implements support for all request policies: default(keyless), default(hashslot), all_shards, all_nodes, multi_shard, and special.
Response Aggregator: Combines multi-shard replies based on response_policy:
all_succeeded, one_succeeded, agg_sum, special, etc.
Includes custom handling for special commands like FT.CURSOR.
Raw Command Support: Policies are enforced on Client.Do(ctx, args...).

ofekshenawa and others added 7 commits

May 14, 2025 21:35


          feat(routing): add internal request/response policy enums

82a3433


          Merge pull request #3 from ofekshenawa/define-policy-type

9e4369a

feat(routing): add internal request/response policy enums


          feat: load the policy table in cluster client (#4)

74407a0

* feat: load the policy table in cluster client

* Remove comments


          modify Tips and command pplicy in commandInfo (#5)

f99c63b


          centralize cluster command routing in osscluster_router.go and refact…

b6633bf

…or osscluster.go (#6)

* centralize cluster command routing in osscluster_router.go and refactor osscluster.go

* enalbe ci on all branches

* Add debug prints

* Add debug prints

* FIX: deal with nil policy

* FIX: fixing clusterClient process

* chore(osscluster): simplify switch case

* wip(command): ai generated clone method for commands

* feat: implement response aggregator for Redis cluster commands

* feat: implement response aggregator for Redis cluster commands

* fix: solve concurrency errors

* fix: solve concurrency errors

* return MaxRedirects settings

* remove locks from getCommandPolicy

* Handle MOVED errors more robustly, remove cluster reloading at exectutions, ennsure better routing

* Fix: supports Process hook test

* Fix: remove response aggregation for single shard commands

* Add more preformant type conversion for Cmd type

* Add router logic into processPipeline

---------

Co-authored-by: Nedyalko Dyakov <[email protected]>


          Merge branch 'load-balance-search-commands-to-shards' into load-balan…

ed528f8

…ce-search-commands-to-shards


          remove thread debugging code

43fcc67

ofekshenawa changed the title ~~Load balance search commands to shards~~ Implement Request and Response Policy Based Routing in Cluster Mode

ofekshenawa added 4 commits

July 4, 2025 16:05


          remove thread debugging code && reject commands with policy that cann…

7eb3818

…ot be used in pipeline


          refactor processPipline and cmdType enum

de344fd


          remove FDescribe from cluster tests

57cdd32


          Add tests

04a110a

ofekshenawa requested review from bobymicroby, htemelski-redis and ndyakov

July 6, 2025 10:28

ofekshenawa added 4 commits

July 6, 2025 14:44


          fix aggregation test

f1c7f62


          fix mget test

e0b122a


          fix mget test

a2ffd62


          remove aggregateKeyedResponses

c00bd81

ofekshenawa marked this pull request as ready for review

July 6, 2025 12:54

htemelski-redis requested changes

View reviewed changes

osscluster_router.go

    
              		}

              		if result.cmd != nil && result.err == nil {

              			// For MGET, extract individual values from the array result

              			if strings.ToLower(cmd.Name()) == "mget" {

Contributor

htemelski-redis Sep 25, 2025

Do we actually need this special case?

osscluster_router.go

    
              }

              // getCommandPolicy retrieves the routing policy for a command

              func (c *ClusterClient) getCommandPolicy(ctx context.Context, cmd Cmder) *routing.CommandPolicy {

Contributor

htemelski-redis Sep 25, 2025 •

edited

Loading

~~It seems like this will introduce a big overhead for each command execution.~~
We should fetch all policies during the connection handshake

Contributor

htemelski-redis Sep 25, 2025

Note: for the first stage we should use hard-coded policy manager that can be extended in the future to take into account the COMMAND command output

Member

bobymicroby Sep 25, 2025

@htemelski-redis 💡 Consider implementing a PolicyResolverConfig type that users can override via the client options. This config should map module__command_name to metadata (policies, key requirements, etc.).

Set hardcoded defaults in the client options, but allow users to override policies per command as needed.

htemelski-redis added 2 commits

October 8, 2025 09:24


          added scaffolding for the req-resp manager

de1b16c


          added default policies for the search commands

1b2eaa6

htemelski-redis force-pushed the load-balance-search-commands-to-shards branch from 6e3b627 to 1b2eaa6 Compare

October 8, 2025 08:05

htemelski-redis added 2 commits

October 8, 2025 14:50


          split command map into module->command

64245f8


          cleanup, added logic to refresh the cache

3397b6f

ndyakov reviewed

View reviewed changes

.github/workflows/build.yml Show resolved Hide resolved

htemelski-redis added 3 commits

October 9, 2025 12:24


          added reactive cache refresh

4fb4c68


          revert cluster refresh

bd526a8


          fixed lint

5b01de5

htemelski-redis added 2 commits

October 9, 2025 13:43


          updated build workflow

4d1d775


          update build action

2a06726

ndyakov reviewed

View reviewed changes

Member

ndyakov left a comment

Submitting partial review for the aggregators.

internal/routing/aggregator.go Outdated Show resolved Hide resolved

osscluster_router.go

Comment on lines +446 to +449

    
              	// For MGET without policy, use keyed aggregator

              	if cmdName == "mget" {

              		return routing.NewDefaultAggregator(true)

              	}

Member

ndyakov Oct 9, 2025

Since we are passing the cmd.Name() in routing.NewResponseAggregator this can be handler by it. If policy is nil for mget, maybe the NewResponseAggregator can accept a policy and check the nil as well`.

internal/routing/aggregator.go

Comment on lines +68 to +79

    
              	a.mu.Lock()

              	defer a.mu.Unlock()

              	if err != nil && a.firstErr == nil {

              		a.firstErr = err

              		return nil

              	}

              	if err == nil && !a.hasResult {

              		a.result = result

              		a.hasResult = true

              	}

              	return nil

Member

ndyakov Oct 9, 2025

Couple of questions here:

Should we return only the first observed error?
Why are we overwriting the result?
Can't we just have an atomic boolean hasError?
Same, if we can have atomic hasResult we can drop the mutex.

My questions and my idea is that if we are going to return on the first error, we can do this with atomics and skip the cpu cycle for the mutex.

Contributor

htemelski-redis Oct 9, 2025

For the all succeed policy, we either return one of the replies if there is no error, or one of the errors if there's at least one
So

Yes, returning only the first error is sufficient
We are setting it only once
3/4. I feel that using atomics will overcomplicate the aggregators, plus there are some caveats to using them. I think we should try to maximize the compatibility of the library

Member

ndyakov Oct 10, 2025

We already do use atomics in the library so using them here is not a problem in terms of compatibility. As for increasing the complexity - that is rarely the case for such small units as the Aggregators. Let's try it out, I do agree it will be different than the mutex approach but in this specific case I do see the use of atomics as the correct approach for leveraging the cpu as much as possible without having the additional cpu cycle for the mutex.

For most aggregators we are either going to sum numbers, calculate min/max or check flags. This can be covered by atomics. For the cases where we are combining more complex response, it makes sense to lock and append to a slice.

internal/routing/aggregator.go

Comment on lines +105 to +118

    
              func (a *OneSucceededAggregator) Add(result interface{}, err error) error {

              	a.mu.Lock()

              	defer a.mu.Unlock()

              	if err != nil && a.firstErr == nil {

              		a.firstErr = err

              		return nil

              	}

              	if err == nil && !a.hasResult {

              		a.result = result

              		a.hasResult = true

              	}

              	return nil

              }

Member

ndyakov Oct 9, 2025

Same as with AllSucceededAggregator. Maybe we can use atomics here.

Contributor

htemelski-redis Oct 9, 2025

Same as above

internal/routing/aggregator.go

    
              			return nil

              		}

              		if err == nil {

              			a.sum += val

Member

ndyakov Oct 9, 2025

Again, maybe we can use atomic.Int64

Contributor

htemelski-redis Oct 9, 2025

-||-

internal/routing/aggregator.go

    
              }

              // AggMinAggregator returns the minimum numeric value from all shards.

              type AggMinAggregator struct {

Member

ndyakov Oct 9, 2025

Looking at https://github.com/haraldrudell/parl , there is atomic min and atomic max implementations that we can also use.

Member

ndyakov Oct 9, 2025

p.s. I do suggest copying only the needed implementation or using it as reference to reimplement, not including the whole dependency. Of course, mentioning the creator in the code.

Contributor

htemelski-redis Oct 9, 2025

-||-

internal/routing/aggregator.go

    
              		return nil, a.firstErr

              	}

              	if !a.hasResult {

              		return nil, fmt.Errorf("redis: no valid results to aggregate for min operation")

Member

ndyakov Oct 9, 2025

Can we extract such errors in a separate file?

internal/routing/aggregator.go Outdated Show resolved Hide resolved

internal/routing/aggregator.go

Comment on lines +567 to +588

    
              // SetAggregatorFunc allows setting custom aggregation logic for special commands.

              func (a *SpecialAggregator) SetAggregatorFunc(fn func([]interface{}, []error) (interface{}, error)) {

              	a.mu.Lock()

              	defer a.mu.Unlock()

              	a.aggregatorFunc = fn

              }

              // SpecialAggregatorRegistry holds custom aggregation functions for specific commands.

              var SpecialAggregatorRegistry = make(map[string]func([]interface{}, []error) (interface{}, error))

              // RegisterSpecialAggregator registers a custom aggregation function for a command.

              func RegisterSpecialAggregator(cmdName string, fn func([]interface{}, []error) (interface{}, error)) {

              	SpecialAggregatorRegistry[cmdName] = fn

              }

              // NewSpecialAggregator creates a special aggregator with command-specific logic if available.

              func NewSpecialAggregator(cmdName string) *SpecialAggregator {

              	agg := &SpecialAggregator{}

              	if fn, exists := SpecialAggregatorRegistry[cmdName]; exists {

              		agg.SetAggregatorFunc(fn)

              	}

              	return agg

Member

ndyakov Oct 9, 2025

SetAggregatorFunc is only used internally in this package, I assume it can be private if needed at all, see next comment.

internal/routing/aggregator.go

Comment on lines +583 to +588

    
              func NewSpecialAggregator(cmdName string) *SpecialAggregator {

              	agg := &SpecialAggregator{}

              	if fn, exists := SpecialAggregatorRegistry[cmdName]; exists {

              		agg.SetAggregatorFunc(fn)

              	}

              	return agg

Member

ndyakov Oct 9, 2025

Suggested change

      
            func NewSpecialAggregator(cmdName string) *SpecialAggregator {
          
            	agg := &SpecialAggregator{}
          
            	if fn, exists := SpecialAggregatorRegistry[cmdName]; exists {
          
            		agg.SetAggregatorFunc(fn)
          
            	}
          
            	return agg
          
            func NewSpecialAggregator(cmdName string) *SpecialAggregator {
          
                fn := SpecialAggregatorRegistry[cmdName]
          
            	return &SpecialAggregator{aggregatorFunc: fn}

I do think this should be doable and we are going to remove the need for SetAggregatorFunc and therefore - locking and unlocking the mutex.

ndyakov reviewed

View reviewed changes

Member

ndyakov left a comment

Submitting another partial review.

internal/routing/policy.go

    
              }

              func (p *CommandPolicy) CanBeUsedInPipeline() bool {

              	return p.Request != ReqAllNodes && p.Request != ReqAllShards && p.Request != ReqMultiShard

Member

ndyakov Oct 9, 2025

What about special? Can it be used in a pipeline?

Contributor

htemelski-redis Oct 10, 2025

My understanding is that special should be handled on a case-by-case basis

internal/routing/shard_picker.go

Comment on lines +8 to +12

    
              // ShardPicker chooses “one arbitrary shard” when the request_policy is

              // ReqDefault and the command has no keys.

              type ShardPicker interface {

              	Next(total int) int // returns an index in [0,total)

              }

Member

ndyakov Oct 9, 2025

Those are great, can we implement StaticShardPicker or StickyShardPicker that will always return the same shard. I do think this can be helpful for testing. This is not a blocker by any means.

command.go Outdated Show resolved Hide resolved

command.go

Comment on lines -879 to +1073

    
              	return strconv.ParseBool(cmd.val)

              	return strconv.ParseBool(cmd.Val())

Member

ndyakov Oct 9, 2025

why was this change needed?

command.go Outdated Show resolved Hide resolved

command.go

Comment on lines +4396 to +4414

    
              	if commandInfoTips != nil {

              		if v, ok := commandInfoTips[requestPolicy]; ok {

              			if p, err := routing.ParseRequestPolicy(v); err == nil {

              				req = p

              			}

              		}

              		if v, ok := commandInfoTips[responsePolicy]; ok {

              			if p, err := routing.ParseResponsePolicy(v); err == nil {

              				resp = p

              			}

              		}

              	}

              	tips := make(map[string]string, len(commandInfoTips))

              	for k, v := range commandInfoTips {

              		if k == requestPolicy || k == responsePolicy {

              			continue

              		}

              		tips[k] = v

              	}

Member

ndyakov Oct 9, 2025

can't we do both of those in a single range over commandInfoTips?

Contributor

htemelski-redis Oct 10, 2025

Not sure that I completely understand the question

command.go Show resolved Hide resolved

command.go

    
              	return nil

              }

              func (cmd *MonitorCmd) Clone() Cmder {

Member

ndyakov Oct 9, 2025

let's move this above the ExtractCommandValue function

json.go

    
              	return nil

              }

              func (cmd *IntPointerSliceCmd) Clone() Cmder {

Member

ndyakov Oct 9, 2025

it's tricky here. do we need to return the same pointer or do we only want the value when cloning?

osscluster.go

Comment on lines +1864 to +1868

    
              // cmdInfo will fetch and cache the command policies after the first execution

              func (c *ClusterClient) cmdInfo(ctx context.Context, name string) *CommandInfo {

              	cmdsInfo, err := c.cmdsInfoCache.Get(ctx)

              	// Use a separate context that won't be canceled to ensure command info lookup

              	// doesn't fail due to original context cancellation

              	cmdInfoCtx := context.Background()

Member

ndyakov Oct 9, 2025

most of the time the cmdInfo should be cached already, why don't we just use the c.context(ctx) to determine if the original one (with it's timeout) be used or a Background context will be used when c.opt.ContextTimeoutEnabled is false.

ndyakov reviewed

View reviewed changes

Member

ndyakov left a comment

Final part of initial review

Overview:

Let's use atomics when possible.
Left questions related to the node selection and setting of values.

Overall the design of the solution looks good, would have to do an additional pass over the test files once this review is addressed.

Thank you both @ofekshenawa and @htemelski-redis!

osscluster_router.go Show resolved Hide resolved

osscluster_router.go

Comment on lines +50 to +53

    
              	if c.hasKeys(cmd) {

              		// execute on key based shard

              		return node.Client.Process(ctx, cmd)

              	}

Member

ndyakov Oct 9, 2025

Do we know that this node servers the slot for the key?

Contributor

htemelski-redis Oct 9, 2025

Yes, the node should've been selected based on the slot osscluster.go:L1906

func (c *ClusterClient) cmdNode(

osscluster_router.go

    
              		// execute on key based shard

              		return node.Client.Process(ctx, cmd)

              	}

              	return c.executeOnArbitraryShard(ctx, cmd)

Member

ndyakov Oct 9, 2025

since it doesn't matter and there is already some node selected, why not use it?

Contributor

htemelski-redis Oct 9, 2025

We have two different ways of picking an arbitrary shard, either round robin or a random one

Member

ndyakov Oct 10, 2025 •

edited

Loading

Yes, I understand that, but for some reason there is already a node selected here that may have been selected because MOVED OR normal key based selection. Why do we have to reselect the node? Shouldn't this selection of arbitrary node be done outside, so we do the node selection only one time and the node on line #52 is the one that should be used for this command?

osscluster_router.go Outdated Show resolved Hide resolved

osscluster_router.go

Comment on lines +498 to +500

    
              			// Command executed successfully but value extraction failed

              			// This is common for complex commands like CLUSTER SLOTS

              			// The command already has its result set correctly, so just return

Member

ndyakov Oct 9, 2025

I do not understand that comment here. Why the value extraction returned nil? Can we make sure the cmd has value set at least? If it doesn't, we may return a cmd with nil value and nil error, which doesn't make sense.

osscluster_router.go

Comment on lines +748 to +759

    
              		if c, ok := cmd.(*KeyValuesCmd); ok {

              			// KeyValuesCmd needs a key string and values slice

              			if key, ok := value.(string); ok {

              				c.SetVal(key, []string{}) // Default empty values

              			}

              		}

              	case CmdTypeZSliceWithKey:

              		if c, ok := cmd.(*ZSliceWithKeyCmd); ok {

              			// ZSliceWithKeyCmd needs a key string and Z slice

              			if key, ok := value.(string); ok {

              				c.SetVal(key, []Z{}) // Default empty Z slice

              			}

Member

ndyakov Oct 9, 2025

why are we setting empty values here?

Contributor

htemelski-redis Oct 10, 2025

No idea tbh, will look into it


          addresed first batch of comments

17201a1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet