Skip to content

Conversation

@Steboss
Copy link
Contributor

@Steboss Steboss commented Nov 24, 2025

No description provided.

Copy link
Contributor

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this! Made some comments. Let me know what you think.


# Configuration
NAMESPACE="${NAMESPACE:-default}"
JOBSET_NAME="jax-vllm-multinode"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have these variables defined twice in separate files. Is there anyway that we can provide a single source of truth to avoid unintentional errors (e.g. edited in one place but not another)?

effect: NoSchedule
containers:
- name: gateway-container
image: 941377147396.dkr.ecr.us-east-1.amazonaws.com/sbosisio/jio:jax-k8s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that we need to turn it into a placeholder and set it dynamically for each job in the final production workflow.

echo "Gateway URL: ${GATEWAY_URL}"
echo "Ray Head IP: ${RAY_HEAD_IP}"

# 1. Wait for gateway to be ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that in the long run this could be integrated into the bridge.

@Steboss how long could the gap be, between the start of the gateway and the application (jax/vLLM) pods? Can we make the launch of the application pods dependent on the gateway pod?

#NCCL
# - name: NCCL_DEBUG
# value: "INFO" # Change to WARN after debugging
- name: NCCL_PROTO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL's official guidelines are to avoid setting this variable explicitly whenever possible. Is this mandated by AWS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants