Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions .github/workflows/build-deploy-changes.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
name: Build & Deploy Changed Services

permissions:
packages: write
contents: read

on:
push:
branches: [main, dev, 'release/*']
pull_request:
branches: [main, dev, 'release/*']

env:
TAG: ${{ github.run_number }}

jobs:
build:
name: Build and Deploy
runs-on: [self-hosted, paicicd]
timeout-minutes: 120
environment: auto-test
container:
image: ubuntu:latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock
steps:
- name: Install git
run: |
DEBIAN_FRONTEND=noninteractive apt update
DEBIAN_FRONTEND=noninteractive apt install -y git

- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
submodules: false
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.ref_name }}

- name: Get Changed Folders (Services)
id: changes
run: |
git config --global --add safe.directory "$GITHUB_WORKSPACE"
if [ "${{ github.event_name }}" = "pull_request" ]; then
echo "Pull request detected"
# Fetch the merge base to get only PR changes
git fetch origin ${{ github.event.pull_request.base.ref }} --depth=50
base_sha=$(git merge-base origin/${{ github.event.pull_request.base.ref }} ${{ github.event.pull_request.head.sha }})
head_sha="${{ github.event.pull_request.head.sha }}"
else
base_sha="${{ github.event.before }}"
head_sha="${{ github.sha }}"
fi

echo "Comparing $base_sha...$head_sha"
changed_files=$(git diff --name-only "$base_sha" "$head_sha")
echo "Changed files: $changed_files"

# extract service folders under src/, skip alert-manager
folders=$(echo "$changed_files" | grep '^src/' \
| grep -v 'alert-manager' \
| awk -F'/' '{print $2}' \
| sort -u | tr '\n' ' ')
echo "Changed folders: $folders"

# export as output for next steps
echo "folders=$folders" >> $GITHUB_OUTPUT

- name: Check if folders are empty
id: check
run: |
if [ -z "${{ steps.changes.outputs.folders }}" ]; then
echo "has_changed=false" >> $GITHUB_OUTPUT
else
echo "has_changed=true" >> $GITHUB_OUTPUT
fi

- name: Install Package
if: steps.check.outputs.has_changed == 'true'
run: |
DEBIAN_FRONTEND=noninteractive apt install -y python3 python-is-python3 pip git unzip docker-cli ca-certificates curl apt-transport-https lsb-release gnupg parallel
curl -sL https://aka.ms/InstallAzureCLIDeb | bash

- name: Install python libs
if: steps.check.outputs.has_changed == 'true'
run: python -m pip install --break-system-packages pyyaml jinja2 paramiko etcd3 protobuf==3.20.3 kubernetes gitpython

- name: Decode and unzip config file
if: steps.check.outputs.has_changed == 'true'
run: |
echo "${{ secrets.CONFIG_FILE_B64 }}" | base64 -d > config.zip
mkdir -p $GITHUB_WORKSPACE/config
unzip -o config.zip -d $GITHUB_WORKSPACE/config
ls -l $GITHUB_WORKSPACE/config

- name: Arrange Config Files
if: steps.check.outputs.has_changed == 'true'
run: |
rm -rf /tmp/auth-configuration
mv $GITHUB_WORKSPACE/config/auth-configuration /tmp/
ls -l /tmp/auth-configuration

- name: Build Images of Changed Services
if: steps.check.outputs.has_changed == 'true'
run: |
changed_services="${{ steps.changes.outputs.folders }}"
echo "Building: $changed_services"
$GITHUB_WORKSPACE/build/pai_build.py build \
-c $GITHUB_WORKSPACE/config/cluster-configuration \
-s $changed_services

- name: Push Images of Changed Services to ACR
if: steps.check.outputs.has_changed == 'true'
run: |
changed_services="${{ steps.changes.outputs.folders }}"
echo "Pushing: $changed_services"
$GITHUB_WORKSPACE/build/pai_build.py push \
-c $GITHUB_WORKSPACE/config/cluster-configuration \
-s $changed_services

- name: Push Images of Changed Service to GHCR
if: steps.check.outputs.has_changed == 'true'
run: |
changed_services="${{ steps.changes.outputs.folders }}"
echo "Pushing: $changed_services"
$GITHUB_WORKSPACE/build/pai_build.py push \
-c $GITHUB_WORKSPACE/config/cluster-configuration \
-s $changed_services \
--docker-registry ghcr.io \
--docker-namespace ${GITHUB_REPOSITORY_OWNER} \
--docker-username ${{ github.actor }} \
--docker-password ${{ secrets.GITHUB_TOKEN }}

- name: Azure CLI get credentials and deploy
if: steps.check.outputs.has_changed == 'true'
run: |
az version
az login --identity --client-id ${{ secrets.AZURE_MANAGED_IDENTITY_CLIENT_ID }}
az aks install-cli
az aks get-credentials \
--resource-group ${{ secrets.AZURE_RESOURCE_GROUP }} \
--name ${{ secrets.KUBERNETES_CLUSTER }} \
--overwrite-existing
kubelogin convert-kubeconfig -l azurecli
kubectl config use-context ${{ secrets.KUBERNETES_CLUSTER }}
echo "${{ secrets.PAI_CLUSTER_NAME }}" > cluster_id
echo "Stopping changed pai services \"${{ steps.changes.outputs.folders }}\" on ${{ secrets.PAI_CLUSTER_NAME }} ..."
$GITHUB_WORKSPACE/paictl.py service stop -n ${{ steps.changes.outputs.folders }} < cluster_id
echo "Pushing config to cluster \"${{ secrets.PAI_CLUSTER_NAME }}\" ..."
$GITHUB_WORKSPACE/paictl.py config push -m service -p $GITHUB_WORKSPACE/config/cluster-configuration < cluster_id
echo "Starting to update \"${{ steps.changes.outputs.folders }}\" on ${{ secrets.PAI_CLUSTER_NAME }} ..."
$GITHUB_WORKSPACE/paictl.py service start -n ${{ steps.changes.outputs.folders }} < cluster_id
kubectl get pod
kubectl get service

test:
name: Test rest-server
needs: build
runs-on: [self-hosted, paicicd]
environment: auto-test
steps:
- name: Test rest-server
run: |
echo "Testing rest-server ${{ secrets.PAI_WEB_URL }}/rest-server/api/v2/info"
curl ${{ secrets.PAI_WEB_URL }}/rest-server/api/v2/info
echo "Checking virtual cluster status..."
vc_info=$(curl -H "Authorization: Bearer ${{ secrets.PAI_WEB_TOKEN }}" -s ${{ secrets.PAI_WEB_URL }}/rest-server/api/v2/virtual-clusters)
if [ $? -ne 0 ]; then
echo "Failed to access virtual cluster API"
exit 1
fi
echo "Virtual cluster info: $vc_info"

36 changes: 36 additions & 0 deletions build/pai_build.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,10 +120,46 @@ def main():
nargs='+',
help="The service list that contains corresponding images you want to push"
)
push_parser.add_argument(
'--docker-registry',
type=str,
help="The docker registry you want to push to, which will override the config file"
)
push_parser.add_argument(
"--docker-namespace",
type=str,
help="The docker namespace you want to push to, which will override the config file if '--docker-registry' is also set"
)
push_parser.add_argument(
'--docker-username',
type=str,
help="The docker username you want to use for authentication, which will override the config file if '--docker-registry' is also set"
)
push_parser.add_argument(
'--docker-password',
type=str,
help="The docker password you want to use for authentication, which will override the config file if '--docker-registry' is also set"
)
push_parser.add_argument(
"--docker-tag",
type=str,
help="The docker tag you want to push to, which will override the config file if '--docker-registry' is also set"
)
push_parser.set_defaults(func=push_image)

args = parser.parse_args()
config_model = load_build_config(args.config)
if hasattr(args, 'docker_registry') and args.docker_registry is not None:
config_model['dockerRegistryInfo']['dockerRegistryDomain'] = args.docker_registry
if args.docker_namespace is not None:
config_model['dockerRegistryInfo']['dockerNameSpace'] = args.docker_namespace
if args.docker_username is not None:
config_model['dockerRegistryInfo']['dockerUserName'] = args.docker_username
if args.docker_password is not None:
config_model['dockerRegistryInfo']['dockerPassword'] = args.docker_password
if args.docker_tag is not None:
config_model['dockerRegistryInfo']['dockerTag'] = args.docker_tag

args.func(args, config_model)

endtime = datetime.datetime.now()
Expand Down
24 changes: 23 additions & 1 deletion contrib/aks/scripts/install-fuse.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,26 @@
set -xe

DEBIAN_FRONTEND=noninteractive apt-get update -y
DEBIAN_FRONTEND=noninteractive apt-get install libfuse3-dev fuse3 blobfuse2 -y || echo "Failed to install fuse"
DEBIAN_FRONTEND=noninteractive apt-get install libfuse3-dev fuse3 blobfuse2 -y || echo "Failed to install fuse"

# Check if blobfuse2 is installed
if ! command -v blobfuse2 >/dev/null 2>&1; then
echo "blobfuse2 is not installed. Exiting."
exit 1
fi

INSTALLED_VERSION=$(blobfuse2 --version | grep -oP '\d+\.\d+\.\d+')
REQUIRED_VERSION="2.5.0"

# Check if version extraction succeeded
if [ -z "$INSTALLED_VERSION" ]; then
echo "Failed to extract blobfuse2 version. Exiting."
exit 1
fi

if dpkg --compare-versions "$INSTALLED_VERSION" "lt" "$REQUIRED_VERSION"; then
echo "Updating blobfuse2 to a version newer than $REQUIRED_VERSION"
DEBIAN_FRONTEND=noninteractive apt-get install --only-upgrade blobfuse2 -y || echo "Failed to update blobfuse2"
else
echo "blobfuse2 is already up-to-date (version $INSTALLED_VERSION)"
fi
32 changes: 32 additions & 0 deletions docs/LuciaTrainingPlatform/blog/2025-04-30-release-1-0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
slug: release-ltp-v1.0
title: Releasing Lucia Training Platform v1.0
author: Lucia Training Platform Team
tags: [ltp, announcement, release]
---

We are pleased to announce the official release of **Lucia Training Platform v1.0.0**!

## Lucia Training Platform v1.0.0 Release Notes

This inaugural release establishes Lucia Training Platform as a comprehensive AI platform solution, built on the foundation of OpenPAI with significant enhancements and customizations for enterprise AI workloads.

## Platform Features & Stability
- Updated Virtual Machine Scale Set deployment scripts with MI300 GPU support and kubelet bug fixes
- Fixed launch order issues between AMD device plugin and AMDGPU module loading
- Fixed local disk mounting into containers for high-speed data loading
- Implemented priority restrictions for production jobs to ensure resource allocation
- Automated daily backup of user logs to blob storage with cordon trigger functionality
- Updated OpenPAI-runtime image to resolve SSH crashes in large-scale training jobs
- Added refresh API to clean storage cache when new Persistent Volumes (PV) or Persistent Volume Claims (PVC) are added
- Implemented automated email notifications for production jobs to specific user groups

## Job Reliability & Monitoring
- Implemented automatic detection metrics and rules for AMD GPU issues during runtime
- Enabled job execution on specific cordoned nodes for admin management
- Automated node cordoning and uncordoning with single node validation
- Added support for monitoring count of per-VC available/used nodes in Prometheus

## User Experience
- Complete revision of the homepage with acknowledgment of OpenPAI's great contribution
- Updated all titles and references from OpenPAI to Lucia Training Platform (LTP) throughout the web portal
30 changes: 30 additions & 0 deletions docs/LuciaTrainingPlatform/blog/2025-06-20-release-1-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
slug: release-ltp-v1.1
title: Releasing Lucia Training Platform v1.1
author: Lucia Training Platform Team
tags: [ltp, announcement, release]
---

We are pleased to announce the official release of **Lucia Training Platform v1.1.0**!

## Lucia Training Platform v1.1.0 Release Notes

This release introduces new inference capabilities, enhanced stability improvements, comprehensive monitoring systems, and significant security enhancements.

## Platform Features & Stability
- Added support for inference job submission
- Added prototype user interface with webportal plugin

## Job Reliability & Monitoring
- Automated Azure VM recycling and validation processing workflows
- Automated pipeline for submitting ICM tickets for unhealthy Azure VMs.
- Kusto database implementation for action status tracking, node status monitoring, and job status analytics

## User Experience
- Enhanced dashboard with comprehensive platform performance metrics

## Security
- Forced upgrades of operating system, Linux, and Python packages to address security vulnerabilities
- Updated Golang and Node.js packages to latest secure versions
- Disabled and replaced unapproved registries (non-ACR/MCR) on LTP platform
- Disabled SSH access for all users to enhance security posture
40 changes: 40 additions & 0 deletions docs/LuciaTrainingPlatform/blog/2025-08-11-release-1-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
slug: release-ltp-v1.2
title: Releasing Lucia Training Platform v1.2
author: Lucia Training Platform Team
tags: [ltp, announcement, release]
---

We are pleased to announce the official release of **Lucia Training Platform v1.2.0**!

## Lucia Training Platform v1.2.0 Release Notes

This release introduces significant new features, enhanced reliability monitoring, improved user experience, and strengthened security measures.

## Platform Features & Stability
- Virtual Cluster administrators can now stop jobs in their own VC
- Enhanced inference job interface with external IP gateway support
- Portal displays only active clusters for improved user experience
- Enhanced job execution capabilities with Docker support within jobs
- Resolved CUDA version mismatch issues causing job-exporter crashes
- Fixed configuration refresh issues when updating user settings
- Resolved blob mount failures and Azure copy token issues

## Local Storage
- Local storage service with user API interface implementation
- Integration with node recycling processes

## Job Reliability & Monitoring
- Initial automatic node failure detection module design and implementation
- Enhanced job monitoring kusto data pipeline with summary and reaction time tracking
- Proactive alerting email for certificate expiration management

## User Experience
- Added webportal plugin integration for Copilot functionality
- Initial backend support for Copilot features
- Enhanced dashboard with comprehensive platform metrics
- Added Mean Time Between Incidents (MTBI) tracking for virtual machines and nodes in dashboard

## Security
- Updates to address security vulnerabilities in container images
- Kubernetes version upgrade for enhanced security and performance
Loading
Loading