TorchServe is a flexible and easy to use tool for serving and scaling PyTorch models in production.
Requires python >= 3.8
curl http://127.0.0.1:8080/predictions/bert -T input.txt# Install dependencies
# cuda is optional
python ./ts_scripts/install_dependencies.py --cuda=cu121
# Latest release
pip install torchserve torch-model-archiver torch-workflow-archiver
# Nightly build
pip install torchserve-nightly torch-model-archiver-nightly torch-workflow-archiver-nightly
# Install dependencies
# cuda is optional
python ./ts_scripts/install_dependencies.py --cuda=cu121
# Latest release
conda install -c pytorch torchserve torch-model-archiver torch-workflow-archiver
# Nightly build
conda install -c pytorch-nightly torchserve torch-model-archiver torch-workflow-archiver
# Latest release
docker pull pytorch/torchserve
# Nightly build
docker pull pytorch/torchserve-nightly
Refer to torchserve docker for details.
- Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, Nvidia MPS
 - Model Management API: multi model management with optimized worker to model allocation
 - Inference API: REST and gRPC support for batched inference
 - TorchServe Workflows: deploy complex DAGs with multiple interdependent models
 - Default way to serve PyTorch models in
- Sagemaker
 - Vertex AI
 - Kubernetes with support for autoscaling, session-affinity, monitoring using Grafana works on-prem, AWS EKS, Google GKE, Azure AKS
 - Kserve: Supports both v1 and v2 API, autoscaling and canary deployments for A/B testing
 - Kubeflow
 - MLflow
 
 - Export your model for optimized inference. Torchscript out of the box, PyTorch Compiler preview, ORT and ONNX, IPEX, TensorRT, FasterTransformer, FlashAttention (Better Transformers)
 - Performance Guide: builtin support to optimize, benchmark and profile PyTorch and TorchServe performance
 - Expressive handlers: An expressive handler architecture that makes it trivial to support inferencing for your usecase with many supported out of the box
 - Metrics API: out of box support for system level metrics with Prometheus exports, custom metrics,
 - Large Model Inference Guide: With support for GenAI, LLMs including
- Fast Kernels with FlashAttention v2, continuous batching and streaming response
 - PyTorch Tensor Parallel preview, Pipeline Parallel
 - Microsoft DeepSpeed, DeepSpeed-Mii
 - Hugging Face Accelerate, Diffusers
 - Running large models on AWS Sagemaker and Inferentia2
 - Running Llama 2 Chatbot locally on Mac
 
 - Monitoring using Grafana and Datadog
 
- Model Server for PyTorch Documentation: Full documentation
 - TorchServe internals: How TorchServe was built
 - Contributing guide: How to contribute to TorchServe
 
- Serving Llama 2 with TorchServe
 - Chatbot with Llama 2 on Mac π¦π¬
 - π€ HuggingFace Transformers with a Better Transformer Integration/ Flash Attention & Xformer Memory Efficient
 - Stable Diffusion
 - Model parallel inference
 - MultiModal models with MMF combining text, audio and video
 - Dual Neural Machine Translation for a complex workflow DAG
 - TorchServe Integrations
 - TorchServe Internals
 - TorchServe UseCases
 
For more examples
We welcome all contributions!
To learn more about how to contribute, see the contributor guide here.
- High performance Llama 2 deployments with AWS Inferentia2 using TorchServe
 - Naver Case Study: Transition From High-Cost GPUs to Intel CPUs and oneAPI powered Software with performance
 - Run multiple generative AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe and save up to 75% in inference costs
 - Deploying your Generative AI model in only four steps with Vertex AI and PyTorch
 - PyTorch Model Serving on Google Cloud TPU v5
 - Monitoring using Datadog
 - Torchserve Performance Tuning, Animated Drawings Case-Study
 - Walmart Search: Serving Models at a Scale on TorchServe
 - π₯ Scaling inference on CPU with TorchServe
 - π₯ TorchServe C++ backend
 - Grokking Intel CPU PyTorch performance from first principles: a TorchServe case study
 - Grokking Intel CPU PyTorch performance from first principles( Part 2): a TorchServe case study
 - Case Study: Amazon Ads Uses PyTorch and AWS Inferentia to Scale Models for Ads Processing
 - Optimize your inference jobs using dynamic batch inference with TorchServe on Amazon SageMaker
 - Using AI to bring children's drawings to life
 - π₯ Model Serving in PyTorch
 - Evolution of Cresta's machine learning architecture: Migration to AWS and PyTorch
 - π₯ Explain Like Iβm 5: TorchServe
 - π₯ How to Serve PyTorch Models with TorchServe
 - How to deploy PyTorch models on Vertex AI
 - Quantitative Comparison of Serving Platforms
 - Efficient Serverless deployment of PyTorch models on Azure
 - Deploy PyTorch models with TorchServe in Azure Machine Learning online endpoints
 - Dynaboard moving beyond accuracy to holistic model evaluation in NLP
 - A MLOps Tale about operationalising MLFlow and PyTorch
 - Operationalize, Scale and Infuse Trust in AI Models using KFServing
 - How Wadhwani AI Uses PyTorch To Empower Cotton Farmers
 - TorchServe Streamlit Integration
 - Dynabench aims to make AI models more robust through distributed human workers
 - Announcing TorchServe
 
Made with contrib.rocks.
This repository is jointly operated and maintained by Amazon, Meta and a number of individual contributors listed in the CONTRIBUTORS file. For questions directed at Meta, please send an email to [email protected]. For questions directed at Amazon, please send an email to [email protected]. For all other questions, please open up an issue in this repository here.
TorchServe acknowledges the Multi Model Server (MMS) project from which it was derived