Quick setup for Text Embeddings Inference on a t3.micro instance.
-
Name:
tei-server -
AMI: Amazon Linux 2023 (default)
-
Instance type: t3.micro
-
Key pair: Create new key pair (save the .pem file)
-
Storage: 8 GB gp3 (default)
-
Network settings → Edit → Add security group rule:
- Type: Custom TCP
- Port range:
8080 - Source type: Anywhere
-
Advanced details → User data:
#!/bin/bash
set -euxo pipefail
dnf install -y ca-certificates procps docker
systemctl enable --now docker
mkdir -p /opt/tei/lib
TEI_IMAGE="ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.2"
CID=$(docker create ${TEI_IMAGE})
docker cp ${CID}:/usr/local/bin/text-embeddings-router /opt/tei/
docker cp ${CID}:/usr/lib/llvm-14/lib/libomp.so.5 /opt/tei/lib/libomp.so.5
docker rm ${CID}
ln -s libomp.so.5 /opt/tei/lib/libiomp5.so
chmod +x /opt/tei/text-embeddings-router
fallocate -l 2G /swapfile || dd if=/dev/zero of=/swapfile bs=1M count=2048
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile swap swap defaults 0 0' >> /etc/fstab
cat >/etc/systemd/system/tei.service <<'EOF'
[Unit]
Description=Text Embeddings Inference
After=network.target
[Service]
Type=simple
Environment=LD_LIBRARY_PATH=/opt/tei/lib
Environment=OMP_NUM_THREADS=1
Environment=TOKENIZERS_PARALLELISM=false
ExecStart=/opt/tei/text-embeddings-router --model-id nomic-ai/nomic-embed-text-v1.5 --port 8080 --max-batch-tokens 1024
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now tei- Click Launch instance
The instance needs ~3-5 minutes to:
- Install dependencies
- Download the model
- Start the service
curl http://<public-ip>:8080/embed \
-X POST \
-H 'Content-Type: application/json' \
-d '{"inputs": "Hello world"}'chmod 400 your-key.pem
ssh -i your-key.pem ec2-user@<public-ip>
# Check service status
sudo systemctl status tei
# View logs
sudo journalctl -u tei -fTo make this setup horizontally scalable and highly available, you can place the TEI instances behind an Application Load Balancer (ALB) and run them in an Auto Scaling Group (ASG).
This turns your single free-tier instance into a distributed embeddings service that can scale out automatically under load.
Clients
│
▼
Application Load Balancer (HTTP :80)
│
▼
Target Group (port 8080)
│
├── t3.micro (TEI)
├── t3.micro (TEI)
└── t3.micro (TEI)
(Auto Scaling Group)
Each instance runs the same systemd-based TEI service you already set up.
-
Go to EC2 → Target Groups → Create target group
-
Target type: Instance
-
Protocol: HTTP
-
Port:
8080 -
VPC: same VPC as your instances
-
Health check:
- Protocol: HTTP
- Path:
/health - Healthy threshold: 2
- Unhealthy threshold: 2
TEI exposes
/healthautomatically — no extra config needed.
Create the target group but do not register instances yet.
-
Go to EC2 → Load Balancers → Create load balancer
-
Choose Application Load Balancer
-
Name:
tei-alb -
Scheme: Internet-facing
-
IP address type: IPv4
-
Listeners:
- HTTP :80
-
Availability Zones:
- Select at least 2 AZs
-
Security group:
- Allow inbound HTTP (port 80) from your desired sources
Attach the target group you created earlier.
-
Go to EC2 → Launch Templates → Create launch template
-
Base it on your working instance:
-
AMI: Amazon Linux 2023
-
Instance type:
t3.micro -
Key pair: optional (for debugging)
-
Security group:
- Allow inbound 8080 from ALB security group
-
-
Advanced details → User data:
- Paste the exact same user-data script from your single-instance setup
This ensures every new instance automatically installs and starts TEI.
-
Go to EC2 → Auto Scaling Groups → Create
-
Use the launch template you just created
-
VPC: same VPC
-
Subnets: at least 2 (different AZs)
-
Attach to existing load balancer
- Choose your ALB target group
-
Scaling configuration:
- Min:
1 - Desired:
1 - Max:
N(e.g. 5 or 10)
- Min:
A simple and effective policy:
Target tracking scaling
- Metric: Average CPU Utilization
- Target value:
60%
Optional alternatives:
- ALB
RequestCountPerTarget - ALB
TargetResponseTime
CPU works well because TEI is CPU-bound.
Instead of calling the instance IP:
curl http://<alb-dns-name>/embed \
-X POST \
-H 'Content-Type: application/json' \
-d '{"inputs": "Hello world"}'The ALB will automatically distribute requests across instances.
- Stateless requests → perfect load balancing
- CPU-bound inference → clean horizontal scaling
- Slow tolerance → cold starts are acceptable
- Cheap nodes → failures don’t matter
You get:
- High availability
- Automatic recovery
- Linear throughput scaling
- Predictable cost
All without GPUs.
-
Warm-up time: new instances may take a few minutes to download the model — ALB health checks handle this safely.
-
Security:
- Restrict instance port 8080 to ALB only
- Optionally add auth or IP allowlists at ALB level
-
HTTPS:
- Add an ACM certificate to the ALB listener for TLS
-
Spot instances:
- Great for batch embedding jobs
Adding ALB + Auto Scaling transforms this from:
“a clever free-tier hack”
into:
a production-grade, horizontally scalable embeddings service built entirely on CPUs.