使用 Docker 部署 AI 模型——从模型到 API 服务

使用 Docker 部署 AI 模型——从模型到 API 服务

作者: CaoZH
日期: 2026-05-15
本文为原创教程


2026 年,AI 模型部署已经是后端开发者的必备技能。无论是开源的 LLaMA、Stable Diffusion,还是微调的自定义模型,Docker 都是最标准、最可靠的部署方式。

本文以部署一个文本分类模型为例,带你走完”模型 → API → Docker → 部署”的完整流程。

一、准备工作

1
2
3
4
5
6
# 安装 Docker 和 NVIDIA Container Toolkit(如果使用 GPU)
sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

# 验证 GPU 可用
docker run --gpus all nvidia/cuda:12.0-base nvidia-smi

二、项目结构

1
2
3
4
5
6
7
8
ai-model-server/
├── Dockerfile
├── requirements.txt
├── app.py # FastAPI 服务
├── model/ # 模型文件
│ └── model.pt
├── docker-compose.yml
└── .env

三、编写 API 服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(title="AI Model API", version="1.0.0")

# 全局加载模型(避免每次请求都加载)
MODEL_PATH = "/app/model"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

try:
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
model.to(device)
model.eval()
logger.info(f"模型加载完成,使用设备: {device}")
except Exception as e:
logger.error(f"模型加载失败: {e}")
model = None
tokenizer = None

class PredictRequest(BaseModel):
text: str
max_length: int = 128

class PredictResponse(BaseModel):
label: str
confidence: float
processing_time_ms: float

@app.get("/health")
async def health():
return {
"status": "ok",
"model_loaded": model is not None,
"device": str(device)
}

@app.post("/predict", response_model=PredictResponse)
async def predict(request: PredictRequest):
if model is None:
raise HTTPException(status_code=503, detail="模型未加载")

start = time.time()

inputs = tokenizer(
request.text,
return_tensors="pt",
truncation=True,
max_length=request.max_length,
padding=True
).to(device)

with torch.no_grad():
outputs = model(**inputs)
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
confidence, predicted = torch.max(probabilities, dim=-1)

# 标签映射(根据你的模型调整)
labels = ["negative", "neutral", "positive"]

processing_time = (time.time() - start) * 1000

return PredictResponse(
label=labels[predicted.item()],
confidence=confidence.item(),
processing_time_ms=round(processing_time, 2)
)
1
2
3
4
5
6
# requirements.txt
fastapi==0.110.0
uvicorn[standard]==0.27.0
torch>=2.0.0
transformers>=4.35.0
pydantic>=2.0.0

四、编写 Dockerfile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# 多阶段构建

# 阶段一:安装依赖
FROM python:3.11-slim AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 阶段二:运行
FROM python:3.11-slim

WORKDIR /app

# 安装运行时依赖
RUN apt-get update && apt-get install -y \
libgomp1 \
&& rm -rf /var/lib/apt/lists/*

# 从构建阶段复制 Python 包
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin

# 复制应用代码
COPY app.py .
COPY model/ ./model/

# 创建非 root 用户
RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
USER appuser

# 健康检查
HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8000/health')"

EXPOSE 8000

# GPU 版本
# CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

# CPU 版本(生产推荐)
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

五、Docker Compose 配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# docker-compose.yml
version: '3.8'

services:
# AI 模型服务
ai-model:
build:
context: .
dockerfile: Dockerfile
image: ai-model-server:latest
container_name: ai-model
ports:
- "8000:8000"
volumes:
- ./model:/app/model:ro
- model-cache:/root/.cache
environment:
- PYTHONUNBUFFERED=1
- CUDA_VISIBLE_DEVICES=0 # GPU 编号
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 5

# Nginx 反向代理
nginx:
image: nginx:alpine
container_name: ai-nginx
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
depends_on:
ai-model:
condition: service_healthy
restart: unless-stopped

volumes:
model-cache:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# nginx.conf
server {
listen 80;
server_name _;

client_max_body_size 10m;

location / {
proxy_pass http://ai-model:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 60s;
}

# 限制请求速率
location /predict {
proxy_pass http://ai-model:8000;
limit_req zone=api burst=10 nodelay;
}
}

六、构建与部署

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 构建镜像
docker compose build

# 启动
docker compose up -d

# 查看日志
docker compose logs -f

# 测试
curl http://localhost:8000/health
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"text": "这个产品非常好用,我很满意!"}'

七、性能优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 启用批处理(提高吞吐量)
from fastapi.concurrency import run_in_threadpool

class PredictionService:
def __init__(self):
self.batch_size = 8
self.queue = []

async def predict_batch(self, texts: list):
# 批量推理,GPU 利用率更高
inputs = tokenizer(texts, return_tensors="pt",
padding=True, truncation=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
return outputs

八、总结

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## 部署 AI 模型的关键点

✅ 多阶段 Docker 构建(减小镜像体积)
✅ 非 root 用户运行
✅ 健康检查
✅ GPU 支持(nvidia-container-toolkit)
✅ 批处理提高吞吐量
✅ 反向代理 + 限流
✅ 模型缓存卷

## 推荐镜像优化
- 基础镜像:python:3.11-slim(~150MB)
- 使用 pip --no-cache-dir
- 多阶段构建分离依赖和代码
- 最终镜像:~500MB(含 PyTorch)

首发于 CaoZH 的笔记