使用 Docker 部署 AI 模型——从模型到 API 服务
使用 Docker 部署 AI 模型——从模型到 API 服务
作者: CaoZH日期: 2026-05-15本文为原创教程
2026 年,AI 模型部署已经是后端开发者的必备技能。无论是开源的 LLaMA、Stable Diffusion,还是微调的自定义模型,Docker 都是最标准、最可靠的部署方式。
本文以部署一个文本分类模型为例,带你走完”模型 → API → Docker → 部署”的完整流程。
一、准备工作 1 2 3 4 5 6 sudo apt install -y nvidia-container-toolkit sudo systemctl restart docker docker run --gpus all nvidia/cuda:12.0-base nvidia-smi
二、项目结构 1 2 3 4 5 6 7 8 ai-model-server/ ├── Dockerfile ├── requirements.txt ├── app.py # FastAPI 服务 ├── model/ # 模型文件 │ └── model.pt ├── docker-compose.yml └── .env
三、编写 API 服务 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 from fastapi import FastAPI, HTTPExceptionfrom pydantic import BaseModelimport torchfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport timeimport logginglogging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = FastAPI(title="AI Model API" , version="1.0.0" ) MODEL_PATH = "/app/model" device = torch.device("cuda" if torch.cuda.is_available() else "cpu" ) try : tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH) model.to(device) model.eval () logger.info(f"模型加载完成,使用设备: {device} " ) except Exception as e: logger.error(f"模型加载失败: {e} " ) model = None tokenizer = None class PredictRequest (BaseModel ): text: str max_length: int = 128 class PredictResponse (BaseModel ): label: str confidence: float processing_time_ms: float @app.get("/health" ) async def health (): return { "status" : "ok" , "model_loaded" : model is not None , "device" : str (device) } @app.post("/predict" , response_model=PredictResponse ) async def predict (request: PredictRequest ): if model is None : raise HTTPException(status_code=503 , detail="模型未加载" ) start = time.time() inputs = tokenizer( request.text, return_tensors="pt" , truncation=True , max_length=request.max_length, padding=True ).to(device) with torch.no_grad(): outputs = model(**inputs) probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1 ) confidence, predicted = torch.max (probabilities, dim=-1 ) labels = ["negative" , "neutral" , "positive" ] processing_time = (time.time() - start) * 1000 return PredictResponse( label=labels[predicted.item()], confidence=confidence.item(), processing_time_ms=round (processing_time, 2 ) )
1 2 3 4 5 6 # requirements.txt fastapi==0.110.0 uvicorn[standard]==0.27.0 torch>=2.0.0 transformers>=4.35.0 pydantic>=2.0.0
四、编写 Dockerfile 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 FROM python:3.11 -slim AS builderWORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt FROM python:3.11 -slimWORKDIR /app RUN apt-get update && apt-get install -y \ libgomp1 \ && rm -rf /var/lib/apt/lists/* COPY --from=builder /usr/local /lib/python3.11/site-packages /usr/local /lib/python3.11/site-packages COPY --from=builder /usr/local /bin /usr/local /bin COPY app.py . COPY model/ ./model/ RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app USER appuserHEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \ CMD python -c "import requests; requests.get('http://localhost:8000/health')" EXPOSE 8000 CMD ["uvicorn" , "app:app" , "--host" , "0.0.0.0" , "--port" , "8000" , "--workers" , "2" ]
五、Docker Compose 配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 version: '3.8' services: ai-model: build: context: . dockerfile: Dockerfile image: ai-model-server:latest container_name: ai-model ports: - "8000:8000" volumes: - ./model:/app/model:ro - model-cache:/root/.cache environment: - PYTHONUNBUFFERED=1 - CUDA_VISIBLE_DEVICES=0 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu ] restart: unless-stopped healthcheck: test: ["CMD" , "curl" , "-f" , "http://localhost:8000/health" ] interval: 30s timeout: 10s retries: 5 nginx: image: nginx:alpine container_name: ai-nginx ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro depends_on: ai-model: condition: service_healthy restart: unless-stopped volumes: model-cache:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 server { listen 80 ; server_name _; client_max_body_size 10m ; location / { proxy_pass http://ai-model:8000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_read_timeout 60s ; } location /predict { proxy_pass http://ai-model:8000; limit_req zone=api burst=10 nodelay; } }
六、构建与部署 1 2 3 4 5 6 7 8 9 10 11 12 13 14 docker compose build docker compose up -d docker compose logs -f curl http://localhost:8000/health curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{"text": "这个产品非常好用,我很满意!"}'
七、性能优化 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from fastapi.concurrency import run_in_threadpoolclass PredictionService : def __init__ (self ): self.batch_size = 8 self.queue = [] async def predict_batch (self, texts: list ): inputs = tokenizer(texts, return_tensors="pt" , padding=True , truncation=True ).to(device) with torch.no_grad(): outputs = model(**inputs) return outputs
八、总结 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## 部署 AI 模型的关键点 ✅ 多阶段 Docker 构建(减小镜像体积) ✅ 非 root 用户运行 ✅ 健康检查 ✅ GPU 支持(nvidia-container-toolkit) ✅ 批处理提高吞吐量 ✅ 反向代理 + 限流 ✅ 模型缓存卷 ## 推荐镜像优化 - 基础镜像:python:3.11-slim(~150MB)- 使用 pip --no-cache-dir- 多阶段构建分离依赖和代码- 最终镜像:~500MB(含 PyTorch)
首发于 CaoZH 的笔记