Deploying Private DeepSeek R1 on AWS Spot Instances (Save 90% on GPU Costs)

Stop Paying for API Keys: Run DeepSeek-R1 on AWS Spot Instances for $0.15/Hour

If you are a DevOps engineer or developer, you know the pain: You want to run private, uncensored LLMs (like DeepSeek-R1 or Llama 3) for your internal tools, but the costs are prohibitive.

  • OpenAI/Claude APIs: Fast but expensive at scale, and you send your private data to them.
  • AWS On-Demand GPUs: A single g4dn.xlarge instance costs **~$0.52/hour** ($380/month).

There is a third option that most “AI tutorials” ignore: AWS Spot Instances combined with “stateless” deployment.


In this guide, I will show you how to deploy a private, inference-ready DeepSeek-R1 (Distill-Llama-8B) API on AWS for roughly $0.15 – $0.18 per hour—a 70% discount—using Terraform and Ollama.

The Hardware: Why g4dn.xlarge?

For running 7B to 14B parameter models, you don’t need an H100. You just need enough VRAM to hold the model weights.

Instance TypeGPUVRAMOn-Demand PriceSpot Price (Est.)Good For
g4dn.xlargeNVIDIA T416 GB$0.526/hr**~$0.16/hr**DeepSeek-R1 8B & 14B (Perfect Fit)
g5.xlargeNVIDIA A10G24 GB$1.006/hr**~$0.35/hr**DeepSeek-R1 32B
p3.2xlargeNVIDIA V10016 GB$3.06/hr$0.90/hrAvoid (Too expensive/old)

Technical Note: The full DeepSeek-R1 is 671B parameters and requires a cluster. We will use the Distilled 8B or 14B versions, which retain most of the reasoning capability but fit on a single T4 GPU.

The Architecture: Handling Spot Interruptions

Spot instances are cheap because AWS can reclaim them with 2 minutes’ notice.

Do not treat this like a pet server. Treat it like a container.

  • No Persistent Root Volume: We don’t care if the OS is deleted.
  • Startup Script (User Data): Every time the server boots, it will auto-install drivers and pull the model.
  • Terraform: To request the instance and handle the “Spot Request” logic.

Step 1: The Terraform Script

Save this as main.tf. It requests the cheapest available GPU spot instance in your region.

Terraform

provider "aws" {
  region = "us-east-1" # Change to your region
}

resource "aws_security_group" "llm_sg" {
  name        = "llm-security-group"
  description = "Allow SSH and Ollama API"

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"] # Lock this down to your IP in production!
  }

  ingress {
    from_port   = 11434
    to_port     = 11434
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"] # Ollama API Port
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_spot_instance_request" "llm_worker" {
  ami           = "ami-0a0e5d9c7acc336f1" # Ubuntu 22.04 Deep Learning AMI (Drivers pre-installed)
  instance_type = "g4dn.xlarge"
  spot_price    = "0.20" # Max price you are willing to pay
  key_name      = "your-ssh-key-name" # Create this in AWS Console first!

  security_groups = [aws_security_group.llm_sg.name]
  wait_for_fulfillment = true
  spot_type = "one-time" 

  # The Magic: This script runs on boot
  user_data = file("install_ollama.sh")

  tags = {
    Name = "DeepSeek-Spot-Instance"
  }
}

output "instance_ip" {
  value = aws_spot_instance_request.llm_worker.public_ip
}

Critical Tip: Instead of using a blank Ubuntu AMI and struggling with NVIDIA drivers, use the AWS Deep Learning AMI (Ubuntu 22.04). It comes with CUDA pre-installed, saving you 20 minutes of boot time.

Step 2: The “User Data” Automation

Create a file named install_ollama.sh. This script runs automatically when the instance launches.

#!/bin/bash

# 1. Update and install basic tools
apt-get update
apt-get install -y curl

# 2. Install Ollama (The easiest way to run LLMs)
curl -fsSL https://ollama.com/install.sh | sh

# 3. Configure Ollama to listen on all interfaces (so you can access it remotely)
mkdir -p /etc/systemd/system/ollama.service.d
echo '[Service]
Environment="OLLAMA_HOST=0.0.0.0"' > /etc/systemd/system/ollama.service.d/override.conf

systemctl daemon-reload
systemctl restart ollama

# 4. Wait for Ollama to start, then pull the model
# We use "nohup" so this continues even if the script exits
nohup bash -c 'sleep 10 && ollama pull deepseek-r1:8b' &

Step 3: Deploy & Verify

  1. Initialize Terraform: terraform init
  2. Apply: terraform apply
  3. Wait ~2 minutes. Terraform will output your instance_ip.

SSH into the box to check the progress:

Bash

ssh -i your-key.pem ubuntu@<INSTANCE_IP>
tail -f /var/log/cloud-init-output.log

Once you see “success”, test the API from your local laptop:

curl http://<INSTANCE_IP>:11434/api/generate -d '{
  "model": "deepseek-r1:8b",
  "prompt": "Write a python script to parse AWS CloudTrail logs",
  "stream": false
}'

If you get a JSON response with code, congratulations. You now have your own private DeepSeek API running for pennies per hour.

Optimization: How to Stop “Re-Downloading” the Model

The script above downloads the 5GB model every time the Spot instance starts. This wastes bandwidth and takes ~3 minutes.

The “Pro” Fix:

  1. Create an AWS EBS Volume (100GB) manually.
  2. Update your Terraform to attach this volume to the Spot Instance at /var/lib/ollama.
  3. Now, the model files persist. Even if AWS kills your Spot instance, your data is safe on the EBS volume. When you launch a new one, it detects the existing files and is ready instantly.

Conclusion

By moving from On-Demand to Spot, you reduce your AI lab costs from $380/month to roughly $110/month (if running 24/7), or just $0.15 for a quick 1-hour experiment.For DevOps engineers, this is the most cost-effective way to experiment with private AI agents, log analysis, and code generation without leaking company data to public APIs.

Scroll to Top