I've been running production workloads on cloud providers for years. AWS, GCP, you name it. But there's something deeply satisfying about running your own metal—especially when you apply the same enterprise-grade practices you'd use in a corporate environment.
When I picked up two ASUS NUC 15 Pro+ units — each with 96GB RAM and a 4TB NVMe drive — I didn't just want another homelab. I wanted to build something that could rival any production infrastructure, just at a smaller scale.
Together they form a Proxmox cluster with 192GB of RAM and 8TB of NVMe storage. Here's how I turned these consumer-grade NUCs into a fully automated, monitored, and secured homelab infrastructure using 100% Infrastructure as Code.
The Foundation: Why Proxmox?
I evaluated several hypervisors before settling on Proxmox VE. VMware ESXi felt too enterprise (and expensive), while bare Docker felt too limiting. Proxmox hit the sweet spot:
- Native LXC support: Near-metal performance without VM overhead
- ZFS integration: Snapshots, compression, and data integrity out of the box
- API-first design: Everything is automatable via REST API
- Active community: Solutions to problems are always a search away
But here's the kicker—I wasn't going to click through any web UIs. Everything had to be code.
The Architecture: LXC Containers Over Kubernetes
You might be wondering: "Why not Kubernetes?" I ran k8s clusters professionally for years. For a homelab, it's overkill.
Instead, I chose LXC containers for service isolation:
LXC 106: Media Server (Jellyfin) LXC 107: Media Stack (Sonarr, Radarr, Prowlarr, Bazarr, qBittorrent) LXC 108: Home Apps (Mealie, Grocy, Paperless-ngx, Glance) LXC 109: Service Monitoring (Uptime Kuma) LXC 110: Cloud Storage (Nextcloud + MariaDB + Redis) LXC 400: Docker Host (monitoring infrastructure - Prometheus, Grafana, Loki)
Each container gets dedicated resources, isolated networking, and can be individually snapshotted. Boot times? Under 5 seconds. Try that with a VM.
## Infrastructure as Code: The Holy Trinity
### Terraform: Declarative Infrastructure
Every VM and container starts as Terraform code:
```hcl
resource "proxmox_lxc" "media_server" {
target_node = "nuc"
hostname = "jellyfin"
ostemplate = "local:vztmpl/debian-12-standard_12.7-1_amd64.tar.zst"
cores = 4
memory = 4096
swap = 0
rootfs {
storage = "local-lvm"
size = "20G"
}
network {
name = "eth0"
bridge = "vmbr0"
ip = "192.168.4.106/24"
gw = "192.168.4.1"
}
features {
nesting = true
fuse = true
}
}
Packer: Hardened VM Templates
I built four base templates, each hardened according to CIS benchmarks:
- Ubuntu 22.04 LTS: For services needing broad compatibility
- Debian 12: Minimal footprint for lightweight services
- Rocky Linux 9: When I need RHEL compatibility
- Alpine 3.19: Ultra-lightweight container hosts
Each template includes:
- Security hardening scripts
- Node Exporter pre-installed
- Cloud-init ready
- Automatic security updates configured
Ansible: Configuration Management
Once Terraform provisions the infrastructure, Ansible takes over:
- name: Deploy Monitoring Stack
hosts: monitoring
roles:
- prometheus
- grafana
- loki
- alertmanager
vars:
prometheus_retention: 30d
grafana_auth_anonymous_enabled: false
alertmanager_slack_webhook: !vault |
$ANSIBLE_VAULT;1.1;AES256
## The Monitoring Stack: Observability First
You can't manage what you can't measure. My monitoring stack rivals what I've built for companies with millions in ARR:
### Metrics Collection
- **Prometheus**: 30+ scrape targets across all services
- **Node Exporter**: System metrics on every host
- **Proxmox Exporter**: Hypervisor-level metrics
- **Custom exporters**: For application-specific metrics
### Visualization & Alerting
- **Grafana**: 15 pre-built dashboards covering everything from CPU usage to backup status
- **Alertmanager**: Intelligent routing with PagerDuty integration (yes, for a homelab)
- **Loki**: Centralized logging with 30-day retention
### Real Alerts That Matter
```yaml
groups:
- name: homelab
rules:
- alert: BackupFailed
expr: time() - backup_last_success_timestamp > 86400
annotations:
summary: "Backup hasn't succeeded in 24 hours"
- alert: ServiceDown
expr: up{job=~"jellyfin|nextcloud|vault"} == 0
for: 5m
annotations:
summary: "Critical service {{ $labels.job }} is down"
## Security: Production-Grade Hardening
### Network Segmentation
Five VLANs isolate different security zones:
- Management (Proxmox, monitoring)
- DMZ (internet-facing services)
- Internal (trusted services)
- IoT (untrusted devices)
- Lab (experimental workloads)
### Authentication & Authorization
- **Authelia**: SSO for all web services with 2FA enforcement
- **Vault**: Centralized secrets management with automatic rotation
- **SSH**: Key-based only, with fail2ban and rate limiting
### Automated Security Updates
```bash
#!/bin/bash
## Runs nightly via systemd timer
for container in $(pct list | awk 'NR>1 {print $1}'); do
pct exec $container -- bash -c "apt update && apt upgrade -y"
pct snapshot $container pre-update-$(date +%Y%m%d)
done
## Backup Strategy: Assume Everything Will Fail
Three-tier backup strategy because I've learned the hard way:
1. **Local ZFS snapshots**: Every 4 hours, retained for 7 days
2. **Proxmox Backup Server**: Nightly incrementals to dedicated LXC container
3. **Cloud backup**: Weekly encrypted Restic backups to Backblaze B2
Restoration? Tested monthly. I can rebuild the entire infrastructure from scratch in under 2 hours.
## The DevOps Pipeline: GitOps All The Way
Every change goes through Git:
```yaml
name: Deploy Infrastructure
on:
push:
branches: [main]
jobs:
terraform:
runs-on: self-hosted
steps:
- uses: actions/checkout@v3
- name: Terraform Apply
run: |
cd terraform
terraform init
terraform apply -auto-approve
ansible:
needs: terraform
runs-on: self-hosted
steps:
- name: Run Ansible Playbooks
run: |
cd ansible
ansible-playbook site.yml
The GitHub runner? It runs on the NUC itself, in a dedicated LXC container.
Challenges and Lessons Learned
LXC Networking Quirks
LXC containers with Tailscale hit weird DNS resolution issues. The fix? Custom /etc/resolv.conf
management:
echo "nameserver 100.100.100.100" > /etc/resolv.conf
chattr +i /etc/resolv.conf # Make it immutable
Resource Constraints
32GB RAM seems like a lot until you're running 15+ services. Resource limits are crucial:
## Ansible group_vars/all/main.yml
container_defaults:
memory: 2048
swap: 0
cores: 2
container_overrides:
jellyfin:
memory: 4096
cores: 4
nextcloud:
memory: 3072
Backup Window Management
Staggering backups prevents IO storms:
2:00 AM - VMs backup
3:00 AM - Container snapshots
4:00 AM - Restic cloud sync
5:00 AM - Cleanup old snapshots
Performance Metrics
After 6 months of running a two-node cluster:
- Uptime: 99.94% (one planned maintenance window)
- Resource usage: 65% RAM (about 125GB of 192GB), 15% CPU average
- Power consumption: ~70W idle, ~170W under load
- Monthly cost: ~3 for Backblaze B2
Compare that to equivalent cloud infrastructure: easily $200+/month.
Future Expansions
With the second NUC in place, scaling is still trivial:
- Ceph storage: Distributed storage across nodes
- BGP routing: Full network automation with Cilium
- GitLab CI: Move from GitHub to self-hosted CI/CD
Key Takeaways
Building a production-grade homelab taught me several things:
Infrastructure as Code isn't just for work. The same practices that make cloud infrastructure manageable work brilliantly at home scale.
LXC containers are underrated. For service isolation without Kubernetes complexity, they're perfect.
Monitoring isn't optional. You'll catch issues before they become problems.
Automation compounds. Every script you write saves hours down the road.
Documentation matters. Future you will thank present you.
The Real-World Impact
The best infrastructure is the one that gets out of your way. This setup has been running my home services—media streaming with Jellyfin, recipe management with Mealie, document archival with Paperless-ngx, and cloud storage with Nextcloud—without me thinking about it. That's when you know you've built something solid.
The migration from a monolithic docker-host to dedicated LXC containers brought tangible benefits:
- Service isolation: Jellyfin transcoding no longer impacts Nextcloud performance
- Granular backups: Individual container snapshots instead of backing up everything
- Easier debugging: Logs and metrics scoped to specific services
- Resource efficiency: Better CPU and memory allocation per service type
What's your homelab running? I'd love to hear how others are approaching infrastructure at home scale.