Waht is Longhorn?

Longhorn is a lightweight, reliable and easy-to-use distributed block storage system for Kubernetes.

Longhorn is free, open source software. Originally developed by Rancher Labs, it is now being developed as a incubating project of the Cloud Native Computing Foundation.

Why choose Longhorn?

Simple to deploy & operate: Install via Helm/manifests, clean web UI, great Rancher integration.

Kubernetes-native: Everything is CRDs/CSI; snapshots, backups, restores, and automation are all inside the cluster.

High availability & self-healing: Each volume keeps multiple replicas across nodes; disk/node failures trigger automatic rebuilds.

Built-in backup/DR: Incremental backups to S3/NFS, recurring jobs, Disaster Recovery volumes, and cross-cluster restore.

Infra flexibility & cost: Runs on local disks (SSD/HDD) on almost any hardware (on-prem/edge/ARM64); no vendor lock-in.

Useful features: Thin-provisioning, soft anti-affinity, RWX via NFS provisioner, fast local snapshots, online volume expansion.

Observability: Health checks, metrics, and a UI that shows data paths and replicas.

When is it a good fit?

Small to mid-size clusters, lean DevOps teams, and edge/branch sites.

General stateful workloads: app services, CI runners (Jenkins/GitLab), MinIO, light analytics—anything not ultra-latency-sensitive.

Straightforward, low-cost DR with S3/NFS backups and quick cross-cluster restore.

When Ceph feels too heavy but you still need HA block storage.

Limitations / performance notes

Network/CPU overhead from block-level replication; not ideal for ultra-IOPS, low-latency OLTP databases.

Throughput/latency is typically below direct local PVs or a well-tuned Ceph for very heavy workloads.

Needs adequate bandwidth and capacity between nodes (10 GbE recommended for serious loads).

Prefer SSDs/NVMe; pure HDD setups rebuild slowly and add latency.

Best practices (quick hits)

Run ≥3 nodes for true HA; set 2–3 replicas per volume.

Separate OS and dgxth iliata disks; use Maintenance Mode before draining a node.

Configure an S3/NFS BackupStore and recurring snapshot/backup jobs; test restores regularly.

Provide multiple StorageClasses (e.g., fast-ssd with replica=2; standard with replica=3).

Monitor node/filesystem health, free space, and rebuild progress with alerts.

Installation prerequisites (minimal yet sufficient)

Kubernetes v1.25+ with ≥3 worker nodes for true HA.

Nodes: x86_64 (SSE4.2) or ARM64; ≥4 GiB RAM (8 GiB+ recommended), up-to-date Linux.

Dedicated data disk for Longhorn (prefer SSD/NVMe); avoid placing data on the OS/root disk.

Stable, low-latency network between nodes; firewalls must not block intra-cluster storage traffic.

Container runtime: containerd or Docker (compatible versions).

iSCSI (classic data engine): install open-iscsi and keep iscsid running on every node.

NVMe/TCP (optional newer engine): ensure kernel modules nvme, nvme-core, nvme-tcp are available and loaded at boot.

Notes

multipathd can interfere with attaches—disable it or blacklist Longhorn devices.

Keep swap off (or properly configured) on Kubernetes nodes.

Use NTP/chrony for clock sync across nodes.

Quick install (Ubuntu/Debian)

# On every node
sudo apt-get update
sudo apt-get install -y open-iscsi nfs-common cryptsetup
sudo systemctl enable --now iscsid

# (Optional) NVMe/TCP
echo -e "nvme\nnvme-core\nnvme-tcp" | sudo tee /etc/modules-load.d/nvme-tcp.conf
sudo modprobe nvme nvme-core nvme-tcp

Quick install (RHEL/Rocky/Alma)

# On every node
sudo dnf install -y iscsi-initiator-utils nfs-utils cryptsetup
sudo systemctl enable --now iscsid

# (Optional) NVMe/TCP
echo -e "nvme\nnvme-core\nnvme-tcp" | sudo tee /etc/modules-load.d/nvme-tcp.conf
sudo modprobe nvme nvme-core nvme-tcp

Deploy Longhorn with Helm

helm repo add longhorn https://charts.longhorn.io
helm repo update

kubectl create namespace longhorn-system

helm install longhorn longhorn/longhorn -n longhorn-system

# Access the UI (NodePort by default)
kubectl -n longhorn-system get svc | grep longhorn-frontend

Example StorageClasses

# fast-ssd (replica=2) — good for common workloads with moderate risk tolerance
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: driver.longhorn.io
parameters:
  numberOfReplicas: "2"
  staleReplicaTimeout: "30"
  fsType: "ext4"
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
# standard-longhorn (replica=3) — higher resiliency for more critical data
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard-longhorn
provisioner: driver.longhorn.io
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "30"
  fsType: "xfs"
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

BackupStore configuration

S3-compatible: Longhorn UI → Settings → Backup Target, e.g. s3://my-bucket@us-east-1/longhorn Provide a Backup Target Credentials Secret with access key, secret, and custom endpoint if not AWS.

NFS: e.g. nfs://10.0.0.20:/export/longhorn-backups (must be reachable from Longhorn manager pods on all nodes).

Recurring jobs

Per-volume, schedule hourly/daily snapshots and daily/weekly backups; keep sensible retention (e.g., 7/30 versions).

Test restore (recommended routine)

Restore a recent backup into a new volume.

Bind to a temporary PVC and run read/write checks.

Record timings and results in your runbook.

Operations & maintenance

Volume expansion: supported online via UI or by increasing spec.resources.requests.storage on the PVC.

Node offboarding: enable Maintenance Mode, drain, return node, then monitor rebuild to completion.

Thin provisioning: default is enabled—watch actual disk consumption and alert around 80–85% capacity.

Monitoring & alerts

Track read/write latency, IOPS, rebuild duration/progress, free space, and replica health.

Use Prometheus/Grafana (or your stack) and alert on disk pressure and abnormal latency.

Troubleshooting (common)

Attach/Mount failures: verify iscsid (or NVMe/TCP) is running, multipathd disabled/blacklisted, firewalls open, and K8s/Longhorn versions compatible.

Slow rebuilds: network bottlenecks or HDD-only pools—migrate to SSD/NVMe and improve bandwidth.

High latency: too many replicas, pod contention on the same node, or disk contention—tune replica counts and placement.

NVMe/TCP vs iSCSI — quick note

iSCSI (classic, mature): wide compatibility, simple setup on most distros.

NVMe/TCP (newer): potential for lower latency & better throughput; requires newer kernels and loaded modules.

Security & compliance

Keep RBAC enabled; scope Longhorn access to the longhorn-system namespace.

Pod Security: add the minimal required permissions (Privileged/HostPath) per Longhorn docs.

Upgrades — safe practice

Validate backups/DR and perform a test restore before upgrading.

Choose a chart version compatible with your Longhorn target; upgrade in stages and watch volume health/latency.

Longhorn

Waht is Longhorn?

Why choose Longhorn?

When is it a good fit?

Best practices (quick hits)

Installation prerequisites (minimal yet sufficient)

Quick install (Ubuntu/Debian)

Quick install (RHEL/Rocky/Alma)

Deploy Longhorn with Helm

Example StorageClasses

BackupStore configuration

Recurring jobs

Test restore (recommended routine)

Operations & maintenance

Troubleshooting (common)

NVMe/TCP vs iSCSI — quick note

Security & compliance

Upgrades — safe practice

Comments

More from this blog

Longhorn-Restoring a PostgreSQL Cluster

Longhorn — Backup & Restore Volumes

Install K3S

Command Palette

Waht is Longhorn?

Why choose Longhorn?

When is it a good fit?

Best practices (quick hits)

Installation prerequisites (minimal yet sufficient)

Quick install (Ubuntu/Debian)

Quick install (RHEL/Rocky/Alma)

Deploy Longhorn with Helm

Example StorageClasses

BackupStore configuration

Recurring jobs

Test restore (recommended routine)

Operations & maintenance

Troubleshooting (common)

NVMe/TCP vs iSCSI — quick note

Security & compliance

Upgrades — safe practice

Comments

More from this blog