Troubleshooting - Kubernetes Problem Solving

Tổng quan

Troubleshooting Kubernetes requires systematic approach. Common issues include pod failures, networking problems, và resource constraints.

Pod Troubleshooting

Pod States

# Check pod status
kubectl get pods -o wide

# Describe pod for events
kubectl describe pod <pod-name>

# Check pod logs
kubectl logs <pod-name> -c <container-name>

# Previous container logs
kubectl logs <pod-name> --previous

Common Pod Issues

Image Pull Errors

# Error: ImagePullBackOff
kubectl describe pod <pod-name>

# Common causes:
# - Wrong image name/tag
# - Missing image pull secrets
# - Registry authentication issues
# - Network connectivity to registry

CrashLoopBackOff

# Check exit codes and restart counts
kubectl describe pod <pod-name>

# Check logs for errors
kubectl logs <pod-name> --previous

# Common causes:
# - Application startup failures
# - Missing dependencies
# - Resource limits too low
# - Configuration errors

Pod Debugging

# Execute into running container
kubectl exec -it <pod-name> -- /bin/bash

# Debug with ephemeral containers
kubectl debug <pod-name> -it --image=busybox

# Copy files from pod
kubectl cp <pod-name>:/path/to/file ./local-file

Node Troubleshooting

Node Status

# Check node conditions
kubectl describe node <node-name>

# Check node resources
kubectl top node <node-name>

# Check kubelet status
sudo systemctl status kubelet

# Check kubelet logs
sudo journalctl -u kubelet -f

Common Node Issues

Node NotReady

# Check node conditions
kubectl get nodes
kubectl describe node <node-name>

# Common causes:
# - kubelet not running
# - Network connectivity issues
# - Disk pressure
# - Memory pressure
# - Container runtime issues

Resource Pressure

# Check disk usage
df -h

# Check memory usage
free -m

# Check running processes
ps aux | head -20

# Clean up docker images
docker system prune -a

Networking Troubleshooting

Service Issues

# Check service endpoints
kubectl get endpoints <service-name>

# Test service connectivity
kubectl run test-pod --image=busybox -it --rm -- wget -qO- <service-name>:<port>

# Check DNS resolution
kubectl run test-pod --image=busybox -it --rm -- nslookup <service-name>

Network Policies

# Check network policies
kubectl get networkpolicies

# Test pod-to-pod connectivity
kubectl exec -it <pod-1> -- ping <pod-2-ip>

# Check iptables rules
sudo iptables -L -n -v

CNI Issues

# Check CNI plugin pods
kubectl get pods -n kube-system | grep -E "(calico|flannel|weave|cilium)"

# Check CNI configuration
cat /etc/cni/net.d/*

# Restart CNI pods
kubectl delete pods -n kube-system -l k8s-app=calico-node

Storage Troubleshooting

PVC Issues

# Check PVC status
kubectl get pvc

# Describe PVC for events
kubectl describe pvc <pvc-name>

# Check available PVs
kubectl get pv

# Check storage class
kubectl get storageclass

Volume Mount Issues

# Check pod events
kubectl describe pod <pod-name>

# Check filesystem permissions
kubectl exec -it <pod-name> -- ls -la /mount/path

# Common issues:
# - Incorrect mount path
# - Permission denied
# - Volume not available
# - Storage class issues

Performance Troubleshooting

Resource Monitoring

# Check resource usage
kubectl top pods --all-namespaces
kubectl top nodes

# Check resource requests/limits
kubectl describe pod <pod-name> | grep -A 10 "Limits\|Requests"

# Check resource quotas
kubectl describe quota -n <namespace>

Application Performance

# Check application metrics
curl <pod-ip>:8080/metrics

# Check slow queries (for databases)
kubectl exec -it <postgres-pod> -- psql -c "SELECT * FROM pg_stat_activity;"

# Profile application
kubectl exec -it <pod-name> -- /usr/bin/pprof

DNS Troubleshooting

DNS Resolution

# Test DNS from pod
kubectl run test-pod --image=busybox -it --rm -- nslookup kubernetes.default

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

# Test external DNS
kubectl run test-pod --image=busybox -it --rm -- nslookup google.com

CoreDNS Issues

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Restart CoreDNS
kubectl rollout restart deployment/coredns -n kube-system

# Check DNS endpoints
kubectl get endpoints kube-dns -n kube-system

Security Troubleshooting

RBAC Issues

# Check user permissions
kubectl auth can-i <verb> <resource> --as=<user>

# Check service account permissions
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<namespace>:<sa-name>

# Check role bindings
kubectl get rolebindings -A
kubectl describe rolebinding <binding-name>

Pod Security

# Check security context
kubectl describe pod <pod-name> | grep -A 10 "Security Context"

# Check admission controller logs
kubectl logs -n kube-system <admission-controller-pod>

# Check pod security policies
kubectl get psp

Cluster Troubleshooting

Control Plane Issues

# Check system pods
kubectl get pods -n kube-system

# Check API server logs
sudo journalctl -u kube-apiserver

# Check etcd health
kubectl exec -n kube-system etcd-master -- etcdctl endpoint health

# Check scheduler logs
kubectl logs -n kube-system <scheduler-pod>

Certificate Issues

# Check certificate expiration
sudo kubeadm certs check-expiration

# Renew certificates
sudo kubeadm certs renew all

# Check certificate details
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout

Debugging Tools

kubectl Commands

# Get events sorted by time
kubectl get events --sort-by=.metadata.creationTimestamp

# Get all resources in namespace
kubectl get all -n <namespace>

# Patch resources for debugging
kubectl patch deployment <name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","command":["sleep","3600"]}]}}}}'

Third-party Tools

# stern - multi-pod log tailing
stern <pod-pattern>

# kubectx/kubens - context switching
kubectx <context>
kubens <namespace>

# k9s - terminal UI
k9s

# kubectl-debug - debugging utilities
kubectl debug <pod> -it --image=nicolaka/netshoot

Java Debugging in Kubernetes

Khi debug ứng dụng Java trong Kubernetes, bạn có thể sử dụng các kỹ thuật sau:

1. Remote Debugging: Thêm các JVM arguments vào container của bạn để bật remote debugging. Sau đó, bạn có thể kết nối từ IDE (ví dụ: IntelliJ IDEA, Eclipse).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-java-app
spec:
  template:
    spec:
      containers:
      - name: my-app
        image: my-java-app:latest
        env:
        - name: JAVA_TOOL_OPTIONS
          value: "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005"
        ports:
        - containerPort: 8080
        - containerPort: 5005 # Expose debug port

Sau đó, expose cổng debug bằng kubectl port-forward:

kubectl port-forward <pod-name> 5005:5005

2. Thread Dumps và Heap Dumps: Sử dụng kubectl exec để chạy các lệnh jstack hoặc jmap bên trong container Java.

# Lấy PID của ứng dụng Java trong container
kubectl exec -it <pod-name> -- ps -ef | grep java

# Lấy thread dump
kubectl exec -it <pod-name> -- jstack <java-pid> > thread-dump.txt

# Lấy heap dump
kubectl exec -it <pod-name> -- jmap -dump:format=b,file=/tmp/heap-dump.hprof <java-pid>

# Copy heap dump ra ngoài
kubectl cp <pod-name>:/tmp/heap-dump.hprof ./heap-dump.hprof

3. Logging và Metrics: Đảm bảo ứng dụng Java của bạn có logging và metrics đầy đủ để dễ dàng theo dõi hành vi và hiệu suất.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;

public class DebuggingExample {

    private static final Logger logger = LoggerFactory.getLogger(DebuggingExample.class);
    private final MeterRegistry meterRegistry; // Injected via Spring/CDI

    public DebuggingExample(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    public void processData(String data) {
        Timer.Sample sample = Timer.start(meterRegistry);
        try {
            logger.info("Processing data: {}", data);
            // Simulate some complex logic
            if (data.contains("error")) {
                logger.error("Error condition met for data: {}", data);
                throw new RuntimeException("Simulated error");
            }
        } catch (Exception e) {
            logger.error("Exception during data processing", e);
            meterRegistry.counter("data_processing_errors_total").increment();
        } finally {
            sample.stop(meterRegistry.timer("data_processing_duration"));
        }
    }
}

Troubleshooting Checklist

Pod Issues

  • [ ] Check pod status và events
  • [ ] Verify image availability
  • [ ] Check resource requests/limits
  • [ ] Verify configuration (ConfigMaps, Secrets)
  • [ ] Check logs for errors
  • [ ] Verify service account permissions

Network Issues

  • [ ] Check service endpoints
  • [ ] Test DNS resolution
  • [ ] Verify network policies
  • [ ] Check CNI plugin status
  • [ ] Test connectivity between pods

Storage Issues

  • [ ] Check PVC status
  • [ ] Verify storage class
  • [ ] Check available PVs
  • [ ] Verify mount permissions
  • [ ] Check storage provisioner logs

Best Practices

  • Always start với kubectl get và describe
  • Check events for recent activities
  • Use logs to understand application behavior
  • Test connectivity systematically
  • Monitor resource usage
  • Keep troubleshooting runbooks
  • Document common issues và solutions

Next Steps

  1. 📚 Học về Performance Tuning
  2. 🎯 Practice troubleshooting scenarios
  3. 🏗️ Build monitoring alerts
  4. 💻 Create debugging tooling

Nội dung đã được mở rộng với hands-on troubleshooting scenarios và advanced debugging techniques, cùng các ví dụ Java.