Troubleshooting - Kubernetes Problem Solving
Tổng quan
Troubleshooting Kubernetes requires systematic approach. Common issues include pod failures, networking problems, và resource constraints.
Pod Troubleshooting
Pod States
# Check pod status
kubectl get pods -o wide
# Describe pod for events
kubectl describe pod <pod-name>
# Check pod logs
kubectl logs <pod-name> -c <container-name>
# Previous container logs
kubectl logs <pod-name> --previous
Common Pod Issues
Image Pull Errors
# Error: ImagePullBackOff
kubectl describe pod <pod-name>
# Common causes:
# - Wrong image name/tag
# - Missing image pull secrets
# - Registry authentication issues
# - Network connectivity to registry
CrashLoopBackOff
# Check exit codes and restart counts
kubectl describe pod <pod-name>
# Check logs for errors
kubectl logs <pod-name> --previous
# Common causes:
# - Application startup failures
# - Missing dependencies
# - Resource limits too low
# - Configuration errors
Pod Debugging
# Execute into running container
kubectl exec -it <pod-name> -- /bin/bash
# Debug with ephemeral containers
kubectl debug <pod-name> -it --image=busybox
# Copy files from pod
kubectl cp <pod-name>:/path/to/file ./local-file
Node Troubleshooting
Node Status
# Check node conditions
kubectl describe node <node-name>
# Check node resources
kubectl top node <node-name>
# Check kubelet status
sudo systemctl status kubelet
# Check kubelet logs
sudo journalctl -u kubelet -f
Common Node Issues
Node NotReady
# Check node conditions
kubectl get nodes
kubectl describe node <node-name>
# Common causes:
# - kubelet not running
# - Network connectivity issues
# - Disk pressure
# - Memory pressure
# - Container runtime issues
Resource Pressure
# Check disk usage
df -h
# Check memory usage
free -m
# Check running processes
ps aux | head -20
# Clean up docker images
docker system prune -a
Networking Troubleshooting
Service Issues
# Check service endpoints
kubectl get endpoints <service-name>
# Test service connectivity
kubectl run test-pod --image=busybox -it --rm -- wget -qO- <service-name>:<port>
# Check DNS resolution
kubectl run test-pod --image=busybox -it --rm -- nslookup <service-name>
Network Policies
# Check network policies
kubectl get networkpolicies
# Test pod-to-pod connectivity
kubectl exec -it <pod-1> -- ping <pod-2-ip>
# Check iptables rules
sudo iptables -L -n -v
CNI Issues
# Check CNI plugin pods
kubectl get pods -n kube-system | grep -E "(calico|flannel|weave|cilium)"
# Check CNI configuration
cat /etc/cni/net.d/*
# Restart CNI pods
kubectl delete pods -n kube-system -l k8s-app=calico-node
Storage Troubleshooting
PVC Issues
# Check PVC status
kubectl get pvc
# Describe PVC for events
kubectl describe pvc <pvc-name>
# Check available PVs
kubectl get pv
# Check storage class
kubectl get storageclass
Volume Mount Issues
# Check pod events
kubectl describe pod <pod-name>
# Check filesystem permissions
kubectl exec -it <pod-name> -- ls -la /mount/path
# Common issues:
# - Incorrect mount path
# - Permission denied
# - Volume not available
# - Storage class issues
Performance Troubleshooting
Resource Monitoring
# Check resource usage
kubectl top pods --all-namespaces
kubectl top nodes
# Check resource requests/limits
kubectl describe pod <pod-name> | grep -A 10 "Limits\|Requests"
# Check resource quotas
kubectl describe quota -n <namespace>
Application Performance
# Check application metrics
curl <pod-ip>:8080/metrics
# Check slow queries (for databases)
kubectl exec -it <postgres-pod> -- psql -c "SELECT * FROM pg_stat_activity;"
# Profile application
kubectl exec -it <pod-name> -- /usr/bin/pprof
DNS Troubleshooting
DNS Resolution
# Test DNS from pod
kubectl run test-pod --image=busybox -it --rm -- nslookup kubernetes.default
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml
# Test external DNS
kubectl run test-pod --image=busybox -it --rm -- nslookup google.com
CoreDNS Issues
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Restart CoreDNS
kubectl rollout restart deployment/coredns -n kube-system
# Check DNS endpoints
kubectl get endpoints kube-dns -n kube-system
Security Troubleshooting
RBAC Issues
# Check user permissions
kubectl auth can-i <verb> <resource> --as=<user>
# Check service account permissions
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<namespace>:<sa-name>
# Check role bindings
kubectl get rolebindings -A
kubectl describe rolebinding <binding-name>
Pod Security
# Check security context
kubectl describe pod <pod-name> | grep -A 10 "Security Context"
# Check admission controller logs
kubectl logs -n kube-system <admission-controller-pod>
# Check pod security policies
kubectl get psp
Cluster Troubleshooting
Control Plane Issues
# Check system pods
kubectl get pods -n kube-system
# Check API server logs
sudo journalctl -u kube-apiserver
# Check etcd health
kubectl exec -n kube-system etcd-master -- etcdctl endpoint health
# Check scheduler logs
kubectl logs -n kube-system <scheduler-pod>
Certificate Issues
# Check certificate expiration
sudo kubeadm certs check-expiration
# Renew certificates
sudo kubeadm certs renew all
# Check certificate details
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout
Debugging Tools
kubectl Commands
# Get events sorted by time
kubectl get events --sort-by=.metadata.creationTimestamp
# Get all resources in namespace
kubectl get all -n <namespace>
# Patch resources for debugging
kubectl patch deployment <name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","command":["sleep","3600"]}]}}}}'
Third-party Tools
# stern - multi-pod log tailing
stern <pod-pattern>
# kubectx/kubens - context switching
kubectx <context>
kubens <namespace>
# k9s - terminal UI
k9s
# kubectl-debug - debugging utilities
kubectl debug <pod> -it --image=nicolaka/netshoot
Java Debugging in Kubernetes
Khi debug ứng dụng Java trong Kubernetes, bạn có thể sử dụng các kỹ thuật sau:
1. Remote Debugging: Thêm các JVM arguments vào container của bạn để bật remote debugging. Sau đó, bạn có thể kết nối từ IDE (ví dụ: IntelliJ IDEA, Eclipse).
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-java-app
spec:
template:
spec:
containers:
- name: my-app
image: my-java-app:latest
env:
- name: JAVA_TOOL_OPTIONS
value: "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005"
ports:
- containerPort: 8080
- containerPort: 5005 # Expose debug port
Sau đó, expose cổng debug bằng kubectl port-forward:
kubectl port-forward <pod-name> 5005:5005
2. Thread Dumps và Heap Dumps:
Sử dụng kubectl exec để chạy các lệnh jstack hoặc jmap bên trong container Java.
# Lấy PID của ứng dụng Java trong container
kubectl exec -it <pod-name> -- ps -ef | grep java
# Lấy thread dump
kubectl exec -it <pod-name> -- jstack <java-pid> > thread-dump.txt
# Lấy heap dump
kubectl exec -it <pod-name> -- jmap -dump:format=b,file=/tmp/heap-dump.hprof <java-pid>
# Copy heap dump ra ngoài
kubectl cp <pod-name>:/tmp/heap-dump.hprof ./heap-dump.hprof
3. Logging và Metrics: Đảm bảo ứng dụng Java của bạn có logging và metrics đầy đủ để dễ dàng theo dõi hành vi và hiệu suất.
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
public class DebuggingExample {
private static final Logger logger = LoggerFactory.getLogger(DebuggingExample.class);
private final MeterRegistry meterRegistry; // Injected via Spring/CDI
public DebuggingExample(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
}
public void processData(String data) {
Timer.Sample sample = Timer.start(meterRegistry);
try {
logger.info("Processing data: {}", data);
// Simulate some complex logic
if (data.contains("error")) {
logger.error("Error condition met for data: {}", data);
throw new RuntimeException("Simulated error");
}
} catch (Exception e) {
logger.error("Exception during data processing", e);
meterRegistry.counter("data_processing_errors_total").increment();
} finally {
sample.stop(meterRegistry.timer("data_processing_duration"));
}
}
}
Troubleshooting Checklist
Pod Issues
- [ ] Check pod status và events
- [ ] Verify image availability
- [ ] Check resource requests/limits
- [ ] Verify configuration (ConfigMaps, Secrets)
- [ ] Check logs for errors
- [ ] Verify service account permissions
Network Issues
- [ ] Check service endpoints
- [ ] Test DNS resolution
- [ ] Verify network policies
- [ ] Check CNI plugin status
- [ ] Test connectivity between pods
Storage Issues
- [ ] Check PVC status
- [ ] Verify storage class
- [ ] Check available PVs
- [ ] Verify mount permissions
- [ ] Check storage provisioner logs
Best Practices
- Always start với kubectl get và describe
- Check events for recent activities
- Use logs to understand application behavior
- Test connectivity systematically
- Monitor resource usage
- Keep troubleshooting runbooks
- Document common issues và solutions
Next Steps
- 📚 Học về Performance Tuning
- 🎯 Practice troubleshooting scenarios
- 🏗️ Build monitoring alerts
- 💻 Create debugging tooling
Nội dung đã được mở rộng với hands-on troubleshooting scenarios và advanced debugging techniques, cùng các ví dụ Java.