Fargate에서 예약 실패로 보류 상태에있는 컨테이너를 디버깅하는 방법은 무엇입니까?
객관적인
제 목표는 Fargate를 사용하여 AWS EKS에 배포 할 수있는 것입니다. 배포 작업을 node_group
. 그러나 Fargate를 사용하도록 전환했을 때 포드가 모두 보류 상태에있는 것 같습니다.
내 현재 코드의 모습
Terraform을 사용하여 프로비저닝하고 있습니다 (반드시 Terraform 답변을 찾지는 않음). 이것은 내 EKS 클러스터를 만드는 방법입니다.
module "eks_cluster" {
source = "terraform-aws-modules/eks/aws"
version = "13.2.1"
cluster_name = "${var.project_name}-${var.env_name}"
cluster_version = var.cluster_version
vpc_id = var.vpc_id
cluster_enabled_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
enable_irsa = true
subnets = concat(var.private_subnet_ids, var.public_subnet_ids)
create_fargate_pod_execution_role = false
node_groups = {
my_nodes = {
desired_capacity = 1
max_capacity = 2
min_capacity = 1
instance_type = var.nodes_instance_type
subnets = var.private_subnet_ids
}
}
}
다음은 Fargate 프로필을 프로비저닝하는 방법입니다.
resource "aws_eks_fargate_profile" "airflow" {
cluster_name = module.eks_cluster.cluster_id
fargate_profile_name = "${var.project_name}-fargate-${var.env_name}"
pod_execution_role_arn = aws_iam_role.fargate_iam_role.arn
subnet_ids = var.private_subnet_ids
selector {
namespace = "airflow"
}
}
그리고 이것이 내가 필요한 정책을 만들고 연결하는 방법입니다.
resource "aws_iam_role" "fargate_iam_role" {
name = "${var.project_name}-fargate-${var.env_name}"
force_detach_policies = true
assume_role_policy = jsonencode({
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = ["eks-fargate-pods.amazonaws.com", "eks.amazonaws.com"]
}
}]
Version = "2012-10-17"
})
}
# Attach IAM Policy for Fargate
resource "aws_iam_role_policy_attachment" "fargate_pod_execution" {
role = aws_iam_role.fargate_iam_role.name
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSFargatePodExecutionRolePolicy"
}
내가 시도했지만 작동하지 않는 것
Fargate Profile
Exists 가있는 동일한 네임 스페이스에 포드 (헬름 차트를 사용하고 있음 )를 배포 해 보았습니다 . 실행하면 다음 kubectl get pods -n airflow
과 같이 보류중인 모든 포드가 표시됩니다.
NAME READY STATUS RESTARTS AGE
airflow-flower-79b5948677-vww5d 0/1 Pending 0 40s
airflow-redis-master-0 0/1 Pending 0 40s
airflow-scheduler-6b6bd4b6f6-j9qzg 0/2 Pending 0 41s
airflow-web-567b55fbbf-z8dsg 0/2 Pending 0 41s
airflow-worker-0 0/2 Pending 0 40s
airflow-worker-1 0/2 Pending 0 40s
그런 다음의 이벤트를 봅니다 kubectl get events -n airflow
.
LAST SEEN TYPE REASON OBJECT MESSAGE
2m15s Normal LoggingEnabled pod/airflow-flower-79b5948677-vww5d Successfully enabled logging for pod
2m16s Normal SuccessfulCreate replicaset/airflow-flower-79b5948677 Created pod: airflow-flower-79b5948677-vww5d
2m17s Normal ScalingReplicaSet deployment/airflow-flower Scaled up replica set airflow-flower-79b5948677 to 1
2m15s Normal LoggingEnabled pod/airflow-redis-master-0 Successfully enabled logging for pod
2m16s Normal SuccessfulCreate statefulset/airflow-redis-master create Pod airflow-redis-master-0 in StatefulSet airflow-redis-master successful
2m15s Normal LoggingEnabled pod/airflow-scheduler-6b6bd4b6f6-j9qzg Successfully enabled logging for pod
2m16s Normal SuccessfulCreate replicaset/airflow-scheduler-6b6bd4b6f6 Created pod: airflow-scheduler-6b6bd4b6f6-j9qzg
2m17s Normal NoPods poddisruptionbudget/airflow-scheduler No matching pods found
2m17s Normal ScalingReplicaSet deployment/airflow-scheduler Scaled up replica set airflow-scheduler-6b6bd4b6f6 to 1
2m15s Normal LoggingEnabled pod/airflow-web-567b55fbbf-z8dsg Successfully enabled logging for pod
2m16s Normal SuccessfulCreate replicaset/airflow-web-567b55fbbf Created pod: airflow-web-567b55fbbf-z8dsg
2m17s Normal ScalingReplicaSet deployment/airflow-web Scaled up replica set airflow-web-567b55fbbf to 1
2m15s Normal LoggingEnabled pod/airflow-worker-0 Successfully enabled logging for pod
2m15s Normal LoggingEnabled pod/airflow-worker-1 Successfully enabled logging for pod
2m16s Normal SuccessfulCreate statefulset/airflow-worker create Pod airflow-worker-0 in StatefulSet airflow-worker successful
2m16s Normal SuccessfulCreate statefulset/airflow-worker create Pod airflow-worker-1 in StatefulSet airflow-worker successful
그런 다음을 통해 포드 중 하나를 설명하려고하면 다음 kubectl describe pod
과 같은 결과가 나타납니다.
Name: airflow-redis-master-0
Namespace: airflow
Priority: 2000001000
Priority Class Name: system-node-critical
Node: <none>
Labels: app=redis
chart=redis-10.5.7
controller-revision-hash=airflow-redis-master-588d57785d
eks.amazonaws.com/fargate-profile=airflow-fargate-airflow-dev
release=airflow
role=master
statefulset.kubernetes.io/pod-name=airflow-redis-master-0
Annotations: CapacityProvisioned: 0.25vCPU 0.5GB
Logging: LoggingEnabled
checksum/configmap: 2b82c78fd9186045e6e2b44cfbb38460310697cf2f2f175c9d8618dd4d42e1ca
checksum/health: a5073935c8eb985cf8f3128ba7abbc4121cef628a9a1b0924c95cf97d33323bf
checksum/secret: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
cluster-autoscaler.kubernetes.io/safe-to-evict: true
kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/airflow-redis-master
NominatedNodeName: 6f344dfd11-000a9c54e4e240a2a8b3dfceb5f8227e
Containers:
airflow-redis:
Image: docker.io/bitnami/redis:5.0.7-debian-10-r32
Port: 6379/TCP
Host Port: 0/TCP
Command:
/bin/bash
-c
if [[ -n $REDIS_PASSWORD_FILE ]]; then password_aux=`cat ${REDIS_PASSWORD_FILE}`
export REDIS_PASSWORD=$password_aux fi if [[ ! -f /opt/bitnami/redis/etc/master.conf ]];then cp /opt/bitnami/redis/mounted-etc/master.conf /opt/bitnami/redis/etc/master.conf fi if [[ ! -f /opt/bitnami/redis/etc/redis.conf ]];then cp /opt/bitnami/redis/mounted-etc/redis.conf /opt/bitnami/redis/etc/redis.conf fi ARGS=("--port" "${REDIS_PORT}")
ARGS+=("--requirepass" "${REDIS_PASSWORD}") ARGS+=("--masterauth" "${REDIS_PASSWORD}")
ARGS+=("--include" "/opt/bitnami/redis/etc/redis.conf")
ARGS+=("--include" "/opt/bitnami/redis/etc/master.conf")
/run.sh ${ARGS[@]}
Liveness: exec [sh -c /health/ping_liveness_local.sh 5] delay=5s timeout=5s period=5s #success=1 #failure=5
Readiness: exec [sh -c /health/ping_readiness_local.sh 5] delay=5s timeout=1s period=5s #success=1 #failure=5
Environment:
REDIS_REPLICATION_MODE: master
REDIS_PASSWORD: <set to the key 'redis-password' in secret 'my-creds'> Optional: false
REDIS_PORT: 6379
Mounts:
/data from redis-data (rw)
/health from health (rw)
/opt/bitnami/redis/etc/ from redis-tmp-conf (rw)
/opt/bitnami/redis/mounted-etc from config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-dmwvn (ro)
Volumes:
health:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: airflow-redis-health
Optional: false
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: airflow-redis
Optional: false
redis-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
redis-tmp-conf:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
default-token-dmwvn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-dmwvn
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal LoggingEnabled 3m12s fargate-scheduler Successfully enabled logging for pod
Warning FailedScheduling 12s fargate-scheduler Pod provisioning timed out (will retry) for pod: airflow/airflow-redis-master-0
내가 시도한 다른 것들
- 적절한 태그로 내 서브넷에 태그 지정 (퍼블릭 / 프라이빗 서브넷에 따라 조건부) :
kubernetes_tags = map(
"kubernetes.io/role/${var.type == "Public" ? "elb" : "internal-elb"}", 1,
"kubernetes.io/cluster/${var.kubernetes_cluster_name}", "shared"
)
- 내 포드에 Fargate 프로필 주석 달기 (예 : 인프라 : fargate)
- VPC 설정을 디버그합니다. 내 이해를 돕기 위해 Fargate에 대해 다음 설정을 설명해야합니다 (출처는 여기 ).
single_nat_gateway = true # needed for fargate (https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf#page=135&zoom=100,96,764)
enable_nat_gateway = true # needed for fargate (https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf#page=135&zoom=100,96,764)
enable_vpn_gateway = false
enable_dns_hostnames = true # needed for fargate (https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf#page=135&zoom=100,96,764)
enable_dns_support = true # needed for fargate (https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf#page=135&zoom=100,96,764)
그러나 즉시 생성 된 VPC가 제공되었으며 이러한 설정이 이미 켜져 있는지 / 꺼 졌는지 확인하는 방법을 잘 모르겠습니다.
이 문제를 디버그하기 위해 수행해야하는 단계는 무엇입니까?
답변
테스트 목적으로 NAT 게이트웨이를 사용하여 vpc 프라이빗 서브넷에서 외부 세계로의 연결을 활성화해야한다고 생각합니다. 따라서 퍼블릭에서 NAT 게이트웨이를 생성하고 다음과 같은 관련 라우팅 테이블의 추가 항목을 프라이빗 서브넷에 추가 할 수 있습니다.
0.0.0.0/0 nat-xxxxxxxx
이것이 효과가 있고 더 안전한 방화벽 인스턴스를 통해 아웃 바운드를 제한하려면 방화벽 공급자 지원에 문의하여 멀리 나가는 트래픽을 허용 목록에 추가하는 방법을 문의해야합니다.