
Network-Aware Dynamic Pod Scaling with Custom Metrics
Using native Kubernetes components with custom metrics for intelligent, resource-efficient scaling of network-intensive components.
- Purpose: Automatically provisions new nodes in response to unschedulable pods
- Installation: Typically installed via Helm (as discussed below) or Kubernetes manifests
Provisioner
to define how it should provision nodes. Here's an example configuration which specifies limitations of network-bandwith:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
cat <<EoF> basic-networking-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: higher-bandwidth-usage
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 30s
consolidationPolicy: WhenEmptyOrUnderutilized
template:
metadata:
labels:
workload-type: network-intensive
network-tier: high
spec:
expireAfter: 336h
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
# Specifically targeting network-optimized instance families
- key: eks.amazonaws.com/instance-network-bandwidth
operator: Gt
values: ["10000"]
- key: eks.amazonaws.com/instance-generation
operator: Gt
values:
- "4"
- key: kubernetes.io/arch
operator: In
values:
- amd64
- arm64
- key: kubernetes.io/os
operator: In
values:
- linux
- bottlerocket
taints:
- key: workload-type
value: network-intensive
effect: NoSchedule
terminationGracePeriod: 24h0m0s
EoF
kubectl apply -f basic-networking-nodepool.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
cat <<EoF> basic-networking-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: network-app
spec:
replicas: 1
selector:
matchLabels:
app: network-app
template:
metadata:
labels:
app: network-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: network-app
image: your-registry/network-app:v6
ports:
- containerPort: 8080
name: http
- containerPort: 8000
name: metrics
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
EoF
kubectl apply -f basic-networking-deployment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
cat <<EoF> basic-networking-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: network-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: network-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: network_bandwidth_usage
selector:
matchLabels:
namespace: default
target:
type: AverageValue
averageValue: "8388608" # 8MB/s
EoF
kubectl apply -f basic-networking-hpa.yaml
1
2
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
1
kubectl create namespace prometheus
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
cat <<EoF> prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
alertmanager:
enabled: true
config:
global:
resolve_timeout: 5m
route:
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
routes:
- match:
alertname: Watchdog
receiver: 'null'
receivers:
- name: 'null'
EoF
helm install prometheus prometheus-community/kube-prometheus-stack --namespace prometheus --create-namespace --values prometheus-values.yaml
1
2
3
4
Check pods
kubectl get pods -n prometheus
Check services
kubectl get svc -n prometheus
1
2
3
4
Port forward Prometheus
kubectl port-forward -n prometheus svc/prometheus-kube-prometheus-prometheus 9090:9090
Port forward Grafana
kubectl port-forward -n prometheus svc/prometheus-grafana 3000:80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from flask import Flask, Response
from prometheus_client import generate_latest, Counter, Gauge, CONTENT_TYPE_LATEST, start_http_server
app = Flask(__name__)
# Define Prometheus metrics
BYTES_TRANSFERRED = Counter('network_bytes_total', 'Total bytes transferred')
ACTIVE_REQUESTS = Gauge('network_active_requests', 'Number of active requests')
def metrics():
return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)
def download_file():
# Simulated file download logic
ACTIVE_REQUESTS.inc()
file_size = 100 * 1024 * 1024 # 100MB
BYTES_TRANSFERRED.inc(file_size)
# ... (download logic)
ACTIVE_REQUESTS.dec()
return "Download complete"
if __name__ == '__main__':
start_http_server(8000) # Prometheus metrics endpoint
app.run(host='0.0.0.0', port=8080)
1
your-registry/network-app:v6
1
2
3
4
5
6
7
8
9
10
11
12
# Use an official Python runtime as a parent image
FROM public.ecr.aws/docker/library/python:3.13-slim
# Set the working directory in the container
WORKDIR /usr/src/app
# Copy the current directory contents into the container at /usr/src/app
COPY . /usr/src/app
# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
# Make port 8000 and 8080 available to the world outside this container
EXPOSE 8080 8000
# Run app.py when the container launches
CMD ["python", "app.py"]
1
2
flask == 3.1.*
prometheus_client == 0.21.*
1
docker buildx build -t "your-registry/network-app:v4" --platform linux/amd64,linux/arm64 --push .
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: network-metrics
namespace: monitoring
spec:
groups:
- name: network
rules:
- record: network_bandwidth_usage
expr: |
sum(
rate(network_bytes_sent_total[5m]) +
rate(network_bytes_received_total[5m])
) by (pod)
- record: network_active_requests
expr: network_active_requests
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# prometheus-adapter-values.yaml
prometheus:
url: http://prometheus-server.prometheus.svc.cluster.local
port: 80
rules:
default: false
custom:
- seriesQuery: 'network_bandwidth_usage{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "^(.*)$"
as: "${1}"
metricsQuery: <<.Series>>{<<.LabelMatchers>>}
- seriesQuery: 'network_bytes_total{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: rate(<<.Series>>{<<.LabelMatchers>>}[1m])
- Network bandwidth usage: Scales when average usage exceeds 5MB/s per pod.
- Active requests: Scales when average active requests exceed 3 per pod.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: network-app-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: network-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: network_bandwidth_usage
selector:
matchLabels:
namespace: default
target:
type: AverageValue
averageValue: "5242880" # 5MB/s (adjust as needed)
- type: Pods
pods:
metric:
name: network_active_requests
selector:
matchLabels:
namespace: default
target:
type: AverageValue
averageValue: "3" # 3 active requests per pod
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # Reduced to be more responsive
policies:
- type: Pods
value: 2 # Scale up by 2 pods at a time
periodSeconds: 30
- type: Percent
value: 50 # Or by 50% of current replicas
periodSeconds: 30
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300 # 5 minutes to prevent rapid scale down
policies:
- type: Pods
value: 1 # Scale down by 1 pod at a time
periodSeconds: 60
- type: Percent
value: 20 # Or by 20% of current replicas
periodSeconds: 60
selectPolicy: Min # Use the more conservative scaling option
- Exporting custom metrics from the application.
- Using Prometheus to collect and process these metrics.
- Configuring Prometheus Adapter to make the metrics available to Kubernetes.
- Scale up Pods on Network intensive nodes as defined in the karpenter nodepool
- Image not found
- Setting up an HPA to use these custom metrics for scaling decisions.
- Image not found
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.