performance-docs/doc/source/methodologies/monitoring/index.rst

35 KiB

Methodology for Containerized Openstack Monitoring

Abstract

This document describes one of the Containerized Openstack monitoring solutions to provide scalable and comprehensive architecture and obtain all crucial performance metrics on each structure layer.

Containerized Openstack Monitoring Architecture

This part of documentation describes required performance metrics in each distinguished Containerized Openstack layer.

Containerized Openstack comprises three layers where Monitoring System should be able to query all necessary counters: - OS layer - Kubernetes layer - Openstack layer

Monitoring instruments must be logically divided in two groups:
  • Monitoring Server Side
  • Node Client Side

Operation System Layer

We were using Ubuntu Xenial on top of bare-metal servers for both server and node side.

Baremetal hardware description

We deployed everything at 200 servers environment with following hardware characteristics:

server vendor,model HP,DL380 Gen9
CPU
vendor,model ----------------+ processor_count ----------------+ core_count ----------------+ frequency_MHz Intel,E5-2680 v3 ------------------------+ 2 ------------------------+ 12 ------------------------+ 2500
RAM
vendor,model ----------------+ amount_MB HP,752369-081 ------------------------+ 262144
NETWORK
interface_name ----------------+ vendor,model ----------------+ bandwidth p1p1 ------------------------+ Intel,X710 Dual Port ------------------------+ 10G
STORAGE

dev_name ----------------+ vendor,model

----------------+ SSD/HDD ----------------+ size

/dev/sda ------------------------+ | raid10 - HP P840 | 12 disks EH0600JEDHE ------------------------+ HDD ------------------------+ 3,6TB

Operating system configuration

Baremetal nodes were provisioned with Cobbler with our in-home preseed scripts. OS versions we used:

Versions Operating Systems
Software Version
Ubuntu Ubuntu 16.04.1 LTS
Kernel 4.4.0-47-generic

You can find /etc folder contents from the one of the typical system we were using:

etc_tarball <configs/node1.tar.gz>

Required system metrics

At this layer we must get this list of processes:

List of processes
Mariadb -----------------------------------------+ Rabbitmq -----------------------------------------+ Keystone -----------------------------------------+ Glance -----------------------------------------+ Cinder -----------------------------------------+ Nova -----------------------------------------+ Neutron -----------------------------------------+ Openvswitch -----------------------------------------+ Kubernetes

And following list of metrics:

Node load average
1min -----------------------------------------+ 5min -----------------------------------------+ 15min
Global process stats
Running -----------------------------------------+ Stopped -----------------------------------------+ Waiting
Global CPU Usage

Steal

-----------------------------------------+

Wait

-----------------------------------------+

User

-----------------------------------------+

System

-----------------------------------------+

Interrupt

-----------------------------------------+

Nice

-----------------------------------------+

Idle

Per CPU Usage

User

-----------------------------------------+

System

Global memory usage
bandwidth -----------------------------------------+ Cached -----------------------------------------+ Buffered -----------------------------------------+ Free -----------------------------------------+ Used -----------------------------------------+ Total

Numa monitoring For each node +

Numa_hit -----------------------------------------+ Numa_miss -----------------------------------------+ Numa_foreign -----------------------------------------+ Local_node -----------------------------------------+ Other_node

Numa monitoring For each pid +

Huge -----------------------------------------+ Heap -----------------------------------------+ Stack -----------------------------------------+ Private

Global IOSTAT + Per device IOSTAT +

Merge reads /s -----------------------------------------+ Merge write /s -----------------------------------------+ read/s -----------------------------------------+ write/s -----------------------------------------+ Read transfer -----------------------------------------+ Write transfer -----------------------------------------+ Read latency -----------------------------------------+ Write latency -----------------------------------------+ Write transfer -----------------------------------------+ Queue size -----------------------------------------+ Await
Network per interface
Octets /s (in, out) -----------------------------------------+ Packet /s (in, out) -----------------------------------------+ Dropped /s
Other system metrics
Entropy -----------------------------------------+ DF per device

Kubernetes Layer

Kargo from Fuel-CCP-installer was our main tool to deploy K8S

on top of provisioned systems (monitored nodes).

Kargo sets up Kubernetes in the following way:

  • masters: Calico, Kubernetes API services
  • nodes: Calico, Kubernetes minion services
  • etcd: etcd service

Kargo deployment parameters

You can find Kargo deployment script in Kargo deployment script section

docker_options: "--insecure-registry 172.20.8.35:5000 -D"
upstream_dns_servers: [172.20.8.34, 8.8.4.4]
nameservers: [172.20.8.34, 8.8.4.4]
kube_service_addresses: 10.224.0.0/12
kube_pods_subnet: 10.240.0.0/12
kube_network_node_prefix: 22
kube_apiserver_insecure_bind_address: "0.0.0.0"
dns_replicas: 3
dns_cpu_limit: "100m"
dns_memory_limit: "512Mi"
dns_cpu_requests: "70m"
dns_memory_requests: "70Mi"
deploy_netchecker: false
Software Version
Fuel-CCP-Installer 6fd81252cb2d2c804f388337aa67d4403700f094
Kargo 2c23027794d7851ee31363c5b6594180741ee923

Required K8S metrics

Here we should get K8S health metrics and ETCD performance metrics:

ETCD performance metrics
members count / states -----------------------------------------+ numbers of keys in a cluster -----------------------------------------+ Size of data set -----------------------------------------+ Avg. latency from leader to followers -----------------------------------------+ Bandwidth rate, send/receive -----------------------------------------+ Create store success/fail -----------------------------------------+ Get success/fail -----------------------------------------+ Set success/fail -----------------------------------------+ Package rate, send/receive -----------------------------------------+ Expire count -----------------------------------------+ Update success/fail -----------------------------------------+ Compare-and-swap success/fail -----------------------------------------+ Watchers -----------------------------------------+ Delete success/fail -----------------------------------------+ Compare-and-delete success/fail -----------------------------------------+ Append req, send/ receive
K8S health metrics
Number of node in each state -----------------------------------------+ Total number of namespaces -----------------------------------------+ Total number of PODs per cluster,node,ns -----------------------------------------+ Total of number of services -----------------------------------------+ Endpoints in each service -----------------------------------------+ Number of API service instances -----------------------------------------+ Number of controller instances -----------------------------------------+ Number of scheduler instances -----------------------------------------+ Cluster resources, scheduler view
K8S API log analysis
Number of responses (per each HTTP code) -----------------------------------------+ Response Time

For last two metrics we should utilize log collector to store and parse all log records within K8S environments.

Openstack Layer

CCP stands for "Containerized Control Plane". CCP aims to build, run and manage production-ready OpenStack containers on top of Kubernetes cluster.

Software Version
Fuel-CCP 8570d0e0e512bd16f8449f0a10b1e3900fd09b2d

CCP configuration

CCP was deployed on top of 200 nodes K8S cluster in the following configuration:

node[1-3]: Kubernetes
node([4-6])$: # 4-6
  roles:
    - controller
    - openvswitch
node[7-9]$: # 7-9
  roles:
    - rabbitmq
node10$: # 10
  roles:
    - galera
node11$: # 11
  roles:
    - heat
node(1[2-9])$: # 12-19
  roles:
    - compute
    - openvswitch
node[2-9][0-9]$: # 20-99
  roles:
    - compute
    - openvswitch
node(1[0-9][0-9])$: # 100-199
  roles:
    - compute
    - openvswitch
node200$:
  roles:
    - backup

CCP Openstack services list ( versions.yaml ):

openstack/cinder:
  git_ref: stable/newton
  git_url: https://github.com/openstack/cinder.git
openstack/glance:
  git_ref: stable/newton
  git_url: https://github.com/openstack/glance.git
openstack/heat:
  git_ref: stable/newton
  git_url: https://github.com/openstack/heat.git
openstack/horizon:
  git_ref: stable/newton
  git_url: https://github.com/openstack/horizon.git
openstack/keystone:
  git_ref: stable/newton
  git_url: https://github.com/openstack/keystone.git
openstack/neutron:
  git_ref: stable/newton
  git_url: https://github.com/openstack/neutron.git
openstack/nova:
  git_ref: stable/newton
  git_url: https://github.com/openstack/nova.git
openstack/requirements:
  git_ref: stable/newton
  git_url: https://git.openstack.org/openstack/requirements.git
openstack/sahara-dashboard:
  git_ref: stable/newton
  git_url: https://git.openstack.org/openstack/sahara-dashboard.git

K8S Ingress Resources rules were enabled during CCP deployment to expose Openstack services endpoints to external routable network.

See CCP deployment script and configuration files in the CCP deployment and configuration files section.

At this layer we should get openstack environment metrics, API and resources utilization metrics.

Versions of CCP-related software
Openstack metrics
Total number of controller nodes -----------------------------------------+ Total number of services -----------------------------------------+ Total number of compute nodes -----------------------------------------+ Total number of nodes -----------------------------------------+ Total number of VMs -----------------------------------------+ Number of VMs per tenant, per node -----------------------------------------+ Resource utilization per project,service -----------------------------------------+ Total number of tenants -----------------------------------------+ API request time -----------------------------------------+ Mean time to spawn VM

Implementation

This part of documentation describes Monitoring System implementation. Here is software list that we chose to realize all required tasks:

Monitoring Node Server Side Monitored Node Client Side
Metrics server Metrics agent

Prometheus + | ElasticSearch Grafana | + Kibana

Telegraf | Heka

Server Side Software

Prometheus

Software Version
Prometheus GitHub 7e369b9318a4d5d97a004586a99f10fa51a46b26

Due to high load rate we faced an issue with Prometheus performance at metrics count up to 15 millions. We split Prometheus setup in 2 standalone nodes. First node - to poll API metrics from K8S-related services that natively available at /metrics uri and exposed by K8S API and ETCD API by default. Second node - to store all other metrics that should be collected and calculated locally on environment servers via Telegraf.

Prometheus nodes deployments scripts and configuration files could be found at Prometheus deployment and configuration files section

Grafana

Software Version
Grafana v4.0.1

Grafana was used as a metrics visualizer with several dashboards for each metrics group. Separate individual dashboards were built for each group of metrics:

  • System nodes metrics
  • Kubernetes metrics
  • ETCD metrics
  • Openstack metrics

You can find their setting at Grafana dashboards configuration

Grafana server deployment script:

#!/bin/bash
ansible-playbook -i ./hosts ./deploy-graf-prom.yaml --tags "grafana"

It uses the same yaml configuration file deploy-graf-prom.yaml from Prometheus deployment and configuration files section.

ElasticSearch

Software Version
ElasticSearch 2.4.2

ElasticSearch is well-known proven log storage and we used it as a standalone node for collecting Kubernetes API logs and all other logs from containers across environment. For appropriate performance at 200 nodes lab we increased ES_HEAP_SIZE from default 1G to 10G in /etc/default/elasticsearch configuration file.

Elastic search and Kibana dashboard were installed with deploy_elasticsearch_kibana.sh deployment script.

Kibana

Software Version
Kibana 4.5.4

We used Kibana as a main visualization tool for Elastic Search. We were able to create chart graphs based on K8S API logs analysis. Kibana was installed on a single separate node with a single dashboard representing K8S API Response time graph.

Dashboard settings:

Kibana_dashboard.json <configs/dashboards/Kibana_dashboard.json>

Client side Software

Telegraf

Software Version
Telegraf v1.0.0-beta2-235-gbc14ac5 git: openstack_stats bc14ac5b9475a59504b463ad8f82ed810feed3ec

Telegraf was chosen as client-side metrics agent. It provides multiple ways to poll and calculate from variety of different sources. With regard to its plugin-driven nature, it takes data from different inputs and exposes calculated metrics in Prometheus format. We used forked version of Telegraf with custom patches to be able to utilize custom Openstack-input plugin:

Following automation scripts and configuration files were used to start Telegraf agent across environment nodes.

Telegraf deployment and configuration files

Below you can see which plugins were used to obtain metrics.

Standart Plugins
inputs.cpu  CPU
inputs.disk
inputs.diskio
inputs.kernel
inputs.mem
inputs.processes
inputs.swap
inputs.system
inputs.kernel_vmstat
inputs.net
inputs.netstat
inputs.exec
Openstack input plugin

inputs.openstack custom plugin was used to gather the most of required Openstack-related metrics.

settings:

interval = '40s'
identity_endpoint = "http://keystone.ccp.svc.cluster.local:5000/v3"
domain = "default"
project = "admin"
username = "admin"
password = "password"
System.exec plugin

system.exec plugin was used to trigger scripts to poll and calculate all non-standard metrics.

common settings:

interval = "15s"
timeout = "30s"
data_format = "influx"

commands:

"/opt/telegraf/bin/list_openstack_processes.sh"
"/opt/telegraf/bin/per_process_cpu_usage.sh"
"/opt/telegraf/bin/numa_stat_per_pid.sh"
"/opt/telegraf/bin/iostat_per_device.sh"
"/opt/telegraf/bin/memory_bandwidth.sh"
"/opt/telegraf/bin/network_tcp_queue.sh"
"/opt/telegraf/bin/etcd_get_metrics.sh"
"/opt/telegraf/bin/k8s_get_metrics.sh"
"/opt/telegraf/bin/vmtime.sh"
"/opt/telegraf/bin/osapitime.sh"

You can see full Telegraf configuration file and its custom input scripts in the section Telegraf deployment and configuration files.

Heka

Software Version
Heka 0.10.0

We chose Heka as log collecting agent for its wide variety of inputs (possibility to feed data from Docker socket), filters (custom shorthand SandBox filters in LUA language) and possibility to encode data for ElasticSearch.

With Heka agent started across environment servers we were able to send containers' logs to ElasticSearch server. With custom LUA filter we extracted K8S API data and convert it in appropriate format to visualize API timing counters (Average Response Time).

Heka deployment scripts and configuration file with LUA custom filter are in Heka deployment and configuration section.

Applications

Kargo deployment script

deploy_k8s_using_kargo.sh

configs/deploy_k8s_using_kargo.sh

CCP deployment and configuration files

deploy-ccp.sh

configs/ccp/deploy-ccp.sh

ccp.yaml

configs/ccp/ccp.yaml

configs.yaml

configs/ccp/configs.yaml

topology.yaml

configs/ccp/topology.yaml

repos.yaml

configs/ccp/repos.yaml

versions.yaml

configs/ccp/versions.yaml

Prometheus deployment and configuration files

Deployment scripts

deploy_prometheus.sh

configs/prometheus-grafana-telegraf/deploy_prometheus.sh

deploy-graf-prom.yaml

configs/prometheus-grafana-telegraf/deploy-graf-prom.yaml

docker_prometheus.yaml

configs/prometheus-grafana-telegraf/docker_prometheus.yaml

deploy_etcd_collect.sh

configs/prometheus-grafana-telegraf/deploy_etcd_collect.sh

Configuration files

prometheus-kuber.yml.j2

configs/prometheus-grafana-telegraf/prometheus/prometheus-kuber.yml.j2

prometheus-system.yml.j2

configs/prometheus-grafana-telegraf/prometheus/prometheus-system.yml.j2

targets.yml.j2

configs/prometheus-grafana-telegraf/prometheus/targets.yml.j2

Grafana dashboards configuration

Systems_nodes_statistics.json <configs/dashboards/Systems_nodes_statistics.json>

Kubernetes_statistics.json <configs/dashboards/Kubernetes_statistics.json>

ETCD.json <configs/dashboards/ETCD.json>

OpenStack.json <configs/dashboards/OpenStack.json>

ElasticSearch deployment script

deploy_elasticsearch_kibana.sh

configs/elasticsearch-heka/deploy_elasticsearch_kibana.sh

Telegraf deployment and configuration files

deploy_telegraf.sh

configs/prometheus-grafana-telegraf/deploy_telegraf.sh

deploy-telegraf.yaml

configs/prometheus-grafana-telegraf/deploy-telegraf.yaml

Telegraf system

telegraf-sys.conf

configs/prometheus-grafana-telegraf/telegraf/telegraf-sys.conf

Telegraf openstack

telegraf-openstack.conf.j2

configs/prometheus-grafana-telegraf/telegraf/telegraf-openstack.conf.j2

Telegraf inputs scripts

list_openstack_processes.sh

configs/prometheus-grafana-telegraf/telegraf/scripts/list_openstack_processes.sh

per_process_cpu_usage.sh

configs/prometheus-grafana-telegraf/telegraf/scripts/per_process_cpu_usage.sh

numa_stat_per_pid.sh

configs/prometheus-grafana-telegraf/telegraf/scripts/numa_stat_per_pid.sh

iostat_per_device.sh

configs/prometheus-grafana-telegraf/telegraf/scripts/iostat_per_device.sh

memory_bandwidth.sh

configs/prometheus-grafana-telegraf/telegraf/scripts/memory_bandwidth.sh

network_tcp_queue.sh

configs/prometheus-grafana-telegraf/telegraf/scripts/network_tcp_queue.sh

etcd_get_metrics.sh

configs/prometheus-grafana-telegraf/telegraf/scripts/etcd_get_metrics.sh

k8s_get_metrics.sh

configs/prometheus-grafana-telegraf/telegraf/scripts/k8s_get_metrics.sh

vmtime.sh

configs/prometheus-grafana-telegraf/telegraf/scripts/vmtime.sh

osapitime.sh

configs/prometheus-grafana-telegraf/telegraf/scripts/osapitime.sh

Heka deployment and configuration

Deployment

deploy_heka.sh

configs/elasticsearch-heka/deploy_heka.sh

deploy-heka.yaml

configs/elasticsearch-heka/deploy-heka.yaml

Configuration

00-hekad.toml.j2

configs/elasticsearch-heka/heka/00-hekad.toml.j2

kubeapi_to_int.lua.j2

configs/elasticsearch-heka/heka/kubeapi_to_int.lua.j2