Document the cluster policies

This change also adds the ability to insert SVG drawings in the RST
documentation.

Change-Id: I45127fad6832c81208135af2246dbbaab9257180
(cherry picked from commit 619e700a24)
This commit is contained in:
Simon Pasquier 2015-11-16 16:47:55 +01:00
parent a53053b570
commit c064c17c71
4 changed files with 1226 additions and 5 deletions

1
doc/.gitignore vendored
View File

@ -1 +1,2 @@
build/
images/*.pdf

View File

@ -18,6 +18,12 @@ PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
# SVG to PDF conversion
SVG2PDF = inkscape
SVG2PDF_FLAGS =
# Build a list of SVG files to convert to PDF
PDF_FILES := $(foreach dir, images, $(patsubst %.svg,%.pdf,$(wildcard $(dir)/*.svg)))
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
@ -48,6 +54,7 @@ help:
clean:
rm -rf $(BUILDDIR)/*
rm -f $(PDF_FILES)
html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
@ -103,14 +110,14 @@ epub:
@echo
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
latex:
latex: $(PDF_FILES)
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
@echo "Run \`make' in that directory to run these through (pdf)latex" \
"(use \`make latexpdf' here to do that automatically)."
latexpdf:
latexpdf: $(PDF_FILES)
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through pdflatex..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf
@ -175,3 +182,10 @@ pseudoxml:
$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
@echo
@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
# Rule for building the PDF files only
images: $(PDF_FILES)
# Pattern rule for converting SVG to PDF
%.pdf : %.svg
$(SVG2PDF) -f $< -A $@

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 166 KiB

View File

@ -34,6 +34,14 @@ and the *GSE plugins* for Global Status Evaluation plugins.
Both the AFD and GSE plugins in turn create metrics called the *AFD metrics*
and the *GSE metrics* respectively.
.. figure:: ../../images/AFD_and_GSE_message_flow.*
:width: 800
:alt: Message flow for the AFD and GSE metrics
:align: center
Message flow for the AFD and GSE metrics
The *AFD metrics* contain information about the health status of a
resource like a device, a system component like a filesystem, or service
like an API endpoint, at the node level.
@ -49,7 +57,29 @@ The health status of a cluster is inferred by the GSE plugins using
aggregation and correlation rules and facts contained in the
*AFD metrics* it receives from the Collectors.
The *AFD and GSE metrics* are consumed by other groups
In the current version of the LMA Toolchain, three GSE plugins are configured:
* The Service Cluster GSE which receives metrics from the AFD plugins monitoring the services and emits health status for the clusters of services (nova-api, nova-scheduler and so on).
* The Node Cluster GSE which receives metrics from the AFD plugins monitoring the system and emits health status for the clusters of nodes (controllers, computes and so on).
* The Global Cluster GSE which receives metrics from the two other GSE plugins and emits health status for the top-level clusters (Nova, MySQL and so on).
The meaning associated with a health status is the following:
* **Down**: One or several primary functions of a cluster are failed. For example,
the API service for Nova or Cinder isn't accessible.
* **Critical**: One or several primary functions of a
cluster are severely degraded. The quality
of service delivered to the end-user should be severely
impacted.
* **Warning**: One or several primary functions of the
cluster are slightly degraded. The quality
of service delivered to the end-user should be slightly
impacted.
* **Unknown**: There is not enough data to infer the actual
health state of the cluster.
* **Okay**: None of the above was found to be true.
The *AFD and GSE metrics* are also consumed by other groups
of Heka plugins we call the *Persisters*.
* There is a *Persister* for InfluxDB which turns the *GSE metric*
@ -162,7 +192,7 @@ Where:
system mount point. If value is specified as an empty string (""), then the rule
is applied to all the aggregated values for the specified field name like for example
the file system mount point.
If value is specified as the * wildcard character,
If value is specified as the '*' wildcard character,
then the rule is applied to each of the metrics matching the metric name and field name.
For example, the alarm definition sample given above would run the rule
for each of the file system mount points associated with the *fs_space_percent_free* metric.
@ -177,7 +207,7 @@ Where:
| not implemented yet)
| function
| Type: enum(last | min | max | sum | count | avg | median | mode | roc | mww | mww_nonparametric)
| Type: enum('last' | 'min' | 'max' | 'sum' | 'count' | 'avg' | 'median' | 'mode' | 'roc' | 'mww' | 'mww_nonparametric')
| Where:
| last:
| returns the last value of all the values
@ -331,3 +361,123 @@ need to re-apply the Puppet module::
/etc/fuel/plugins/lma_collector-0.8/puppet/manifests/configure_afd_filters.pp
This will restart the LMA Collector with your change.
Cluster policies
----------------
GSE plugins are driven by policies that describe how plugins determine the
cluster's health status.
By default, two policies are defined:
* *highest_severity*, it defines that the cluster's status depends on the
member with the highest severity, typically used for a cluster of services.
* *majority_of_members*, it defines that the cluster is healthy as long as
(N+1)/2 members of the cluster are healthy. This is typically used for
clusters managed by Pacemaker.
The GSE policies are defined declaratively in the */etc/hiera/override/gse_filters.yaml*
file at the *gse_policies* entry.
A policy consists of a list of rules which are evaluated against the
current status of the cluster's members. When one of the rules matches, the
cluster's status gets the value associated with the rule and the evaluation
stops here. The last rule of the list is usually a catch-all rule that
defines the default status in case none of the previous rules could be matched.
A policy rule is defined as shown in the example below::
# The following rule definition reads as: "the cluster's status is critical if more than 50% of its members are either down or criticial"
- status: critical
trigger:
logical_operator: or
rules:
- function: percent
arguments: [ down, critical ]
relational_operator: '>'
threshold: 50
Where
| status:
| Type: Enum(down, critical, warning, okay, unknown)
| The cluster's status if the condition is met
| logical_operator
| Type: Enum('and' | '&&' | 'or' | '||')
| The conjonction relation for the condition rules
| rules
| Type: list
| List of condition rules to execute
| function
| Type: enum('count' | 'percent')
| Where:
| count:
| returns the *number of members* that match the passed value(s).
| percent:
| returns the *percentage of members* that match the passed value(s).
| arguments:
| Type: list of status values
| List of status values passed to the function
| relational_operator:
| Type: Enum('lt' | '<' | 'gt' | '>' | 'lte' | '<=' | 'gte' | '>=')
| The comparison against the threshold
| threshold
| Type: float
| The threshold value
Lets now take a more detailed look at the policy called *highest_severity*::
gse_policies:
highest_severity:
- status: down
trigger:
logical_operator: or
rules:
- function: count
arguments: [ down ]
relational_operator: '>'
threshold: 0
- status: critical
trigger:
logical_operator: or
rules:
- function: count
arguments: [ critical ]
relational_operator: '>'
threshold: 0
- status: warning
trigger:
logical_operator: or
rules:
- function: count
arguments: [ warning ]
relational_operator: '>'
threshold: 0
- status: okay
trigger:
logical_operator: or
rules:
- function: count
arguments: [ okay ]
relational_operator: '>'
threshold: 0
- status: unknown
The policy definition reads as:
* The status of the cluster is *Down* if the status of at least one cluster's member is *Down*.
* Otherwise the status of the cluster is *Critical* if the status of at least one cluster's member is *Critical*.
* Otherwise the status of the cluster is *Warning* if the status of at least one cluster's member is *Warning*.
* Otherwise the status of the cluster is *Okay* if the status of at least one cluster's entity is *Okay*.
* Otherwise the status of the cluster is *Unknown*.