Document the cluster policies
This change also adds the ability to insert SVG drawings in the RST
documentation.
Change-Id: I45127fad6832c81208135af2246dbbaab9257180
(cherry picked from commit 619e700a24
)
This commit is contained in:
parent
a53053b570
commit
c064c17c71
|
@ -1 +1,2 @@
|
|||
build/
|
||||
images/*.pdf
|
||||
|
|
18
doc/Makefile
18
doc/Makefile
|
@ -18,6 +18,12 @@ PAPEROPT_letter = -D latex_paper_size=letter
|
|||
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
|
||||
# the i18n builder cannot share the environment and doctrees with the others
|
||||
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
|
||||
# SVG to PDF conversion
|
||||
SVG2PDF = inkscape
|
||||
SVG2PDF_FLAGS =
|
||||
# Build a list of SVG files to convert to PDF
|
||||
PDF_FILES := $(foreach dir, images, $(patsubst %.svg,%.pdf,$(wildcard $(dir)/*.svg)))
|
||||
|
||||
|
||||
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
|
||||
|
||||
|
@ -48,6 +54,7 @@ help:
|
|||
|
||||
clean:
|
||||
rm -rf $(BUILDDIR)/*
|
||||
rm -f $(PDF_FILES)
|
||||
|
||||
html:
|
||||
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
|
||||
|
@ -103,14 +110,14 @@ epub:
|
|||
@echo
|
||||
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
|
||||
|
||||
latex:
|
||||
latex: $(PDF_FILES)
|
||||
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
|
||||
@echo
|
||||
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
|
||||
@echo "Run \`make' in that directory to run these through (pdf)latex" \
|
||||
"(use \`make latexpdf' here to do that automatically)."
|
||||
|
||||
latexpdf:
|
||||
latexpdf: $(PDF_FILES)
|
||||
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
|
||||
@echo "Running LaTeX files through pdflatex..."
|
||||
$(MAKE) -C $(BUILDDIR)/latex all-pdf
|
||||
|
@ -175,3 +182,10 @@ pseudoxml:
|
|||
$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
|
||||
@echo
|
||||
@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
|
||||
|
||||
# Rule for building the PDF files only
|
||||
images: $(PDF_FILES)
|
||||
|
||||
# Pattern rule for converting SVG to PDF
|
||||
%.pdf : %.svg
|
||||
$(SVG2PDF) -f $< -A $@
|
||||
|
|
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 166 KiB |
|
@ -34,6 +34,14 @@ and the *GSE plugins* for Global Status Evaluation plugins.
|
|||
Both the AFD and GSE plugins in turn create metrics called the *AFD metrics*
|
||||
and the *GSE metrics* respectively.
|
||||
|
||||
|
||||
.. figure:: ../../images/AFD_and_GSE_message_flow.*
|
||||
:width: 800
|
||||
:alt: Message flow for the AFD and GSE metrics
|
||||
:align: center
|
||||
|
||||
Message flow for the AFD and GSE metrics
|
||||
|
||||
The *AFD metrics* contain information about the health status of a
|
||||
resource like a device, a system component like a filesystem, or service
|
||||
like an API endpoint, at the node level.
|
||||
|
@ -49,7 +57,29 @@ The health status of a cluster is inferred by the GSE plugins using
|
|||
aggregation and correlation rules and facts contained in the
|
||||
*AFD metrics* it receives from the Collectors.
|
||||
|
||||
The *AFD and GSE metrics* are consumed by other groups
|
||||
In the current version of the LMA Toolchain, three GSE plugins are configured:
|
||||
|
||||
* The Service Cluster GSE which receives metrics from the AFD plugins monitoring the services and emits health status for the clusters of services (nova-api, nova-scheduler and so on).
|
||||
* The Node Cluster GSE which receives metrics from the AFD plugins monitoring the system and emits health status for the clusters of nodes (controllers, computes and so on).
|
||||
* The Global Cluster GSE which receives metrics from the two other GSE plugins and emits health status for the top-level clusters (Nova, MySQL and so on).
|
||||
|
||||
The meaning associated with a health status is the following:
|
||||
|
||||
* **Down**: One or several primary functions of a cluster are failed. For example,
|
||||
the API service for Nova or Cinder isn't accessible.
|
||||
* **Critical**: One or several primary functions of a
|
||||
cluster are severely degraded. The quality
|
||||
of service delivered to the end-user should be severely
|
||||
impacted.
|
||||
* **Warning**: One or several primary functions of the
|
||||
cluster are slightly degraded. The quality
|
||||
of service delivered to the end-user should be slightly
|
||||
impacted.
|
||||
* **Unknown**: There is not enough data to infer the actual
|
||||
health state of the cluster.
|
||||
* **Okay**: None of the above was found to be true.
|
||||
|
||||
The *AFD and GSE metrics* are also consumed by other groups
|
||||
of Heka plugins we call the *Persisters*.
|
||||
|
||||
* There is a *Persister* for InfluxDB which turns the *GSE metric*
|
||||
|
@ -162,7 +192,7 @@ Where:
|
|||
system mount point. If value is specified as an empty string (""), then the rule
|
||||
is applied to all the aggregated values for the specified field name like for example
|
||||
the file system mount point.
|
||||
If value is specified as the ‘*’ wildcard character,
|
||||
If value is specified as the '*' wildcard character,
|
||||
then the rule is applied to each of the metrics matching the metric name and field name.
|
||||
For example, the alarm definition sample given above would run the rule
|
||||
for each of the file system mount points associated with the *fs_space_percent_free* metric.
|
||||
|
@ -177,7 +207,7 @@ Where:
|
|||
| not implemented yet)
|
||||
|
||||
| function
|
||||
| Type: enum(‘last’ | ‘min’ | ‘max’ | ‘sum’ | ‘count’ | ‘avg’ | ‘median’ | ‘mode’ | ‘roc’ | ‘mww’ | ‘mww_nonparametric’)
|
||||
| Type: enum('last' | 'min' | 'max' | 'sum' | 'count' | 'avg' | 'median' | 'mode' | 'roc' | 'mww' | 'mww_nonparametric')
|
||||
| Where:
|
||||
| last:
|
||||
| returns the last value of all the values
|
||||
|
@ -331,3 +361,123 @@ need to re-apply the Puppet module::
|
|||
/etc/fuel/plugins/lma_collector-0.8/puppet/manifests/configure_afd_filters.pp
|
||||
|
||||
This will restart the LMA Collector with your change.
|
||||
|
||||
Cluster policies
|
||||
----------------
|
||||
|
||||
GSE plugins are driven by policies that describe how plugins determine the
|
||||
cluster's health status.
|
||||
|
||||
By default, two policies are defined:
|
||||
|
||||
* *highest_severity*, it defines that the cluster's status depends on the
|
||||
member with the highest severity, typically used for a cluster of services.
|
||||
* *majority_of_members*, it defines that the cluster is healthy as long as
|
||||
(N+1)/2 members of the cluster are healthy. This is typically used for
|
||||
clusters managed by Pacemaker.
|
||||
|
||||
The GSE policies are defined declaratively in the */etc/hiera/override/gse_filters.yaml*
|
||||
file at the *gse_policies* entry.
|
||||
|
||||
A policy consists of a list of rules which are evaluated against the
|
||||
current status of the cluster's members. When one of the rules matches, the
|
||||
cluster's status gets the value associated with the rule and the evaluation
|
||||
stops here. The last rule of the list is usually a catch-all rule that
|
||||
defines the default status in case none of the previous rules could be matched.
|
||||
|
||||
A policy rule is defined as shown in the example below::
|
||||
|
||||
# The following rule definition reads as: "the cluster's status is critical if more than 50% of its members are either down or criticial"
|
||||
- status: critical
|
||||
trigger:
|
||||
logical_operator: or
|
||||
rules:
|
||||
- function: percent
|
||||
arguments: [ down, critical ]
|
||||
relational_operator: '>'
|
||||
threshold: 50
|
||||
|
||||
Where
|
||||
|
||||
| status:
|
||||
| Type: Enum(down, critical, warning, okay, unknown)
|
||||
| The cluster's status if the condition is met
|
||||
|
||||
| logical_operator
|
||||
| Type: Enum('and' | '&&' | 'or' | '||')
|
||||
| The conjonction relation for the condition rules
|
||||
|
||||
| rules
|
||||
| Type: list
|
||||
| List of condition rules to execute
|
||||
|
||||
| function
|
||||
| Type: enum('count' | 'percent')
|
||||
| Where:
|
||||
| count:
|
||||
| returns the *number of members* that match the passed value(s).
|
||||
| percent:
|
||||
| returns the *percentage of members* that match the passed value(s).
|
||||
|
||||
| arguments:
|
||||
| Type: list of status values
|
||||
| List of status values passed to the function
|
||||
|
||||
| relational_operator:
|
||||
| Type: Enum('lt' | '<' | 'gt' | '>' | 'lte' | '<=' | 'gte' | '>=')
|
||||
| The comparison against the threshold
|
||||
|
||||
| threshold
|
||||
| Type: float
|
||||
| The threshold value
|
||||
|
||||
Lets now take a more detailed look at the policy called *highest_severity*::
|
||||
|
||||
gse_policies:
|
||||
|
||||
highest_severity:
|
||||
- status: down
|
||||
trigger:
|
||||
logical_operator: or
|
||||
rules:
|
||||
- function: count
|
||||
arguments: [ down ]
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
- status: critical
|
||||
trigger:
|
||||
logical_operator: or
|
||||
rules:
|
||||
- function: count
|
||||
arguments: [ critical ]
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
- status: warning
|
||||
trigger:
|
||||
logical_operator: or
|
||||
rules:
|
||||
- function: count
|
||||
arguments: [ warning ]
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
- status: okay
|
||||
trigger:
|
||||
logical_operator: or
|
||||
rules:
|
||||
- function: count
|
||||
arguments: [ okay ]
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
- status: unknown
|
||||
|
||||
The policy definition reads as:
|
||||
|
||||
* The status of the cluster is *Down* if the status of at least one cluster's member is *Down*.
|
||||
|
||||
* Otherwise the status of the cluster is *Critical* if the status of at least one cluster's member is *Critical*.
|
||||
|
||||
* Otherwise the status of the cluster is *Warning* if the status of at least one cluster's member is *Warning*.
|
||||
|
||||
* Otherwise the status of the cluster is *Okay* if the status of at least one cluster's entity is *Okay*.
|
||||
|
||||
* Otherwise the status of the cluster is *Unknown*.
|
||||
|
|
Loading…
Reference in New Issue