Merge "Foundation for LMA docs"

2019-01-14 08:10:03 +00:00 · 2019-01-14 08:10:03 +00:00 · 1509383894
parent e7d169f62a eab9ca05a6
commit 1509383894
10 changed files with 1369 additions and 1 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -8,7 +8,9 @@ Contents:

   install/index
   testing/index
-
+   monitoring/index
+   logging/index
+   readme

 Indices and Tables
 ==================
--- a/doc/source/logging/elasticsearch.rst
+++ b/doc/source/logging/elasticsearch.rst
@ -0,0 +1,196 @@
+Elasticsearch
+=============
+
+The Elasticsearch chart in openstack-helm-infra provides a distributed data
+store to index and analyze logs generated from the OpenStack-Helm services.
+The chart contains templates for:
+
+- Elasticsearch client nodes
+- Elasticsearch data nodes
+- Elasticsearch master nodes
+- An Elasticsearch exporter for providing cluster metrics to Prometheus
+- A cronjob for Elastic Curator to manage data indices
+
+Authentication
+--------------
+
+The Elasticsearch deployment includes a sidecar container that runs an Apache
+reverse proxy to add authentication capabilities for Elasticsearch.  The
+username and password are configured under the Elasticsearch entry in the
+endpoints section of the chart's values.yaml.
+
+The configuration for Apache can be found under the conf.httpd key, and uses a
+helm-toolkit function that allows for including gotpl entries in the template
+directly.  This allows the use of other templates, like the endpoint lookup
+function templates, directly in the configuration for Apache.
+
+Elasticsearch Service Configuration
+-----------------------------------
+
+The Elasticsearch service configuration file can be modified with a combination
+of pod environment variables and entries in the values.yaml file.  Elasticsearch
+does not require much configuration out of the box, and the default values for
+these configuration settings are meant to provide a highly available cluster by
+default.
+
+The vital entries in this configuration file are:
+
+- path.data:  The path at which to store the indexed data
+- path.repo:  The location of any snapshot repositories to backup indexes
+- bootstrap.memory_lock:  Ensures none of the JVM is swapped to disk
+- discovery.zen.minimum_master_nodes:  Minimum required masters for the cluster
+
+The bootstrap.memory_lock entry ensures none of the JVM will be swapped to disk
+during execution, and setting this value to false will negatively affect the
+health of your Elasticsearch nodes.  The discovery.zen.minimum_master_nodes flag
+registers the minimum number of masters required for your Elasticsearch cluster
+to register as healthy and functional.
+
+To read more about Elasticsearch's configuration file, please see the official
+documentation_.
+
+.. _documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html
+
+Elastic Curator
+---------------
+
+The Elasticsearch chart contains a cronjob to run Elastic Curator at specified
+intervals to manage the lifecycle of your indices.  Curator can perform:
+
+- Take and send a snapshot of your indexes to a specified snapshot repository
+- Delete indexes older than a specified length of time
+- Restore indexes with previous index snapshots
+- Reindex an index into a new or preexisting index
+
+The full list of supported Curator actions can be found in the actions_ section of
+the official Curator documentation.  The list of options available for those
+actions can be found in the options_ section of the Curator documentation.
+
+.. _actions: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/actions.html
+.. _options: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/options.html
+
+Curator's configuration is handled via entries in Elasticsearch's values.yaml
+file and must be overridden to achieve your index lifecycle management
+needs.  Please note that any unused field should be left blank, as an entry of
+"None" will result in an exception, as Curator will read it as a Python NoneType
+insead of a value of None.
+
+The section for Curator's service configuration can be found at:
+
+::
+
+    conf:
+      curator:
+        config:
+          client:
+            hosts:
+              - elasticsearch-logging
+            port: 9200
+            url_prefix:
+            use_ssl: False
+            certificate:
+            client_cert:
+            client_key:
+            ssl_no_validate: False
+            http_auth:
+            timeout: 30
+            master_only: False
+          logging:
+            loglevel: INFO
+            logfile:
+            logformat: default
+            blacklist: ['elasticsearch', 'urllib3']
+
+Curator's actions are configured in the following section:
+
+::
+
+    conf:
+      curator:
+        action_file:
+          actions:
+            1:
+              action: delete_indices
+              description: "Clean up ES by deleting old indices"
+              options:
+                timeout_override:
+                continue_if_exception: False
+                ignore_empty_list: True
+                disable_action: True
+              filters:
+              - filtertype: age
+                source: name
+                direction: older
+                timestring: '%Y.%m.%d'
+                unit: days
+                unit_count: 30
+                field:
+                stats_result:
+                epoch:
+                exclude: False
+
+The Elasticsearch chart contains commented example actions for deleting and
+snapshotting indexes older 30 days.  Please note these actions are provided as a
+reference and are disabled by default to avoid any unexpected behavior against
+your indexes.
+
+Elasticsearch Exporter
+----------------------
+
+The Elasticsearch chart contains templates for an exporter to provide metrics
+for Prometheus.  These metrics provide insight into the performance and overall
+health of your Elasticsearch cluster.  Please note monitoring for Elasticsearch
+is disabled by default, and must be enabled with the following override:
+
+
+::
+
+    monitoring:
+      prometheus:
+        enabled: true
+
+
+The Elasticsearch exporter uses the same service annotations as the other
+exporters, and no additional configuration is required for Prometheus to target
+the Elasticsearch exporter for scraping.  The Elasticsearch exporter is
+configured with command line flags, and the flags' default values can be found
+under the following key in the values.yaml file:
+
+::
+
+    conf:
+      prometheus_elasticsearch_exporter:
+        es:
+          all: true
+          timeout: 20s
+
+The configuration keys configure the following behaviors:
+
+- es.all:  Gather information from all nodes, not just the connecting node
+- es.timeout:  Timeout for metrics queries
+
+More information about the Elasticsearch exporter can be found on the exporter's
+GitHub_ page.
+
+.. _GitHub: https://github.com/justwatchcom/elasticsearch_exporter
+
+
+Snapshot Repositories
+---------------------
+
+Before Curator can store snapshots in a specified repository, Elasticsearch must
+register the configured repository.  To achieve this, the Elasticsearch chart
+contains a job for registering an s3 snapshot repository backed by radosgateway.
+This job is disabled by default as the curator actions for snapshots are
+disabled by default.  To enable the snapshot job, the
+conf.elasticsearch.snapshots.enabled flag must be set to true.  The following
+configuration keys are relevant:
+
+- conf.elasticsearch.snapshots.enabled: Enable snapshot repositories
+- conf.elasticsearch.snapshots.bucket: Name of the RGW s3 bucket to use
+- conf.elasticsearch.snapshots.repositories: Name of repositories to create
+
+More information about Elasticsearch repositories can be found in the official
+Elasticsearch snapshot_ documentation:
+
+.. _snapshot: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html#_repositories
--- a/doc/source/logging/fluent-logging.rst
+++ b/doc/source/logging/fluent-logging.rst
@ -0,0 +1,279 @@
+Fluent-logging
+===============
+
+The fluent-logging chart in openstack-helm-infra provides the base for a
+centralized logging platform for OpenStack-Helm.  The chart combines two
+services, Fluentbit and Fluentd, to gather logs generated by the services,
+filter on or add metadata to logged events, then forward them to Elasticsearch
+for indexing.
+
+Fluentbit
+---------
+
+Fluentbit runs as a log-collecting component on each host in the cluster, and
+can be configured to target specific log locations on the host.  The Fluentbit_
+configuration schema can be found on the official Fluentbit website.
+
+.. _Fluentbit: http://fluentbit.io/documentation/0.12/configuration/schema.html
+
+Fluentbit provides a set of plug-ins for ingesting and filtering various log
+types.  These plug-ins include:
+
+- Tail:  Tails a defined file for logged events
+- Kube:  Adds Kubernetes metadata to a logged event
+- Systemd:  Provides ability to collect logs from the journald daemon
+- Syslog:  Provides the ability to collect logs from a Unix socket (TCP or UDP)
+
+The complete list of plugins can be found in the configuration_ section of the
+Fluentbit documentation.
+
+.. _configuration: http://fluentbit.io/documentation/current/configuration/
+
+Fluentbit uses parsers to turn unstructured log entries into structured entries
+to make processing and filtering events easier.  The two formats supported are
+JSON maps and regular expressions.  More information about Fluentbit's parsing
+abilities can be found in the parsers_ section of Fluentbit's documentation.
+
+.. _parsers: http://fluentbit.io/documentation/current/parser/
+
+Fluentbit's service and parser configurations are defined via the values.yaml
+file, which allows for custom definitions of inputs, filters and outputs for
+your logging needs.
+Fluentbit's configuration can be found under the following key:
+
+::
+
+    conf:
+      fluentbit:
+        - service:
+            header: service
+            Flush: 1
+            Daemon: Off
+            Log_Level: info
+            Parsers_File: parsers.conf
+        - containers_tail:
+            header: input
+            Name: tail
+            Tag: kube.*
+            Path: /var/log/containers/*.log
+            Parser: docker
+            DB: /var/log/flb_kube.db
+            Mem_Buf_Limit: 5MB
+        - kube_filter:
+            header: filter
+            Name: kubernetes
+            Match: kube.*
+            Merge_JSON_Log: On
+        - fluentd_output:
+            header: output
+            Name: forward
+            Match: "*"
+            Host: ${FLUENTD_HOST}
+            Port: ${FLUENTD_PORT}
+
+Fluentbit is configured by default to capture logs at the info log level.  To
+change this, override the Log_Level key with the appropriate levels, which are
+documented in Fluentbit's configuration_.
+
+Fluentbit's parser configuration can be found under the following key:
+
+::
+
+    conf:
+      parsers:
+        - docker:
+            header: parser
+            Name: docker
+            Format: json
+            Time_Key: time
+            Time_Format: "%Y-%m-%dT%H:%M:%S.%L"
+            Time_Keep: On
+
+The values for the fluentbit and parsers keys are consumed by a fluent-logging
+helper template that produces the appropriate configurations for the relevant
+sections.  Each list item (keys prefixed with a '-') represents a section in the
+configuration files, and the arbitrary name of the list item should represent a
+logical description of the section defined.  The header key represents the type
+of definition (filter, input, output, service or parser), and the remaining
+entries will be rendered as space delimited configuration keys and values. For
+example, the definitions above would result in the following:
+
+::
+
+    [SERVICE]
+        Daemon false
+        Flush 1
+        Log_Level info
+        Parsers_File parsers.conf
+    [INPUT]
+        DB /var/log/flb_kube.db
+        Mem_Buf_Limit 5MB
+        Name tail
+        Parser docker
+        Path /var/log/containers/*.log
+        Tag kube.*
+    [FILTER]
+        Match kube.*
+        Merge_JSON_Log true
+        Name kubernetes
+    [OUTPUT]
+        Host ${FLUENTD_HOST}
+        Match *
+        Name forward
+        Port ${FLUENTD_PORT}
+    [PARSER]
+        Format json
+        Name docker
+        Time_Format %Y-%m-%dT%H:%M:%S.%L
+        Time_Keep true
+        Time_Key time
+
+Fluentd
+-------
+
+Fluentd runs as a forwarding service that receives event entries from Fluentbit
+and routes them to the appropriate destination.  By default, Fluentd will route
+all entries received from Fluentbit to Elasticsearch for indexing.  The
+Fluentd_ configuration schema can be found at the official Fluentd website.
+
+.. _Fluentd: https://docs.fluentd.org/v0.12/articles/config-file
+
+Fluentd's configuration is handled in the values.yaml file in fluent-logging.
+Similar to Fluentbit, configuration overrides provide flexibility in defining
+custom routes for tagged log events.  The configuration can be found under the
+following key:
+
+::
+
+    conf:
+      fluentd:
+        - fluentbit_forward:
+            header: source
+            type: forward
+            port: "#{ENV['FLUENTD_PORT']}"
+            bind: 0.0.0.0
+        - elasticsearch:
+            header: match
+            type: elasticsearch
+            expression: "**"
+            include_tag_key: true
+            host: "#{ENV['ELASTICSEARCH_HOST']}"
+            port: "#{ENV['ELASTICSEARCH_PORT']}"
+            logstash_format: true
+            buffer_chunk_limit: 10M
+            buffer_queue_limit: 32
+            flush_interval: "20"
+            max_retry_wait: 300
+            disable_retry_limit: ""
+
+The values for the fluentd keys are consumed by a fluent-logging helper template
+that produces appropriate configurations for each directive desired.  The list
+items (keys prefixed with a '-') represent sections in the configuration file,
+and the name of each list item should represent a logical description of the
+section defined.  The header key represents the type of definition (name of the
+fluentd plug-in used), and the expression key is used when the plug-in requires
+a pattern to match against (example: matches on certain input patterns).  The
+remaining entries will be rendered as space delimited configuration keys and
+values.  For example, the definition above would result in the following:
+
+::
+
+    <source>
+      bind 0.0.0.0
+      port "#{ENV['FLUENTD_PORT']}"
+      @type forward
+    </source>
+    <match **>
+      buffer_chunk_limit 10M
+      buffer_queue_limit 32
+      disable_retry_limit
+      flush_interval 20s
+      host "#{ENV['ELASTICSEARCH_HOST']}"
+      include_tag_key true
+      logstash_format true
+      max_retry_wait 300
+      port "#{ENV['ELASTICSEARCH_PORT']}"
+      @type elasticsearch
+    </match>
+
+Some fluentd plug-ins require nested definitions.  The fluentd helper template
+can handle these definitions with the following structure:
+
+::
+
+    conf:
+      td_agent:
+        - fluentbit_forward:
+            header: source
+            type: forward
+            port: "#{ENV['FLUENTD_PORT']}"
+            bind: 0.0.0.0
+        - log_transformer:
+            header: filter
+            type: record_transformer
+            expression: "foo.bar"
+            inner_def:
+              - record_transformer:
+                  header: record
+                  hostname: my_host
+                  tag: my_tag
+
+In this example, the my_transformer list will generate a nested configuration
+entry in the log_transformer section.  The nested definitions are handled by
+supplying a list as the value for an arbitrary key, and the list value will
+indicate the entry should be handled as a nested definition.  The helper
+template will render the above example key/value pairs as the following:
+
+::
+
+    <source>
+      bind 0.0.0.0
+      port "#{ENV['FLUENTD_PORT']}"
+      @type forward
+    </source>
+    <filter foo.bar>
+      <record>
+        hostname my_host
+        tag my_tag
+      </record>
+      @type record_transformer
+    </filter>
+
+Fluentd Exporter
+----------------------
+
+The fluent-logging chart contains templates for an exporter to provide metrics
+for Fluentd.  These metrics provide insight into Fluentd's performance.  Please
+note monitoring for Fluentd is disabled by default, and must be enabled with the
+following override:
+
+::
+
+    monitoring:
+      prometheus:
+        enabled: true
+
+
+The Fluentd exporter uses the same service annotations as the other exporters,
+and no additional configuration is required for Prometheus to target the
+Fluentd exporter for scraping.  The Fluentd exporter is configured with command
+line flags, and the flags' default values can be found under the following key
+in the values.yaml file:
+
+::
+
+    conf:
+      fluentd_exporter:
+        log:
+          format: "logger:stdout?json=true"
+          level: "info"
+
+The configuration keys configure the following behaviors:
+
+- log.format:  Define the logger used and format of the output
+- log.level:  Log level for the exporter to use
+
+More information about the Fluentd exporter can be found on the exporter's
+GitHub_ page.
+
+.. _GitHub: https://github.com/V3ckt0r/fluentd_exporter
--- a/doc/source/logging/index.rst
+++ b/doc/source/logging/index.rst
@ -0,0 +1,11 @@
+OpenStack-Helm Logging
+======================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   elasticsearch
+   fluent-logging
+   kibana
--- a/doc/source/logging/kibana.rst
+++ b/doc/source/logging/kibana.rst
@ -0,0 +1,76 @@
+Kibana
+======
+
+The Kibana chart in OpenStack-Helm Infra provides visualization for logs indexed
+into Elasticsearch.  These visualizations provide the means to view logs captured
+from services deployed in cluster and targeted for collection by Fluentbit.
+
+Authentication
+--------------
+
+The Kibana deployment includes a sidecar container that runs an Apache reverse
+proxy to add authentication capabilities for Kibana.  The username and password
+are configured under the Kibana entry in the endpoints section of the chart's
+values.yaml.
+
+The configuration for Apache can be found under the conf.httpd key, and uses a
+helm-toolkit function that allows for including gotpl entries in the template
+directly.  This allows the use of other templates, like the endpoint lookup
+function templates, directly in the configuration for Apache.
+
+Configuration
+-------------
+
+Kibana's configuration is driven by the chart's values.yaml file.  The configuration
+options are found under the following keys:
+
+::
+
+    conf:
+      elasticsearch:
+        pingTimeout: 1500
+        preserveHost: true
+        requestTimeout: 30000
+        shardTimeout: 0
+        startupTimeout: 5000
+      il8n:
+        defaultLocale: en
+      kibana:
+        defaultAppId: discover
+        index: .kibana
+      logging:
+        quiet: false
+        silent: false
+        verbose: false
+      ops:
+        interval: 5000
+      server:
+        host: localhost
+        maxPayloadBytes: 1048576
+        port: 5601
+        ssl:
+          enabled: false
+
+The case of the sub-keys is important as these values are injected into
+Kibana's configuration configmap with the toYaml function.  More information on
+the configuration options and available settings can be found in the official
+Kibana documentation_.
+
+.. _documentation: https://www.elastic.co/guide/en/kibana/current/settings.html
+
+Installation
+------------
+
+.. code_block: bash
+
+helm install --namespace=<namespace> local/kibana --name=kibana
+
+Setting Time Field
+------------------
+
+For Kibana to successfully read the logs from Elasticsearch's indexes, the time
+field will need to be manually set after Kibana has successfully deployed.  Upon
+visiting the Kibana dashboard for the first time, a prompt will appear to choose the
+time field with a drop down menu.  The default time field for Elasticsearch indexes
+is '@timestamp'.  Once this field is selected, the default view for querying log entries
+can be found by selecting the "Discover"
--- a/doc/source/monitoring/grafana.rst
+++ b/doc/source/monitoring/grafana.rst
@ -0,0 +1,89 @@
+Grafana
+=======
+
+The Grafana chart in OpenStack-Helm Infra provides default dashboards for the
+metrics gathered with Prometheus.  The default dashboards include visualizations
+for metrics on: Ceph, Kubernetes, nodes, containers, MySQL, RabbitMQ, and
+OpenStack.
+
+Configuration
+-------------
+
+Grafana
+~~~~~~~
+
+Grafana's configuration is driven with the chart's values.YAML file, and the
+relevant configuration entries are under the following key:
+
+::
+
+    conf:
+      grafana:
+        paths:
+        server:
+        database:
+        session:
+        security:
+        users:
+        log:
+        log.console:
+        dashboards.json:
+        grafana_net:
+
+These keys correspond to sections in the grafana.ini configuration file, and the
+to_ini helm-toolkit function will render these values into the appropriate
+format in grafana.ini.  The list of options for these keys can be found in the
+official Grafana configuration_ documentation.
+
+.. _configuration: http://docs.grafana.org/installation/configuration/
+
+Prometheus Data Source
+~~~~~~~~~~~~~~~~~~~~~~
+
+Grafana requires configured data sources for gathering metrics for display in
+its dashboards.  The configuration options for datasources are found under the
+following key in Grafana's values.YAML file:
+
+::
+
+    conf:
+      provisioning:
+        datasources;
+          monitoring:
+            name: prometheus
+            type: prometheus
+            access: proxy
+            orgId: 1
+            editable: true
+            basicAuth: true
+
+The Grafana chart will use the keys under each entry beneath
+.conf.provisioning.datasources as inputs to a helper template that will render
+the appropriate configuration for the data source.  The key for each data source
+(monitoring in the above example) should map to an entry in the endpoints
+section in the chart's values.yaml, as the data source's URL and authentication
+credentials will be populated by the values defined in the defined endpoint.
+
+.. _sources: http://docs.grafana.org/features/datasources/
+
+Dashboards
+~~~~~~~~~~
+
+Grafana adds dashboards during installation with dashboards defined in YAML under
+the following key:
+
+::
+
+    conf:
+      dashboards:
+
+
+These YAML definitiions are transformed to JSON are added to Grafana's
+configuration configmap and mounted to the Grafana pods dynamically, allowing for
+flexibility in defining and adding custom dashboards to Grafana.  Dashboards can
+be added by inserting a new key along with a YAML dashboard definition as the
+value.  Additional dashboards can be found by searching on Grafana's dashboards_
+page or you can define your own. A json-to-YAML tool, such as json2yaml_ , will
+help transform any custom or new dashboards from JSON to YAML.
+
+.. _json2yaml: https://www.json2yaml.com/
--- a/doc/source/monitoring/index.rst
+++ b/doc/source/monitoring/index.rst
@ -0,0 +1,11 @@
+OpenStack-Helm Monitoring
+=========================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   grafana
+   prometheus
+   nagios
--- a/doc/source/monitoring/nagios.rst
+++ b/doc/source/monitoring/nagios.rst
@ -0,0 +1,365 @@
+Nagios
+======
+
+The Nagios chart in openstack-helm-infra can be used to provide an alarming
+service that's tightly coupled to an OpenStack-Helm deployment.  The Nagios
+chart uses a custom Nagios core image that includes plugins developed to query
+Prometheus directly for scraped metrics and triggered alarms, query the Ceph
+manager endpoints directly to determine the health of a Ceph cluster, and to
+query Elasticsearch for logged events that meet certain criteria (experimental).
+
+Authentication
+--------------
+
+The Nagios deployment includes a sidecar container that runs an Apache reverse
+proxy to add authentication capabilities for Nagios.  The username and password
+are configured under the nagios entry in the endpoints section of the chart's
+values.yaml.
+
+The configuration for Apache can be found under the conf.httpd key, and uses a
+helm-toolkit function that allows for including gotpl entries in the template
+directly.  This allows the use of other templates, like the endpoint lookup
+function templates, directly in the configuration for Apache.
+
+Image Plugins
+-------------
+
+The Nagios image used contains custom plugins that can be used for the defined
+service check commands.  These plugins include:
+
+- check_prometheus_metric.py: Query Prometheus for a specific metric and value
+- check_exporter_health_metric.sh: Nagios plugin to query prometheus exporter
+- check_rest_get_api.py: Check REST API status
+- check_update_prometheus_hosts.py: Queries Prometheus, updates Nagios config
+- query_prometheus_alerts.py: Nagios plugin to query prometheus ALERTS metric
+
+More information about the Nagios image and plugins can be found here_.
+
+.. _here: https://github.com/att-comdev/nagios
+
+
+Nagios Service Configuration
+----------------------------
+
+The Nagios service is configured via the following section in the chart's
+values file:
+
+::
+
+    conf:
+      nagios:
+        nagios:
+          log_file: /opt/nagios/var/log/nagios.log
+          cfg_file:
+            - /opt/nagios/etc/nagios_objects.cfg
+            - /opt/nagios/etc/objects/commands.cfg
+            - /opt/nagios/etc/objects/contacts.cfg
+            - /opt/nagios/etc/objects/timeperiods.cfg
+            - /opt/nagios/etc/objects/templates.cfg
+            - /opt/nagios/etc/objects/prometheus_discovery_objects.cfg
+          object_cache_file: /opt/nagios/var/objects.cache
+          precached_object_file: /opt/nagios/var/objects.precache
+          resource_file: /opt/nagios/etc/resource.cfg
+          status_file: /opt/nagios/var/status.dat
+          status_update_interval: 10
+          nagios_user: nagios
+          nagios_group: nagios
+          check_external_commands: 1
+          command_file: /opt/nagios/var/rw/nagios.cmd
+          lock_file: /var/run/nagios.lock
+          temp_file: /opt/nagios/var/nagios.tmp
+          temp_path: /tmp
+          event_broker_options: -1
+          log_rotation_method: d
+          log_archive_path: /opt/nagios/var/log/archives
+          use_syslog: 1
+          log_service_retries: 1
+          log_host_retries: 1
+          log_event_handlers: 1
+          log_initial_states: 0
+          log_current_states: 1
+          log_external_commands: 1
+          log_passive_checks: 1
+          service_inter_check_delay_method: s
+          max_service_check_spread: 30
+          service_interleave_factor: s
+          host_inter_check_delay_method: s
+          max_host_check_spread: 30
+          max_concurrent_checks: 60
+          check_result_reaper_frequency: 10
+          max_check_result_reaper_time: 30
+          check_result_path: /opt/nagios/var/spool/checkresults
+          max_check_result_file_age: 3600
+          cached_host_check_horizon: 15
+          cached_service_check_horizon: 15
+          enable_predictive_host_dependency_checks: 1
+          enable_predictive_service_dependency_checks: 1
+          soft_state_dependencies: 0
+          auto_reschedule_checks: 0
+          auto_rescheduling_interval: 30
+          auto_rescheduling_window: 180
+          service_check_timeout: 60
+          host_check_timeout: 60
+          event_handler_timeout: 60
+          notification_timeout: 60
+          ocsp_timeout: 5
+          perfdata_timeout: 5
+          retain_state_information: 1
+          state_retention_file: /opt/nagios/var/retention.dat
+          retention_update_interval: 60
+          use_retained_program_state: 1
+          use_retained_scheduling_info: 1
+          retained_host_attribute_mask: 0
+          retained_service_attribute_mask: 0
+          retained_process_host_attribute_mask: 0
+          retained_process_service_attribute_mask: 0
+          retained_contact_host_attribute_mask: 0
+          retained_contact_service_attribute_mask: 0
+          interval_length: 1
+          check_workers: 4
+          check_for_updates: 1
+          bare_update_check: 0
+          use_aggressive_host_checking: 0
+          execute_service_checks: 1
+          accept_passive_service_checks: 1
+          execute_host_checks: 1
+          accept_passive_host_checks: 1
+          enable_notifications: 1
+          enable_event_handlers: 1
+          process_performance_data: 0
+          obsess_over_services: 0
+          obsess_over_hosts: 0
+          translate_passive_host_checks: 0
+          passive_host_checks_are_soft: 0
+          check_for_orphaned_services: 1
+          check_for_orphaned_hosts: 1
+          check_service_freshness: 1
+          service_freshness_check_interval: 60
+          check_host_freshness: 0
+          host_freshness_check_interval: 60
+          additional_freshness_latency: 15
+          enable_flap_detection: 1
+          low_service_flap_threshold: 5.0
+          high_service_flap_threshold: 20.0
+          low_host_flap_threshold: 5.0
+          high_host_flap_threshold: 20.0
+          date_format: us
+          use_regexp_matching: 1
+          use_true_regexp_matching: 0
+          daemon_dumps_core: 0
+          use_large_installation_tweaks: 0
+          enable_environment_macros: 0
+          debug_level: 0
+          debug_verbosity: 1
+          debug_file: /opt/nagios/var/nagios.debug
+          max_debug_file_size: 1000000
+          allow_empty_hostgroup_assignment: 1
+          illegal_macro_output_chars: "`~$&|'<>\""
+
+Nagios CGI Configuration
+------------------------
+
+The Nagios CGI configuration is defined via the following section in the chart's
+values file:
+
+::
+
+    conf:
+      nagios:
+        cgi:
+          main_config_file: /opt/nagios/etc/nagios.cfg
+          physical_html_path: /opt/nagios/share
+          url_html_path: /nagios
+          show_context_help: 0
+          use_pending_states: 1
+          use_authentication: 0
+          use_ssl_authentication: 0
+          authorized_for_system_information: "*"
+          authorized_for_configuration_information: "*"
+          authorized_for_system_commands: nagiosadmin
+          authorized_for_all_services: "*"
+          authorized_for_all_hosts: "*"
+          authorized_for_all_service_commands: "*"
+          authorized_for_all_host_commands: "*"
+          default_statuswrl_layout: 4
+          ping_syntax: /bin/ping -n -U -c 5 $HOSTADDRESS$
+          refresh_rate: 90
+          result_limit: 100
+          escape_html_tags: 1
+          action_url_target: _blank
+          notes_url_target: _blank
+          lock_author_names: 1
+          navbar_search_for_addresses: 1
+          navbar_search_for_aliases: 1
+
+Nagios Host Configuration
+-------------------------
+
+The Nagios chart includes a single host definition for the Prometheus instance
+queried for metrics.  The host definition can be found under the following
+values key:
+
+::
+
+    conf:
+      nagios:
+        hosts:
+          - prometheus:
+              use: linux-server
+              host_name: prometheus
+              alias: "Prometheus Monitoring"
+              address: 127.0.0.1
+              hostgroups: prometheus-hosts
+              check_command: check-prometheus-host-alive
+
+The address for the Prometheus host is defined by the PROMETHEUS_SERVICE
+environment variable in the deployment template, which is determined by the
+monitoring entry in the Nagios chart's endpoints section.  The endpoint is then
+available as a macro for Nagios to use in all Prometheus based queries.  For
+example:
+
+::
+
+    - check_prometheus_host_alive:
+        command_name: check-prometheus-host-alive
+        command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"
+
+The $USER2$ macro above corresponds to the Prometheus endpoint defined in the
+PROMETHEUS_SERVICE environment variable.  All checks that use the
+prometheus-hosts hostgroup will map back to the Prometheus host defined by this
+endpoint.
+
+Nagios HostGroup Configuration
+------------------------------
+
+The Nagios chart includes configuration values for defined host groups under the
+following values key:
+
+::
+
+    conf:
+      nagios:
+        host_groups:
+          - prometheus-hosts:
+              hostgroup_name: prometheus-hosts
+              alias: "Prometheus Virtual Host"
+          - base-os:
+              hostgroup_name: base-os
+              alias: "base-os"
+
+These hostgroups are used to define which group of hosts should be targeted by
+a particular nagios check.  An example of a check that targets Prometheus for a
+specific metric query would be:
+
+::
+
+    - check_ceph_monitor_quorum:
+        use: notifying_service
+        hostgroup_name: prometheus-hosts
+        service_description: "CEPH_quorum"
+        check_command: check_prom_alert!ceph_monitor_quorum_low!CRITICAL- ceph monitor quorum does not exist!OK- ceph monitor quorum exists
+        check_interval: 60
+
+An example of a check that targets all hosts for a base-os type check (memory
+usage, latency, etc) would be:
+
+::
+
+    - check_memory_usage:
+        use: notifying_service
+        service_description: Memory_usage
+        check_command: check_memory_usage
+        hostgroup_name: base-os
+
+These two host groups allow for a wide range of targeted checks for determining
+the status of all components of an OpenStack-Helm deployment.
+
+Nagios Command Configuration
+----------------------------
+
+The Nagios chart includes configuration values for the command definitions Nagios
+will use when executing service checks. These values are found under the
+following key:
+
+::
+
+    conf:
+      nagios:
+        commands:
+          - send_service_snmp_trap:
+              command_name: send_service_snmp_trap
+              command_line: "$USER1$/send_service_trap.sh '$USER8$' '$HOSTNAME$' '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$' '$USER4$' '$USER5$'"
+          - send_host_snmp_trap:
+              command_name: send_host_snmp_trap
+              command_line: "$USER1$/send_host_trap.sh '$USER8$' '$HOSTNAME$' $HOSTSTATEID$ '$HOSTOUTPUT$' '$USER4$' '$USER5$'"
+          - send_service_http_post:
+              command_name: send_service_http_post
+              command_line: "$USER1$/send_http_post_event.py --type service --hostname '$HOSTNAME$' --servicedesc '$SERVICEDESC$' --state_id $SERVICESTATEID$ --output '$SERVICEOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
+          - send_host_http_post:
+              command_name: send_host_http_post
+              command_line: "$USER1$/send_http_post_event.py --type host --hostname '$HOSTNAME$' --state_id $HOSTSTATEID$ --output '$HOSTOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
+          - check_prometheus_host_alive:
+              command_name: check-prometheus-host-alive
+              command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"
+
+The list of defined commands can be modified with configuration overrides, which
+allows for the ability define commands specific to an infrastructure deployment.
+These commands can include querying Prometheus for metrics on dependencies for a
+service to determine whether an alert should be raised, executing checks on each
+host to determine network latency or file system usage, or checking each node
+for issues with ntp clock skew.
+
+Note: Since the conf.nagios.commands key contains a list of the defined commands,
+the entire contents of conf.nagios.commands will need to be overridden if
+additional commands are desired (due to the immutable nature of lists).
+
+Nagios Service Check Configuration
+----------------------------------
+
+The Nagios chart includes configuration values for the service checks Nagios
+will execute.  These service check commands can be found under the following
+key:
+
+::
+    conf:
+      nagios:
+        services:
+          - notifying_service:
+              name: notifying_service
+              use: generic-service
+              flap_detection_enabled: 0
+              process_perf_data: 0
+              contact_groups: snmp_and_http_notifying_contact_group
+              check_interval: 60
+              notification_interval: 120
+              retry_interval: 30
+              register: 0
+          - check_ceph_health:
+              use: notifying_service
+              hostgroup_name: base-os
+              service_description: "CEPH_health"
+              check_command: check_ceph_health
+              check_interval: 300
+          - check_hosts_health:
+              use: generic-service
+              hostgroup_name: prometheus-hosts
+              service_description: "Nodes_health"
+              check_command: check_prom_alert!K8SNodesNotReady!CRITICAL- One or more nodes are not ready.
+              check_interval: 60
+          - check_prometheus_replicas:
+              use: notifying_service
+              hostgroup_name: prometheus-hosts
+              service_description: "Prometheus_replica-count"
+              check_command: check_prom_alert_with_labels!replicas_unavailable_statefulset!statefulset="prometheus"!statefulset {statefulset} has lesser than configured replicas
+              check_interval: 60
+
+The Nagios service configurations define the checks Nagios will perform.  These
+checks contain keys for defining: the service type to use, the host group to
+target, the description of the service check, the command the check should use,
+and the interval at which to trigger the service check.  These services can also
+be extended to provide additional insight into the overall status of a
+particular service.  These services also allow the ability to define advanced
+checks for determining the overall health and liveness of a service.  For
+example, a service check could trigger an alarm for the OpenStack services when
+Nagios detects that the relevant database and message queue has become
+unresponsive.
--- a/doc/source/monitoring/prometheus.rst
+++ b/doc/source/monitoring/prometheus.rst
@ -0,0 +1,338 @@
+Prometheus
+==========
+
+The Prometheus chart in openstack-helm-infra provides a time series database and
+a strong querying language for monitoring various components of OpenStack-Helm.
+Prometheus gathers metrics by scraping defined service endpoints or pods at
+specified intervals and indexing them in the underlying time series database.
+
+Authentication
+--------------
+
+The Prometheus deployment includes a sidecar container that runs an Apache
+reverse proxy to add authentication capabilities for Prometheus.  The
+username and password are configured under the monitoring entry in the endpoints
+section of the chart's values.yaml.
+
+The configuration for Apache can be found under the conf.httpd key, and uses a
+helm-toolkit function that allows for including gotpl entries in the template
+directly.  This allows the use of other templates, like the endpoint lookup
+function templates, directly in the configuration for Apache.
+
+Prometheus Service configuration
+--------------------------------
+
+The Prometheus service is configured via command line flags set during runtime.
+These flags include: setting the configuration file, setting log levels, setting
+characteristics of the time series database, and enabling the web admin API for
+snapshot support.  These settings can be configured via the values tree at:
+
+::
+
+    conf:
+      prometheus:
+        command_line_flags:
+          log.level: info
+          query.max_concurrency: 20
+          query.timeout: 2m
+          storage.tsdb.path: /var/lib/prometheus/data
+          storage.tsdb.retention: 7d
+          web.enable_admin_api: false
+          web.enable_lifecycle: false
+
+The Prometheus configuration file contains the definitions for scrape targets
+and the location of the rules files for triggering alerts on scraped metrics.
+The configuration file is defined in the values file, and can be found at:
+
+::
+
+    conf:
+      prometheus:
+        scrape_configs: |
+
+By defining the configuration via the values file, an operator can override all
+configuration components of the Prometheus deployment at runtime.
+
+Kubernetes Endpoint Configuration
+---------------------------------
+
+The Prometheus chart in openstack-helm-infra uses the built-in service discovery
+mechanisms for Kubernetes endpoints and pods to automatically configure scrape
+targets.  Functions added to helm-toolkit allows configuration of these targets
+via annotations that can be applied to any service or pod that exposes metrics
+for Prometheus, whether a service for an application-specific exporter or an
+application that provides a metrics endpoint via its service. The values in
+these functions correspond to entries in the monitoring tree under the
+prometheus key in a chart's values.yaml file.
+
+
+The functions definitions are below:
+
+::
+
+    {{- define "helm-toolkit.snippets.prometheus_service_annotations" -}}
+    {{- $config := index . 0 -}}
+    {{- if $config.scrape }}
+    prometheus.io/scrape: {{ $config.scrape | quote }}
+    {{- end }}
+    {{- if $config.scheme }}
+    prometheus.io/scheme: {{ $config.scheme | quote }}
+    {{- end }}
+    {{- if $config.path }}
+    prometheus.io/path: {{ $config.path | quote }}
+    {{- end }}
+    {{- if $config.port }}
+    prometheus.io/port: {{ $config.port | quote }}
+    {{- end }}
+    {{- end -}}
+
+::
+
+    {{- define "helm-toolkit.snippets.prometheus_pod_annotations" -}}
+    {{- $config := index . 0 -}}
+    {{- if $config.scrape }}
+    prometheus.io/scrape: {{ $config.scrape | quote }}
+    {{- end }}
+    {{- if $config.path }}
+    prometheus.io/path: {{ $config.path | quote }}
+    {{- end }}
+    {{- if $config.port }}
+    prometheus.io/port: {{ $config.port | quote }}
+    {{- end }}
+    {{- end -}}
+
+These functions render the following annotations:
+
+- prometheus.io/scrape:  Must be set to true for Prometheus to scrape target
+- prometheus.io/scheme:  Overrides scheme used to scrape target if not http
+- prometheus.io/path:    Overrides path used to scrape target metrics if not /metrics
+- prometheus.io/port:    Overrides port to scrape metrics on if not service's default port
+
+Each chart that can be targeted for monitoring by Prometheus has a prometheus
+section under a monitoring tree in the chart's values.yaml, and Prometheus
+monitoring is disabled by default for those services.  Example values for the
+required entries can be found in the following monitoring configuration for the
+prometheus-node-exporter chart:
+
+::
+
+    monitoring:
+      prometheus:
+        enabled: false
+        node_exporter:
+          scrape: true
+
+If the prometheus.enabled key is set to true, the annotations are set on the
+targeted service or pod as the condition for applying the annotations evaluates
+to true.  For example:
+
+::
+
+    {{- $prometheus_annotations := $envAll.Values.monitoring.prometheus.node_exporter }}
+    ---
+    apiVersion: v1
+    kind: Service
+    metadata:
+    name: {{ tuple "node_metrics" "internal" . | include "helm-toolkit.endpoints.hostname_short_endpoint_lookup" }}
+    labels:
+    {{ tuple $envAll "node_exporter" "metrics" | include "helm-toolkit.snippets.kubernetes_metadata_labels" | indent 4 }}
+    annotations:
+    {{- if .Values.monitoring.prometheus.enabled }}
+    {{ tuple $prometheus_annotations | include "helm-toolkit.snippets.prometheus_service_annotations" | indent 4 }}
+    {{- end }}
+
+Kubelet, API Server, and cAdvisor
+---------------------------------
+
+The Prometheus chart includes scrape target configurations for the kubelet, the
+Kubernetes API servers, and cAdvisor.  These targets are configured based on
+a kubeadm deployed Kubernetes cluster, as OpenStack-Helm uses kubeadm to deploy
+Kubernetes in the gates.  These configurations may need to change based on your
+chosen method of deployment.  Please note the cAdvisor metrics will not be
+captured if the kubelet was started with the following flag:
+
+::
+
+    --cadvisor-port=0
+
+To enable the gathering of the kubelet's custom metrics, the following flag must
+be set:
+
+::
+
+    --enable-custom-metrics
+
+Installation
+------------
+
+The Prometheus chart can be installed with the following command:
+
+.. code-block:: bash
+
+    helm install --namespace=openstack local/prometheus --name=prometheus
+
+The above command results in a Prometheus deployment configured to automatically
+discover services with the necessary annotations for scraping, configured to
+gather metrics on the kubelet, the Kubernetes API servers, and cAdvisor.
+
+Extending Prometheus
+--------------------
+
+Prometheus can target various exporters to gather metrics related to specific
+applications to extend visibility into an OpenStack-Helm deployment.  Currently,
+openstack-helm-infra contains charts for:
+
+- prometheus-kube-state-metrics: Provides additional Kubernetes metrics
+- prometheus-node-exporter: Provides metrics for nodes and linux kernels
+- prometheus-openstack-metrics-exporter: Provides metrics for OpenStack services
+
+Kube-State-Metrics
+~~~~~~~~~~~~~~~~~~
+
+The prometheus-kube-state-metrics chart provides metrics for Kubernetes objects
+as well as metrics for kube-scheduler and kube-controller-manager.  Information
+on the specific metrics available via the kube-state-metrics service can be
+found in the kube-state-metrics_ documentation.
+
+The prometheus-kube-state-metrics chart can be installed with the following:
+
+.. code-block:: bash
+
+    helm install --namespace=kube-system local/prometheus-kube-state-metrics --name=prometheus-kube-state-metrics
+
+.. _kube-state-metrics: https://github.com/kubernetes/kube-state-metrics/tree/master/Documentation
+
+Node Exporter
+~~~~~~~~~~~~~
+
+The prometheus-node-exporter chart provides hardware and operating system metrics
+exposed via Linux kernels.  Information on the specific metrics available via
+the Node exporter can be found on the Node_exporter_ GitHub page.
+
+The prometheus-node-exporter chart can be installed with the following:
+
+.. code-block:: bash
+
+    helm install --namespace=kube-system local/prometheus-node-exporter --name=prometheus-node-exporter
+
+.. _Node_exporter: https://github.com/prometheus/node_exporter
+
+OpenStack Exporter
+~~~~~~~~~~~~~~~~~~
+
+The prometheus-openstack-exporter chart provides metrics specific to the
+OpenStack services.  The exporter's source code can be found here_. While the
+metrics provided are by no means comprehensive, they will be expanded upon.
+
+Please note the OpenStack exporter requires the creation of a Keystone user to
+successfully gather metrics.  To create the required user, the chart uses the
+same keystone user management job the OpenStack service charts use.
+
+The prometheus-openstack-exporter chart can be installed with the following:
+
+.. code-block:: bash
+
+    helm install --namespace=openstack local/prometheus-openstack-exporter --name=prometheus-openstack-exporter
+
+.. _here: https://github.com/att-comdev/openstack-metrics-collector
+
+Other exporters
+~~~~~~~~~~~~~~~
+
+Certain charts in OpenStack-Helm include templates for application-specific
+Prometheus exporters, which keeps the monitoring of those services tightly coupled
+to the chart.  The templates for these exporters can be found in the monitoring
+subdirectory in the chart.  These exporters are disabled by default, and can be
+enabled by setting the appropriate flag in the monitoring.prometheus key of the
+chart's values.yaml file.  The charts containing exporters include:
+
+- Elasticsearch_
+- RabbitMQ_
+- MariaDB_
+- Memcached_
+- Fluentd_
+- Postgres_
+
+.. _Elasticsearch: https://github.com/justwatchcom/elasticsearch_exporter
+.. _RabbitMQ: https://github.com/kbudde/rabbitmq_exporter
+.. _MariaDB: https://github.com/prometheus/mysqld_exporter
+.. _Memcached: https://github.com/prometheus/memcached_exporter
+.. _Fluentd: https://github.com/V3ckt0r/fluentd_exporter
+.. _Postgres: https://github.com/wrouesnel/postgres_exporter
+
+Ceph
+~~~~
+
+Starting with Luminous, Ceph can export metrics with ceph-mgr prometheus module.
+This module can be enabled in Ceph's values.yaml under the ceph_mgr_enabled_plugins
+key by appending prometheus to the list of enabled modules.  After enabling the
+prometheus module, metrics can be scraped on the ceph-mgr service endpoint.  This
+relies on the Prometheus annotations attached to the ceph-mgr service template, and
+these annotations can be modified in the endpoints section of Ceph's values.yaml
+file.  Information on the specific metrics available via the prometheus module
+can be found in the Ceph prometheus_ module documentation.
+
+.. _prometheus: http://docs.ceph.com/docs/master/mgr/prometheus/
+
+
+Prometheus Dashboard
+--------------------
+
+Prometheus includes a dashboard that can be accessed via the accessible
+Prometheus endpoint (NodePort or otherwise).  This dashboard will give you a
+view of your scrape targets' state, the configuration values for Prometheus's
+scrape jobs and command line flags, a view of any alerts triggered based on the
+defined rules, and a means for using PromQL to query scraped metrics.  The
+Prometheus dashboard is a useful tool for verifying Prometheus is configured
+appropriately and to verify the status of any services targeted for scraping via
+the Prometheus service discovery annotations.
+
+Rules Configuration
+-------------------
+
+Prometheus provides a querying language that can operate on defined rules which
+allow for the generation of alerts on specific metrics.  The Prometheus chart in
+openstack-helm-infra defines these rules via the values.yaml file.  By defining
+these in the values file, it allows operators flexibility to provide specific
+rules via overrides at installation.  The following rules keys are provided:
+
+::
+
+    values:
+      conf:
+        rules:
+          alertmanager:
+          etcd3:
+          kube_apiserver:
+          kube_controller_manager:
+          kubelet:
+          kubernetes:
+          rabbitmq:
+          mysql:
+          ceph:
+          openstack:
+          custom:
+
+These provided keys provide recording and alert rules for all infrastructure
+components of an OpenStack-Helm deployment.  If you wish to exclude rules for a
+component, leave the tree empty in an overrides file.  To read more
+about Prometheus recording and alert rules definitions, please see the official
+Prometheus recording_ and alert_ rules documentation.
+
+.. _recording: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
+.. _alert: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
+
+Note: Prometheus releases prior to 2.0 used gotpl to define rules.  Prometheus
+2.0 changed the rules format to YAML, making them much easier to read.  The
+Prometheus chart in openstack-helm-infra uses Prometheus 2.0 by default to take
+advantage of changes to the underlying storage layer and the handling of stale
+data.  The chart will not support overrides for Prometheus versions below 2.0,
+as the command line flags for the service changed between versions.
+
+The wide range of exporters included in OpenStack-Helm coupled with the ability
+to define rules with configuration overrides allows for the addition of custom
+alerting and recording rules to fit an operator's monitoring needs.  Adding new
+rules or modifying existing rules require overrides for either an existing key
+under conf.rules or the addition of a new key under conf.rules.  The addition
+of custom rules can be used to define complex checks that can be extended for
+determining the liveliness or health of infrastructure components.
--- a/doc/source/readme.rst
+++ b/doc/source/readme.rst
@ -0,0 +1 @@
+.. include:: ../../README.rst