To make it easier to maintain the jobs all experimental
jobs (those which are not run in check and gate pipelines)
are moved to a separate file. They will be revised later
to use the same deploy-env role.
Also many charts use Openstack images for testing this
PR adds 2023.1 Ubuntu Focal overrides for all these charts.
Change-Id: I4a6fb998c7eb1026b3c05ddd69f62531137b6e51
This change adds feature to launch Prometheus process using a custom script which should be stored in override values. Because the known issue https://github.com/prometheus/prometheus/issues/6934 is still open many years, we are going to struggle with growing WAL files using our custom downstream wrapper script which stops Prometheus process and deletes WALs.
This solution can not fit all customers because completely kills wal cached data but it is ok for our purposes. Such way I just added the feature to use another custom script to launch Prometheus and left original functionality by default. Default/custom mode are defined in 'values.yaml' as the body of the custom launcher script.
Change-Id: Ie02ea1d6a7de5c676e2e96f3dcd6aca172af4afb
Based on spec in openstack-helm repo,
support-OCI-image-registry-with-authentication-turned-on.rst
Each Helm chart can configure an OCI image registry and
credentials to use. A Kubernetes secret is then created with these
info. Service Accounts then specify an imagePullSecret specifying
the Secret with creds for the registry. Then any pod using one
of these ServiceAccounts may pull images from an authenticated
container registry.
Change-Id: Iebda4c7a861aa13db921328776b20c14ba346269
This change updates the default image value in the prometheus
chart from newton to wallaby for the helm_test image.
Change-Id: I0f70734a8455661f7705baeed3cafbaf529c56a8
This change updates the helm-toolkit path in each chart as part
of the move to helm v3. This is due to a lack of helm serve.
Change-Id: I011e282616bf0b5a5c72c1db185c70d8c721695e
This will ease mirroring capabilities for the docker official images.
Signed-off-by: Thiago Brito <thiago.brito@windriver.com>
Change-Id: I0f9177b0b83e4fad599ae0c3f3820202bf1d450d
Prometheus documentation shows that /-/ready can be used to check that
it is ready to service traffic (i.e. respond to queries) [0]. I've
witnessed cases where Prometheus's readiness probe is passing during
initial deployment using /status, which in turn triggers its helm test
to start. Said helm test then fails because /status is not a good a
reliable indicator that Prometheus is actually ready to serve traffic
and the helm test is performing actions that require it to be proprely
up and ready.
[0]: https://prometheus.io/docs/prometheus/latest/management_api/
Change-Id: Iab22d0c986d680663fbe8e84d6c0d89b03dc6428
This patchset enabled TLS path for Prometheus when it acts as
a server. Note that TLS is not directly terminated at Prometheus.
TLS is terminated at apache proxy which in turn route request
to Prometheus.
Change-Id: I0db366b6237a34da2e9a31345d96ae8f63815fa2
The flag storage.tsdb.retention is deprecated and generates warnings
on startup storage.tsdb.retention.time is the new flag.
storage.tsdb.wal-compression is now set as the default in v2.20
and above and is no longer needed
Change-Id: I66f861a354a3cdde69a712ca5fd8a1d1a1eca60a
This reverts commit fb7fc87d23.
I first submitted that as a way to add dynamic capability to the
prometheus rules (they infamously don't support ENV variable
substitution there). However this be done easily with another solution,
and would clean up the prometheus chart values significantly.
Change-Id: Ibec512d92490798ae5522468b915b49e7746806a
Since we introduced chart version check in gates, requirements are not
satisfied with strict check of 0.1.0
Change-Id: I15950b735b4f8566bc0018fe4f4ea9ba729235fc
Signed-off-by: Andrii Ostapenko <andrii.ostapenko@att.com>
Added chart lint in zuul CI to enhance the stability for charts.
Fixed some lint errors in the current charts.
Change-Id: I9df4024c7ccf8b3510e665fc07ba0f38871fcbdb
This change allows us to substitute values into our rules files.
Example:
- alert: my_region_is_down
expr: up{region="{{ $my_region }}"} == 0
To support this change, rule annotations that used the expansion
{{ $labels.foo }} had to be surrounded with "{{` ... `}}" to render
correctly.
Change-Id: Ia7ac891de8261acca62105a3e2636bd747a5fbea
Some scrape targets require the use of TLS client certificates, which
are specified as filenames as part of the tls_config.
This change allows these client certs and keys to be provided, stores
them in a secret, and mounts them in the pod under /tls_configs.
Example:
tls_configs:
kubernetes-etcd:
ca.pem: |
-----BEGIN CERTIFICATE-----
-----END CERTIFICATE-----
crt.pem: |
-----BEGIN CERTIFICATE-----
-----END CERTIFICATE-----
key.pem: |
-----BEGIN RSA PRIVATE KEY-----
-----END RSA PRIVATE KEY-----
conf:
prometheus:
scrape_configs:
template: |
scrape_configs:
- job_name: kubernetes-etcd
scheme: https
tls_config:
ca_file: /tls_configs/kubernetes-etcd.ca.pem
cert_file: /tls_configs/kubernetes-etcd.cert.pem
key_file: /tls_configs/kubernetes-etcd.key.pem
Change-Id: I963c65dc39f1b5110b091296b93e2de9cdd980a4
- Update alertmanger and prometheus discovery port from 6783 to 9094
- Update to support fqdn for discovery hostname
- Add one test alert to Prometheus to test alert pipeline
- update container name from alertmanger to prometheus-alertmanager
Change-Id: Iec5e758e4b576dff01e84591a2440d030d5ff3c4
This updates the chart to include the pod security context
on the pod template.
This also adds the container security context to set
readOnlyRootFilesystem flag to true
Change-Id: Icb7a9de4d98bac1f0bcf6181b6e88695f4b09709
Unrestrict octal values rule since benefits of file modes readability
exceed possible issues with yaml 1.2 adoption in future k8s versions.
These issues will be addressed when/if they occur.
Also ensure osh-infra is a required project for lint job, that matters
when running job against another project.
Change-Id: Ic5e327cf40c4b09c90738baff56419a6cef132da
Signed-off-by: Andrii Ostapenko <andrii.ostapenko@att.com>
This commit rewrites lint job to make template linting available.
Currently yamllint is run in warning mode against all templates
rendered with default values. Duplicates detected and issues will be
addressed in subsequent commits.
Also all y*ml files are added for linting and corresponding code changes
are made. For non-templates warning rules are disabled to improve
readability. Chart and requirements yamls are also modified in the name
of consistency.
Change-Id: Ife6727c5721a00c65902340d95b7edb0a9c77365
The current copyright refers to a non-existent group
"openstack helm authors" with often out-of-date references that
are confusing when adding a new file to the repo.
This change removes all references to this copyright by the
non-existent group and any blank lines underneath.
Change-Id: I1882738cf9757c5350a8533876fd37b5920b5235
This change converts alert expressions which relied on instant vectors
to use range aggregate functions instead - For just the 'basic_linux'
rules.
Change-Id: I30d6ab71d747b297f522bbeb12b8f4dbfce1eefe
Co-Authored-By: Meghan Heisler <mkheisler93@gmail.com>
This change updates the prometheus alerting rules to use ranged vectors
in their expressions, to avoid situations wher missed scrapes would
cause scalar metrics to "go stale" - resetting the alert timer.
Only the ceph alerts are affected by this change.
Change-Id: Ib47866d12616aaa808e6a09c58aa4352e338a152
Co-Authored-By: Meghan Heisler <mkheisler93@gmail.com>
This change converts alert expressions which relied on instant vectors
to use range aggregate functions instead.
Change-Id: I4df757f961524bed23b6a6ad361779c1749ca2c5
Co-Authored-By: Meghan Heisler <mkheisler93@gmail.com>
This change adds a means of introducing new storage classes
and local persistent volumes.
Change-Id: I340c75f3d0a1678f3149f3cf62e4ab104823cc49
Co-Authored-By: Steven Fitzpatrick <steven.fitzpatrick@att.com>
This patch set updates and tests the apiVersion for rbac.authorization.k8s.io
from v1beta1 to v1 in preparation for its removal in k8s 1.20.
Change-Id: I4e68db1f75ff72eee55ecec93bd59c68c179c627
Signed-off-by: Tin Lam <tin@irrational.io>
This updates the deployment scripts for Prometheus to leverage the
feature gate functionality rather than bash generation of the list
of override files to use for alerting rules
Change-Id: Ie497ae930f7cc4db690a4ddc812a92e4491cde93
Signed-off-by: Steve Wilkerson <sw5822@att.com>
I noticed a some nagios service checks were checking prometheus
alerts which did not exist in our default prometheus configuration.
In one case a prometheus alert did not match the naming convention
of similar alerts.
One nagios service check, ceph_monitor_clock_skew_high, does not
have a corresponding alert at all, so I've changed it to check the
node_ntmp_clock_skew_high
alert, where a node has the label ceph-mon="enabled".
Change-Id: I2ebf9a4954190b8e2caefc8a61270e28bf24d9fa
This updates the Prometheus chart to support federation. This
moves to defining the Prometheus configuration file via a template
in the values.yaml file instead of through raw yaml. This allows
for overriding the chart's default configuration wholesale, as
this would be required for a hierarchical federated setup. This
also strips out all of the default rules defined in the chart for
the same reason. There are example rules defined for the various
aspects of OSH's infrastructure in the prometheus/values_overrides
directory that are executed as part of the normal CI jobs. This
also adds a nonvoting federated-monitoring job that vets out the
ability to federate prometheus in a hierarchical fashion with
extremely basic overrides
Change-Id: I0f121ad5e4f80be4c790dc869955c6b299ca9f26
Signed-off-by: Steve Wilkerson <sw5822@att.com>
This updates the podManagementPolicy to 'Parallel' for Prometheus
and Alertmanager, as there's no need to handle deploying these
two services in a sequential manner
Change-Id: I2f33b9651bed20c4cb2e0c477ae2227cbf9310cf
Signed-off-by: Steve Wilkerson <sw5822@att.com>