This update introduces a new maintenance group alarm ; 200.003
This new alarm is minor and management affecting if asserted.
It is considered management affecting for the upgrades case because
the pxeboot network is needed to upgrade a node.
The alarm represents a communication/messaging failure between the
active controller mtcAgent process and the mtcClient that runs
on each node.
Test Plan:
PASS: Verify alarm attributes
PASS: - code of 200.003
PASS: - assertion cause text
PASS: - proposed repair action text
PASS: - suppression option
PASS: - does not inhibit other alarms
PASS: - affect of assertion on upgrade healthcheck
PASS: Verify ability to assert and clear
PASS: Verify fm logging for the above assertion and clear
Story: 2010940
Task: 49789
Change-Id: I507d30213674c5b1e24fcfebe15c6a87bad74358
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This new alarm is raised when the controllers deploy
state is out of sync during the deployment.
Test Plan:
PASS: alarm raised when the deploy state is out of sync
PASS: alarm cleared when the deploy state is in sync
Task: 49732
Story: 2010676
Change-Id: Ibdcc54f02c9e156b2b78313b527cd273a62425f1
Signed-off-by: junfeng-li <junfeng.li@windriver.com>
This change added alarm 250.004, "IPsec certificates renewal failed".
This alarm will be raised by the ipsec-cert-renew cron job when the
renewal failed, and will be cleared when cron job script is re-run,
either manually or by cron, after the error is fixed.
Test Plan:
PASS: Simulate a failure condition (eg, ipsec-client return non zero),
run the cron job script, verify the IPsec renewal fails, and
alarm 250.004 is raised.
PASS: Run the script with IPsec cert not being about to expire, verify
the script finish successfully and alarm 250.004 is cleared.
Story: 2010940
Task: 49706
Change-Id: Ie4d3970ca32173939c1df55a2e59241ac214b2ae
Signed-off-by: Andy Ning <andy.ning@windriver.com>
Configuring new alarm 850.002 which will be used when K8s
periodic audit happens and any of the endpoint health check
fails.
Test Plan:
PASS: Performed tox test locally and packages were built
successfully
PASS: Verify that k8s orchestrated upgrade will get blocked,
if it is tried while alarm is set. And k8s orchestrated
upgrade will complete, if it is tried when alarm is
cleared.
Story: 2011037
Task: 49535
Change-Id: I335179ea98ef63d7c35c89d82328a52ab2391f5c
Signed-off-by: rakshith mr <rakshith.mr@windriver.com>
Currently there is no alarm for node taint.
This new alarm 900.701 describes the attributes
of the node taint.
Test Plan:
PASSED: Verified the details of the alarm
using fm alarm-list.
Partial-Bug: 2046273
Change-Id: I929ddb45b75f1e4b097b84919f703d458d8fa39e
Signed-off-by: Vanathi.Selvaraju <vanathi.selvaraju@windriver.com>
Maintenance service raises an alarm with ID 200.016 if luks-fs-mgr
service is inactive. This change adds the description of 200.016 alarm.
Test Plan:
PASS: build-pkgs -c -p fm-doc
PASS: build-image
PASS: AIO-SX bootstrap with LUKS service status inactive. A critical
alarm with ID 200.016 should be displayed while listing the alarm
using 'fm alarm-list'
Story: 2010872
Task: 49125
Depends-On: https://review.opendev.org/c/starlingx/metal/+/901455
Change-Id: Iadee64bffbb37cfd94aa735f7eeb12ba0fa86fbd
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
Even when event logs have this attribute available, it is not being
logged in the /var/log/fm-event.log file. This could result in a lack
of certain information for external tools that use it.
Test Plan:
PASS: Build fm-common package and install it. Then, trigger an alarm
with 'proposed_repair_action'. Verify its presence in the
/var/log/fm-event.log
PASS: Raise an alarm without 'proposed_repair_action'. Check that
the empty attribute is present.
PASS: Raise an alarm with a long 'proposed_repair_action'. verify that
this message is not affecting other attributes.
Closes-Bug: 2042579
Change-Id: Ic27b840041872c3afd0be28e11556acf42a3d5a9
Signed-off-by: fperez <fabrizio.perez@windriver.com>
When fm manager is restarted, there is no mechanism to detect it
from fm api client side. As a result, when subcloud delete clear
alarm request is sent after fm manager is restarted, fm api client
will show broke pipe and clear alarm request is not received and
this alarm stays.
This fix is to check socket fd state before send/receive from
fm api client. If broken pipe is detected, it will try to
reconnect to fm manager.
Closes-bug: 2039684
Test Plan:
PASS: Restart fm manager and confirm that detect broken pipe
and reconnect messages in /var/log. For example,
-----
sm: err fmSocket.cpp(270): A broken pipe error occurred
sm: warning fmAPI.cpp(116): Invalid file descriptor. Atte
mpting to reconnect...
sm: info fmAPI.cpp(149): Connected to FM Manager.
-----
PASS: Delete offline subcloud and confirm the alarm is
removed.
Change-Id: Ibc0f4d96b5c0a385d8fedbc1acd23898f1cbea46
Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>
The alarm 280.004 is added and will be raised when the system peer
connection failure has been detected, and will be cleared when the
system peer connection has been restored.
The alarm 280.005 is added and will be raised when the a subcloud peer
group is being managed by a remote system with a lower priority, and
will be cleared when the subclouds belonging to the subcloud peer group
have been migrated back to current system.
Test Plan:
PASS - Verify successful tox test and package build
Story: 2010852
Task: 48492
Change-Id: I3068676933c0446a88bd4290277456cd0962f941
Signed-off-by: Zhang Rong(Jon) <rong.zhang@windriver.com>
wrsEventMessage traps are being managed as wrsAlarmMessages.
Events do not contain wrsEventProposedRepairAction and
wrsEventSuppressionAllowed fields, so they need to generate a default
value in FM.
This commit adds those fields in order to create the event traps
with the same format as the alarm traps.
Depends-on: https://review.opendev.org/c/starlingx/snmp-armada-app/+/892624
Partial-bug: 2032844
Signed-off-by: Agustin Carranza <agustin.carranza@windriver.com>
Change-Id: I58577406cc75c597f6f430015ddd51d0029d4539
With sphinx version update, it is raised a warning (treated as
error) with the 'language = None' configuration. The default value is
'language = en' which has the same behavior.
Test plan
PASS: Run tox and check it ends successfully.
Closes-bug: 2033412
Related-Bug: #1976377
Change-Id: Ie003c0a02fcfc6f237ae5b3efb259de6748077ad
Signed-off-by: Agustin Carranza <agustin.carranza@windriver.com>
400.001 and 400.002 alarms are tagged for openstack but should be
starlingx.
This change tags them to starlingx so the documentation scripts are
able to classify them correctly.
Test plan
PASS: Check the parsing scripts end successfully.
Closes-bug: 2028379
Signed-off-by: Agustin Carranza <agustin.carranza@windriver.com>
Change-Id: I8f5966d5b0a7b82198e4bc2e735fa4536a4cdd0a
This story shall update the README file of a few most used StarlingX
repos.
Test Plan: N/A
Story: 2010814
Task: 48379
Change-Id: I98483245931c5d764c662f5283c59da0b2d69efe
Signed-off-by: Roger Ferraz <rogerio.ferraz@encora.com>
Since the alarm documentation has been automated and the events.yaml
file is taken as source of truth for it, it is required to link
the alarms proposed repair action with a direct link to the
documentation for the users.
This change modifies the mentions of documentation to a proper link,
using Sphinx placeholder that are interpreted by the documentation
language.
Test plan
PASS:
* Build fm-doc package. Check that all parsing checks were run and
package was built successfully.
Closes-bug: 2022104
Change-Id: Iccb34e42ed80634d73cf7549e9230976579deef7
Signed-off-by: Agustin Carranza <agustin.carranza@windriver.com>
The 250.002 alarm has been deprecated long time ago.
This change deletes it from the alarm list.
Test plan:
PASS:
* Build fm-doc and fm-api packages.
* Check that all parsing checks were run and package was built
successfully.
Depends-on: https://review.opendev.org/c/starlingx/distcloud/+/886001
Closes-bug: 2024010
Signed-off-by: Agustin Carranza <agustin.carranza@windriver.com>
Change-Id: I0180bd5addc1feae6e3e45edd24c1f50d6622e2c
Some alarms reference to "System Administration Manual" but this
document does not exist. It was changed to a generic documentation
reference.
The 800.103 alarm has been deprecated so it is deleted from the
events.yaml file.
Test plan
Pass:
* Build fm-doc package. Check that all parsing checks were run and
package was built successfully.
Closes-bug: 2022104
Signed-off-by: Agustin Carranza <agustin.carranza@windriver.com>
Change-Id: I4723d05e77983796a0f64c7242f5c2bcf4699763
yaml.load will report a warning in pyyaml 5 and an error
in pyyaml 6 if it is called without a Loader argument.
The no-member pylint error was being suppressed due to
legacy http code, so now that is un-suppressed globally
and the yaml.load is replaced with yaml.safe_load
Test Plan:
PASS: tox
PASS: yaml.load('events.yaml') returns the same content
as yaml.safe_load('events.yaml')
Story: 2010642
Task: 48157
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: Ibac118cd9555f3334251b10a6b3e0a5986285854
This change adds a parsing check to ERROR if Context field is Empty.
Until now there had not been a requirement of non empty fields, so in
case this is needed in the future for other key/values, a collection is
created.
Test plan
PASS: * Add/modify an alarm/log in events.yaml file with Context field
set to <Empty>.
* Run the checkEventYaml script and check it fails.
PASS: * Check that all the events in events.yaml file have the Context
field set to a non empty value.
* Run the checkEventYaml script and check it ends successfully.
Closes-bug: 2020381
Signed-off-by: Agustin Carranza <agustin.carranza@windriver.com>
Change-Id: Ia267886dd49099525751165975fb5d291c0c6f82
System config update alarms are 900.6xx series
The new alarms are originated by a new type of vim strategy
orchestrating configuration update.
The new alarms are similar in numbering and wording as the
kube upgrade auto apply 900.4xx series alarms and logs.
System config update in-progress alarm is 900.010.
System config update aborted alarm is 900.011.
Story: 2010719
Task: 47947
Change-Id: Ieb6e68adf359ac7b0489d15bb33cb4b4a9f3ef3f
Signed-off-by: Yuxing Jiang <Yuxing.Jiang@windriver.com>
Product Documentation is missing the alarm 900.007 'Kubernetes upgrade
in progress.'
That alarm has the Context field set to none. In order to be included
in stx documentation, it has to be set to Context: starlingx.
Test plan:
PASS: Run documentation generating scripts and check the alarm is now
included.
Closes-bug: 2019146
Signed-off-by: Agustin Carranza <agustin.carranza@windriver.com>
Change-Id: I4d4867e5299e3fb1eb37c9bcd3e53447d4f08ba5
This commit is intended to update the 260.002 alarm. As the 'severity'
is set to 'minor', it is desired to classify it as
non-management-affecting by adjusting its Management_Affecting_Severity
value to 'none'.
Test Plan:
PASS: Build and install Debian package.
Story: 2010719
Task: 47938
Change-Id: Ie228191ebdda5f2651dab1309b929ae06bc1f7f6
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
This commit adds a new alarm id and definition for resources
that has INSYNC=False.
The alarm will be raised when a resource is not
synchronized during a process of update. It will be cleared when
the resource is synchronized again.
Test Plan:
- Verify successful tox test and package build
- Verify the alarm can be raised using FmClientCli
Story: 2010719
Task: 47910
Change-Id: I24a976ed4beaa8248df25fd97eeee27f5754b969
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
Updating the rsa ssh host key based on:
https://github.blog/2023-03-23-we-updated-our-rsa-ssh-host-key/
Note: In the future, StarlingX should have a zuul job and
secret setup for all repos so we do not need to do this
for every repo.
Needed to rename the secret, because zuul fails if like-named
secrets have diffent values in different branches of the same
repo.
Partial-Bug: #2015246
Change-Id: Id0caa3ad6efbaed9fff904c6fab8ba35472ee6f5
Signed-off-by: Davlet Panech <davlet.panech@windriver.com>
Some documentation generating scripts were introduced in order to avoid
manual intervention every time an alarm/log is changed/added/removed.
Those scripts required a way to know where the alarm/log belongs to.
For that requirement, the field Context was introduced in previous
commits. During that development, it was taken the current
classification at that time in the docs as source of truth, but it was
outdated.
This commits modifies the values that were detected as wrong/outdated.
The scripts also require the value 'none' in the Context field for when
an alarm/log should not be included in the documentation but still be
defined in the events.yaml file. So the Context value is updated for
that case too.
Context incorrectly tagged as openstack and changed to starlingx:
* 900.006
Context incorrectly tagged as starlingx and changed to openstack:
* 100.105
* 100.112
* 100.113
* 300.001
* 300.002
Closes-bug: 2012981
Test plan
PASS: Since the Context field does not have impact in functionality,
build and install fm-doc package successfully.
Check the file in the filesystem contains this change.
PASS: Trigger random alarms and check FM functionality.
Signed-off-by: Agustin Carranza <agustin.carranza@windriver.com>
Change-Id: I16f858bbb712349f08b2ceca33152e365b0ed733
Currently, there is no alarm for Restore in progress.
Because of this, the system is shown as healthy,
before restore has been completed.
This new alarm will prevent the system from being healthy
until restore has properly been completed.
TEST PLAN
PASS: On any available system, the following commands can
be triggered at anytime:
* Run "system restore-start" to trigger alarm
* Run "system restore-complete" to clear alarm
Story: 2010117
Task: 47689
Signed-off-by: Joshua Kraitberg <joshua.kraitberg@windriver.com>
Change-Id: I292b5c8083c08b68ac757fe5a650989178eb819f
When a 800-Series alarm occurs, users refer to the documentation to
know what kind of error is shown. But sometimes that is not enough
information.
The output of some commands can be useful information and could
save time when solving issues related to the storage alarms.
Closes-bug: 2004601
Test plan
PASS: * Build fm packages and deploy an ISO containing new fm
packages.
* Trigger alarms that were modified by this commit,
(e.g. shutdown a controller).
* Run fm alarm-list --uuid and copy the uuid of a 800-series
alarm.
* Run fm alarm-show <uuid> and check that the field
has changed.
Signed-off-by: Agustin Carranza <agustin.carranza@windriver.com>
Change-Id: I94e2719b55b4fc14b692439526b5b47204460ac7
Added in the following tox targets for fm-rest-api:
- bandit
- flake8 / pep8
- pylint (suppressing most of the codes)
All the tox targets run on python3
The test-requirements.txt have been updated
The StarlingX Debian upper constraints are utilized.
The spec-lint (rpm) job is removed from Zuul.
Zuul runs pylint for sub directories
Bandit exclusions are updated.
Included a change to a .py file to trigger
the bandit zuul job.
Test Plan (for fm-rest-api)
PASS: tox -e bandit
PASS: tox -e coverage
PASS: tox -e flake8
PASS: tox -e pylint
Story: 2010531
Task: 47575
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I7ecaf1c90495b283c26e02e3b481bfe4c77c3939
The checkEventYaml script verifies if all contents
are properly populated for the events.yaml file.
This change ensures that check is done by zuul, rather
than during the build.
yaml.load after version 5.1 requires a Loader argument.
The yaml.load in fm-doc are now updated to use safe_load
intead
Test Plan:
PASS: tox -e linters
PASS: remove 'context' field from an alarm and observe
that tox -e linters reports a failure.
PASS: build-pkgs -p fm-doc
Story: 2010531
Task: 47549
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I369ffe4c74fcaf5fe4a916822fed18a78ead8ff8
Removal of stale alarm 270.001(Host compute service failure)
is raised by the vim. This might be an old reference to nova.
It’s likely not in use since stx.
Test Plan:
PASS: Verify with a load without the changes (removal of alarm)
and the event log in platform.log shows an entry for 270.001 alarm.
PASS: Verify with a load with changes of alarm removal and
the event log in platform.log does not show an entry for 270.001 alarm.
Closes-Bug: 2004744
Change-Id: I47a9f5cede2cfade4a16c63a2dc1bcfd563e88cf
Signed-off-by: Vanathi.Selvaraju <vanathi.selvaraju@windriver.com>
The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.
This ensures that any new code submissions under those
directories will increment the versions.
All packages have a higher version than before the change.
Test Plan:
PASS: build-pkgs -c -p fm-api
PASS: build-pkgs -c -p fm-common
PASS: build-pkgs -c -p fm-doc
PASS: build-pkgs -c -p fm-mgr
PASS: build-pkgs -c -p fm-rest-api
PASS: build-pkgs -c -p python-fmclient
Story: 2010550
Task: 47226
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I65e881ba96512d2eaba25c44332d5ae82efea502
The python2.7 jobs will no longer be executed as part
of the zuul check and gate.
This also removed the unused devstack job for stx/fault
Story: 2010531
Task: 47304
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I308a067e6ca23e45b7f5539853d7bb28f31bb7f5
For dynamic bash completion, instead of using the legacy
/etc/bash_completiond.d, the current bash-completion can use a
dynamic mechanism in which the customized completion is called
upon completion activation.
The new location that is already pointed by the .bashrc file,
also engaged by the /etc/bash_completion, is
/usr/share/bash-completion/completions.
However, the bash file was placed under a subfolder with the
name of the command which is not necessary since the file already
contains the command name.
Also, the proper file name shall contain .bash extension.
Closes-Bug: 2001553
Test Plan:
PASS: Build python-fmclient package.
PASS: Build Debian image and install it successfully.
Verify fm.bash is installed under /usr/share/bash-completion/completions
PASS: Verify bash completion is working as expected:
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
Change-Id: I3b796d26633459b98d7555e48e0bf5ea01c630d3
This change will allow this repo to pass zuul now
that this has merged:
https://review.opendev.org/c/zuul/zuul-jobs/+/866943
Tox 4 deprecated whitelist_externals.
Replace whitelist_externals with allowlist_externals
Removed the 'build' target from zuul which just invokes
the devstack script which is un-supported.
Partial-Bug: #2000399
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I59bd7c82c297e12969e31b5de9ac02d2a47834a6