diff --git a/README.md b/README.md deleted file mode 100644 index d17ce85..0000000 --- a/README.md +++ /dev/null @@ -1,109 +0,0 @@ -Team and repository tags -======================== - -[![Team and repository tags](https://governance.openstack.org/tc/badges/monasca-notification.svg)](https://governance.openstack.org/tc/reference/tags/index.html) - - - -# Notification Engine - -This engine reads alarms from Kafka and then notifies the customer using their configured notification method. -Multiple notification and retry engines can run in parallel up to one per available Kafka partition. Zookeeper -is used to negotiate access to the Kafka partitions whenever a new process joins or leaves the working set. - -# Architecture -The notification engine generates notifications using the following steps: -1. Reads Alarms from Kafka, with no auto commit. - KafkaConsumer class -2. Determine notification type for an alarm. Done by reading from mysql. - AlarmProcessor class -3. Send Notification. - NotificationProcessor class -4. Successful notifications are added to a sent notification topic. - NotificationEngine class -5. Failed notifications are added to a retry topic. - NotificationEngine class -6. Commit offset to Kafka - KafkaConsumer class - -The notification engine uses three Kafka topics: -1. alarm_topic: Alarms inbound to the notification engine. -2. notification_topic: Successfully sent notifications. -3. notification_retry_topic: Unsuccessful notifications. - -A retry engine runs in parallel with the notification engine and gives any -failed notification a configurable number of extra chances at succeess. - -The retry engine generates notifications using the following steps: -1. Reads Notification json data from Kafka, with no auto commit. - KafkaConsumer class -2. Rebuild the notification that failed. - RetryEngine class -3. Send Notification. - NotificationProcessor class -4. Successful notifictions are added to a sent notification topic. - RetryEngine class -5. Failed notifications that have not hit the retry limit are added back to the retry topic. - RetryEngine class -6. Failed notifications that have hit the retry limit are discarded. - RetryEngine class -6. Commit offset to Kafka - KafkaConsumer class - -The retry engine uses two Kafka topics: -1. notification_retry_topic: Notifications that need to be retried. -2. notification_topic: Successfully sent notifications. - -## Fault Tolerance -When reading from the alarm topic no committing is done. The committing is done only after processing. This allows -the processing to continue even though some notifications can be slow. In the event of a catastrophic failure some -notifications could be sent but the alarms not yet acknowledged. This is an acceptable failure mode, better to send a -notification twice than not at all. - -The general process when a major error is encountered is to exit the daemon which should allow the other processes to -renegotiate access to the Kafka partitions. It is also assumed the notification engine will be run by a process -supervisor which will restart it in case of a failure. This way any errors which are not easy to recover from are -automatically handled by the service restarting and the active daemon switching to another instance. - -Though this should cover all errors there is risk that an alarm or set of alarms can be processed and notifications -sent out multiple times. To minimize this risk a number of techniques are used: - -- Timeouts are implemented with all notification types. -- An alarm TTL is utilized. Any alarm older than the TTL is not processed. - -# Operation -Yaml config file by default is in '/etc/monasca/notification.yaml', a sample is in this project. - -## Monitoring -statsd is incorporated into the daemon and will send all stats to statsd server launched by monasca-agent. -Default host and port points at **localhost:8125**. - -- Counters - - ConsumedFromKafka - - AlarmsFailedParse - - AlarmsNoNotification - - NotificationsCreated - - NotificationsSentSMTP - - NotificationsSentWebhook - - NotificationsSentPagerduty - - NotificationsSentFailed - - NotificationsInvalidType - - AlarmsFinished - - PublishedToKafka -- Timers - - ConfigDBTime - - SendNotificationTime - -# Future Considerations -- More extensive load testing is needed - - How fast is the mysql db? How much load do we put on it. Initially I think it makes most sense to read notification - details for each alarm but eventually I may want to cache that info. - - How expensive are commits to Kafka for every message we read? Should we commit every N messages? - - How efficient is the default Kafka consumer batch size? - - Currently we can get ~200 notifications per second per NotificationEngine instance using webhooks to a local - http server. Is that fast enough? - - Are we putting too much load on Kafka at ~200 commits per second? - -# License - -Copyright (c) 2014 Hewlett-Packard Development Company, L.P. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or -implied. -See the License for the specific language governing permissions and -limitations under the License. diff --git a/README.rst b/README.rst new file mode 100644 index 0000000..a3aeec3 --- /dev/null +++ b/README.rst @@ -0,0 +1,139 @@ +Team and repository tags +======================== + +|Team and repository tags| + +.. raw:: html + + + +Notification Engine +=================== + +This engine reads alarms from Kafka and then notifies the customer using +the configured notification method. Multiple notification and retry +engines can run in parallel, up to one per available Kafka partition. +Zookeeper is used to negotiate access to the Kafka partitions whenever a +new process joins or leaves the working set. + +Architecture +============ + +The notification engine generates notifications using the following +steps: + +1. Read Alarms from Kafka, with no auto commit. - + monasca\_common.kafka.KafkaConsumer class +2. Determine notification type for an alarm. Done by reading from mysql. - AlarmProcessor class +3. Send notification. - NotificationProcessor class +4. Add successful notifications to a sent notification topic. - NotificationEngine class +5. Add failed notifications to a retry topic. - NotificationEngine class +6. Commit offset to Kafka - KafkaConsumer class + +The notification engine uses three Kafka topics: + +1. alarm\_topic: Alarms inbound to the notification engine. +2. notification\_topic: Successfully sent notifications. +3. notification\_retry\_topic: Failed notifications. + +A retry engine runs in parallel with the notification engine and gives +any failed notification a configurable number of extra chances at +success. + +The retry engine generates notifications using the following steps: + +1. Read notification json data from Kafka, with no auto commit. - KafkaConsumer class +2. Rebuild the notification that failed. - RetryEngine class +3. Send notification. - NotificationProcessor class +4. Add successful notifications to a sent notification topic. - RetryEngine class +5. Add failed notifications that have not hit the retry limit back to the retry topic. - + RetryEngine class +6. Discard failed notifications that have hit the retry limit. - RetryEngine class +7. Commit offset to Kafka. - KafkaConsumer class + +The retry engine uses two Kafka topics: + +1. notification\_retry\_topic: Notifications that need to be retried. +2. notification\_topic: Successfully sent notifications. + +Fault Tolerance +--------------- + +When reading from the alarm topic, no committing is done. The committing +is done only after processing. This allows the processing to continue +even though some notifications can be slow. In the event of a +catastrophic failure some notifications could be sent but the alarms +have not yet been acknowledged. This is an acceptable failure mode, +better to send a notification twice than not at all. + +The general process when a major error is encountered is to exit the +daemon which should allow the other processes to renegotiate access to +the Kafka partitions. It is also assumed that the notification engine +will be run by a process supervisor which will restart it in case of a +failure. In this way, any errors which are not easy to recover from are +automatically handled by the service restarting and the active daemon +switching to another instance. + +Though this should cover all errors, there is the risk that an alarm or +a set of alarms can be processed and notifications are sent out multiple +times. To minimize this risk a number of techniques are used: + +- Timeouts are implemented for all notification types. +- An alarm TTL is utilized. Any alarm older than the TTL is not + processed. + +Operation +========= + +``oslo.config`` is used for handling configuration options. A sample +configuration file ``etc/monasca/notification.conf.sample`` can be +generated by running: + +:: + + tox -e genconfig + +Monitoring +---------- + +StatsD is incorporated into the daemon and will send all stats to the +StatsD server launched by monasca-agent. Default host and port points to +**localhost:8125**. + +- Counters + + - ConsumedFromKafka + - AlarmsFailedParse + - AlarmsNoNotification + - NotificationsCreated + - NotificationsSentSMTP + - NotificationsSentWebhook + - NotificationsSentPagerduty + - NotificationsSentFailed + - NotificationsInvalidType + - AlarmsFinished + - PublishedToKafka + +- Timers + + - ConfigDBTime + - SendNotificationTime + +Future Considerations +===================== + +- More extensive load testing is needed: + + - How fast is the mysql db? How much load do we put on it. Initially I + think it makes most sense to read notification details for each alarm + but eventually I may want to cache that info. + - How expensive are commits to Kafka for every message we read? Should + we commit every N messages? + - How efficient is the default Kafka consumer batch size? + - Currently we can get ~200 notifications per second per + NotificationEngine instance using webhooks to a local http server. Is + that fast enough? + - Are we putting too much load on Kafka at ~200 commits per second? + +.. |Team and repository tags| image:: https://governance.openstack.org/tc/badges/monasca-notification.svg + :target: https://governance.openstack.org/tc/reference/tags/index.html diff --git a/lower-constraints.txt b/lower-constraints.txt index 2065e59..6509c4b 100644 --- a/lower-constraints.txt +++ b/lower-constraints.txt @@ -4,6 +4,7 @@ bandit==1.4.0 configparser==3.5.0 coverage==4.0 debtcollector==1.2.0 +docutils==0.11 extras==1.0.0 fixtures==3.0.0 flake8==2.5.5 diff --git a/setup.cfg b/setup.cfg index f1fcccd..d850df6 100644 --- a/setup.cfg +++ b/setup.cfg @@ -8,7 +8,7 @@ classifier= License :: OSI Approved :: Apache Software License Topic :: System :: Monitoring keywords = openstack monitoring email -description-file = README.md +description-file = README.rst home-page = https://github.com/stackforge/monasca-notification license = Apache @@ -35,5 +35,5 @@ universal = 1 [extras] jira_plugin = - jira + jira>=1.0.3 Jinja2>=2.10 # BSD License (3 clause) diff --git a/test-requirements.txt b/test-requirements.txt index 73c420e..c0cf4e9 100644 --- a/test-requirements.txt +++ b/test-requirements.txt @@ -15,3 +15,4 @@ testrepository>=0.0.18 # Apache-2.0/BSD SQLAlchemy!=1.1.5,!=1.1.6,!=1.1.7,!=1.1.8,>=1.0.10 # MIT PyMySQL>=0.7.6 # MIT License psycopg2>=2.6.2 # LGPL/ZPL +docutils>=0.11 # OSI-Approved Open Source, Public Domain diff --git a/tox.ini b/tox.ini index b763daf..65716aa 100644 --- a/tox.ini +++ b/tox.ini @@ -43,6 +43,7 @@ basepython = python3 commands = {[testenv:flake8]commands} {[testenv:bandit]commands} + python setup.py check --restructuredtext --strict [testenv:venv] basepython = python3