Added some ideas for better Fault Tolerance

2014-03-18 14:47:14 -06:00 · 2014-03-18 14:47:14 -06:00 · 789ae36e12
parent 2b5ca0140f
commit 789ae36e12
2 changed files with 22 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -25,15 +25,30 @@ There are 4 internal queues:
 4. finished - alarms that are done with processing, either the notification is sent or there was none.

 ## High Availability
-HA is handled by utilizing multiple partitions withing kafka. When multiple notification engines are running the partitions
-are spread out among them, as engines die/restart things reshuffle.
+HA is handled by running multiple notification engines. Only one at a time is active if it dies another can take
+over and continue from where it left. A zookeeper lock file is used to ensure only one running daemon. If needed
+the code can be modified to use kafka partitions to have multiple active engines working on different alarms.

+## Fault Tolerance
 When reading from the alarm topic no committing is done. The committing is done in sent_notification processor. This allows
 the processing to continue even though some notifications can be slow. In the event of a catastrophic failure some
 notifications could be sent but the alarms not yet acknowledged. This is an acceptable failure mode, better to send a
 notification twice than not at all.

-It is assumed the notification engine will be run by a process supervisor which will restart it in case of a failure.
+The general process when a major error is encountered is to exit the daemon which should allow another daemon to take
+over according to the HA strategy. It is also assumed the notification engine will be run by a process supervisor which
+will restart it in case of a failure. This way any errors which are not easy to recover from are automatically handled
+by the service restarting and the active daemon switching to another instance.
+
+Though this should cover all errors there is risk that an alarm or set of alarms can be processed and notifications
+sent out multiple times. To minimize this risk a number of techniques are used:
+
+- Timeouts are implemented with all notification types.
+- On a clean shutdown each process finishes active work.
+- An alarm TTL is utilized. Any alarm older than the TTL is not processed.
+- A maximum offset lag time is set. The offset is normally only updated if there is a continuous chain of finished
+  alarms. If there is a new offset that arrives yet still a gap it is normally held in reserve. If the maximum lag
+  time has been set and exceeded when a new finished alarm comes in the offset is updated regardless of gaps.

 # Operation
 Yaml config file by default is in '/etc/mon/notification.yaml', a sample is in this project.
--- a/notification.yaml
+++ b/notification.yaml
@ -25,10 +25,10 @@ processors:
        number: 4

 queues:
-    alarms_size: 1024
-    finished_size: 1024
-    notifications_size: 1024
-    sent_notifications_size: 1024
+    alarms_size: 256
+    finished_size: 256
+    notifications_size: 256
+    sent_notifications_size: 50  # limiting this size reduces potential # of re-sent notifications after a failure

 zookeeper:
    url: 192.168.10.10:2181