Copyright(c) 2013-2017, Wind River Systems, Inc. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Wind River Systems nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ----------------------------------------------------------------------- This file contains instructions for using the Titanium Cloud Guest-Client. Titanium Cloud Setup ===================== The following steps are required to setup the Titanium Cloud to heartbeat a VM. 1. Create and modify a Flavor for your VM. A flavor extraspec, 'Guest Heartbeat', is used to indicate that VMs of this flavor support Titanium Cloud Guest Heartbeat. The default value is 'False'. If support is indicated, then as soon as the VM's Titanium Cloud Guest-Client daemon registers with the Titanium Cloud Compute Services on the compute node host, heartbeating will be enabled. a) Create a new flavor: via dashboard ... - Select 'Admin->Flavors' to bring up the list of flavors - Select '+ Create Flavor' in the upper right. - Fill in the fields as desired - Select 'Create Flavor' via command line ... - nova flavor-create ... b) Modify the newly created flavor or an existing flavor: via dashboard ... - Select 'Admin->Flavors' to bring up the list of flavors - Choose a flavor to modify. - Select the to go to the Flavor Detail page - Select the Extra Specs TAB - Select '+ Create' - Select 'Guest Heartbeat' from pull-down Extra Spec menu - Check the 'Guest Heartbeat' checkbox - Select 'Create' via command line ... - nova flavor-key set sw:wrs:guest:heartbeat=True Note: already running instances that were launched with this flavor are NOT affected. 2) Launch a new instance of your VM. 3) Verify your VM is running with Guest Heartbeat enabled. Log into the VM. Guest-Client logs are written to syslog's 'daemon' facility, which are typically logged by the syslog service to /var/log/daemon.log. Please refer to syslog for details on log settings in order to determine location of logged Guest-Client messages. Guest-Client logs are easy to identify. The logs always contain the string 'Guest-Client'. A recursive grep of /var/log is one way to determine where your syslog is sending the Guest-Client logs. LOG=`grep -r -l 'Guest-Client' /var/log` echo $LOG /var/log/daemon.log A successful connection can be verified by looking for the following log. grep "Guest-Client" $USER_LOG | grep "heartbeat state change" Guest-Client heartbeat state change from enabling to enabled VM Setup ======== Configuring Guest-Client Initialization/Start Scripts ----------------------------------------------------- The Titanium Cloud communicates with the Guest-Client through a character device. The packaged initialization/startup scripts need to be updated to specify the character device exposed by QEMU to the VM. +-- Virtual Machine ---+ | | | | Titanium Cloud <-------------------> QEMU <------------> Guest-Client | unix-stream-socket char-device | | | +----------------------+ The variable that needs updating in the initialization/start scripts is called GUEST_CLIENT_DEVICE. Also the location of the Guest-Client binary needs to be updated in the initialization/start scripts. The variable that needs updating is called GUEST_CLIENT. Configuring Guest Heartbeat & Application Health Check ------------------------------------------------------ The Guest-Client within your VM will register with the Titanium Cloud Compute Services on the compute node host. Part of that registration process is the specification of a heartbeat interval and a corrective action for a failed/unhealthy VM. The values of heartbeat interval and corrective action come from the guest_heartbeat.conf file and is located in /etc/guest-client/heartbeat directory by default. Guest heartbeat works on a challenge response model. The Titanium Server Compute Services on the compute node host will challenge the Guest-Client daemon with a message each interval. The Guest-Client must respond prior to the next interval with a message indicating good health. If the Titanium Cloud Compute Services does not receive a valid response, or if the response specifies that the VM is in ill health, then corrective action is taken. The mechanism can be extended by allowing additional VM resident application specific scripts and processes, to register for heartbeating. Each script or process can specify its own heartbeat interval, and its own corrective action to be taken against the VM as a whole. On ill health the Guest-Client reports ill health to the Titanium Cloud Compute Services on the compute node host on the next challenge, and provoke the corrective action. This mechanism allows for detection of a failed or hung QEMU/KVM instance, or a failure of the OS within the VM to schedule the Guest-Client process or to route basic IO, or an application level error/failure. Configuring the Guest-Client Heartbeat & Application Health Check ... The heartbeat interval defaults to every second and can be overridden by the VM in the guest_heartbeat.conf. /etc/guest-client/heartbeat/guest_heartbeat.conf: ## This specifies the interval between heartbeats in milliseconds between the ## guest-client heartbeat and the Titanium Cloud Compute Services on the ## compute node host. HB_INTERVAL=1000 The corrective action defaults to 'reboot' and can be overridden by the VM in the guest_heartbeat.conf. /etc/guest-client/heartbeat/guest_heartbeat.conf: ## This specifies the corrective action against the VM in the case of a ## heartbeat failure between the guest-client and Titanium Cloud Compute ## Services on the compute node host and also when the health script ## configured below fails. ## ## Your options are: ## "log" Only a log is issued. ## "reboot" Issue a reboot against this VM. ## "stop" Issue a stop against this VM. ## CORRECTIVE_ACTION="reboot" A health check script can be registered to run periodically to verify the health of the VM. This is specified in the guest_heartbeat.conf. /etc/guest-client/heartbeat/guest_heartbeat.conf: ## The Path to the health check script. This is optional. ## The script will be called periodically to check for the health of the VM. ## The health check interval is specified in seconds. HEALTH_CHECK_INTERVAL=30 HEALTH_CHECK_SCRIPT="/etc/guest-client/heartbeat/sample_health_check_script" Configuring Guest Notifications and Voting ------------------------------------------ The Guest-Client running in the VM can be used as a conduit for notifications of VM lifecycle events being taken by the Titanium Cloud that will impact this VM. Reboots, pause/resume and migrations are examples of the types of events your VM can be notified of. Depending on the event, a vote on the event maybe required before a notification is sent. Notifications may precede the event, follow it or both. The full table of events and notifications is found below. Titanium Action Event Name Vote* Pre-notification Post-notification Timeout --------------- ----------------- ---- ---------------- ----------------- ------- stop stop yes yes no shutdown reboot reboot yes yes no shutdown pause pause yes yes no suspend unpause unpause no no yes resume suspend suspend yes yes no suspend resume resume no no yes resume resize resize_begin yes yes no suspend resize_end no no yes resume live-migrate live_migrate_begin yes yes no suspend live_migrate_end no no yes resume cold-migrate cold_migrate_begin yes yes no suspend cold_migrate_end no no yes resume** * voting has its own timeout called 'vote' that is event independent. ** after VM reboot and reconnection which is subject to the 'restart' timeout. Notifications are an opportunity for the VM to take preparatory actions in anticipation of the forthcoming event, or recovery actions after the event has completed. A few examples - A reboot or stop notification might allow the application to stop accepting transactions and cleanly wrap up existing transactions. - A 'resume' notification after a suspend might trigger a time adjustment. - Pre and post migrate notifications might trigger the application to de-register and then re-register with a network load balancer. If you register a notification handler, it will receive all events. If an event is not of interest, it should return immediately with a successful return code. A process may only register a single notification handler. However multiple processes may independently register handlers. Also a script based handler may be registered via the guest_heartbeat.conf. When multiple processes and scripts register notification handlers, they will be run in parallel. Notifications are subject to configurable timeouts. Timeouts are specified by each registered process and in the guest_heartbeat.conf. The timeouts in the guest_heartbeat.conf govern the maximum time all registered notification handlers have to complete. While pre-notification handlers are running, the event will be delayed. If the timeout is reached, the event will be allowed to proceed. While post-notification handlers are running, or waiting to be run, the Titanium Cloud will not be able to declare the action complete. Keep in mind that many events that offer a post notification will require the VM's Guest-Client to reconnect to the compute host, and that may be further delayed while the VM is rebooted as in a cold migration. When post-notification is finally triggered, it is subject to a timeout as well. If the timeout is reached, the event will be declared complete. NOTE: A post-event notification that follows a reboot, as in the cold_migrate_end event, is a special case. It will be triggered as soon as the local heartbeat server reconnects with the compute host, and likely before any processes have a chance to register a handler. The only handler guaranteed to see such a notification is a script directly registered by the Guest-Client itself via guest_heartbeat.conf. In addition to notifications, there is also an opportunity for the VM to vote on any proposed event. Voting precedes all notifications, and offers the VM a chance to reject the event the Titanium Cloud wishes to initiate. If multiple handlers are registered, it only takes one rejection to abort the event. The same handler that handles notifications also handles voting. Voting is subject to a configurable timeout. The same timeout applies regardless of the event. The timeout is specified when the Guest-Client registers with compute services on the host. The timeout is specified in the guest_heartbeat.conf file. This timeout governs the maximum time all registered voting handlers have to complete the vote. Any voters that fail to vote within the timeout are assumed to have agreed with the proposed action. Rejecting an event should be the exception, not the rule, reserved for cases when the VM is handling an exceptionally sensitive operation, as well as a slow operation that can't complete in the notification timeout. An example - an active-standby application deployment (1:1), where the active rejects a shutdown or pause or ... due to its peer standby is not ready or synchronized. A vote handler should generally not take any action beyond returning its vote. Just because you vote to accept, doesn't mean all your peers will also accept (i.e. the event might not happen). Taking an action against an event that never happens is almost certainly NOT what you want. Instead save your actions for the notification that follows if no one rejects. The one exception might be to temporarily block the initiation of any new task that would cause you to vote to reject an event in the near future. The theory being that the requester of the event may retry in the near future. The Titanium Cloud is not required to offer a vote. Voting may be bypassed on certain recovery scenarios. Configuring Guest-Client Notification and Voting ... ## The overall time to vote in seconds regardless of the event being voted ## upon. It should reflect the slowest of all expected voters when in a sane ## and healthy condition, plus some allowance for scheduling and messaging. VOTE=8 ## The overall time to handle a stop or reboot notification in seconds. ## It should reflect the slowest of all expected notification handlers ## when in a sane and healthy condition, plus some allowance for scheduling ## and messaging. SHUTDOWN_NOTICE=8 ## The overall time to handle a pause, suspend or migrate-begin notification ## in seconds. It should reflect the slowest of all expected notification ## handlers when in a sane and healthy condition, plus some allowance for ## scheduling and messaging. SUSPEND_NOTICE=8 ## The overall time to handle an unpause, resume or migrate-end notification ## in seconds. It should reflect the slowest of all expected notification ## handlers when in a sane and healthy condition, plus some allowance for ## scheduling and messaging. It does not include reboot time. RESUME_NOTICE=13 ## The overall time to reboot, up to the point the guest-client heartbeat ## starts in seconds. Allow for some I/O contention. RESTART=300 ## The Path to the event notification script. This is optional. ## The script will be called when an action is initiated that will impact ## the VM. ## ## The event handling script is invoked with two parameters: ## ## event_handling_script ## ## MSG_TYPE is one of: ## 'revocable' Indicating a vote is called for. Return zero to accept, ## non-zero to reject. For a rejection, the first line of ## stdout emitted by the script will be captured and logged ## logged indicating why the event was rejected. ## ## 'irrevocable' Indicating this is a notification only. Take preparatory ## actions and return zero if successful, or non-zero on ## failure. For a failure, the first line of stdout ## emitted by the script will be captured and logged ## indicating the cause of the failure. ## ## EVENT is one of: ( 'stop', 'reboot', 'pause', 'unpause', 'suspend', ## 'resume', 'live_migrate_begin', ## 'live_migrate_end', 'cold_migrate_begin', ## 'cold_migrate_end' ) ## EVENT_NOTIFICATION_SCRIPT="/etc/guest-client/heartbeat/sample_event_handling_script" VM Application Setup ==================== An application running in the VM may wish to register directly for voting and notifications. See the guest_heartbeat_api.h for more details. A working example can be found in the guest_client_api source directory in the sample_guest_app.c file. To compile the sample-guest-app run ... cd wrs-guest-heartbeat-3.0.0 make sample This will create an executable called sample-guest-app in the 'build' directory. When compiling the guest application ... include headers: #include #include #include link with: -lguest_common_api -lguest_heartbeat_api