.. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ========================== Deployment Steps Framework ========================== https://storyboard.openstack.org/#!/story/1753128 There is a desire for ironic to support customizable and extendable deployment steps, which would provide the ability to prepare bare metal nodes (servers) that better match the needs of the users who will be using the nodes. In order to support that, we propose refactoring the existing deployment code in ironic into a deployment steps framework, similar to the cleaning steps framework. Problem description =================== Presently, ironic provides a way to prepare nodes prior to them being made available for deployment (see `state diagram`_). This is done via `cleaning`_. However, it is not always possible, efficient, or effective to perform some of these preparations without knowing the requirements of the users of the nodes. In addition, there may be operations that should only be done once the users' requirements are known. For example, during `cleaning`_, a node could be configured for RAID. However, this might not be the desired RAID configuration that the user of the node wants. Since the user's desires are only known at deployment time, a mechanism that allows for custom RAID configuration during deployment is preferred. Features like custom RAID configuration, BIOS configuration, and custom kernel boot parameters are a few use cases that would benefit from a way of defining deployment steps at deploy time, in ironic. It makes sense to provide support for this via deployment steps. This would be conceptually similar to the cleaning steps supported by ironic already. Proposed change =============== This proposal is the first step in providing support for performing different deployment operations based on the user's desires. (The `RFE to reconfigure nodes on deploy using traits`_ is an example of a feature that depends on this work.) The proposed change is to implement a deployment steps (or ``deploy steps``) framework that is very similar to the existing framework for automated and manual `cleaning`_. (This was discussed and agreed upon in principle, at the `OpenStack Dublin PTG`_.) This change is internal to ironic. Users will not be able to affect the deployment process any more than they can do today. Conceptually, the clean steps model is a simple idea and operators are familiar with it. Having similar deploy steps provides consistency and it will be easier for operators to adopt, due to their familiarity with clean steps. It is also powerful in that, at the end of the day (or year or two), a particular step could be a clean step, a deploy step, or both. This includes re-factoring of code to be used by both clean and deploy steps. The existing deployment process will be implemented as a list of one (or more) deploy steps. What is a deploy step? ---------------------- Similar to clean steps, functions that are deploy steps will be decorated with ``@deploy_step``, defined in ironic/drivers/base.py as follows:: def deploy_step(priority, argsinfo=None): """Decorator for deployment steps. :param priority: an integer priority; used for determining the order in which the step is run in the the deployment process. (See below, "When are deploy steps executed" for more details.) :param argsinfo: a dictionary of keyword arguments where key is the name of the argument and value is a dictionary as follows: ‘description’: . Required. This should include possible values. ‘required’: Boolean. Optional; default is False. True if this argument is required. An alternative is to have one decorator that allows specifying a function to be a clean step and/or a deploy step, e.g.:: @step(clean_priority=0, deploy_priority=0, argsinfo=None) However, clean steps are abortable and deploy steps aren't (yet, see below), and it is unclear whether other arguments might be added for the deploy step decorator. Thus, it seems safer and simpler to have a separate decorator for deploy steps. (Having one decorator for both types of steps is left as a future exercise.) Although ironic allows cleaning to be aborted, ironic doesn't allow the deployment to be aborted (although there is an `RFE to support abort in deploy_wait`_). So it is outside the scope of this specification. A deploy step can be implemented by any Interface, not just DeployInterface. When are deploy steps executed? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each deploy step has a priority; a non-negative integer. In this first phase, the priorities will be hard-coded. There will be no way to turn off or change these priorities. The steps are executed from highest priority to lowest priority. Steps with priorities of zero (0) are not executed. A step has to be finished, before the next one is started. Alternatives ------------ There may be other ways to provide support for customizable deployment steps per user/instance, but there doesn't seem to be good reasons for having a different design from that used for clean steps. We could choose not to provide support for customized deploy steps on a per user/instance basis. In that case, some of the current workarounds to overcome this problem include: * have groups of nodes configured in advance (using clean steps) for each required combination of configurations. This could lead to strange capacity planning issues. * executing the desired configuration steps after each node is deployed. As these configuration steps are executed post-deploy, most of them need a reboot of the node, orchestration is needed to do these reboots properly, and this causes performance issues that are not acceptable in a production environment. This approach won't work for pre-deploy steps though, such as RAID for the boot disk. * users can create their own images for each use case. But the limitation is that the number of images can grow exponentially, and that there is no ability to match a specific type of hardware with a specific image. * use a customizable DeployInterface like the `ansible`_ deploy interface (although the `ansible`_ deploy interface is not recommended for production use). This may not be able to achieve the same level of access to the hardware or settings, to have the same effect. Data model impact ----------------- Similar to clean steps, a Node object will be updated with: * a new ``deploy_step`` field: this is the current deploy step that is being executed or None if no steps have been executed yet. This will require an update to the DB. * ``driver_internal_info['deploy_steps']``: the list of deploy steps to be executed. * ``driver_internal_info['deploy_step_index']``: the index into the list of deploy steps (or None if no steps have been executed yet); this corresponds to node.deploy_step. State Machine Impact -------------------- No new state or transition will be added. The state of the node will alternate from states.DEPLOYING (``deploying``) to states.DEPLOYWAIT (``wait call-back``) for each asynchronous deploy step. REST API impact --------------- There will not be any new API methods. GET /v1/nodes/* ~~~~~~~~~~~~~~~ The GET /v1/nodes/* requests that return information about nodes will be modified to also return the node's ``deploy_step`` field and the deploy-related information in the node's ``driver_internal_info`` field. Similar to the ``clean_step`` field, the ``deploy_step`` field will be the current deploy step being executed, or None if there is no deployment in progress (or hasn't started yet). If the deployment fails, the ``deploy_step`` field will show which step caused the deployment to fail. This change requires a new API version. For nodes that have not yet been deployed using the deploy steps, the ``deploy_step`` field will be None, and there won't be any deploy-related entries in the ``driver_internal_info`` field. For older API versions, this ``deploy_step`` field will not be available, although any deploy-related entries in the ``driver_internal_info`` field will be shown. Client (CLI) impact ------------------- The only change (when the new API version is specified), is that the response for a Node will include the new ``deploy_step`` field and during deployment, the new deploy-step-related entries in the node's ``driver_internal_info`` field. "ironic" CLI ~~~~~~~~~~~~ Even though this has been deprecated, responses will include the change described above. "openstack baremetal" CLI ~~~~~~~~~~~~~~~~~~~~~~~~~ Responses will inclde the change described above. RPC API impact -------------- None. Driver API impact ----------------- Similar to cleaning, these methods will be added to the drivers.base.BaseInterface class:: def get_deploy_steps(self, task): """Get a list of deploy steps this interface can perform on a node. :param task: a TaskManager object, useful for interfaces overriding this method :returns: a list of deploy step dictionaries """ def execute_deploy_step(self, task, step): """Execute the deploy step on task.node. :param task: a TaskManager object :param step: The dictionary representing the step to execute :raises DeployStepFailed: if the step fails :returns: None if this method has completed synchronously, or states.DEPLOYWAIT if the step will continue to execute asynchronously. """ The actual deploy steps will be determined in the coding phase; we will start with one big deploy step (to get the framework in) and then break that step up into more steps -- determined by what makes sense given the existing code, and the constraints (e.g. support for out-of-tree drivers, backwards compatibility when a deploy step in release N is split into several steps in release N+1). (This specification will be updated with the actual deploy steps, once that is determined.) Out-of-tree Interfaces ~~~~~~~~~~~~~~~~~~~~~~ Although the conductor will still support deployment the old way (without deploy steps), this support will be deprecated and removed based on the `standard deprecation policy `_. (The deprecation period may be extended if there is a strong desire to do so by the vendors; we're flexible.) For out-of-tree interfaces that don't have deploy steps, the conductor will emit (log) a deprecation warning, that the out-of-tree interface should be updated to use deploy steps, and that all nodes that are being deployed using the old way, need to be finished deploying, before an upgrade to the release where there is no longer any more support for the old way. Nova driver impact ------------------ None Ramdisk impact -------------- There should be no impact to the ramdisk (IPA). In the future, when we allow configuration and specification of deploy steps per node, we might provide support for collecting deploy steps from the ramdisk, but that is out of scope for this first phase. Security impact --------------- None Other end user impact --------------------- None. Scalability impact ------------------ None. Performance Impact ------------------ None. Other deployer impact --------------------- None. Developer impact ---------------- DeployInterfaces (and any other interfaces involved in the deployment process) will need to be written with deploy steps in mind. Implementation ============== Assignee(s) ----------- Primary assignee: * rloo (Ruby Loo) Work Items ---------- Ironic: * Add support for deploy steps to base driver * rework the existing code into one or more deploy steps * Update the conductor to get the deploy steps and execute them ``python-ironicclient``: * Add support for node.deploy_step Dependencies ============ None. Testing ======= * unit tests for all new code and changed behaviour * CI jobs already test the deployment process; they should continue to work with these changes Upgrades and Backwards Compatibility ==================================== * Old Interfaces will work with the new BaseInterface class because the code will cleanly fall back when an Interface does not support ``get_deploy_steps()``. A deprecation warning will be logged, and we will remove support for the old way according to the OpenStack policy for deprecations & removals. * Likewise, an Interface implementation with ``get_deploy_steps()`` will work in an older version of Ironic. * In a cold upgrade: * if the agent heartbeats and driver_internal_info['deploy_steps'] is empty, proceed the old way. * if a deployment is started by a conductor using deploy steps (new code), it means all the conductors are using the new code, so the deployment can continue on any conductor that supports the node * In a rolling upgrade: * if the agent heartbeats and driver_internal_info['deploy_steps'] is empty, proceed the old way (similar to cold upgrade) * a new conductor will not use the deploy steps mechanism if it is pinned to the old release (via `pin_release_version` configuration option). if a deployment is started by a conductor using deploy steps (new code), it means that it is unpinned, and all the conductors are using the new code, so the deployment can continue on any conductor that supports the node. Documentation Impact ==================== * api-ref: https://developer.openstack.org/api-ref/baremetal/ will be updated to include the new node.deploy_step field References ========== * `cleaning`_ * `OpenStack Dublin PTG`_ etherpad * `RFE to reconfigure nodes on deploy using traits`_ * `RFE to support abort in deploy_wait`_ * `state diagram`_ .. _`cleaning`: https://docs.openstack.org/ironic/latest/admin/cleaning.html .. _`OpenStack Dublin PTG`: https://etherpad.openstack.org/p/ironic-rocky-ptg-deploy-steps .. _`RFE to reconfigure nodes on deploy using traits`: https://bugs.launchpad.net/ironic/+bug/1722275 .. _`RFE to support abort in deploy_wait`: https://bugs.launchpad.net/ironic/+bug/1498251 .. _`state diagram`: https://docs.openstack.org/ironic/latest/contributor/states.html .. _`ansible`: https://docs.openstack.org/ironic/latest/admin/drivers/ansible.html