From 38e37b9717212b232fe2d247bbc92bab5502cc12 Mon Sep 17 00:00:00 2001 From: Ruby Loo Date: Sat, 3 Mar 2018 18:29:24 +0000 Subject: [PATCH] Deployment steps framework This proposes refactoring the existing deployment code/process so that we have a deploy steps framework that is similar to our clean steps framework. This would be the first step towards supporting customizable deployment steps based on the user's requirements that are only known at deploy time. Change-Id: I2f68170e6741b0dbb6d5fbd5315a3be9fd7b28a7 Story: 1753128 Task: 10665 --- specs/approved/deployment-steps-framework.rst | 395 ++++++++++++++++++ .../deployment-steps-framework.rst | 1 + 2 files changed, 396 insertions(+) create mode 100644 specs/approved/deployment-steps-framework.rst create mode 120000 specs/not-implemented/deployment-steps-framework.rst diff --git a/specs/approved/deployment-steps-framework.rst b/specs/approved/deployment-steps-framework.rst new file mode 100644 index 00000000..dd8bf578 --- /dev/null +++ b/specs/approved/deployment-steps-framework.rst @@ -0,0 +1,395 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================== +Deployment Steps Framework +========================== + +https://storyboard.openstack.org/#!/story/1753128 + +There is a desire for ironic to support customizable and extendable deployment +steps, which would provide the ability to prepare bare metal nodes (servers) +that better match the needs of the users who will be using the nodes. + +In order to support that, we propose refactoring the existing deployment +code in ironic into a deployment steps framework, similar to the cleaning +steps framework. + +Problem description +=================== + +Presently, ironic provides a way to prepare nodes prior to them being made +available for deployment (see `state diagram`_). This is done via `cleaning`_. +However, it is not always possible, efficient, or effective to perform some of +these preparations without knowing the requirements of the users of the +nodes. In addition, there may be operations that should only be done once the +users' requirements are known. + +For example, during `cleaning`_, a node could be configured for RAID. +However, this might not be the desired RAID configuration that the user of the +node wants. Since the user's desires are only known at deployment time, a +mechanism that allows for custom RAID configuration during deployment is +preferred. + +Features like custom RAID configuration, BIOS configuration, and custom +kernel boot parameters are a few use cases that would benefit from a way +of defining deployment steps at deploy time, in ironic. + +It makes sense to provide support for this via deployment steps. This would +be conceptually similar to the cleaning steps supported by ironic already. + +Proposed change +=============== + +This proposal is the first step in providing support for performing different +deployment operations based on the user's desires. (The `RFE to reconfigure +nodes on deploy using traits`_ is an example of a feature that depends on +this work.) + +The proposed change is to implement a deployment steps (or ``deploy steps``) +framework that is very similar to the existing framework for automated and +manual `cleaning`_. (This was discussed and agreed upon in principle, at the +`OpenStack Dublin PTG`_.) + +This change is internal to ironic. Users will not be able to affect the +deployment process any more than they can do today. + +Conceptually, the clean steps model is a simple idea and operators are familiar +with it. Having similar deploy steps provides consistency and it will be easier +for operators to adopt, due to their familiarity with clean steps. It is also +powerful in that, at the end of the day (or year or two), a particular step +could be a clean step, a deploy step, or both. + +This includes re-factoring of code to be used by both clean and deploy steps. + +The existing deployment process will be implemented as a list of one (or more) +deploy steps. + +What is a deploy step? +---------------------- +Similar to clean steps, functions that are deploy steps will be decorated +with ``@deploy_step``, defined in ironic/drivers/base.py as follows:: + + def deploy_step(priority, argsinfo=None): + """Decorator for deployment steps. + + :param priority: an integer priority; used for determining the order in + which the step is run in the the deployment process. (See below, + "When are deploy steps executed" for more details.) + :param argsinfo: a dictionary of keyword arguments where key is the name of + the argument and value is a dictionary as follows: + + ‘description’: . Required. This should include + possible values. + ‘required’: Boolean. Optional; default is False. True if this + argument is required. + +An alternative is to have one decorator that allows specifying a function +to be a clean step and/or a deploy step, e.g.:: + + @step(clean_priority=0, deploy_priority=0, argsinfo=None) + +However, clean steps are abortable and deploy steps aren't (yet, see below), +and it is unclear whether other arguments might be added for the deploy step +decorator. Thus, it seems safer and simpler to have a separate decorator for +deploy steps. (Having one decorator for both types of steps is left as a +future exercise.) + +Although ironic allows cleaning to be aborted, ironic doesn't allow the +deployment to be aborted (although there is an `RFE to support abort in +deploy_wait`_). So it is outside the scope of this specification. + +A deploy step can be implemented by any Interface, not just DeployInterface. + +When are deploy steps executed? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each deploy step has a priority; a non-negative integer. In this first phase, +the priorities will be hard-coded. There will be no way to turn off or change +these priorities. + +The steps are executed from highest priority to lowest priority. Steps with +priorities of zero (0) are not executed. A step has to be finished, before the +next one is started. + +Alternatives +------------ + +There may be other ways to provide support for customizable deployment +steps per user/instance, but there doesn't seem to be good reasons for +having a different design from that used for clean steps. + +We could choose not to provide support for customized deploy steps on a per +user/instance basis. In that case, some of the current workarounds to overcome +this problem include: + +* have groups of nodes configured in advance (using clean steps) for each + required combination of configurations. This could lead to strange capacity + planning issues. + +* executing the desired configuration steps after each node is deployed. + As these configuration steps are executed post-deploy, most of them need a + reboot of the node, orchestration is needed to do these reboots properly, + and this causes performance issues that are not acceptable in a production + environment. This approach won't work for pre-deploy steps though, such as + RAID for the boot disk. + +* users can create their own images for each use case. But the limitation + is that the number of images can grow exponentially, and that there is no + ability to match a specific type of hardware with a specific image. + +* use a customizable DeployInterface like the `ansible`_ deploy interface + (although the `ansible`_ deploy interface is not recommended for production + use). This may not be able to achieve the same level of access to the + hardware or settings, to have the same effect. + +Data model impact +----------------- + +Similar to clean steps, a Node object will be updated with: + +* a new ``deploy_step`` field: this is the current deploy step that is being + executed or None if no steps have been executed yet. This will require an + update to the DB. +* ``driver_internal_info['deploy_steps']``: the list of deploy steps to be + executed. +* ``driver_internal_info['deploy_step_index']``: the index into the list of + deploy steps (or None if no steps have been executed yet); this corresponds + to node.deploy_step. + +State Machine Impact +-------------------- + +No new state or transition will be added. + +The state of the node will alternate from states.DEPLOYING (``deploying``) to +states.DEPLOYWAIT (``wait call-back``) for each asynchronous deploy step. + +REST API impact +--------------- + +There will not be any new API methods. + +GET /v1/nodes/* +~~~~~~~~~~~~~~~ +The GET /v1/nodes/* requests that return information about nodes will +be modified to also return the node's ``deploy_step`` field and the +deploy-related information in the node's ``driver_internal_info`` field. + +Similar to the ``clean_step`` field, the ``deploy_step`` field will be the +current deploy step being executed, or None if there is no deployment in +progress (or hasn't started yet). + +If the deployment fails, the ``deploy_step`` field will show which step caused +the deployment to fail. + +This change requires a new API version. For nodes that have not yet been +deployed using the deploy steps, the ``deploy_step`` field will be None, and +there won't be any deploy-related entries in the ``driver_internal_info`` +field. + +For older API versions, this ``deploy_step`` field will not be available, +although any deploy-related entries in the ``driver_internal_info`` field will +be shown. + +Client (CLI) impact +------------------- +The only change (when the new API version is specified), is that the response +for a Node will include the new ``deploy_step`` field and during deployment, +the new deploy-step-related entries in the node's ``driver_internal_info`` +field. + +"ironic" CLI +~~~~~~~~~~~~ +Even though this has been deprecated, responses will include the change +described above. + +"openstack baremetal" CLI +~~~~~~~~~~~~~~~~~~~~~~~~~ +Responses will inclde the change described above. + +RPC API impact +-------------- + +None. + +Driver API impact +----------------- + +Similar to cleaning, these methods will be added to the +drivers.base.BaseInterface class:: + + def get_deploy_steps(self, task): + """Get a list of deploy steps this interface can perform on a node. + + :param task: a TaskManager object, useful for interfaces overriding this method + :returns: a list of deploy step dictionaries + """ + + def execute_deploy_step(self, task, step): + """Execute the deploy step on task.node. + + :param task: a TaskManager object + :param step: The dictionary representing the step to execute + :raises DeployStepFailed: if the step fails + :returns: None if this method has completed synchronously, or + states.DEPLOYWAIT if the step will continue to execute + asynchronously. + """ + +The actual deploy steps will be determined in the coding phase; we will start +with one big deploy step (to get the framework in) and then break that step up +into more steps -- determined by what makes sense given the existing code, and +the constraints (e.g. support for out-of-tree drivers, backwards compatibility +when a deploy step in release N is split into several steps in release N+1). + +(This specification will be updated with the actual deploy steps, once that +is determined.) + +Out-of-tree Interfaces +~~~~~~~~~~~~~~~~~~~~~~ +Although the conductor will still support deployment the old way (without +deploy steps), this support will be deprecated and removed based on the +`standard deprecation policy +`_. +(The deprecation period may be extended if there is a strong desire to do so +by the vendors; we're flexible.) + +For out-of-tree interfaces that don't have deploy steps, the conductor will +emit (log) a deprecation warning, that the out-of-tree interface should be +updated to use deploy steps, and that all nodes that are being deployed +using the old way, need to be finished deploying, before an upgrade to the +release where there is no longer any more support for the old way. + +Nova driver impact +------------------ + +None + +Ramdisk impact +-------------- + +There should be no impact to the ramdisk (IPA). + +In the future, when we allow configuration and specification of deploy steps +per node, we might provide support for collecting deploy steps from the +ramdisk, but that is out of scope for this first phase. + +Security impact +--------------- + +None + +Other end user impact +--------------------- + +None. + +Scalability impact +------------------ + +None. + +Performance Impact +------------------ + +None. + +Other deployer impact +--------------------- + +None. + +Developer impact +---------------- + +DeployInterfaces (and any other interfaces involved in the deployment process) +will need to be written with deploy steps in mind. + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + * rloo (Ruby Loo) + +Work Items +---------- + +Ironic: + * Add support for deploy steps to base driver + * rework the existing code into one or more deploy steps + * Update the conductor to get the deploy steps and execute them + +``python-ironicclient``: + * Add support for node.deploy_step + +Dependencies +============ +None. + +Testing +======= + +* unit tests for all new code and changed behaviour +* CI jobs already test the deployment process; they should continue to work + with these changes + +Upgrades and Backwards Compatibility +==================================== + +* Old Interfaces will work with the new BaseInterface class because + the code will cleanly fall back when an Interface does not support + ``get_deploy_steps()``. A deprecation warning will be logged, and we will + remove support for the old way according to the OpenStack policy for + deprecations & removals. + +* Likewise, an Interface implementation with ``get_deploy_steps()`` will work + in an older version of Ironic. + +* In a cold upgrade: + + * if the agent heartbeats and driver_internal_info['deploy_steps'] is empty, + proceed the old way. + * if a deployment is started by a conductor using deploy steps (new code), + it means all the conductors are using the new code, so the deployment + can continue on any conductor that supports the node + +* In a rolling upgrade: + + * if the agent heartbeats and driver_internal_info['deploy_steps'] is empty, + proceed the old way (similar to cold upgrade) + * a new conductor will not use the deploy steps mechanism if it is pinned to + the old release (via `pin_release_version` configuration option). + if a deployment is started by a conductor using deploy steps (new code), + it means that it is unpinned, and all the conductors are using the new + code, so the deployment can continue on any conductor that supports the + node. + +Documentation Impact +==================== + +* api-ref: https://developer.openstack.org/api-ref/baremetal/ will be updated + to include the new node.deploy_step field + +References +========== + +* `cleaning`_ +* `OpenStack Dublin PTG`_ etherpad +* `RFE to reconfigure nodes on deploy using traits`_ +* `RFE to support abort in deploy_wait`_ +* `state diagram`_ + +.. _`cleaning`: https://docs.openstack.org/ironic/latest/admin/cleaning.html +.. _`OpenStack Dublin PTG`: https://etherpad.openstack.org/p/ironic-rocky-ptg-deploy-steps +.. _`RFE to reconfigure nodes on deploy using traits`: https://bugs.launchpad.net/ironic/+bug/1722275 +.. _`RFE to support abort in deploy_wait`: https://bugs.launchpad.net/ironic/+bug/1498251 +.. _`state diagram`: https://docs.openstack.org/ironic/latest/contributor/states.html +.. _`ansible`: https://docs.openstack.org/ironic/latest/admin/drivers/ansible.html diff --git a/specs/not-implemented/deployment-steps-framework.rst b/specs/not-implemented/deployment-steps-framework.rst new file mode 120000 index 00000000..69fe1d96 --- /dev/null +++ b/specs/not-implemented/deployment-steps-framework.rst @@ -0,0 +1 @@ +../approved/deployment-steps-framework.rst \ No newline at end of file