diff --git a/specs/approved/agent-api.rst b/specs/approved/agent-api.rst new file mode 100644 index 00000000..973da9a1 --- /dev/null +++ b/specs/approved/agent-api.rst @@ -0,0 +1,312 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================================= +Promote agent vendor passthru to core API +========================================= + +https://bugs.launchpad.net/ironic/+bug/1570841 + +This spec suggests making the current agent vendor passthru API (lookup and +heartbeat) first class API endpoints and deprecate the agent vendor passthru +interface. + +Problem description +=================== + +The vendor passthru was designed for vendors to place their specific features +before they get wider adoption in Ironic and get promoted to the core API. + +However, when IPA is used (which is the only ramdisk available in-tree), these +two API endpoints play the critical role in both deployment and cleaning +processes. Thus every IPA-based driver must mix agent vendor passthru into +its vendor passthru. + +There was a bug in the drac driver when it was not done, and the driver +did not work with IPA. + +This proposal also tries to reduce the amount of data sent to and from +unauthenticated endpoints. The current vendor passthru API accepts the whole +inventory and returns the whole node record, including IPMI credentials. + +Proposed change +=============== + +* Create new API endpoints for lookup and heartbeat - see `REST API impact`_ + for details. + +* Extend the deploy interface with a heartbeat method - see + `Driver API impact`_ for details. + +Alternatives +------------ + +* Continue doing what we do now. + +* Make the lookup call driver-dependent (as the passthru used to be). + + This looks like unnecessary complication (e.g. we have to pass a driver to + IPA from the conductor nowadays). + +* We could use this change to move the unauthenticated endpoints away from the + main API completely. This could be done by introducing a new API service, + say ``ironic-agent-api``, serving only these two endpoints. Then we will + recommend operators to make this service only listen on the provisioning + network, but not on the network accessible to users. + + Arguably the same result can be achieved by configuring a WSGI container + (Apache mod_wsgi or similar), so it might not be worth complication. + +Data model impact +----------------- + +None + +State Machine Impact +-------------------- + +None + +REST API impact +--------------- + +Two new endpoints are added. Both endpoints are **NOT** authenticated. + +* ``GET /v1/lookup?addresses=MAC1,MAC2&node_uuid=UUID`` + + Look up node details for further use in the ramdisk. + + No body; at least one of the following URL parameters must be present: + + * ``addresses`` comma-separated list of hardware addresses (e.g. MAC) from + the node for lookup; + + * ``node_uuid`` node UUID, if known (e.g. by inspection). + + If ``node_uuid`` is present, ``addresses`` are ignored. + + By default only return a node if it's in one of transient states: + ``deploying``, ``deploy wait``, ``cleaning``, ``clean wait``, + ``inspecting``, ``inspect wait``. Deployers who need lookup to always work + will be able to set a new option ``[api]restrict_lookup`` to ``False``. + + .. note:: + In theory, we don't need ``-ing`` states here either. But when we reboot + during cleaning, we don't currently reset the state to ``clean wait``. + The other ``-ing`` states are supported in case 3rd party drivers have + similar restrictions. + + Response: HTTP 200 with JSON body containing keys: + + * ``config`` dictionary for passing configuration options from conductor + to the ramdisk. For the IPA ramdisk only one is currently used: + + * ``heartbeat_timeout`` timeout (in seconds) between heart beats from + the ramdisk, expected by Ironic. + + * ``node`` partial node representation as a JSON object, with the following + fields sent: + + * ``properties`` for root device hints, + * ``instance_info`` for disk sizing details, + * ``uuid`` node UUID, + * ``driver_internal_info`` for passing other runtime information. + + More fields can be exposed with time with an appropriate API version bump. + + Error codes: + + * 400 - bad request, + + * 404 - a node was not found. + +* ``POST /v1/heartbeat/`` + + Record a heartbeat message from the ramdisk. + + Body is a JSON with fields: + + * ``callback_url`` - the IPA URL to call back. Note that for potential + non-IPA-based drivers it might have a different meaning (e.g. if we agree + on the ansible driver, this can be an SSH "URL" for it). + + Response: HTTP 202 with no body. + + Error codes: + + * 400 - bad request, + + * 404 - a node was not found, + + * 409 - node is locked (should be retried by the ramdisk). + +A new API version will be introduced to cover both endpoints. + +Client (CLI) impact +------------------- + +Both endpoints will be exposed in the Python API for the ironic client as:: + + ironic.node.lookup(addresses, node_uuid=None) + ironic.node.heartbeat(node_uuid, callback_url) + +However, as they are not intended for end users, they will not be exposed in +both CLI. + +"ironic" CLI +~~~~~~~~~~~~ + +None + +"openstack baremetal" CLI +~~~~~~~~~~~~~~~~~~~~~~~~~ + +None + +RPC API impact +-------------- + +A new RPC call is created to connect the heart beat API endpoint and +the new deploy driver method: ``heartbeat`` (async). + +Driver API impact +----------------- + +A new method is added to the deploy driver interface: + +:: + + def heartbeat(self, task, callback_url): + """Record a heart beat for the node. + + :param task: a task manager task + :param callback_url: a URL to use to call to the ramdisk + :return: None + """ + LOG.warning('Got hearbeat message from node %(node)s, but the driver ' + '%(driver)s does not support heartbeating', + {'node': task.node.uuid, 'driver': task.node.driver}) + +The heartbeat method from ``BaseAgentVendor`` will be refactored to a separate +mix-in class for reusing in both ``AgentDeploy`` and ``BaseAgentVendor``. + +The new method will not be abstract to allow drivers that use a different +approach (e.g. which do not have a ramdisk at all). The default implementation +will do nothing to account for deploy drivers which do not need heart beats. + +The new method will receive a shared node lock. It is up to the implementation +to upgrade the lock to exclusive, if required. + +Nova driver impact +------------------ + +None + +Security impact +--------------- + +* This change will expose unauthenticated API to lookup a node by its MAC + addresses. It does not have any impact on most deployments, as both in-tree + deploy methods (iscsi and http) already expose such API. + +* After the complete switch to the new API endpoints is finished, it will no + longer be possible to fetch the whole node knowing its MAC address without + authentication. Only limited fields will be available. Notably, the power + credentials are not sent in the new API endpoints. + +* We should clearly note that any deploy implementation should treat the + incoming data in the new ``heartbeat`` call with care. Particularly, no + sensitive information should be ever sent to the endpoint designated by the + ``callback_url`` parameter. + +Other end user impact +--------------------- + +None + +Scalability impact +------------------ + +None + +Performance Impact +------------------ + +* Unlike the old lookup passthru, the new lookup endpoint will not use RPC, + lowering load on the message queue and the conductor. + +Other deployer impact +--------------------- + +* An update of the IPA image will be recommended to make it use the new API. + +* New option ``restrict_lookup`` in the ``api`` section (boolean, defaults to + ``True``) - whether to restrict the new lookup API to only certain states in + which lookup is expected. + +Developer impact +---------------- + +3rd party driver developers should stop using the ``BaseAgentVendor`` class +in their drivers and just use the ``AgentDeploy`` class. + +3rd party drivers should document whether they require the ``restrict_lookup`` +option to be ``False`` for correct functioning. + +Implementation +============== + +Assignee(s) +----------- + +* Dmitry Tantsur (lp: divius, irc: dtanstur) +* Jim Rollenhagen (irc: jroll) + +Work Items +---------- + +* Create new deploy interface methods + +* Implement them in the AgentDeploy + +* Create new RPC calls and API endpoints + +* Switch IPA to use the new endpoints, and fall back to old ones on failure + +Dependencies +============ + +None + +Testing +======= + +Testing will be conducted as part of the current gate tests. + +Upgrades and Backwards Compatibility +==================================== + +Full backward compatibility will be guaranteed independent of upgrade order +between IPA and ironic itself. + +The ``BaseAgentVendor`` class will be deprecated, but stay for some time, +following the usual deprecation policy. Old IPA images will be able to run by +using the old passthru API. + +The new IPA image will try to hit the new endpoints first, and will fall back +to the old ones on getting HTTP 406 Not Acceptable (meaning, the API version +is not supported). + +Documentation Impact +==================== + +* Document how to implement new deploy drivers with the new heartbeat method. + +* Document the potential security issues with both endpoints. + +References +========== + diff --git a/specs/not-implemented/agent-api.rst b/specs/not-implemented/agent-api.rst new file mode 120000 index 00000000..a8d672cb --- /dev/null +++ b/specs/not-implemented/agent-api.rst @@ -0,0 +1 @@ +../approved/agent-api.rst \ No newline at end of file