No IPA to conductor communication

Allowing a target node to contact the Ironic API as part of the boot process carries the implication that an attacker could then launch an attack on the control plane via the Ironic API. In this spec we provide a means to push the node ID to a target node, and poll the target node for completed build state. Thus all network traffic egresses from conductor to IPA, and never the other way around Story: 1526486 Task: 10377 Co-Authored-By: Mario Villaplana <mario.villaplana@gmail.com> Co-Authored-By: Josh Gachnang <josh@pcsforeducation.com> Co-Authored-By: Jim Rollenhagen <jim@jimrollenhagen.com> Change-Id: Ief9fb2ecdf8dd7c071328ca73baad87ff711f22f
2015-08-12 21:11:26 +00:00 · 2015-08-12 21:11:26 +00:00 · 1e04db5b42
parent ca5f9d4e84
commit 1e04db5b42
2 changed files with 353 additions and 0 deletions
--- a/specs/approved/no-ipa-to-conductor.rst
+++ b/specs/approved/no-ipa-to-conductor.rst
@ -0,0 +1,352 @@
+..
+   This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=================================
+No IPA to conductor communication
+=================================
+
+https://storyboard.openstack.org/#!/story/1526486
+
+This spec intends to make agent->ironic communication optional, instead
+using polling to make all communication inbound to the agent.
+
+Problem description
+===================
+
+As part of the boot process IPA must call the ironic API to query its node
+ID, and notify ironic when it's completed the boot process.
+
+This implies that a target node has network access to the ironic API, which
+means that a malicious party could in theory attack the control plane from an
+instance. If the same control plane holds Keystone, Neutron, or other such
+services, the attacker can now DoS or compromise those. This grants them
+significant control over the infrastructure.
+
+A deployer could mitigate this security flaw by using two networks for hosts:
+
+* A provisioning network, which has access to the ironic API, but no ability
+  to communicate with other nodes in the datacenter.
+
+* A tenant network, which can communicate with the outside world, and other
+  hosts, but cannot contact the ironic API.
+
+However, this doesn't scale with medium to mega scale deployers who leverage
+layer 3 network topologies. In L3 networks a subnet is constrained to a single
+rack. This means that to leverage two networks to image hosts one would need
+to provision a second subnet for every single rack.
+
+Compounding the issue further, different networks with different fundamental
+security policies implies that these disparate policies must be enforced.
+Thus, for each of your provisioning networks that require access to the ironic
+API, there must be access controls configured and enforced in the firewall.
+
+In the context of hundreds, or thousands of racks, this does not scale.
+
+An example of the (potential) security problem:
+
+* Bob boots a host with ironic, this host is publicly routable to the
+  internet. Mallory finds and compromises this host. She then attacks the
+  ironic API from this host. Once she compromises the ironic API, she then
+  starts booting other hosts in the datacenter with a compromised disk image.
+  If Bob uses ironic to manage every host in the datacenter, then Mallory
+  has now effectively owned an entire datacenter.
+
+To remediate this, we need to reduce the attack surface of the control plane
+by removing the need for the data plane to be able to send traffic to the
+API host. To do this, we need to be able to tell the agent that it should not
+call ironic, and make ironic poll the agent instead.
+
+If the deployer runs DHCPD and serves PXE/iPXE from the control plane, then
+there is still logical network access between the target node and the control
+plane. However, this is easily fixed by running those services on dedicated
+intermediary hosts which do not have network access to the rest of the
+control plane.
+
+In this context the network flow is::
+
+  +-----------+     +-----------+     +-------------+
+  |           |     |           +---> |             |
+  | Conductor +---> | iPXE host |     | Target node |
+  |           |     |           | <---+             |
+  +-----------+     +-----------+     +-------------+
+  |                                     ^
+  +-------------------------------------+
+
+
+And a simple example of the provisioning process would be:
+
+1. Conductor receives call to boot node.
+2. Conductor creates boot data, without the ipa-api-url parameter.
+3. Conductor sends OOB call to power target node on.
+4. Target node boots, uses DHCP to get IP.
+5. Target node runs PXE/iPXE, pulls data from ipxe host.
+6. Target boots and runs IPA.
+7. Conductor polls for IPA until it is alive.
+8. Conductor calls IPA's ``get_hardware_info`` command to get information
+   about the node's hardware. This is used to validate the MAC addresses
+   to ensure this is the node we are expecting.
+9. Conductor calls IPA's ``node_info`` command to give it the data it needs
+   to do its job, including the config data returned by the lookup API.
+10. Conductor calls IPA commands on target node to walk it through the
+    provisioning process.
+11. Conductor polls instance at a configurable interval to check on state,
+    gather information, etc.
+12. Target is complete. Conductor reboots target node.
+13. Conductor cleans up boot data.
+14. Complete.
+
+An example of this as a fix to the security problem:
+
+* Bob boots a host with ironic, this host is publicly routable to the
+  internet. Mallory finds and compromises this host. She attempts to attack
+  the ironic API. The connection times out. She gives up and attacks the iPXE
+  host. She succeeds and compromises the iPXE host. She then attempts to
+  attack the ironic API. The connection fails. Mallory, disappointed, gives
+  up and puts her life of crime behind her.
+
+In this case, even though Mallory has compromised the target node, there is no
+intrinsic network access between the target node and the control plane. Thus
+her only route of attack against the provisioning infrastructure would be DoS,
+or to impact hosts which are in the process of booting. But she has no ability
+to attack the queue, conductor, api, db, etc. She cannot gain control over the
+infrastructure, and her attack has been limited.
+
+Proposed change
+===============
+
+We will add two options to the [agent] category:
+
+* ``poll_only``: BoolOpt to enable passive mode. Defaults to False.
+* ``poll_interval``: IntOpt, poll interval in seconds. Defaults to the
+  current ``[api]/ramdisk_heartbeat_timeout`` setting.
+
+And one option to the [api] category:
+
+* ``disable_agent_api``: BoolOpt which disables the agent lookup and heartbeat
+  APIs. Defaults to False.
+
+If ``poll_only`` is enabled, we do not pass the ``ipa-api-url`` kernel command
+line parameter to IPA, which will disable the node lookup and heartbeat
+mechanisms.
+
+If ``poll_only`` is enabled, the conductor will use a periodic task to query
+each agent at an interval as defined in 'poll_interval' instead of querying
+the agent after a heartbeat is received. This periodic task will only query
+nodes in states IPA would normally be heartbeating in: (DEPLOY*, RESCUE*,
+CLEAN*).
+
+It is assumed that the deployer should disallow communication between the
+target node and the ironic API. However, if an API call does come through
+when ``disable_agent_api`` is True, then Ironic should return a 403.
+
+For this mode, we will also need to remove ``ipa-api-url`` being passed as
+kernel parameter to the agent.
+
+We will also add a ``node_info`` command to IPA, described below, which the
+conductor will use to pass the "lookup data" to a node.
+
+Last, we will add a ``get_hardware_info`` command to IPA, which will return
+hardware info we can use to ensure the node is the node we are expecting.
+
+Notes:
+
+* This spec depends on the assumption that ironic can look up the node IP in
+  Neutron. Deployments without Neutron are not supported with poll_only=True.
+  This may be added in the future.
+
+* ironic-inspector is out of scope for this feature, as it does not use
+  Neutron.
+
+* There may be a use case to set ``poll_only`` per node, rather than globally.
+  However, this is outside the scope of this spec.
+
+Alternatives
+------------
+
+None.
+
+Data model impact
+-----------------
+
+None
+
+State Machine Impact
+--------------------
+
+None
+
+REST API impact
+---------------
+
+The lookup and heartbeat APIs used by agents will now return a 403 when
+``disable_agent_api`` is set to True.
+
+Client (CLI) impact
+-------------------
+
+None
+
+"ironic" CLI
+~~~~~~~~~~~~
+
+None
+
+"openstack baremetal" CLI
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+None
+
+RPC API impact
+--------------
+
+None
+
+Driver API impact
+-----------------
+
+Deploy drivers will need to ensure that anything reaching into some agent
+can also be triggered by a periodic task.
+
+Nova driver impact
+------------------
+
+None
+
+Ramdisk impact
+--------------
+
+ironic-python-agent mostly supports this already, as it will run just fine
+without an API URL.
+
+Some steps may also require other data returned by the lookup endpoint. We'll
+add a new synchronous command ``node_info``, which will take this data as a
+single ``node_info`` argument and store it in memory for later use. Ironic will
+call this command when it first notices that IPA is up.
+
+To validate the node is the node we expect, we'll add another synchronous
+command ``get_hardware_info``. This will return the MAC addresses at first,
+but could be evolved later to include things like serial numbers, etc.
+
+Security impact
+---------------
+
+This change will prevent a malicious actor from using IPA as a vector of attack
+against the ironic API.
+
+Note that TLS on the agent API is still important to completely secure the
+interactions between IPA and Ironic; however, this is outside the scope of
+this spec.
+
+Other end user impact
+---------------------
+
+None
+
+Scalability impact
+------------------
+
+Polling target nodes for state from the conductor could have scale issues
+when managing many thousands of nodes. However, polling will be done in a
+thread pool, and so there should be limited impact.
+
+Performance Impact
+------------------
+
+Polling in a large parallel fashion will introduce additional CPU load on the
+conductor nodes. Deployers may need to scale out their conductor nodes to
+handle the additional load.
+
+Other deployer impact
+---------------------
+
+Recap of the configuration options added:
+
+[agent]
+* poll_only (type=BoolOpt, default=False)
+* poll_interval (type=IntOpt, default=<[api]/ramdisk_heartbeat_timeout>)
+
+[api]
+* disable_agent_api (type=BoolOpt, default=False)
+
+We should document where each of these needs to be set (API vs conductor
+hosts).
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  jroll
+
+Other contributors:
+  penick
+
+Work Items
+----------
+
+* Enable IPA to skip the lookup process when ironic does not pass the
+  ``ipa-api-url`` kernel parameter.
+
+* Create the ``get_hardware_info`` IPA command.
+
+* Create the ``node_info`` IPA command.
+
+* Add the new options to ironic.
+
+* Enable Ironic to use polling for agent actions/status rather than using
+  the heartbeat as a trigger.
+
+* Make ironic call the ``node_info`` command after IPA boots, when in polling
+  mode.
+
+* Disable heartbeating in the agent in polling mode.
+
+* Test scale and performance impact on periodic tasks.
+
+* Lots of documentation, especially in admin guides. It may also be worth a
+  large blurb in the reference architecture guide.
+
+
+Dependencies
+============
+
+None.
+
+Testing
+=======
+
+We should configure one of the existing tempest jobs to use this feature.
+
+Upgrades and Backwards Compatibility
+====================================
+
+The deployer must update IPA in their images to support passive mode prior to
+upgrading Ironic and enabling the feature. If they do not, all imaging
+attempts will fail.
+
+Documentation Impact
+====================
+
+This feature needs to be documented as a deployment option.
+
+The ironic-inspector docs need to be updated to capture that inspector won't
+work with poll_only=True.
+
+Admin docs should be updated to note that firewall rules need to be
+implemented to actually close network access between the target node and
+the ironic API.
+
+References
+==========
+
+None
--- a/specs/not-implemented/no-ipa-to-conductor.rst
+++ b/specs/not-implemented/no-ipa-to-conductor.rst
@ -0,0 +1 @@
+../approved/no-ipa-to-conductor.rst