Re-re-propose ironic-multiple-compute-hosts

This re-writes the ironic-multiple-compute-hosts spec to use a hash ring, rather than messing around with how we schedule. Change-Id: I51de94e3fbe301aeed35a6456ed0b7350aefa317
2016-08-02 17:48:17 -04:00 · 2016-08-02 17:48:17 -04:00 · 341143e32b
parent ffed7f72c5
commit 341143e32b
1 changed files with 65 additions and 71 deletions
--- a/specs/newton/approved/ironic-multiple-compute-hosts.rst
+++ b/specs/newton/approved/ironic-multiple-compute-hosts.rst
@ -41,71 +41,71 @@ be able to scale to 10^5 nodes.
 Proposed change
 ===============

-In general, a nova-compute running the Ironic virt driver should only
-register as a single row in the compute_nodes table, rather than many rows.
+We'll lift some hash ring code from ironic (to be put into oslo
+soon), to be used to do consistent hashing of ironic nodes among
+multiple nova-compute services. The hash ring is used within the
+driver itself, and is refreshed at each resource tracker run.

-Nova's scheduler should schedule only to a nova-compute host; the host will
-choose an Ironic node itself, from the nodes that match the request (explained
-further below).  Once an instance is placed on a given nova-compute service
-host, that host will always manage other requests for that instance (delete,
-etc). So the instance count scheduler filter can just be used here to equally
-distribute instances between compute hosts. This reduces the failure domain to
-failing actions for existing instances on a compute host, if a compute host
-happens to fail.
+get_available_nodes() will now return a subset of nodes,
+determined by the following rules:

-The Ironic virt driver should be modified to call an Ironic endpoint with
-the request spec for the instance. This endpoint will reserve a node, and
-return that node. The virt driver will then deploy the instance to this node.
-When the instance is destroyed, the reservation should also be destroyed.
+* any node with an instance managed by the compute service
+* any node that is mapped to the compute service on the hash ring
+* no nodes with instances managed by another compute service

-This endpoint will take parameters related to the request spec, and is being
-worked on the Ironic side here.[0] This has not yet been solidified, but it
-will have, at a minimum, all fields in the flavor object. This might look
-something like::
+The virt driver finds all compute services that are running the
+ironic driver by joining the services table and the compute_nodes
+table. Since there won't be any records in the compute_nodes table
+for a service that is starting for the first time, the virt driver
+also adds its own compute service into this list. The list of all
+hostnames in this list is what is used to instantiate the hash ring.

-    {
-        "memory_mb": 1024,
-        "vcpus": 8,
-        "vcpu_weight": null,
-        "root_gb": 20,
-        "ephemeral_gb": 10,
-        "swap": 2,
-        "rxtx_factor": 1.0,
-        "extra_specs": {
-            "capabilities": "supports_uefi,has_gpu",
-        },
-        "image": {
-            "id": "some-uuid",
-            "properties": {...},
-        },
-    }
+As nova-compute services are brought up or down, the ring will
+re-balance. It's important to note that this re-balance does not
+occur at the same time on all compute services, so for some amount
+of time, an ironic node may be managed by more than one compute
+service. In other words, there may be two compute_nodes records
+for a single ironic node, with a different host value. For
+scheduling purposes, this is okay, because either compute service
+is capable of actually spawning an instance on the node (because the
+ironic service doesn't know about this hashing). This will cause
+capacity reporting (e.g. nova hypervisor-stats) to over-report
+capacity for this time. Once all compute services in the cluster
+have done a resource tracker run and re-balanced the hash ring,
+this will be back to normal.

+It's also important to note that, due to the way nodes with instances
+are handled, if an instance is deleted while the compute service is
+down, that node will be removed from the compute_nodes table when
+the service comes back up (as each service will see an instance on
+the node object, and assume another compute service manages that
+instance). The ironic node will remain active and orphaned. Once
+the periodic task to reap deleted instances runs, the ironic node
+will be torn down and the node will again be reported in the
+compute_nodes table.

-We will report (total ironic capacity) into the resource tracker for each
-compute host. This will end up over-reporting total available capacity to Nova,
-however is the least wrong option here. Other (worse) options are:
+It's all very eventually consistent, with a potentially long time
+to eventual.

-* Report (total ironic capacity)/(number of compute hosts) from each compute
-  host. This is more "right", but has the possibility of a compute host
-  reporting (usage) > (max capacity), and therefore becoming unable to perform
-  new build actions.
-
-* Report some arbitrary incorrect number for total capacity, and try to make
-  the scheduler ignore it. This reports numbers more incorrectly, and also
-  takes more code and has more room for error.
+There's no configuration to enable this mode; it's always running. For
+deployments that continue to use only one compute service, this has the
+same behavior as today.

 Alternatives
 ------------

-Do what we do today, with active/passive failover.
+Do what we do today, with active/passive failover. Doing active/passive
+failover well is not an easy task, and doesn't account for all possible
+failures. This also does not follow Nova's prescribed model for compute
+failure. Furthermore, the resource tracker initialization is slow with many
+Ironic nodes, and so a cold failover could take minutes.

-Doing active/passive failover well is not an easy task, and doesn't account for
-all possible failures. This also does not follow Nova's prescribed model for
-compute failure. Furthermore, the resource tracker initialization is slow
-with many Ironic nodes, and so a cold failover could take minutes.
-
-Resource providers[1] may be another viable alternative, but we shouldn't
-have a hard dependency on that.
+Another alternative is to make nova's scheduler only choose a compute service
+running the ironic driver (essentially at random) and let the scheduling to
+a given node be determined between the virt driver and ironic. The downsides
+here are that operators no longer have a pluggable scheduler (unless we build
+one in ironic), and we'll have to do lots of work to ensure there aren't
+scheduling races between the compute services.

 Data model impact
 -----------------
@ -142,13 +142,7 @@ smaller and improve the performance of the resource tracker loop.
 Other deployer impact
 ---------------------

-A version of Ironic that supports the reservation endpoint must be deployed
-before a version of Nova with this change is deployed. If this is not the
-case, the previous behavior should be used. We'll need to properly deprecate
-the old behavior behind a config option, as deployers will need to configure
-different scheduler filters and host managers than the current recommendation
-for this to work correctly. We should investigate if this can be done
-gracefully without a new config option, however I'm not sure it's possible.
+None.

 Developer impact
 ----------------
@ -166,43 +160,42 @@ Primary assignee:
  jim-rollenhagen (jroll)

 Other contributors:
-  devananda
+  dansmith
  jaypipes

 Work Items
 ----------

-* Change the Ironic driver to be a 1:1 host:node mapping.
+* Import the hash ring code into Nova.

-* Change the Ironic driver to get reservations from Ironic.
+* Use the hash ring in the virt driver to shard nodes among compute daemons.


 Dependencies
 ============

-This depends on a new endpoint in Ironic.[0]
+None.


 Testing
 =======

-This should be tested by being the default configuration.
+This code will run in the default devstack configuration.
+
+We also plan to add a CI job that runs the ironic driver with multiple
+compute hosts, but this likely won't happen until Ocata.


 Documentation Impact
 ====================

-Deployer documentation will need updates to specify how this works, since it
-is different than most drivers.
+Maybe an ops guide update, however I'd like to leave that for next cycle until
+we're pretty sure this is stable.


 References
 ==========

-[0] https://review.openstack.org/#/c/204641/
-
-[1] https://review.openstack.org/#/c/225546/
-

 History
 =======
@ -216,3 +209,4 @@ History
     - Introduced but no changes merged.
   * - Newton
     - Re-proposed.
+     - Completely re-written to use a hash ring.