Merge "Add a shard key"

2022-12-14 16:22:33 +00:00 · 2022-12-14 16:22:33 +00:00 · 86277449b6
parent 08a53c95b9 7051899ef0
commit 86277449b6
1 changed files with 516 additions and 0 deletions
--- a/specs/approved/shard-key.rst
+++ b/specs/approved/shard-key.rst
@ -0,0 +1,516 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+======================
+Shard Key Introduction
+======================
+
+https://storyboard.openstack.org/#!/story/2010378
+
+After much discussion and attempts to remedy the scalability issues with
+``nova-compute`` and its connection to Ironic in large scale deployments,
+and upon newly discovered indicators of ``networking-baremetal`` having a
+similar scaling issue, the community has started to reach an agreement on
+a path forward. Specifically, to introduce a sharding model which would
+allow API consumers to map and lock on to specific sets of baremetal nodes,
+regardless of if the relationship is semi-permanant or entirely situational.
+Only the consumer of the information performing processing can make that
+determination, and it is up to Ironic to try and provide the substrate
+capabilities to efficiently operate against its API.
+
+Problem description
+===================
+
+The reality is Ironic can be used at some absurd scales in the hundreds of
+of thousands of baremetal nodes, and while *most* operators of Ironic either
+run multiple smaller distinct Ironic deployments with less than 500 physical
+machines, some need a single deployment with thousands or tens of thousands
+of physical nodes. At increased scales, external operations polling ironic,
+generally struggle to scale at these levels. It is also easy for
+misconfigurations to be made where performance can become degraded,
+which is because the scaling model and limits are difficult to understand.
+
+This is observable with the operation of Nova's Compute process when running
+the ``nova.virt.ironic`` driver. It is operationally easy to get into
+situations where one is attempting to support thousands of baremetal nodes,
+with too few ``nova-compute`` processes. This specific situation leads to
+the process attempting to take on more work than it was designed to handle.
+
+Recently we discovered a case, while rooted in misconfiguration, where the
+same basic scaling issue exists with ``networking-baremetal`` where it is
+responsible for polling and updating physical network mappings in Neutron.
+The same basic case, a huge amount of work, and multiple processes.
+In this specific case, multiple (3) Neutron services were stressing the Ironic
+API retrieving all of the nodes, and attempting to update all of the related
+physical network mapping records in neutron, resulting in the same record
+being updated 3 times, once from each service.
+
+The root issue is the software consuming Ironic's data needs to be able to
+self-delineate the overall node set and determine the local separation points
+for sharding the nodes. The delineation is required because the processes
+executed are far more processor intensive, which can introduce latency and
+lag which can lead towards race conditions.
+
+The challenge, from what has been done previously, is the previous model
+required downloading the entire data set to build a hash ring from.
+
+Where things are also complicated, is Ironic has an operational model of
+a ``conductor_group``, which is intended to help model a physical grouping
+or operational constraint. The challenge here is that conductor groups are
+not automatic in any way, shape, or form. As a result, conductor groups
+is not the solution we need here.
+
+Proposed change
+===============
+
+Overall the idea, is to introduce a ``shard`` field on the node object,
+which an API user (Service), can utilize to retrieve a subset of nodes.
+
+This new field on the node object would be inline with existing API
+field behavior constraints and can be set via the API.
+
+We *can* provide a means to pre-set the shard, but ultimately it is
+still optional for Ironic, and the shard *exists* for the API
+consumer's benefit.
+
+In order to facilitate the usage by an API client, ``/v1/nodes`` and
+``/v1/ports`` would be updated to accept a ``shard`` parameter
+(i.e. GET /v1/nodes?shard=foo, GET /v1/ports?shard=foo,
+GET /v1/portgroups?shard=foo) in the query to allow for API consumers
+to automatically scope limit their data set and self determine how to
+reduce the workset. For example, ``networking-baremetal`` may not care
+about assignment, it just needs to reduce the localized workset.
+Whereas, ``nova-compute`` needs the shard field to remain static, that is
+unless ``nova-compute`` or some other API consumer were to request the
+``shard`` to be updated on a node.
+
+.. NOTE::
+   The overall process consumers use today is to retreive everything and
+   then limit the scope of work based upon contents of the result set.
+   This results in a large overhead of work and increased looping latency
+   which also can encourage race conditions. Both ``nova-compute``
+   and the ``networking-baremetal`` ML2 plugin operate in this way with
+   different patterns of use. The advantage of the the proposed solution
+   is to enabel the scope limiting/grouping into managable chunks.
+
+In terms of access controls, we would also add a new RBAC policy to
+restrict changes such that the system itself or a appropriately scoped
+(i.e. administrative) user can change the field.
+
+In this model, conductors do not care about the shard key. It is only
+a data storage field on the node. Lookups for contents of the overall
+shard composition/layout, for GET /v1/shards, is to be performed
+directly against the nodes table using a SQL query.
+
+Alternatives
+------------
+
+This is a complex solution to allow simplified yet delineated usage,
+and there are numerous other options for specific details.
+
+Ultimately, each item should be discussed, and considered.
+
+One key aspect, which has been recognized thus far, is that existing
+mechanisms can be inefficiently leveraged to achieve this. An example
+of this is that ``conductor_group``, ``owner``, ``lessee`` all allow for
+filtering of the node result set. A ``conductor_group`` being an explicit
+aspect an API client can request, where as ``owner`` and ``lessee`` are
+access control based filters tied to the API client's submitted Project
+ID used for Authentication. More information on why ``conductor_group``
+is problematic is further on in this document.
+
+Consensus in discussion with the Nova teams seems to be that usage of
+the other fields, while in part may be useful, and possibly even preferred
+in some limited and specific cases, doesn't solve the general need
+to be able to allow clients to self delineate *without* first downloading
+the *entire* node list first. Which in itself, the act of retrieving
+a complete list of nodes is a known scaling challenge, and creates increased
+processing latency.
+
+In the ``conductor_group`` case, there is no current way to discover
+the conductor groups. Where as for ``owner`` and ``lessee``, these are
+specific project ID value fields.
+
+Why not Conductor Group?
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is important to stress similiarity wise, this *is* similar to conductor
+groups, however conductor groups were primarily purposed to model the physical
+constraints and structure of the baremetal infrastructure.
+
+For example, if you have a set of conductors in Europe, and a set of
+conductors in New York, you don't want to try and run a deploy for servers
+in New York, from Europe. Part of the attractiveness for this to be exposed
+or used in Nova, was *also* to align the physical structure. The immediately
+recognized bonus to operators was the list of nodes was limited to the running
+``nova-compute`` process, if so configured. It is known to the Ironic community
+that some infrastructure operators *have* utilized this setting and field to
+facilitate scaling of their ``nova-compute`` infrastructure, however these
+operators have also encountered issues with this use pattern as well that
+we hope to avoid with a shard key implementation.
+
+Where the needs are different with this effort and the pre-existing
+conductor groups, is that conductor groups are part of the hash ring modeling
+behavior where as in the shards model conductors will operate without
+consideration of the shard key value. We need disjointed modeling to support
+API consumer centric usage so they can operate in logical units with distinct
+selections of work.
+Consumers *may* also care about the ``conductor_group`` in addition to the
+shard because needing to geographically delineate is separate from needing
+smaller "chunks" of work, or in this case "groups of baremetal nodes" for
+which a running process is responsible for.
+
+In this specific case, ``conductor_group`` is entirely a manually managed
+aspect, which nova has a separate setting name due to name perception reasons,
+and our hope ultimately is something that is both simple and smart.
+
+.. NOTE::
+   The Nova project has agreed during Project Teams Gathering meetings to
+   deprecate the ``peer_list`` parameter they forced use of previously to
+   support conductor groups with the hash ring logic.
+
+On top of this, Today's ``conductor_group`` functionality is reliant upon
+the hash ring model of use, which is something the Nova team wants to see
+removed from the Nova codebase in the next several development cycles.
+Where as, Ironic will continue to use the hash ring functionality
+for managing our conductor's operating state as it is also modeled for
+conductors to manage thousands of nodes. These thousands of nodes just
+does not scale well into ``nova-compute`` services.
+
+Why not owner or lessee?
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+With the RBAC model improvements which have taken place over the last few
+years, it *is* entirely possible to manage separate projects and credentials
+for a ``nova-compute`` to exist and operate within. The challenge here is
+management of additional credentials and the mappings/interactions.
+
+It might be "feasible" to do the same for scaling ``networking-baremetal``
+interactions with Ironic's API, but the overhead and self management of
+node groupings seems onerous and error prone.
+
+Also, if this was a path taken, it would also be administratively prohibitive
+for nova-computes nodes, and they would be locked to the manual settings.
+
+What if we just let the API consumer figure it out?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This could be an option, but it would lead to worse performance and the
+user experience being worse.
+
+The base conundrum is to orderly and efficiently enumerate through, and then
+acting upon each and every node API client is responsible for interacting
+with.
+
+Today, Nova's Compute service enumerates through every node, using a list
+generated upon one query, and it gets *most* of the data it needs to
+track/interact with a node, keeping the more costly single node requests to a
+minimum. If that client had to track things, it would still have to pull
+a full list, and then it would have to reconcile, track, and map individual
+nodes. We've already seen this as not working using a Hashring today.
+
+Similarly, ``networking-baremetal`` lists all ports. That is all it needs,
+but it has no concept of smaller chunking, blocks, or even enough information
+*to* really make a hashring which would represent existing models. To just
+expect the client to "figure it out" and to "deal with that complexity",
+also means logic far away from a database. And for performance, the closer
+we can keep logic and decisions to an indexed column, the better and more
+performant, which is why the proposed solution has come forth.
+
+Data model impact
+-----------------
+
+Node: Addition of a ``shard`` column/value string field, indexed,
+      with a default value of None. This field is considered to be
+      case sensitive, which is inline with the DB storage type.
+      API queries would seek exact field value matches.
+
+.. NOTE:: We will need to confer with the Nova team and the nova.virt.ironic
+          driver query pattern, to ensure we cover any compound indexes,
+          if needed.
+
+To facilitate this, database migrations, and data model sanity checking
+will need to be added to ``ironic-status`` as part of the upgrade checks.
+
+State Machine Impact
+--------------------
+
+None
+
+REST API impact
+---------------
+
+PATCH /v1/nodes/<node>
+
+In order to set a shard value, a user will need to patch the field.
+This is canned functionality of the existing nodes controller, and will
+be API version and RBAC policy guarded in order to prevent inappropriate
+changes to the field once set. Like all other fields, this operation
+takes the shape of a JSON Patch.
+
+GET /v1/nodes?shard=VALUE,VALUE2,VALUE3
+
+Returns a subset of nodes limited by shard key. In this specific case
+we will also allow a string value of "none", "None" or "null" to
+be utilized to retrieve a list of nodes which do *not* have a shard
+key set. Logic to handle that would be in the DB API layer.
+
+GET /v1/ports?shard=VALUE,VALUE2,VALUEZ
+GET /v1/portgroupss?shard=VALUE,VALUE2,VALUEZ
+
+Returns a subset of ports, limited by the shard key, or list of keys
+provided by the caller. Specifically would utilize a joined query
+to the database to facilitate it.
+
+GET /v1/shards
+
+Returns a JSON representing the shard keys and counts of nodes
+utilizing the shard.
+
+    {{"Name": "Shard-10", "Count": 352},
+    {"Name": "Shard-11", "Count": 351},
+    {"Name": "Shard-12", "Count": 35},
+    {"Name": null, "Count": 921}}
+
+Visibility wise, the new capabilities will be restricted by API
+micro-version. Access wise this field would be restricted in use to
+``system-reader``, ``project-admin``, and future ``service`` roles
+by default. A specific RBAC policy would be added for access to
+this endpoint.
+
+.. NOTE::
+   The /v1/shards endpoint will be read only.
+
+Client (CLI) impact
+-------------------
+Typically, but not always, if there are any REST API changes, there are
+corresponding changes to python-ironicclient. If so, what does the user
+interface look like. If not, describe why there are REST API changes but
+no changes to the client.
+
+"openstack baremetal" CLI
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A ``baremetal shard list`` command would be added.
+
+A ``baremetal node list --shard <shard>`` capability would be
+added to list all nodes in a shard.
+
+A ``--shard`` node level parameter for ``baremetal node set``
+would also be added.
+
+A ``baremetal port list --shard <shard>`` capability would be
+added to limit the related ports to nodes in a shard. Similarly,
+the ``baremetal portgroup list --shard <shard>`` would be updated
+as well.
+
+"openstacksdk"
+~~~~~~~~~~~~~~
+
+A SDK method would be added to get a shard list, and existing list methods
+would be checked to ensure we can query by shard.
+
+RPC API impact
+--------------
+
+None anticipated at this time.
+
+Driver API impact
+-----------------
+
+None
+
+Nova driver impact
+------------------
+
+A separate specification document is being proposed for the Nova
+project to help identify *and* navigate the overall change.
+
+That being said, no direct negative impact is anticipated.
+
+The overall discussion revolving with Nova is to both facilitate a
+minimal impact migration, and not force invasive and breaking changes,
+which may not be realistically needed by the operators.
+
+.. NOTE:: An overall migration path is envisioned, but what is
+          noted here is only a suggestion and how we perceive the
+          overall process.
+
+Anticipated initial Nova migration steps:
+
+Ironic itself will not be providing an explicit process for setting the
+shard value on each node, aside from ``baremetal node set``. Below is what
+*we, Ironic* anticipate as the migration steps overall to move towards this
+model.
+
+1) Complete the Ironic migration. Upon completion, executing the database
+   status check (i.e. ``ironic-status upgrade check``) should detect and warn
+   *if* a ``shard`` key is present on nodes in the database, but nodes
+   exist without a ``shard`` value are present in the database.
+2) The nova-compute service being upgraded is shut down.
+3) A nova-manage command would be executed to reassign nodes to a user
+   supplied ``shard`` value to match.
+   Example: nova-manage ironic-reassign <shard-key> <compute-hostname>
+
+   Programattically, this would retrieve a list of nodes matching the key from
+   Ironic, and then change the associated ComputeNode and Instance tables host
+   fields to be the supplied compute hostname, to match an existing nova
+   compute service.
+
+   .. NOTE:: The command likely needs to match/validate that this is/was a
+             compute hostname.
+
+   .. TODO:: As a final step before the nova-manage command exits, ideally it
+             would double check the state of records in those tables to
+             indicate if there are other nodes the named Compute hostname is
+             responsible for. The last compute hostname in the environment
+             should not generate any warning, any warning would be indicitive
+             of a lost ComputeNode, Instance, or Baremetal node record.
+
+4) The nova-compute.conf file for the upgraded ``nova-compute`` service is
+   restarted with a ``my_shard`` (or other appropriate parameter) which
+   signals to the ``nova.virt.ironic`` driver code to not utilize the hash
+   ring, and to utilize the blend of what it thinks it is responsible for
+   from the database *and* what matches the Ironic baremetal node inventory
+   when queried for matching the configured shard key value.
+5) As additional compute nodes are migrated to using the new shard key setup,
+   existing compute node imbalance should be settled in terms of the
+   internal compute-node logic to retrieve what each node it thinks it is
+   responsible for, and would eventually match the shard key.
+
+This would facilitate an ability to perform a rolling, yet isolated outage
+impact as the new nova-compute configuraiton is coming online, and also allows
+for a flow which should be able to be automated for larger operators.
+
+The manageability, say if one needs to change a ``shard`` or rebalance
+shards, is not yet clear. The current discussion in the Nova project is that
+rebalance/reassociation will only be permitted *IF* the compute service
+has been "forced down" which is an irreversable action
+
+Ramdisk impact
+--------------
+
+None
+
+Security impact
+---------------
+
+The ``shard`` key would be API user settable, as long as sufficient
+API access exists in the RBAC model.
+
+The ``/v/shards`` endpoint would also be restricted based upon the RBAC
+model.
+
+No other security impacts are anticipated.
+
+Other end user impact
+---------------------
+
+None Anticipated
+
+Scalability impact
+------------------
+
+This model is anticipated to allow users of data stored in Ironic to be more
+scalable. No impacts to Ironic's scalability are generally anticipated.
+
+Performance Impact
+------------------
+
+No realistic impact is anticipated. While another field is being added,
+initial prototyping benchmarks have yielded highly performant response
+times for large sets (10,000) baremetal nodes.
+
+Other deployer impact
+---------------------
+
+It *is* recognized that operators *may* wish to auto-assign or auto-shard
+the node set programatically. The agreed upon limitation amongst Ironic
+contributors is that we (Ironic) would not automatically create *new*
+shards in the future. Creation of new shards would be driven by the operator
+by setting a new shard key on any given node.
+
+This may require a new configuration option to control this logic, but
+the logic overall is not viewed as a blocking aspect to the more critical
+need of being able to "assign" a node to a shard. This logic may be added
+later on, we will just endeveour to have updated documentation to explain
+the appropriate usage and options.
+
+Developer impact
+----------------
+
+None anticipated
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Jay Faulkner (JayF)
+
+Other contributors:
+  Julia Kreger (TheJulia)
+
+Work Items
+----------
+
+* Propose nova spec for the use of the keys
+  (https://review.opendev.org/c/openstack/nova-specs/+/862833)
+* Create database schema/upgrades/models.
+* Update Object layer for the ``Node`` and ``Port`` objects in order to
+  permit both objects to be queried by ``shard``.
+* Add query by shard capability to the Nodes and Ports database tables.
+* Expose ``shard`` on the node API, with an incremented microversion
+  *and* implement a new RBAC policy which restricts the ability to change
+  the ``shard`` value
+* Add pre-upgrade status check to warn if there are fields which are
+  not consistently populated. i.e. ``shard`` is not populated on
+  all nodes. This will provide visibility into the mixed and possibly
+  misconfigured operational state for future upgrader.
+* Update OpenStack SDK and python-ironicclient
+
+Dependencies
+============
+
+This specification is loosely dependent upon Nova accepting
+a plan for use of the sharding model of data. At present, it is the
+Ironic team's understanding that it is acceptable to Nova, and Ironic
+needs to merge this spec and related code to support this feature before
+Nova will permit the Nova spec to be merged.
+
+Testing
+=======
+
+Unit testing is expected for all the basic components and operations
+added ot Ironic to support this funcitonality.
+
+We may be able to add some tempest testing for the API field and access
+interactions.
+
+Upgrades and Backwards Compatibility
+====================================
+
+To be determined. We anticipate that the standard upgrade process would apply
+and that there would not realistically be an explicit downgrade compatability
+process, but this capability and functionality is largely for external
+consumption, and details there are yet to be determined.
+
+Documentation Impact
+====================
+
+Admin documentation would need to include an document covering sharding,
+internal mechanics, and usage.
+
+References
+==========
+
+PTG Notes: https://etherpad.opendev.org/p/nova-antelope-ptg
+Bug: https://launchpad.net/bugs/1730834
+Bug: https://launchpad.net/bugs/1825876
+Related Bug: https://launchpad.net/bugs/1853009
+