.. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ====================== Shard Key Introduction ====================== https://storyboard.openstack.org/#!/story/2010378 After much discussion and attempts to remedy the scalability issues with ``nova-compute`` and its connection to Ironic in large scale deployments, and upon newly discovered indicators of ``networking-baremetal`` having a similar scaling issue, the community has started to reach an agreement on a path forward. Specifically, to introduce a sharding model which would allow API consumers to map and lock on to specific sets of baremetal nodes, regardless of if the relationship is semi-permanant or entirely situational. Only the consumer of the information performing processing can make that determination, and it is up to Ironic to try and provide the substrate capabilities to efficiently operate against its API. Problem description =================== The reality is Ironic can be used at some absurd scales in the hundreds of of thousands of baremetal nodes, and while *most* operators of Ironic either run multiple smaller distinct Ironic deployments with less than 500 physical machines, some need a single deployment with thousands or tens of thousands of physical nodes. At increased scales, external operations polling ironic, generally struggle to scale at these levels. It is also easy for misconfigurations to be made where performance can become degraded, which is because the scaling model and limits are difficult to understand. This is observable with the operation of Nova's Compute process when running the ``nova.virt.ironic`` driver. It is operationally easy to get into situations where one is attempting to support thousands of baremetal nodes, with too few ``nova-compute`` processes. This specific situation leads to the process attempting to take on more work than it was designed to handle. Recently we discovered a case, while rooted in misconfiguration, where the same basic scaling issue exists with ``networking-baremetal`` where it is responsible for polling and updating physical network mappings in Neutron. The same basic case, a huge amount of work, and multiple processes. In this specific case, multiple (3) Neutron services were stressing the Ironic API retrieving all of the nodes, and attempting to update all of the related physical network mapping records in neutron, resulting in the same record being updated 3 times, once from each service. The root issue is the software consuming Ironic's data needs to be able to self-delineate the overall node set and determine the local separation points for sharding the nodes. The delineation is required because the processes executed are far more processor intensive, which can introduce latency and lag which can lead towards race conditions. The challenge, from what has been done previously, is the previous model required downloading the entire data set to build a hash ring from. Where things are also complicated, is Ironic has an operational model of a ``conductor_group``, which is intended to help model a physical grouping or operational constraint. The challenge here is that conductor groups are not automatic in any way, shape, or form. As a result, conductor groups is not the solution we need here. Proposed change =============== Overall the idea, is to introduce a ``shard`` field on the node object, which an API user (Service), can utilize to retrieve a subset of nodes. This new field on the node object would be inline with existing API field behavior constraints and can be set via the API. We *can* provide a means to pre-set the shard, but ultimately it is still optional for Ironic, and the shard *exists* for the API consumer's benefit. In order to facilitate the usage by an API client, ``/v1/nodes`` and ``/v1/ports`` would be updated to accept a ``shard`` parameter (i.e. GET /v1/nodes?shard=foo, GET /v1/ports?shard=foo, GET /v1/portgroups?shard=foo) in the query to allow for API consumers to automatically scope limit their data set and self determine how to reduce the workset. For example, ``networking-baremetal`` may not care about assignment, it just needs to reduce the localized workset. Whereas, ``nova-compute`` needs the shard field to remain static, that is unless ``nova-compute`` or some other API consumer were to request the ``shard`` to be updated on a node. .. NOTE:: The overall process consumers use today is to retreive everything and then limit the scope of work based upon contents of the result set. This results in a large overhead of work and increased looping latency which also can encourage race conditions. Both ``nova-compute`` and the ``networking-baremetal`` ML2 plugin operate in this way with different patterns of use. The advantage of the the proposed solution is to enabel the scope limiting/grouping into managable chunks. In terms of access controls, we would also add a new RBAC policy to restrict changes such that the system itself or a appropriately scoped (i.e. administrative) user can change the field. In this model, conductors do not care about the shard key. It is only a data storage field on the node. Lookups for contents of the overall shard composition/layout, for GET /v1/shards, is to be performed directly against the nodes table using a SQL query. Alternatives ------------ This is a complex solution to allow simplified yet delineated usage, and there are numerous other options for specific details. Ultimately, each item should be discussed, and considered. One key aspect, which has been recognized thus far, is that existing mechanisms can be inefficiently leveraged to achieve this. An example of this is that ``conductor_group``, ``owner``, ``lessee`` all allow for filtering of the node result set. A ``conductor_group`` being an explicit aspect an API client can request, where as ``owner`` and ``lessee`` are access control based filters tied to the API client's submitted Project ID used for Authentication. More information on why ``conductor_group`` is problematic is further on in this document. Consensus in discussion with the Nova teams seems to be that usage of the other fields, while in part may be useful, and possibly even preferred in some limited and specific cases, doesn't solve the general need to be able to allow clients to self delineate *without* first downloading the *entire* node list first. Which in itself, the act of retrieving a complete list of nodes is a known scaling challenge, and creates increased processing latency. In the ``conductor_group`` case, there is no current way to discover the conductor groups. Where as for ``owner`` and ``lessee``, these are specific project ID value fields. Why not Conductor Group? ~~~~~~~~~~~~~~~~~~~~~~~~ It is important to stress similiarity wise, this *is* similar to conductor groups, however conductor groups were primarily purposed to model the physical constraints and structure of the baremetal infrastructure. For example, if you have a set of conductors in Europe, and a set of conductors in New York, you don't want to try and run a deploy for servers in New York, from Europe. Part of the attractiveness for this to be exposed or used in Nova, was *also* to align the physical structure. The immediately recognized bonus to operators was the list of nodes was limited to the running ``nova-compute`` process, if so configured. It is known to the Ironic community that some infrastructure operators *have* utilized this setting and field to facilitate scaling of their ``nova-compute`` infrastructure, however these operators have also encountered issues with this use pattern as well that we hope to avoid with a shard key implementation. Where the needs are different with this effort and the pre-existing conductor groups, is that conductor groups are part of the hash ring modeling behavior where as in the shards model conductors will operate without consideration of the shard key value. We need disjointed modeling to support API consumer centric usage so they can operate in logical units with distinct selections of work. Consumers *may* also care about the ``conductor_group`` in addition to the shard because needing to geographically delineate is separate from needing smaller "chunks" of work, or in this case "groups of baremetal nodes" for which a running process is responsible for. In this specific case, ``conductor_group`` is entirely a manually managed aspect, which nova has a separate setting name due to name perception reasons, and our hope ultimately is something that is both simple and smart. .. NOTE:: The Nova project has agreed during Project Teams Gathering meetings to deprecate the ``peer_list`` parameter they forced use of previously to support conductor groups with the hash ring logic. On top of this, Today's ``conductor_group`` functionality is reliant upon the hash ring model of use, which is something the Nova team wants to see removed from the Nova codebase in the next several development cycles. Where as, Ironic will continue to use the hash ring functionality for managing our conductor's operating state as it is also modeled for conductors to manage thousands of nodes. These thousands of nodes just does not scale well into ``nova-compute`` services. Why not owner or lessee? ~~~~~~~~~~~~~~~~~~~~~~~~ With the RBAC model improvements which have taken place over the last few years, it *is* entirely possible to manage separate projects and credentials for a ``nova-compute`` to exist and operate within. The challenge here is management of additional credentials and the mappings/interactions. It might be "feasible" to do the same for scaling ``networking-baremetal`` interactions with Ironic's API, but the overhead and self management of node groupings seems onerous and error prone. Also, if this was a path taken, it would also be administratively prohibitive for nova-computes nodes, and they would be locked to the manual settings. What if we just let the API consumer figure it out? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This could be an option, but it would lead to worse performance and the user experience being worse. The base conundrum is to orderly and efficiently enumerate through, and then acting upon each and every node API client is responsible for interacting with. Today, Nova's Compute service enumerates through every node, using a list generated upon one query, and it gets *most* of the data it needs to track/interact with a node, keeping the more costly single node requests to a minimum. If that client had to track things, it would still have to pull a full list, and then it would have to reconcile, track, and map individual nodes. We've already seen this as not working using a Hashring today. Similarly, ``networking-baremetal`` lists all ports. That is all it needs, but it has no concept of smaller chunking, blocks, or even enough information *to* really make a hashring which would represent existing models. To just expect the client to "figure it out" and to "deal with that complexity", also means logic far away from a database. And for performance, the closer we can keep logic and decisions to an indexed column, the better and more performant, which is why the proposed solution has come forth. Data model impact ----------------- Node: Addition of a ``shard`` column/value string field, indexed, with a default value of None. This field is considered to be case sensitive, which is inline with the DB storage type. API queries would seek exact field value matches. .. NOTE:: We will need to confer with the Nova team and the nova.virt.ironic driver query pattern, to ensure we cover any compound indexes, if needed. To facilitate this, database migrations, and data model sanity checking will need to be added to ``ironic-status`` as part of the upgrade checks. State Machine Impact -------------------- None REST API impact --------------- PATCH /v1/nodes/ In order to set a shard value, a user will need to patch the field. This is canned functionality of the existing nodes controller, and will be API version and RBAC policy guarded in order to prevent inappropriate changes to the field once set. Like all other fields, this operation takes the shape of a JSON Patch. GET /v1/nodes?shard=VALUE,VALUE2,VALUE3 Returns a subset of nodes limited by shard key. In this specific case we will also allow a string value of "none", "None" or "null" to be utilized to retrieve a list of nodes which do *not* have a shard key set. Logic to handle that would be in the DB API layer. GET /v1/ports?shard=VALUE,VALUE2,VALUEZ GET /v1/portgroupss?shard=VALUE,VALUE2,VALUEZ Returns a subset of ports, limited by the shard key, or list of keys provided by the caller. Specifically would utilize a joined query to the database to facilitate it. GET /v1/shards Returns a JSON representing the shard keys and counts of nodes utilizing the shard. {{"Name": "Shard-10", "Count": 352}, {"Name": "Shard-11", "Count": 351}, {"Name": "Shard-12", "Count": 35}, {"Name": null, "Count": 921}} Visibility wise, the new capabilities will be restricted by API micro-version. Access wise this field would be restricted in use to ``system-reader``, ``project-admin``, and future ``service`` roles by default. A specific RBAC policy would be added for access to this endpoint. .. NOTE:: The /v1/shards endpoint will be read only. Client (CLI) impact ------------------- Typically, but not always, if there are any REST API changes, there are corresponding changes to python-ironicclient. If so, what does the user interface look like. If not, describe why there are REST API changes but no changes to the client. "openstack baremetal" CLI ~~~~~~~~~~~~~~~~~~~~~~~~~ A ``baremetal shard list`` command would be added. A ``baremetal node list --shard `` capability would be added to list all nodes in a shard. A ``--shard`` node level parameter for ``baremetal node set`` would also be added. A ``baremetal port list --shard `` capability would be added to limit the related ports to nodes in a shard. Similarly, the ``baremetal portgroup list --shard `` would be updated as well. "openstacksdk" ~~~~~~~~~~~~~~ A SDK method would be added to get a shard list, and existing list methods would be checked to ensure we can query by shard. RPC API impact -------------- None anticipated at this time. Driver API impact ----------------- None Nova driver impact ------------------ A separate specification document is being proposed for the Nova project to help identify *and* navigate the overall change. That being said, no direct negative impact is anticipated. The overall discussion revolving with Nova is to both facilitate a minimal impact migration, and not force invasive and breaking changes, which may not be realistically needed by the operators. .. NOTE:: An overall migration path is envisioned, but what is noted here is only a suggestion and how we perceive the overall process. Anticipated initial Nova migration steps: Ironic itself will not be providing an explicit process for setting the shard value on each node, aside from ``baremetal node set``. Below is what *we, Ironic* anticipate as the migration steps overall to move towards this model. 1) Complete the Ironic migration. Upon completion, executing the database status check (i.e. ``ironic-status upgrade check``) should detect and warn *if* a ``shard`` key is present on nodes in the database, but nodes exist without a ``shard`` value are present in the database. 2) The nova-compute service being upgraded is shut down. 3) A nova-manage command would be executed to reassign nodes to a user supplied ``shard`` value to match. Example: nova-manage ironic-reassign Programattically, this would retrieve a list of nodes matching the key from Ironic, and then change the associated ComputeNode and Instance tables host fields to be the supplied compute hostname, to match an existing nova compute service. .. NOTE:: The command likely needs to match/validate that this is/was a compute hostname. .. TODO:: As a final step before the nova-manage command exits, ideally it would double check the state of records in those tables to indicate if there are other nodes the named Compute hostname is responsible for. The last compute hostname in the environment should not generate any warning, any warning would be indicitive of a lost ComputeNode, Instance, or Baremetal node record. 4) The nova-compute.conf file for the upgraded ``nova-compute`` service is restarted with a ``my_shard`` (or other appropriate parameter) which signals to the ``nova.virt.ironic`` driver code to not utilize the hash ring, and to utilize the blend of what it thinks it is responsible for from the database *and* what matches the Ironic baremetal node inventory when queried for matching the configured shard key value. 5) As additional compute nodes are migrated to using the new shard key setup, existing compute node imbalance should be settled in terms of the internal compute-node logic to retrieve what each node it thinks it is responsible for, and would eventually match the shard key. This would facilitate an ability to perform a rolling, yet isolated outage impact as the new nova-compute configuraiton is coming online, and also allows for a flow which should be able to be automated for larger operators. The manageability, say if one needs to change a ``shard`` or rebalance shards, is not yet clear. The current discussion in the Nova project is that rebalance/reassociation will only be permitted *IF* the compute service has been "forced down" which is an irreversable action Ramdisk impact -------------- None Security impact --------------- The ``shard`` key would be API user settable, as long as sufficient API access exists in the RBAC model. The ``/v/shards`` endpoint would also be restricted based upon the RBAC model. No other security impacts are anticipated. Other end user impact --------------------- None Anticipated Scalability impact ------------------ This model is anticipated to allow users of data stored in Ironic to be more scalable. No impacts to Ironic's scalability are generally anticipated. Performance Impact ------------------ No realistic impact is anticipated. While another field is being added, initial prototyping benchmarks have yielded highly performant response times for large sets (10,000) baremetal nodes. Other deployer impact --------------------- It *is* recognized that operators *may* wish to auto-assign or auto-shard the node set programatically. The agreed upon limitation amongst Ironic contributors is that we (Ironic) would not automatically create *new* shards in the future. Creation of new shards would be driven by the operator by setting a new shard key on any given node. This may require a new configuration option to control this logic, but the logic overall is not viewed as a blocking aspect to the more critical need of being able to "assign" a node to a shard. This logic may be added later on, we will just endeveour to have updated documentation to explain the appropriate usage and options. Developer impact ---------------- None anticipated Implementation ============== Assignee(s) ----------- Primary assignee: Jay Faulkner (JayF) Other contributors: Julia Kreger (TheJulia) Work Items ---------- * Propose nova spec for the use of the keys (https://review.opendev.org/c/openstack/nova-specs/+/862833) * Create database schema/upgrades/models. * Update Object layer for the ``Node`` and ``Port`` objects in order to permit both objects to be queried by ``shard``. * Add query by shard capability to the Nodes and Ports database tables. * Expose ``shard`` on the node API, with an incremented microversion *and* implement a new RBAC policy which restricts the ability to change the ``shard`` value * Add pre-upgrade status check to warn if there are fields which are not consistently populated. i.e. ``shard`` is not populated on all nodes. This will provide visibility into the mixed and possibly misconfigured operational state for future upgrader. * Update OpenStack SDK and python-ironicclient Dependencies ============ This specification is loosely dependent upon Nova accepting a plan for use of the sharding model of data. At present, it is the Ironic team's understanding that it is acceptable to Nova, and Ironic needs to merge this spec and related code to support this feature before Nova will permit the Nova spec to be merged. Testing ======= Unit testing is expected for all the basic components and operations added ot Ironic to support this funcitonality. We may be able to add some tempest testing for the API field and access interactions. Upgrades and Backwards Compatibility ==================================== To be determined. We anticipate that the standard upgrade process would apply and that there would not realistically be an explicit downgrade compatability process, but this capability and functionality is largely for external consumption, and details there are yet to be determined. Documentation Impact ==================== Admin documentation would need to include an document covering sharding, internal mechanics, and usage. References ========== PTG Notes: https://etherpad.opendev.org/p/nova-antelope-ptg Bug: https://launchpad.net/bugs/1730834 Bug: https://launchpad.net/bugs/1825876 Related Bug: https://launchpad.net/bugs/1853009