..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

======================
Shard Key Introduction
======================

https://storyboard.openstack.org/#!/story/2010378

After much discussion and attempts to remedy the scalability issues with
``nova-compute`` and its connection to Ironic in large scale deployments,
and upon newly discovered indicators of ``networking-baremetal`` having a
similar scaling issue, the community has started to reach an agreement on
a path forward. Specifically, to introduce a sharding model which would
allow API consumers to map and lock on to specific sets of baremetal nodes,
regardless of if the relationship is semi-permanant or entirely situational.
Only the consumer of the information performing processing can make that
determination, and it is up to Ironic to try and provide the substrate
capabilities to efficiently operate against its API.

Problem description
===================

The reality is Ironic can be used at some absurd scales in the hundreds of
of thousands of baremetal nodes, and while *most* operators of Ironic either
run multiple smaller distinct Ironic deployments with less than 500 physical
machines, some need a single deployment with thousands or tens of thousands
of physical nodes. At increased scales, external operations polling ironic,
generally struggle to scale at these levels. It is also easy for
misconfigurations to be made where performance can become degraded,
which is because the scaling model and limits are difficult to understand.

This is observable with the operation of Nova's Compute process when running
the ``nova.virt.ironic`` driver. It is operationally easy to get into
situations where one is attempting to support thousands of baremetal nodes,
with too few ``nova-compute`` processes. This specific situation leads to
the process attempting to take on more work than it was designed to handle.

Recently we discovered a case, while rooted in misconfiguration, where the
same basic scaling issue exists with ``networking-baremetal`` where it is
responsible for polling and updating physical network mappings in Neutron.
The same basic case, a huge amount of work, and multiple processes.
In this specific case, multiple (3) Neutron services were stressing the Ironic
API retrieving all of the nodes, and attempting to update all of the related
physical network mapping records in neutron, resulting in the same record
being updated 3 times, once from each service.

The root issue is the software consuming Ironic's data needs to be able to
self-delineate the overall node set and determine the local separation points
for sharding the nodes. The delineation is required because the processes
executed are far more processor intensive, which can introduce latency and
lag which can lead towards race conditions.

The challenge, from what has been done previously, is the previous model
required downloading the entire data set to build a hash ring from.

Where things are also complicated, is Ironic has an operational model of
a ``conductor_group``, which is intended to help model a physical grouping
or operational constraint. The challenge here is that conductor groups are
not automatic in any way, shape, or form. As a result, conductor groups
is not the solution we need here.

Proposed change
===============

Overall the idea, is to introduce a ``shard`` field on the node object,
which an API user (Service), can utilize to retrieve a subset of nodes.

This new field on the node object would be inline with existing API
field behavior constraints and can be set via the API.

We *can* provide a means to pre-set the shard, but ultimately it is
still optional for Ironic, and the shard *exists* for the API
consumer's benefit.

In order to facilitate the usage by an API client, ``/v1/nodes`` and
``/v1/ports`` would be updated to accept a ``shard`` parameter
(i.e. GET /v1/nodes?shard=foo, GET /v1/ports?shard=foo,
GET /v1/portgroups?shard=foo) in the query to allow for API consumers
to automatically scope limit their data set and self determine how to
reduce the workset. For example, ``networking-baremetal`` may not care
about assignment, it just needs to reduce the localized workset.
Whereas, ``nova-compute`` needs the shard field to remain static, that is
unless ``nova-compute`` or some other API consumer were to request the
``shard`` to be updated on a node.

.. NOTE::
   The overall process consumers use today is to retreive everything and
   then limit the scope of work based upon contents of the result set.
   This results in a large overhead of work and increased looping latency
   which also can encourage race conditions. Both ``nova-compute``
   and the ``networking-baremetal`` ML2 plugin operate in this way with
   different patterns of use. The advantage of the the proposed solution
   is to enabel the scope limiting/grouping into managable chunks.

In terms of access controls, we would also add a new RBAC policy to
restrict changes such that the system itself or a appropriately scoped
(i.e. administrative) user can change the field.

In this model, conductors do not care about the shard key. It is only
a data storage field on the node. Lookups for contents of the overall
shard composition/layout, for GET /v1/shards, is to be performed
directly against the nodes table using a SQL query.

Alternatives
------------

This is a complex solution to allow simplified yet delineated usage,
and there are numerous other options for specific details.

Ultimately, each item should be discussed, and considered.

One key aspect, which has been recognized thus far, is that existing
mechanisms can be inefficiently leveraged to achieve this. An example
of this is that ``conductor_group``, ``owner``, ``lessee`` all allow for
filtering of the node result set. A ``conductor_group`` being an explicit
aspect an API client can request, where as ``owner`` and ``lessee`` are
access control based filters tied to the API client's submitted Project
ID used for Authentication. More information on why ``conductor_group``
is problematic is further on in this document.

Consensus in discussion with the Nova teams seems to be that usage of
the other fields, while in part may be useful, and possibly even preferred
in some limited and specific cases, doesn't solve the general need
to be able to allow clients to self delineate *without* first downloading
the *entire* node list first. Which in itself, the act of retrieving
a complete list of nodes is a known scaling challenge, and creates increased
processing latency.

In the ``conductor_group`` case, there is no current way to discover
the conductor groups. Where as for ``owner`` and ``lessee``, these are
specific project ID value fields.

Why not Conductor Group?
~~~~~~~~~~~~~~~~~~~~~~~~

It is important to stress similiarity wise, this *is* similar to conductor
groups, however conductor groups were primarily purposed to model the physical
constraints and structure of the baremetal infrastructure.

For example, if you have a set of conductors in Europe, and a set of
conductors in New York, you don't want to try and run a deploy for servers
in New York, from Europe. Part of the attractiveness for this to be exposed
or used in Nova, was *also* to align the physical structure. The immediately
recognized bonus to operators was the list of nodes was limited to the running
``nova-compute`` process, if so configured. It is known to the Ironic community
that some infrastructure operators *have* utilized this setting and field to
facilitate scaling of their ``nova-compute`` infrastructure, however these
operators have also encountered issues with this use pattern as well that
we hope to avoid with a shard key implementation.

Where the needs are different with this effort and the pre-existing
conductor groups, is that conductor groups are part of the hash ring modeling
behavior where as in the shards model conductors will operate without
consideration of the shard key value. We need disjointed modeling to support
API consumer centric usage so they can operate in logical units with distinct
selections of work.
Consumers *may* also care about the ``conductor_group`` in addition to the
shard because needing to geographically delineate is separate from needing
smaller "chunks" of work, or in this case "groups of baremetal nodes" for
which a running process is responsible for.

In this specific case, ``conductor_group`` is entirely a manually managed
aspect, which nova has a separate setting name due to name perception reasons,
and our hope ultimately is something that is both simple and smart.

.. NOTE::
   The Nova project has agreed during Project Teams Gathering meetings to
   deprecate the ``peer_list`` parameter they forced use of previously to
   support conductor groups with the hash ring logic.

On top of this, Today's ``conductor_group`` functionality is reliant upon
the hash ring model of use, which is something the Nova team wants to see
removed from the Nova codebase in the next several development cycles.
Where as, Ironic will continue to use the hash ring functionality
for managing our conductor's operating state as it is also modeled for
conductors to manage thousands of nodes. These thousands of nodes just
does not scale well into ``nova-compute`` services.

Why not owner or lessee?
~~~~~~~~~~~~~~~~~~~~~~~~

With the RBAC model improvements which have taken place over the last few
years, it *is* entirely possible to manage separate projects and credentials
for a ``nova-compute`` to exist and operate within. The challenge here is
management of additional credentials and the mappings/interactions.

It might be "feasible" to do the same for scaling ``networking-baremetal``
interactions with Ironic's API, but the overhead and self management of
node groupings seems onerous and error prone.

Also, if this was a path taken, it would also be administratively prohibitive
for nova-computes nodes, and they would be locked to the manual settings.

What if we just let the API consumer figure it out?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This could be an option, but it would lead to worse performance and the
user experience being worse.

The base conundrum is to orderly and efficiently enumerate through, and then
acting upon each and every node API client is responsible for interacting
with.

Today, Nova's Compute service enumerates through every node, using a list
generated upon one query, and it gets *most* of the data it needs to
track/interact with a node, keeping the more costly single node requests to a
minimum. If that client had to track things, it would still have to pull
a full list, and then it would have to reconcile, track, and map individual
nodes. We've already seen this as not working using a Hashring today.

Similarly, ``networking-baremetal`` lists all ports. That is all it needs,
but it has no concept of smaller chunking, blocks, or even enough information
*to* really make a hashring which would represent existing models. To just
expect the client to "figure it out" and to "deal with that complexity",
also means logic far away from a database. And for performance, the closer
we can keep logic and decisions to an indexed column, the better and more
performant, which is why the proposed solution has come forth.

Data model impact
-----------------

Node: Addition of a ``shard`` column/value string field, indexed,
      with a default value of None. This field is considered to be
      case sensitive, which is inline with the DB storage type.
      API queries would seek exact field value matches.

.. NOTE:: We will need to confer with the Nova team and the nova.virt.ironic
          driver query pattern, to ensure we cover any compound indexes,
          if needed.

To facilitate this, database migrations, and data model sanity checking
will need to be added to ``ironic-status`` as part of the upgrade checks.

State Machine Impact
--------------------

None

REST API impact
---------------

PATCH /v1/nodes/<node>

In order to set a shard value, a user will need to patch the field.
This is canned functionality of the existing nodes controller, and will
be API version and RBAC policy guarded in order to prevent inappropriate
changes to the field once set. Like all other fields, this operation
takes the shape of a JSON Patch.

GET /v1/nodes?shard=VALUE,VALUE2,VALUE3

Returns a subset of nodes limited by shard key. In this specific case
we will also allow a string value of "none", "None" or "null" to
be utilized to retrieve a list of nodes which do *not* have a shard
key set. Logic to handle that would be in the DB API layer.

GET /v1/ports?shard=VALUE,VALUE2,VALUEZ
GET /v1/portgroupss?shard=VALUE,VALUE2,VALUEZ

Returns a subset of ports, limited by the shard key, or list of keys
provided by the caller. Specifically would utilize a joined query
to the database to facilitate it.

GET /v1/shards

Returns a JSON representing the shard keys and counts of nodes
utilizing the shard.

    {{"Name": "Shard-10", "Count": 352},
    {"Name": "Shard-11", "Count": 351},
    {"Name": "Shard-12", "Count": 35},
    {"Name": null, "Count": 921}}

Visibility wise, the new capabilities will be restricted by API
micro-version. Access wise this field would be restricted in use to
``system-reader``, ``project-admin``, and future ``service`` roles
by default. A specific RBAC policy would be added for access to
this endpoint.

.. NOTE::
   The /v1/shards endpoint will be read only.

Client (CLI) impact
-------------------
Typically, but not always, if there are any REST API changes, there are
corresponding changes to python-ironicclient. If so, what does the user
interface look like. If not, describe why there are REST API changes but
no changes to the client.

"openstack baremetal" CLI
~~~~~~~~~~~~~~~~~~~~~~~~~

A ``baremetal shard list`` command would be added.

A ``baremetal node list --shard <shard>`` capability would be
added to list all nodes in a shard.

A ``--shard`` node level parameter for ``baremetal node set``
would also be added.

A ``baremetal port list --shard <shard>`` capability would be
added to limit the related ports to nodes in a shard. Similarly,
the ``baremetal portgroup list --shard <shard>`` would be updated
as well.

"openstacksdk"
~~~~~~~~~~~~~~

A SDK method would be added to get a shard list, and existing list methods
would be checked to ensure we can query by shard.

RPC API impact
--------------

None anticipated at this time.

Driver API impact
-----------------

None

Nova driver impact
------------------

A separate specification document is being proposed for the Nova
project to help identify *and* navigate the overall change.

That being said, no direct negative impact is anticipated.

The overall discussion revolving with Nova is to both facilitate a
minimal impact migration, and not force invasive and breaking changes,
which may not be realistically needed by the operators.

.. NOTE:: An overall migration path is envisioned, but what is
          noted here is only a suggestion and how we perceive the
          overall process.

Anticipated initial Nova migration steps:

Ironic itself will not be providing an explicit process for setting the
shard value on each node, aside from ``baremetal node set``. Below is what
*we, Ironic* anticipate as the migration steps overall to move towards this
model.

1) Complete the Ironic migration. Upon completion, executing the database
   status check (i.e. ``ironic-status upgrade check``) should detect and warn
   *if* a ``shard`` key is present on nodes in the database, but nodes
   exist without a ``shard`` value are present in the database.
2) The nova-compute service being upgraded is shut down.
3) A nova-manage command would be executed to reassign nodes to a user
   supplied ``shard`` value to match.
   Example: nova-manage ironic-reassign <shard-key> <compute-hostname>

   Programattically, this would retrieve a list of nodes matching the key from
   Ironic, and then change the associated ComputeNode and Instance tables host
   fields to be the supplied compute hostname, to match an existing nova
   compute service.

   .. NOTE:: The command likely needs to match/validate that this is/was a
             compute hostname.

   .. TODO:: As a final step before the nova-manage command exits, ideally it
             would double check the state of records in those tables to
             indicate if there are other nodes the named Compute hostname is
             responsible for. The last compute hostname in the environment
             should not generate any warning, any warning would be indicitive
             of a lost ComputeNode, Instance, or Baremetal node record.

4) The nova-compute.conf file for the upgraded ``nova-compute`` service is
   restarted with a ``my_shard`` (or other appropriate parameter) which
   signals to the ``nova.virt.ironic`` driver code to not utilize the hash
   ring, and to utilize the blend of what it thinks it is responsible for
   from the database *and* what matches the Ironic baremetal node inventory
   when queried for matching the configured shard key value.
5) As additional compute nodes are migrated to using the new shard key setup,
   existing compute node imbalance should be settled in terms of the
   internal compute-node logic to retrieve what each node it thinks it is
   responsible for, and would eventually match the shard key.

This would facilitate an ability to perform a rolling, yet isolated outage
impact as the new nova-compute configuraiton is coming online, and also allows
for a flow which should be able to be automated for larger operators.

The manageability, say if one needs to change a ``shard`` or rebalance
shards, is not yet clear. The current discussion in the Nova project is that
rebalance/reassociation will only be permitted *IF* the compute service
has been "forced down" which is an irreversable action

Ramdisk impact
--------------

None

Security impact
---------------

The ``shard`` key would be API user settable, as long as sufficient
API access exists in the RBAC model.

The ``/v/shards`` endpoint would also be restricted based upon the RBAC
model.

No other security impacts are anticipated.

Other end user impact
---------------------

None Anticipated

Scalability impact
------------------

This model is anticipated to allow users of data stored in Ironic to be more
scalable. No impacts to Ironic's scalability are generally anticipated.

Performance Impact
------------------

No realistic impact is anticipated. While another field is being added,
initial prototyping benchmarks have yielded highly performant response
times for large sets (10,000) baremetal nodes.

Other deployer impact
---------------------

It *is* recognized that operators *may* wish to auto-assign or auto-shard
the node set programatically. The agreed upon limitation amongst Ironic
contributors is that we (Ironic) would not automatically create *new*
shards in the future. Creation of new shards would be driven by the operator
by setting a new shard key on any given node.

This may require a new configuration option to control this logic, but
the logic overall is not viewed as a blocking aspect to the more critical
need of being able to "assign" a node to a shard. This logic may be added
later on, we will just endeveour to have updated documentation to explain
the appropriate usage and options.

Developer impact
----------------

None anticipated

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  Jay Faulkner (JayF)

Other contributors:
  Julia Kreger (TheJulia)

Work Items
----------

* Propose nova spec for the use of the keys
  (https://review.opendev.org/c/openstack/nova-specs/+/862833)
* Create database schema/upgrades/models.
* Update Object layer for the ``Node`` and ``Port`` objects in order to
  permit both objects to be queried by ``shard``.
* Add query by shard capability to the Nodes and Ports database tables.
* Expose ``shard`` on the node API, with an incremented microversion
  *and* implement a new RBAC policy which restricts the ability to change
  the ``shard`` value
* Add pre-upgrade status check to warn if there are fields which are
  not consistently populated. i.e. ``shard`` is not populated on
  all nodes. This will provide visibility into the mixed and possibly
  misconfigured operational state for future upgrader.
* Update OpenStack SDK and python-ironicclient

Dependencies
============

This specification is loosely dependent upon Nova accepting
a plan for use of the sharding model of data. At present, it is the
Ironic team's understanding that it is acceptable to Nova, and Ironic
needs to merge this spec and related code to support this feature before
Nova will permit the Nova spec to be merged.

Testing
=======

Unit testing is expected for all the basic components and operations
added ot Ironic to support this funcitonality.

We may be able to add some tempest testing for the API field and access
interactions.

Upgrades and Backwards Compatibility
====================================

To be determined. We anticipate that the standard upgrade process would apply
and that there would not realistically be an explicit downgrade compatability
process, but this capability and functionality is largely for external
consumption, and details there are yet to be determined.

Documentation Impact
====================

Admin documentation would need to include an document covering sharding,
internal mechanics, and usage.

References
==========

PTG Notes: https://etherpad.opendev.org/p/nova-antelope-ptg
Bug: https://launchpad.net/bugs/1730834
Bug: https://launchpad.net/bugs/1825876
Related Bug: https://launchpad.net/bugs/1853009