Merge "Add a shard key"
This commit is contained in:
commit
86277449b6
|
@ -0,0 +1,516 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================
|
||||
Shard Key Introduction
|
||||
======================
|
||||
|
||||
https://storyboard.openstack.org/#!/story/2010378
|
||||
|
||||
After much discussion and attempts to remedy the scalability issues with
|
||||
``nova-compute`` and its connection to Ironic in large scale deployments,
|
||||
and upon newly discovered indicators of ``networking-baremetal`` having a
|
||||
similar scaling issue, the community has started to reach an agreement on
|
||||
a path forward. Specifically, to introduce a sharding model which would
|
||||
allow API consumers to map and lock on to specific sets of baremetal nodes,
|
||||
regardless of if the relationship is semi-permanant or entirely situational.
|
||||
Only the consumer of the information performing processing can make that
|
||||
determination, and it is up to Ironic to try and provide the substrate
|
||||
capabilities to efficiently operate against its API.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
The reality is Ironic can be used at some absurd scales in the hundreds of
|
||||
of thousands of baremetal nodes, and while *most* operators of Ironic either
|
||||
run multiple smaller distinct Ironic deployments with less than 500 physical
|
||||
machines, some need a single deployment with thousands or tens of thousands
|
||||
of physical nodes. At increased scales, external operations polling ironic,
|
||||
generally struggle to scale at these levels. It is also easy for
|
||||
misconfigurations to be made where performance can become degraded,
|
||||
which is because the scaling model and limits are difficult to understand.
|
||||
|
||||
This is observable with the operation of Nova's Compute process when running
|
||||
the ``nova.virt.ironic`` driver. It is operationally easy to get into
|
||||
situations where one is attempting to support thousands of baremetal nodes,
|
||||
with too few ``nova-compute`` processes. This specific situation leads to
|
||||
the process attempting to take on more work than it was designed to handle.
|
||||
|
||||
Recently we discovered a case, while rooted in misconfiguration, where the
|
||||
same basic scaling issue exists with ``networking-baremetal`` where it is
|
||||
responsible for polling and updating physical network mappings in Neutron.
|
||||
The same basic case, a huge amount of work, and multiple processes.
|
||||
In this specific case, multiple (3) Neutron services were stressing the Ironic
|
||||
API retrieving all of the nodes, and attempting to update all of the related
|
||||
physical network mapping records in neutron, resulting in the same record
|
||||
being updated 3 times, once from each service.
|
||||
|
||||
The root issue is the software consuming Ironic's data needs to be able to
|
||||
self-delineate the overall node set and determine the local separation points
|
||||
for sharding the nodes. The delineation is required because the processes
|
||||
executed are far more processor intensive, which can introduce latency and
|
||||
lag which can lead towards race conditions.
|
||||
|
||||
The challenge, from what has been done previously, is the previous model
|
||||
required downloading the entire data set to build a hash ring from.
|
||||
|
||||
Where things are also complicated, is Ironic has an operational model of
|
||||
a ``conductor_group``, which is intended to help model a physical grouping
|
||||
or operational constraint. The challenge here is that conductor groups are
|
||||
not automatic in any way, shape, or form. As a result, conductor groups
|
||||
is not the solution we need here.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Overall the idea, is to introduce a ``shard`` field on the node object,
|
||||
which an API user (Service), can utilize to retrieve a subset of nodes.
|
||||
|
||||
This new field on the node object would be inline with existing API
|
||||
field behavior constraints and can be set via the API.
|
||||
|
||||
We *can* provide a means to pre-set the shard, but ultimately it is
|
||||
still optional for Ironic, and the shard *exists* for the API
|
||||
consumer's benefit.
|
||||
|
||||
In order to facilitate the usage by an API client, ``/v1/nodes`` and
|
||||
``/v1/ports`` would be updated to accept a ``shard`` parameter
|
||||
(i.e. GET /v1/nodes?shard=foo, GET /v1/ports?shard=foo,
|
||||
GET /v1/portgroups?shard=foo) in the query to allow for API consumers
|
||||
to automatically scope limit their data set and self determine how to
|
||||
reduce the workset. For example, ``networking-baremetal`` may not care
|
||||
about assignment, it just needs to reduce the localized workset.
|
||||
Whereas, ``nova-compute`` needs the shard field to remain static, that is
|
||||
unless ``nova-compute`` or some other API consumer were to request the
|
||||
``shard`` to be updated on a node.
|
||||
|
||||
.. NOTE::
|
||||
The overall process consumers use today is to retreive everything and
|
||||
then limit the scope of work based upon contents of the result set.
|
||||
This results in a large overhead of work and increased looping latency
|
||||
which also can encourage race conditions. Both ``nova-compute``
|
||||
and the ``networking-baremetal`` ML2 plugin operate in this way with
|
||||
different patterns of use. The advantage of the the proposed solution
|
||||
is to enabel the scope limiting/grouping into managable chunks.
|
||||
|
||||
In terms of access controls, we would also add a new RBAC policy to
|
||||
restrict changes such that the system itself or a appropriately scoped
|
||||
(i.e. administrative) user can change the field.
|
||||
|
||||
In this model, conductors do not care about the shard key. It is only
|
||||
a data storage field on the node. Lookups for contents of the overall
|
||||
shard composition/layout, for GET /v1/shards, is to be performed
|
||||
directly against the nodes table using a SQL query.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
This is a complex solution to allow simplified yet delineated usage,
|
||||
and there are numerous other options for specific details.
|
||||
|
||||
Ultimately, each item should be discussed, and considered.
|
||||
|
||||
One key aspect, which has been recognized thus far, is that existing
|
||||
mechanisms can be inefficiently leveraged to achieve this. An example
|
||||
of this is that ``conductor_group``, ``owner``, ``lessee`` all allow for
|
||||
filtering of the node result set. A ``conductor_group`` being an explicit
|
||||
aspect an API client can request, where as ``owner`` and ``lessee`` are
|
||||
access control based filters tied to the API client's submitted Project
|
||||
ID used for Authentication. More information on why ``conductor_group``
|
||||
is problematic is further on in this document.
|
||||
|
||||
Consensus in discussion with the Nova teams seems to be that usage of
|
||||
the other fields, while in part may be useful, and possibly even preferred
|
||||
in some limited and specific cases, doesn't solve the general need
|
||||
to be able to allow clients to self delineate *without* first downloading
|
||||
the *entire* node list first. Which in itself, the act of retrieving
|
||||
a complete list of nodes is a known scaling challenge, and creates increased
|
||||
processing latency.
|
||||
|
||||
In the ``conductor_group`` case, there is no current way to discover
|
||||
the conductor groups. Where as for ``owner`` and ``lessee``, these are
|
||||
specific project ID value fields.
|
||||
|
||||
Why not Conductor Group?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
It is important to stress similiarity wise, this *is* similar to conductor
|
||||
groups, however conductor groups were primarily purposed to model the physical
|
||||
constraints and structure of the baremetal infrastructure.
|
||||
|
||||
For example, if you have a set of conductors in Europe, and a set of
|
||||
conductors in New York, you don't want to try and run a deploy for servers
|
||||
in New York, from Europe. Part of the attractiveness for this to be exposed
|
||||
or used in Nova, was *also* to align the physical structure. The immediately
|
||||
recognized bonus to operators was the list of nodes was limited to the running
|
||||
``nova-compute`` process, if so configured. It is known to the Ironic community
|
||||
that some infrastructure operators *have* utilized this setting and field to
|
||||
facilitate scaling of their ``nova-compute`` infrastructure, however these
|
||||
operators have also encountered issues with this use pattern as well that
|
||||
we hope to avoid with a shard key implementation.
|
||||
|
||||
Where the needs are different with this effort and the pre-existing
|
||||
conductor groups, is that conductor groups are part of the hash ring modeling
|
||||
behavior where as in the shards model conductors will operate without
|
||||
consideration of the shard key value. We need disjointed modeling to support
|
||||
API consumer centric usage so they can operate in logical units with distinct
|
||||
selections of work.
|
||||
Consumers *may* also care about the ``conductor_group`` in addition to the
|
||||
shard because needing to geographically delineate is separate from needing
|
||||
smaller "chunks" of work, or in this case "groups of baremetal nodes" for
|
||||
which a running process is responsible for.
|
||||
|
||||
In this specific case, ``conductor_group`` is entirely a manually managed
|
||||
aspect, which nova has a separate setting name due to name perception reasons,
|
||||
and our hope ultimately is something that is both simple and smart.
|
||||
|
||||
.. NOTE::
|
||||
The Nova project has agreed during Project Teams Gathering meetings to
|
||||
deprecate the ``peer_list`` parameter they forced use of previously to
|
||||
support conductor groups with the hash ring logic.
|
||||
|
||||
On top of this, Today's ``conductor_group`` functionality is reliant upon
|
||||
the hash ring model of use, which is something the Nova team wants to see
|
||||
removed from the Nova codebase in the next several development cycles.
|
||||
Where as, Ironic will continue to use the hash ring functionality
|
||||
for managing our conductor's operating state as it is also modeled for
|
||||
conductors to manage thousands of nodes. These thousands of nodes just
|
||||
does not scale well into ``nova-compute`` services.
|
||||
|
||||
Why not owner or lessee?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
With the RBAC model improvements which have taken place over the last few
|
||||
years, it *is* entirely possible to manage separate projects and credentials
|
||||
for a ``nova-compute`` to exist and operate within. The challenge here is
|
||||
management of additional credentials and the mappings/interactions.
|
||||
|
||||
It might be "feasible" to do the same for scaling ``networking-baremetal``
|
||||
interactions with Ironic's API, but the overhead and self management of
|
||||
node groupings seems onerous and error prone.
|
||||
|
||||
Also, if this was a path taken, it would also be administratively prohibitive
|
||||
for nova-computes nodes, and they would be locked to the manual settings.
|
||||
|
||||
What if we just let the API consumer figure it out?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This could be an option, but it would lead to worse performance and the
|
||||
user experience being worse.
|
||||
|
||||
The base conundrum is to orderly and efficiently enumerate through, and then
|
||||
acting upon each and every node API client is responsible for interacting
|
||||
with.
|
||||
|
||||
Today, Nova's Compute service enumerates through every node, using a list
|
||||
generated upon one query, and it gets *most* of the data it needs to
|
||||
track/interact with a node, keeping the more costly single node requests to a
|
||||
minimum. If that client had to track things, it would still have to pull
|
||||
a full list, and then it would have to reconcile, track, and map individual
|
||||
nodes. We've already seen this as not working using a Hashring today.
|
||||
|
||||
Similarly, ``networking-baremetal`` lists all ports. That is all it needs,
|
||||
but it has no concept of smaller chunking, blocks, or even enough information
|
||||
*to* really make a hashring which would represent existing models. To just
|
||||
expect the client to "figure it out" and to "deal with that complexity",
|
||||
also means logic far away from a database. And for performance, the closer
|
||||
we can keep logic and decisions to an indexed column, the better and more
|
||||
performant, which is why the proposed solution has come forth.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Node: Addition of a ``shard`` column/value string field, indexed,
|
||||
with a default value of None. This field is considered to be
|
||||
case sensitive, which is inline with the DB storage type.
|
||||
API queries would seek exact field value matches.
|
||||
|
||||
.. NOTE:: We will need to confer with the Nova team and the nova.virt.ironic
|
||||
driver query pattern, to ensure we cover any compound indexes,
|
||||
if needed.
|
||||
|
||||
To facilitate this, database migrations, and data model sanity checking
|
||||
will need to be added to ``ironic-status`` as part of the upgrade checks.
|
||||
|
||||
State Machine Impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
PATCH /v1/nodes/<node>
|
||||
|
||||
In order to set a shard value, a user will need to patch the field.
|
||||
This is canned functionality of the existing nodes controller, and will
|
||||
be API version and RBAC policy guarded in order to prevent inappropriate
|
||||
changes to the field once set. Like all other fields, this operation
|
||||
takes the shape of a JSON Patch.
|
||||
|
||||
GET /v1/nodes?shard=VALUE,VALUE2,VALUE3
|
||||
|
||||
Returns a subset of nodes limited by shard key. In this specific case
|
||||
we will also allow a string value of "none", "None" or "null" to
|
||||
be utilized to retrieve a list of nodes which do *not* have a shard
|
||||
key set. Logic to handle that would be in the DB API layer.
|
||||
|
||||
GET /v1/ports?shard=VALUE,VALUE2,VALUEZ
|
||||
GET /v1/portgroupss?shard=VALUE,VALUE2,VALUEZ
|
||||
|
||||
Returns a subset of ports, limited by the shard key, or list of keys
|
||||
provided by the caller. Specifically would utilize a joined query
|
||||
to the database to facilitate it.
|
||||
|
||||
GET /v1/shards
|
||||
|
||||
Returns a JSON representing the shard keys and counts of nodes
|
||||
utilizing the shard.
|
||||
|
||||
{{"Name": "Shard-10", "Count": 352},
|
||||
{"Name": "Shard-11", "Count": 351},
|
||||
{"Name": "Shard-12", "Count": 35},
|
||||
{"Name": null, "Count": 921}}
|
||||
|
||||
Visibility wise, the new capabilities will be restricted by API
|
||||
micro-version. Access wise this field would be restricted in use to
|
||||
``system-reader``, ``project-admin``, and future ``service`` roles
|
||||
by default. A specific RBAC policy would be added for access to
|
||||
this endpoint.
|
||||
|
||||
.. NOTE::
|
||||
The /v1/shards endpoint will be read only.
|
||||
|
||||
Client (CLI) impact
|
||||
-------------------
|
||||
Typically, but not always, if there are any REST API changes, there are
|
||||
corresponding changes to python-ironicclient. If so, what does the user
|
||||
interface look like. If not, describe why there are REST API changes but
|
||||
no changes to the client.
|
||||
|
||||
"openstack baremetal" CLI
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A ``baremetal shard list`` command would be added.
|
||||
|
||||
A ``baremetal node list --shard <shard>`` capability would be
|
||||
added to list all nodes in a shard.
|
||||
|
||||
A ``--shard`` node level parameter for ``baremetal node set``
|
||||
would also be added.
|
||||
|
||||
A ``baremetal port list --shard <shard>`` capability would be
|
||||
added to limit the related ports to nodes in a shard. Similarly,
|
||||
the ``baremetal portgroup list --shard <shard>`` would be updated
|
||||
as well.
|
||||
|
||||
"openstacksdk"
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
A SDK method would be added to get a shard list, and existing list methods
|
||||
would be checked to ensure we can query by shard.
|
||||
|
||||
RPC API impact
|
||||
--------------
|
||||
|
||||
None anticipated at this time.
|
||||
|
||||
Driver API impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
Nova driver impact
|
||||
------------------
|
||||
|
||||
A separate specification document is being proposed for the Nova
|
||||
project to help identify *and* navigate the overall change.
|
||||
|
||||
That being said, no direct negative impact is anticipated.
|
||||
|
||||
The overall discussion revolving with Nova is to both facilitate a
|
||||
minimal impact migration, and not force invasive and breaking changes,
|
||||
which may not be realistically needed by the operators.
|
||||
|
||||
.. NOTE:: An overall migration path is envisioned, but what is
|
||||
noted here is only a suggestion and how we perceive the
|
||||
overall process.
|
||||
|
||||
Anticipated initial Nova migration steps:
|
||||
|
||||
Ironic itself will not be providing an explicit process for setting the
|
||||
shard value on each node, aside from ``baremetal node set``. Below is what
|
||||
*we, Ironic* anticipate as the migration steps overall to move towards this
|
||||
model.
|
||||
|
||||
1) Complete the Ironic migration. Upon completion, executing the database
|
||||
status check (i.e. ``ironic-status upgrade check``) should detect and warn
|
||||
*if* a ``shard`` key is present on nodes in the database, but nodes
|
||||
exist without a ``shard`` value are present in the database.
|
||||
2) The nova-compute service being upgraded is shut down.
|
||||
3) A nova-manage command would be executed to reassign nodes to a user
|
||||
supplied ``shard`` value to match.
|
||||
Example: nova-manage ironic-reassign <shard-key> <compute-hostname>
|
||||
|
||||
Programattically, this would retrieve a list of nodes matching the key from
|
||||
Ironic, and then change the associated ComputeNode and Instance tables host
|
||||
fields to be the supplied compute hostname, to match an existing nova
|
||||
compute service.
|
||||
|
||||
.. NOTE:: The command likely needs to match/validate that this is/was a
|
||||
compute hostname.
|
||||
|
||||
.. TODO:: As a final step before the nova-manage command exits, ideally it
|
||||
would double check the state of records in those tables to
|
||||
indicate if there are other nodes the named Compute hostname is
|
||||
responsible for. The last compute hostname in the environment
|
||||
should not generate any warning, any warning would be indicitive
|
||||
of a lost ComputeNode, Instance, or Baremetal node record.
|
||||
|
||||
4) The nova-compute.conf file for the upgraded ``nova-compute`` service is
|
||||
restarted with a ``my_shard`` (or other appropriate parameter) which
|
||||
signals to the ``nova.virt.ironic`` driver code to not utilize the hash
|
||||
ring, and to utilize the blend of what it thinks it is responsible for
|
||||
from the database *and* what matches the Ironic baremetal node inventory
|
||||
when queried for matching the configured shard key value.
|
||||
5) As additional compute nodes are migrated to using the new shard key setup,
|
||||
existing compute node imbalance should be settled in terms of the
|
||||
internal compute-node logic to retrieve what each node it thinks it is
|
||||
responsible for, and would eventually match the shard key.
|
||||
|
||||
This would facilitate an ability to perform a rolling, yet isolated outage
|
||||
impact as the new nova-compute configuraiton is coming online, and also allows
|
||||
for a flow which should be able to be automated for larger operators.
|
||||
|
||||
The manageability, say if one needs to change a ``shard`` or rebalance
|
||||
shards, is not yet clear. The current discussion in the Nova project is that
|
||||
rebalance/reassociation will only be permitted *IF* the compute service
|
||||
has been "forced down" which is an irreversable action
|
||||
|
||||
Ramdisk impact
|
||||
--------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
The ``shard`` key would be API user settable, as long as sufficient
|
||||
API access exists in the RBAC model.
|
||||
|
||||
The ``/v/shards`` endpoint would also be restricted based upon the RBAC
|
||||
model.
|
||||
|
||||
No other security impacts are anticipated.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None Anticipated
|
||||
|
||||
Scalability impact
|
||||
------------------
|
||||
|
||||
This model is anticipated to allow users of data stored in Ironic to be more
|
||||
scalable. No impacts to Ironic's scalability are generally anticipated.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
No realistic impact is anticipated. While another field is being added,
|
||||
initial prototyping benchmarks have yielded highly performant response
|
||||
times for large sets (10,000) baremetal nodes.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
It *is* recognized that operators *may* wish to auto-assign or auto-shard
|
||||
the node set programatically. The agreed upon limitation amongst Ironic
|
||||
contributors is that we (Ironic) would not automatically create *new*
|
||||
shards in the future. Creation of new shards would be driven by the operator
|
||||
by setting a new shard key on any given node.
|
||||
|
||||
This may require a new configuration option to control this logic, but
|
||||
the logic overall is not viewed as a blocking aspect to the more critical
|
||||
need of being able to "assign" a node to a shard. This logic may be added
|
||||
later on, we will just endeveour to have updated documentation to explain
|
||||
the appropriate usage and options.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None anticipated
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Jay Faulkner (JayF)
|
||||
|
||||
Other contributors:
|
||||
Julia Kreger (TheJulia)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Propose nova spec for the use of the keys
|
||||
(https://review.opendev.org/c/openstack/nova-specs/+/862833)
|
||||
* Create database schema/upgrades/models.
|
||||
* Update Object layer for the ``Node`` and ``Port`` objects in order to
|
||||
permit both objects to be queried by ``shard``.
|
||||
* Add query by shard capability to the Nodes and Ports database tables.
|
||||
* Expose ``shard`` on the node API, with an incremented microversion
|
||||
*and* implement a new RBAC policy which restricts the ability to change
|
||||
the ``shard`` value
|
||||
* Add pre-upgrade status check to warn if there are fields which are
|
||||
not consistently populated. i.e. ``shard`` is not populated on
|
||||
all nodes. This will provide visibility into the mixed and possibly
|
||||
misconfigured operational state for future upgrader.
|
||||
* Update OpenStack SDK and python-ironicclient
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
This specification is loosely dependent upon Nova accepting
|
||||
a plan for use of the sharding model of data. At present, it is the
|
||||
Ironic team's understanding that it is acceptable to Nova, and Ironic
|
||||
needs to merge this spec and related code to support this feature before
|
||||
Nova will permit the Nova spec to be merged.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unit testing is expected for all the basic components and operations
|
||||
added ot Ironic to support this funcitonality.
|
||||
|
||||
We may be able to add some tempest testing for the API field and access
|
||||
interactions.
|
||||
|
||||
Upgrades and Backwards Compatibility
|
||||
====================================
|
||||
|
||||
To be determined. We anticipate that the standard upgrade process would apply
|
||||
and that there would not realistically be an explicit downgrade compatability
|
||||
process, but this capability and functionality is largely for external
|
||||
consumption, and details there are yet to be determined.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Admin documentation would need to include an document covering sharding,
|
||||
internal mechanics, and usage.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
PTG Notes: https://etherpad.opendev.org/p/nova-antelope-ptg
|
||||
Bug: https://launchpad.net/bugs/1730834
|
||||
Bug: https://launchpad.net/bugs/1825876
|
||||
Related Bug: https://launchpad.net/bugs/1853009
|
||||
|
Loading…
Reference in New Issue