Commit Graph

137 Commits

Author SHA1 Message Date
Tim Burke 7f5576d3e6 ring-builder: require part_power>=0, give better error messages
Closes-Bug: #1697860
Change-Id: I500a86de390b24b9d08a478d695a7d62c447e779
2023-02-17 13:47:53 -08:00
Zuul 6d600bef0c Merge "ring: Have RingBuilder.debug return to old state" 2022-12-01 19:28:08 +00:00
Tim Burke 52a4fe37aa Various doc formatting cleanups
* Get rid of a bunch of accidental blockquote formatting
* Always declare a lexer to use for ``.. code::`` blocks

Change-Id: I8940e75b094843e542e815dde6b6be4740751813
2022-08-02 14:28:36 -07:00
Tim Burke 85d063b20a ring: Have RingBuilder.debug return to old state
Previously, if the logger was enabled before entering the context
manager, it would be disabled upon exiting.

Change-Id: I19e03309c2c1e3ddafb1eee6f3a4ead0267c26ca
2022-07-28 10:26:13 -07:00
Tim Burke 314347a3cb Update SAIO & docker image to use 62xx ports
Note that existing SAIOs with 60xx ports should still work fine.

Change-Id: If5dd79f926fa51a58b3a732b212b484a7e9f00db
Related-Change: Ie1c778b159792c8e259e2a54cb86051686ac9d18
2020-07-20 15:17:12 -07:00
Tim Burke 821b964166 ring: Flag region, zone, and device as required in add_dev
They effectively already *were*, but if you used the RingBuilder API
directly (rather than the CLI) you could previously write down builders
that would hit KeyErrors on load.

Change-Id: I1de895d4571f7464be920345881789d47659729f
2020-03-06 22:25:21 -06:00
Pete Zaitcev 0f13db30b6 Import directly from the inside of the ring module
An odd thing happened: when my code did only
"from swift.common.ring import Ring", and nothing else,
the interpreter looped on the CPU.

The import from the top of the ring module is our standard
protocol for Ring. It causes no problem in places like
container updater or list_endpoints.py. It is a big
mystery why it causes Python 3.7.3 to loop, and only
in specific circumstances.

But we noticed that the recursive import is unnecesasry
in this case, so an obviously harmless fix exists.

Change-Id: I7373bbb0a50d090d6becf141e5832f8ae98381a4
2020-01-03 18:34:11 -06:00
Christian Schwede 98f9781096 Add commands to ring-builder to change region / zone
Currently one has to remove and re-add devices when the region or zone needs to
be changed. Adding the subcommands set_region and set_zone simplifies this, now
it is possible to change the region and/or zone easily. Note that there is no
change to the required rebalancing; it is likely that data still needs to be
moved within the cluster.

This is mostly copy-n-paste of the existing set_weight subcommand and adopting
tests accordingly. Some duplicated code in the tests has been aggregated as
well.

Change-Id: I37dd8e8ac24e2b0bb196758449a1d91a769b2e96
2019-10-04 21:56:12 -07:00
Tim Burke 6853616aea ring: Track more properties of the ring
Plumb the version from the ringbuilder through to the metadata at the
start of the ring. Recover this (if available) when running

    swift-ring-builder <ring> write_builder

When we load the ring, track the count and MD5 of the bytes off disk, as
well as the number of uncompressed bytes.

Expose all this new information as properties on the Ring, along with

  - device_count (number of non-None entries in self._devs),
  - weighted_device_count (number of devices that have weight), and
  - assigned_device_count (number of devices that actually have
    partition assignments).

Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Change-Id: I73deaf6f1d9c1d37630c37c02c597b8812592351
2019-07-12 17:26:28 -07:00
Samuel Merritt a6f7600545 Remove our reimplemented logging.NullHandler
Python 2.6 didn't have one, so we'd try to find logging.NullHandler
but fall back to our own. Since 2.7+ has logging.NullHandler, we can
just use it.

Change-Id: Ie2c27407efc2882e698abe6e4379a00a1d3f4301
2018-05-24 11:06:19 -07:00
Tim Burke 642f79965a py3: port common/ring/ and common/utils.py
I can't imagine us *not* having a py3 proxy server at some point, and
that proxy server is going to need a ring.

While we're at it (and since they were so close anyway), port

* cli/ringbuilder.py and
* common/linkat.py
* common/daemon.py

Change-Id: Iec8d97e0ce925614a86b516c4c6ed82809d0ba9b
2018-02-12 06:42:24 +00:00
malei f0f3de3462 fix typos in swift
Change-Id: I14dd433ed7c7fe789d5d04defbf9eec5bffc301a
2018-02-09 22:47:02 +08:00
Zuul 2f64e31b99 Merge "Further extract a builder method to a util function" 2018-01-18 23:13:53 +00:00
Zuul 909e371515 Merge "Merge repeat code for rebalance" 2018-01-15 15:05:01 +00:00
Clay Gerrard 68906dac43 Further extract a builder method to a util function
... to make testing more targeted and obvious

Related-Change-Id: I89439286b211f2c5ef19deffa77c202f48f07cf8
Change-Id: I93b99128a4fb35395e8e9caf11649e216f824fdf
2018-01-15 11:45:30 +00:00
vxlinux a1ae142d5b Merge repeat code for rebalance
There are three similar code segments in rebalance process as follows:

    tiers = ['cluster', 'regions', 'zones', 'servers', 'devices']
    for i, tier_name in enumerate(tiers):
        replicas_at_tier = sum(weighted_replicas_by_tier[t] for t in
                               weighted_replicas_by_tier if len(t) == i)
        if abs(self.replicas - replicas_at_tier) > 1e-10:
            raise exceptions.RingValidationError(
                '%s != %s at tier %s' % (
                    replicas_at_tier, self.replicas, tier_name))

I think we can encapsulate this code segment to a private function and
replace those code segments with a function call

Change-Id: I89439286b211f2c5ef19deffa77c202f48f07cf8
2018-01-15 11:45:00 +00:00
Tim Burke e343452394 Support existing builders with None _last_part_moves
These were likely written before the first related change, or created
from an existing ring file.

Also, tolerate missing dispersion when rebalancing -- that may not exist
in the builder file.

Change-Id: I26e3b4429c747c23206e4671f7c86543bb182a15
Related-Change: Ib165cf974c865d47c2d9e8f7b3641971d2e9f404
Related-Change: Ie239b958fc7e0547ffda2bebf61546bd4ef3d829
Related-Change: I551fcaf274876861feb12848749590f220842d68
2018-01-08 13:19:13 -08:00
Clay Gerrard 49de7db532 add swift-ring-builder option to recalculate dispersion
Since dispersion info is cached, this can easily happen if we make
changes to how dispersion info is calculated or stored (e.g. we extend
the dispersion calculation to consider dispersion of all part-replicas
in the related change)

Related-Change-Id: Ifefff0260deac0c3e8b369a1e158686c89936686

Change-Id: I714deb9e349cd114a21ec591216a9496aaf9e0d1
2018-01-02 16:20:28 +00:00
Clay Gerrard 7013e70ca6 Represent dispersion worse than one replicanth
With a sufficiently undispersed ring it's possible to move an entire
replicas worth of parts and yet the value of dispersion may not get any
better (even though in reality dispersion has dramatically improved).
The problem is dispersion will currently only represent up to one whole
replica worth of parts being undispersed.

However with EC rings it's possible for more than one whole replicas
worth of partitions to be undispersed, in these cases the builder will
require multiple rebalance operations to fully disperse replicas - but
the dispersion value should improve with every rebalance.

N.B. with this change it's possible for rings with a bad dispersion
value to measure as having a significantly smaller dispersion value
after a rebalance (even though they may not have had their dispersion
change) because the total amount of bad dispersion we can measure has
been increased but we're normalizing within a similar range.

Closes-Bug: #1697543

Change-Id: Ifefff0260deac0c3e8b369a1e158686c89936686
2017-12-28 11:16:17 -08:00
Alistair Coles 348bd83b7e Respect co-builder partition moves when min_part_hours is zero
Repeated calls to each co-builder's _update_last_part_moves() are
unnecessary and have the unfortunate side effect of resetting the
_last_part_moved bitmap. When a component builder has zero
min_part_hours this results in it not preventing its co-builders from
moving parts that it has already moved.

This patch changes the CompositeRingBuilder to call each component
builder _update_last_part_moves() *once* before rebalancing any
component builder. CooperativeRingBuilder's no longer forward calls to
their _update_last_part_moves() method. Each component's
_last_part_moved bitmap is therefore preserved until for the duration
of the composite rebalance.

The initialisation of the RingBuilder _last_part_moves array is moved
to the RingBuilder __init__ method, so that calls to
_update_last_part_moves() are effective even when rebalance() has
never been called on that builder. Otherwise, during a composite
rebalance, a component that has not previously been rebalanced will
not have its _last_part_moves_epoch updated during rebalance and as a
result may report erroneous min_part_seconds_left after its first
rebalance.

Related-Change: I1b30cb3d776be441346a4131007d2487a5440a81
Closes-Bug: #1714274
Change-Id: Ib165cf974c865d47c2d9e8f7b3641971d2e9f404
2017-09-27 15:05:13 +01:00
Kota Tsuyuzaki 2321966456 Accept a trade off of dispersion for balance
... but only if we *have* to!

During the initial gather for balance we prefer to avoid replicas on
over-weight devices that are already under-represented in any of it's
tiers (i.e. if a zone has to have at least one, but may have as many of
two, don't take the only replica).  Instead we hope by going for
replicas on over-weight devices that are at the limits of their
dispersion we might have a better than even chance we find a better
place for them during placement!

This normally works on out - and especially so for rings which can
disperse and balance.  But for existing rings where we'd have to
sacrifice dispersion to improve balance the existing optimistic gather
will end up refusing to trade dispersion for balance - and instead get
stuck without solving either!

You should always be able to solve for *either* dispersion or balance.
But if you can't solve *both* - we bail out on our optimistic gather
much more quickly and instead just focus on improving balance.  With
this change, the ring can get into balanced (and un-dispersed) states
much more quickly!

Change-Id: I17ac627f94f64211afaccad15596a9fcab2fada2
Related-Change-Id: Ie6e2d116b65938edac29efa6171e2470bb3e8e12
Closes-Bug: 1699636
Closes-Bug: 1701472
2017-09-25 12:37:49 -07:00
Alistair Coles 7fa9d6e5b4 Unset random seed after rebalancing ring
Unit tests use the random module in places to randomise
test inputs, but once the tests in test_builder.py or
test_ring_builder_analyzer.py have been run the random
seed is left in a repeatable state because calls are made
to RingBuilder.balance with a seed value. Consequently,
subsequent calls to random in other tests get repeatable
results from one test run to another.

This patch resets the state of the random module before
returning from RingBuilder.rebalance.

Closes-Bug: #1639755
Change-Id: I4b74030afc654e60452e65b3e0f1b45a189c16e3
2017-08-08 22:34:04 +00:00
Tim Burke c211141c27 Add ever_rebalanced property to RingBuilder
...to formalize an otherwise-unwritten contract and get
CooperativeRingBuilder using more "public" interfaces.

Change-Id: Ib6666728eabeff948bb53dff054a69bada47556e
2017-07-21 17:21:49 -07:00
Alistair Coles 71ec83f414 Ring rebalance respects co-builders' last_part_moves
- Add a CooperativeRingBuilder subclass of RingBuilder. The subclass takes
  a reference to a parent CompositeRingBuilder which is consulted about
  whether a part can be moved during rebalance. The parent builder in turn
  consults all component CooperativeRingBuilder's to decide if a part can
  be moved.

- Make CompositeRingBuilder load CooperativeRingBuilder instances.

- Add rebalance() method to CompositeRingBuilder class.

- Add a load_components() method to CompositeRingBuilder class.

- Change the CompositeRingBuilder compose() method to NOT by default
  raise a ValueError if component builders have not been modified since
  last loaded. With the load_components method being added it makes
  less sense insist by default on loaded components being modified, and
  it is desirable to have the same semantic for all methods that load
  components. Previously it has been necessary to use the 'force' flag
  with compose() to prevent these errors being raised, which has the
  unfortunate side effect of also disabling all other checks on
  component builders. A new 'require_modified' parameter is added to
  compose() which defaults to False but can be set to True if the
  previous default behaviour is required.

Change-Id: I1b30cb3d776be441346a4131007d2487a5440a81
2017-07-21 22:17:48 +01:00
Jenkins e94b383655 Merge "Add support to increase object ring partition power" 2017-07-05 14:40:42 +00:00
Clay Gerrard 806b18c6f5 Extend device tier pretty print normalization
* use pretty device tier normalization in additional debug message
* refactor pretty device tier normalization for consistency
* unittest asserting consistency

Change-Id: Ide32e35ff9387f1cc1e4997eb133314d06b794f3
2017-06-28 15:40:34 -07:00
Samuel Merritt 62509cc8f4 Improve debug logging from ring builder.
The gather/place debug logs used to just contain device IDs; now they
include region, zone, and IP. This makes it easier to see what's going
on when debugging rebalance operations.

Change-Id: I6314e327973c57a34b88ebbb4d3b1594dbacd357
2017-06-28 15:09:07 -07:00
Christian Schwede e1140666d6 Add support to increase object ring partition power
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.

1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.

2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.

3. The "next_part_power" flag will be removed.

This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree.  This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.

Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.

Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).

Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>

[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html

Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
2017-06-15 15:08:48 -07:00
Jenkins 9089e44c0b Merge "Add Composite Ring Functionality" 2017-05-18 10:18:31 +00:00
Kota Tsuyuzaki d40031b46f Add Composite Ring Functionality
* Adds a composite_builder module which provides the functionality to
  build a composite ring from a number of component ring builders.

* Add id to RingBuilder to differentiate rings in composite.
  A RingBuilder now gets a UUID when it is saved to file if
  it does not already have one. A RingBuilder loaded from
  file does NOT get a UUID assigned unless it was previously persisted in
  the file. This forces users to explicitly assign an id to
  existing ring builders by saving the state back to file.

  The UUID is included in first line of the output from:

    swift-ring-builder <builder-file>

Background:

This is another implementation for Composite Ring [1]
to enable better dispersion for global erasure coded cluster.

The most significant difference from the related-change [1] is that this
solution attempts to solve the problem as an offline tool rather than
dynamic compositing on the running servers. Due to the change, we gain
advantages such as:

- Less code and being simple
- No complex state validation on the running server
- Easy deployments with an offline tool

This patch does not provide a command line utility for managing
composite rings. The interface for such a tool is still under
discussion; this patch provides the enabling functionality first.

Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>

[1] Related-Change: I80ef36d3ac4d4b7c97a1d034b7fc8e0dc2214d16
Change-Id: I0d8928b55020592f8e75321d1f7678688301d797
2017-05-15 16:42:00 -07:00
Clay Gerrard 6be5196fbe Make add_dev complain louder about missing keys
... and remove some cruft that couldn't possibly work

Change-Id: I560f0a29f0a881c63ec3cb910dbf5476fe2a915a
Related-Change-Id: I0d8928b55020592f8e75321d1f7678688301d797
2017-04-25 19:29:57 -07:00
Gábor Antal 5817f804fd Handle log message interpolation by the logger
According to OpenStack Guideline[1], logged string message should be
interpolated by the logger.

[1]: http://docs.openstack.org/developer/oslo.i18n/guidelines.html#adding-variables-to-log-messages

Change-Id: I49952348c06e340c6cfc708028f2b06e02a16dfc
Closes-Bug: #1661262
2017-02-02 15:27:47 +01:00
Jenkins a97ed0b96a Merge "Fix unnecessary for-loop and mis docs" 2017-01-16 16:36:00 +00:00
Kota Tsuyuzaki bbefaca2ef Fix unnecessary for-loop and mis docs
This is follow-up for https://review.openstack.org/#/c/419107
to address:

- Remove unnecessary for-loop block that breaks the end of the loop
  always
- Correct mis document says "two more times" but actually does only "one
  more time"

Change-Id: I3d76275afc6448709a4b3588259e085bce7fa21d
2017-01-16 06:12:23 -08:00
Jenkins 707e2558b9 Merge "Tighten the move-one-replica test" 2017-01-16 10:40:04 +00:00
Tim Burke 244e640005 Tighten the move-one-replica test
We only need one additional rebalance to get something well-balanced
and well-dispersed.

Drive-by to break a little earlier in the builder after we move a part.

Change-Id: I78b86292c98be5e247a694b9db71d9267e1a0a22
2017-01-11 21:04:25 +00:00
Jenkins 7f6d316dd6 Merge "For any part, only one replica can move in a rebalance" 2017-01-11 20:34:58 +00:00
kellerbr aa17ae1b04 Allow Hacking H401, H403 check
Currently swift ignores a lot of the Hacking style guide. This patch
enables the H401 and H403 checks and fixes any violations. With this
we can get a little closer to following the OpenStack style guidelines.

Change-Id: I5109a052f2ceb2e6a9a174cded62f4231156d39b
2017-01-04 17:23:46 +00:00
Tim Burke 2e7a7347fc Avoid infinite loop while placing parts
Previously, we could over-assign how many parts should be in a tier.
This would cause the local `parts` variable to go negative, which meant
that our `while parts` loop would never terminate.

Change-Id: Id7e7889742ca37cf1a9c0d55fba78d967e90e8d0
Closes-Bug: 1642538
2016-11-17 15:32:47 -08:00
cheng ce26e78902 For any part, only one replica can move in a rebalance
With a min_part_hours of zero, it's possible to move more than one
replicas of the same part in a single rebalance.

This change in behavior only effects min_part_hour zero rings, which
are understood to be uncommon in production mostly because of this
very specific and strange behavior of min_part_hour zero rings.

With this change, no matter how small your min_part_hours it will
always require at least N rebalances to move N part-replicas of the
same part.

To supplement the existing persisted _last_part_moves structure to
enforce min_part_hours, this change adds a _part_moved_bitmap that
exists only during the life of the rebalance, to track when rebalance
moves a part in order to prevent another replicas of the same part
from being moved in the same rebalance.

Add authors: Clay Gerrard, clay.gerrard@gmail.com
             Christian Schwede, cschwede@redhat.com

Closes-bug: #1586167

Change-Id: Ia1629abd5ce6e1b3acc2e94f818ed8223eed993a
2016-09-27 15:01:41 +00:00
Mohit Motiani 89388bf232 Fix typos and grammer in builder.py
Change-Id: Ib87f4df8f741809840e92db9bacf2af847a5f77f
Closes-Bug: #1600403
2016-08-22 12:46:23 -07:00
cheng c1c18da82c check _last_part_moves when pretend_min_part_hours_passed
pretend_min_part_hours_passed do things like this:
self._last_part_moves[part] = 0xff

this will throw exception if self._last_part_moves is None.

this patch is to check self._last_part_moves to prevent exception.
Closes-bug: #1578835

Change-Id: Ic83c7a338b45bfcf61f5ab6100e6db335c3fa81a
2016-07-26 00:58:06 +00:00
Clay Gerrard c70abba529 Adjust replica count before pulling parts from failed devices
When your device count falls below your replica count you can either add
devices or reduce the replica count.

Trying to reduce your replica count fails about half the time because
removing parts from from failed devices temporarily invalidates your
_replica2part2dev table with NONE_DEV which can result in an IndexError
in _adjust_replica2part2dev_size.

If you adjust the replica count first you won't have to worry about
tracking unassigned parts from failed devices.

Closes-Bug: #1558751

Change-Id: I99dc776fd260a2ba68ca77d7b5ed5120d10b06de
2016-03-17 19:27:24 +00:00
Christian Schwede 3ff94cb785 Add internal method to increase ring partition power
This method increases the partition power of an existing ring by one. It does
not move any data nor does it exposes a CLI command yet; it is only intended to
be used in a future version to do the actual ring modification itself.

An existing object that is currently located on partition X will be placed
either on partition 2*X or 2*X+1 after the partition power got increased. The
reason for this is the Ring.get_part() method, that does a bitwise shift to the
right.

To avoid actual data movement to different disks or even nodes, the allocation
of partitions to nodes needs to be changed. The allocation is pairwise due to
the above mentioned new partition scheme. Therefore devices are allocated like
this, with the partition being the index and the value being the device id:

OLD: 0,    3,    7,    5,    2,    1,    ...
NEW: 0, 0, 3, 3, 7, 7, 5, 5, 2, 2, 1, 1, ...

If an operator stops the cluster, increases the partition power and renames &
hardlinks the existing data it is possible to do a power shift without actually
moving data. Please see the partition power spec for further details on this.

Change-Id: I063fd8077497ee8c14d9065f07b4ec0fb5cbe180
Partially-Implements: spec increasing_partition_power
2016-03-04 08:01:47 +00:00
Clay Gerrard b19dc1ddec Always fix devices with multiple part-replica assignments
I've found that given a sufficiently bad replica2part2dev table we can
accidently not entirely fix palcement when more than two replicas of a
part are assigned to the duplicate devices.

It shows up most on > 3 replica rings when you have two *different*
devices both holding two replicas.  But you can see it on a three
replica ring when all three replicas are assigned to the same device.

Change-Id: Ieb213c1a259815a2ed657291242919cda568c7b5
2016-02-01 16:19:08 -08:00
Jenkins 6bb3325f31 Merge "Look for device holes that can be reused when adding new device." 2016-01-21 09:10:28 +00:00
Ben Martin 1f3304c515 Print min_part_hours lockout time remaining
swift-ring-builder currently only displays min_part_hours and
not the amount of time remaining before a rebalance can occur.
This information is readily available and has been displayed
as a quality of life improvement.

Additionally, a bug where the time since the last rebalance
was always updated when rebalance was called regardless of
if any partitions were reassigned. This can lead to partitions
being unable to be reassigned as they never age according to
the time since last rebalance.

Change-Id: Ie0e2b5e25140cbac7465f31a26a4998beb3892e9
Closes-Bug: #1526017
2016-01-11 10:58:38 -06:00
Paul Dardeau fb6751d8ba Look for device holes that can be reused when adding new device.
Change-Id: I1980ebdd9dc89848173d8ca2fe2afb74029dcfa2
Closes-Bug: 1532276
2016-01-09 00:13:06 +00:00
Clay Gerrard 7035639dfd Put part-replicas where they go
It's harder than it sounds.  There was really three challenges.

Challenge #1 Initial Assignment
===============================

Before starting to assign parts on this new shiny ring you've
constructed, maybe we'll pause for a moment up front and consider the
lay of the land.  This process is called the replica_plan.

The replica_plan approach is separating part assignment failures into
two modes:

 1) we considered the cluster topology and it's weights and came up with
    the wrong plan

 2) we failed to execute on the plan

I failed at both parts plenty of times before I got it this close.  I'm
sure a counter example still exists, but when we find it the new helper
methods will let us reason about where things went wrong.

Challenge #2 Fixing Placement
=============================

With a sound plan in hand, it's much easier to fail to execute on it the
less material you have to execute with - so we gather up as many parts
as we can - as long as we think we can find them a better home.

Picking the right parts for gather is a black art - when you notice a
balance is slow it's because it's spending so much time iterating over
replica2part2dev trying to decide just the right parts to gather.

The replica plan can help at least in the gross dispersion collection to
gather up the worst offenders first before considering balance.  I think
trying to avoid picking up parts that are stuck to the tier before
falling into a forced grab on anything over parts_wanted helps with
stability generally - but depending on where the parts_wanted are in
relation to the full devices it's pretty easy pick up something that'll
end up really close to where it started.

I tried to break the gather methods into smaller pieces so it looked
like I knew what I was doing.

Going with a MAXIMUM gather iteration instead of balance (which doesn't
reflect the replica_plan) doesn't seem to be costing me anything - most
of the time the exit condition is either solved or all the parts overly
aggressively locked up on min_part_hours.  So far, it mostly seemds if
the thing is going to balance this round it'll get it in the first
couple of shakes.

Challenge #3 Crazy replica2part2dev tables
==========================================

I think there's lots of ways "scars" can build up a ring which can
result in very particular replica2part2dev tables that are physically
difficult to dig out of.  It's repairing these scars that will take
multiple rebalances to resolve.

... but at this point ...

... lacking a counter example ...

I've been able to close up all the edge cases I was able to find.  It
may not be quick, but progress will be made.

Basically my strategy just required a better understanding of how
previous algorithms were able to *mostly* keep things moving by brute
forcing the whole mess with a bunch of randomness.  Then when we detect
our "elegant" careful part selection isn't making progress - we can fall
back to same old tricks.

Validation
==========

We validate against duplicate part replica assignment after rebalance
and raise an ERROR if we detect more than one replica of a part assigned
to the same device.

In order to meet that requirement we have to have as many devices as
replicas, so attempting to rebalance with too few devices w/o changing
your replica_count is also an ERROR not a warning.

Random Thoughts
===============

As usual with rings, the test diff can be hard to reason about -
hopefully I've added enough comments to assure future me that these
assertions make sense.

Despite being a large rewrite of a lot of important code, the existing
code is known to have failed us.  This change fixes a critical bug that's
trivial to reproduce in a critical component of the system.

There's probably a bunch of error messages and exit status stuff that's
not as helpful as it could be considering the new behaviors.

Change-Id: I1bbe7be38806fc1c8b9181a722933c18a6c76e05
Closes-Bug: #1452431
2015-12-07 16:06:42 -08:00
Kota Tsuyuzaki c81f202f71 Add missing docs for ring.builder.rebalance
commit 71993d84e8 added
a new 'remove_dev' column to the
swift.common.ring.builder.rebalance return value.

This patch adds the docs for that and clean up a bit to
the variable name to be easy to read.

Change-Id: Idfd46e47b9f6894cbafc8b7701a4c7414212f79f
2015-11-05 17:38:08 -08:00