* Get rid of a bunch of accidental blockquote formatting
* Always declare a lexer to use for ``.. code::`` blocks
Change-Id: I8940e75b094843e542e815dde6b6be4740751813
Previously, if the logger was enabled before entering the context
manager, it would be disabled upon exiting.
Change-Id: I19e03309c2c1e3ddafb1eee6f3a4ead0267c26ca
Note that existing SAIOs with 60xx ports should still work fine.
Change-Id: If5dd79f926fa51a58b3a732b212b484a7e9f00db
Related-Change: Ie1c778b159792c8e259e2a54cb86051686ac9d18
They effectively already *were*, but if you used the RingBuilder API
directly (rather than the CLI) you could previously write down builders
that would hit KeyErrors on load.
Change-Id: I1de895d4571f7464be920345881789d47659729f
An odd thing happened: when my code did only
"from swift.common.ring import Ring", and nothing else,
the interpreter looped on the CPU.
The import from the top of the ring module is our standard
protocol for Ring. It causes no problem in places like
container updater or list_endpoints.py. It is a big
mystery why it causes Python 3.7.3 to loop, and only
in specific circumstances.
But we noticed that the recursive import is unnecesasry
in this case, so an obviously harmless fix exists.
Change-Id: I7373bbb0a50d090d6becf141e5832f8ae98381a4
Currently one has to remove and re-add devices when the region or zone needs to
be changed. Adding the subcommands set_region and set_zone simplifies this, now
it is possible to change the region and/or zone easily. Note that there is no
change to the required rebalancing; it is likely that data still needs to be
moved within the cluster.
This is mostly copy-n-paste of the existing set_weight subcommand and adopting
tests accordingly. Some duplicated code in the tests has been aggregated as
well.
Change-Id: I37dd8e8ac24e2b0bb196758449a1d91a769b2e96
Plumb the version from the ringbuilder through to the metadata at the
start of the ring. Recover this (if available) when running
swift-ring-builder <ring> write_builder
When we load the ring, track the count and MD5 of the bytes off disk, as
well as the number of uncompressed bytes.
Expose all this new information as properties on the Ring, along with
- device_count (number of non-None entries in self._devs),
- weighted_device_count (number of devices that have weight), and
- assigned_device_count (number of devices that actually have
partition assignments).
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Change-Id: I73deaf6f1d9c1d37630c37c02c597b8812592351
Python 2.6 didn't have one, so we'd try to find logging.NullHandler
but fall back to our own. Since 2.7+ has logging.NullHandler, we can
just use it.
Change-Id: Ie2c27407efc2882e698abe6e4379a00a1d3f4301
I can't imagine us *not* having a py3 proxy server at some point, and
that proxy server is going to need a ring.
While we're at it (and since they were so close anyway), port
* cli/ringbuilder.py and
* common/linkat.py
* common/daemon.py
Change-Id: Iec8d97e0ce925614a86b516c4c6ed82809d0ba9b
... to make testing more targeted and obvious
Related-Change-Id: I89439286b211f2c5ef19deffa77c202f48f07cf8
Change-Id: I93b99128a4fb35395e8e9caf11649e216f824fdf
There are three similar code segments in rebalance process as follows:
tiers = ['cluster', 'regions', 'zones', 'servers', 'devices']
for i, tier_name in enumerate(tiers):
replicas_at_tier = sum(weighted_replicas_by_tier[t] for t in
weighted_replicas_by_tier if len(t) == i)
if abs(self.replicas - replicas_at_tier) > 1e-10:
raise exceptions.RingValidationError(
'%s != %s at tier %s' % (
replicas_at_tier, self.replicas, tier_name))
I think we can encapsulate this code segment to a private function and
replace those code segments with a function call
Change-Id: I89439286b211f2c5ef19deffa77c202f48f07cf8
These were likely written before the first related change, or created
from an existing ring file.
Also, tolerate missing dispersion when rebalancing -- that may not exist
in the builder file.
Change-Id: I26e3b4429c747c23206e4671f7c86543bb182a15
Related-Change: Ib165cf974c865d47c2d9e8f7b3641971d2e9f404
Related-Change: Ie239b958fc7e0547ffda2bebf61546bd4ef3d829
Related-Change: I551fcaf274876861feb12848749590f220842d68
Since dispersion info is cached, this can easily happen if we make
changes to how dispersion info is calculated or stored (e.g. we extend
the dispersion calculation to consider dispersion of all part-replicas
in the related change)
Related-Change-Id: Ifefff0260deac0c3e8b369a1e158686c89936686
Change-Id: I714deb9e349cd114a21ec591216a9496aaf9e0d1
With a sufficiently undispersed ring it's possible to move an entire
replicas worth of parts and yet the value of dispersion may not get any
better (even though in reality dispersion has dramatically improved).
The problem is dispersion will currently only represent up to one whole
replica worth of parts being undispersed.
However with EC rings it's possible for more than one whole replicas
worth of partitions to be undispersed, in these cases the builder will
require multiple rebalance operations to fully disperse replicas - but
the dispersion value should improve with every rebalance.
N.B. with this change it's possible for rings with a bad dispersion
value to measure as having a significantly smaller dispersion value
after a rebalance (even though they may not have had their dispersion
change) because the total amount of bad dispersion we can measure has
been increased but we're normalizing within a similar range.
Closes-Bug: #1697543
Change-Id: Ifefff0260deac0c3e8b369a1e158686c89936686
Repeated calls to each co-builder's _update_last_part_moves() are
unnecessary and have the unfortunate side effect of resetting the
_last_part_moved bitmap. When a component builder has zero
min_part_hours this results in it not preventing its co-builders from
moving parts that it has already moved.
This patch changes the CompositeRingBuilder to call each component
builder _update_last_part_moves() *once* before rebalancing any
component builder. CooperativeRingBuilder's no longer forward calls to
their _update_last_part_moves() method. Each component's
_last_part_moved bitmap is therefore preserved until for the duration
of the composite rebalance.
The initialisation of the RingBuilder _last_part_moves array is moved
to the RingBuilder __init__ method, so that calls to
_update_last_part_moves() are effective even when rebalance() has
never been called on that builder. Otherwise, during a composite
rebalance, a component that has not previously been rebalanced will
not have its _last_part_moves_epoch updated during rebalance and as a
result may report erroneous min_part_seconds_left after its first
rebalance.
Related-Change: I1b30cb3d776be441346a4131007d2487a5440a81
Closes-Bug: #1714274
Change-Id: Ib165cf974c865d47c2d9e8f7b3641971d2e9f404
... but only if we *have* to!
During the initial gather for balance we prefer to avoid replicas on
over-weight devices that are already under-represented in any of it's
tiers (i.e. if a zone has to have at least one, but may have as many of
two, don't take the only replica). Instead we hope by going for
replicas on over-weight devices that are at the limits of their
dispersion we might have a better than even chance we find a better
place for them during placement!
This normally works on out - and especially so for rings which can
disperse and balance. But for existing rings where we'd have to
sacrifice dispersion to improve balance the existing optimistic gather
will end up refusing to trade dispersion for balance - and instead get
stuck without solving either!
You should always be able to solve for *either* dispersion or balance.
But if you can't solve *both* - we bail out on our optimistic gather
much more quickly and instead just focus on improving balance. With
this change, the ring can get into balanced (and un-dispersed) states
much more quickly!
Change-Id: I17ac627f94f64211afaccad15596a9fcab2fada2
Related-Change-Id: Ie6e2d116b65938edac29efa6171e2470bb3e8e12
Closes-Bug: 1699636
Closes-Bug: 1701472
Unit tests use the random module in places to randomise
test inputs, but once the tests in test_builder.py or
test_ring_builder_analyzer.py have been run the random
seed is left in a repeatable state because calls are made
to RingBuilder.balance with a seed value. Consequently,
subsequent calls to random in other tests get repeatable
results from one test run to another.
This patch resets the state of the random module before
returning from RingBuilder.rebalance.
Closes-Bug: #1639755
Change-Id: I4b74030afc654e60452e65b3e0f1b45a189c16e3
...to formalize an otherwise-unwritten contract and get
CooperativeRingBuilder using more "public" interfaces.
Change-Id: Ib6666728eabeff948bb53dff054a69bada47556e
- Add a CooperativeRingBuilder subclass of RingBuilder. The subclass takes
a reference to a parent CompositeRingBuilder which is consulted about
whether a part can be moved during rebalance. The parent builder in turn
consults all component CooperativeRingBuilder's to decide if a part can
be moved.
- Make CompositeRingBuilder load CooperativeRingBuilder instances.
- Add rebalance() method to CompositeRingBuilder class.
- Add a load_components() method to CompositeRingBuilder class.
- Change the CompositeRingBuilder compose() method to NOT by default
raise a ValueError if component builders have not been modified since
last loaded. With the load_components method being added it makes
less sense insist by default on loaded components being modified, and
it is desirable to have the same semantic for all methods that load
components. Previously it has been necessary to use the 'force' flag
with compose() to prevent these errors being raised, which has the
unfortunate side effect of also disabling all other checks on
component builders. A new 'require_modified' parameter is added to
compose() which defaults to False but can be set to True if the
previous default behaviour is required.
Change-Id: I1b30cb3d776be441346a4131007d2487a5440a81
The gather/place debug logs used to just contain device IDs; now they
include region, zone, and IP. This makes it easier to see what's going
on when debugging rebalance operations.
Change-Id: I6314e327973c57a34b88ebbb4d3b1594dbacd357
This patch adds methods to increase the partition power of an existing
object ring without downtime for the users using a 3-step process. Data
won't be moved to other nodes; objects using the new increased partition
power will be located on the same device and are hardlinked to avoid
data movement.
1. A new setting "next_part_power" will be added to the rings, and once
the proxy server reloaded the rings it will send this value to the
object servers on any write operation. Object servers will now create a
hard-link in the new location to the original DiskFile object. Already
existing data will be relinked using a new tool in the new locations
using hardlinks.
2. The actual partition power itself will be increased. Servers will now
use the new partition power to read from and write to. No longer
required hard links in the old object location have to be removed now by
the relinker tool; the relinker tool reads the next_part_power setting
to find object locations that need to be cleaned up.
3. The "next_part_power" flag will be removed.
This mostly implements the spec in [1]; however it's not using an
"epoch" as described there. The idea of the epoch was to store data
using different partition powers in their own namespace to avoid
conflicts with auditors and replicators as well as being able to abort
such an operation and just remove the new tree. This would require some
heavy change of the on-disk data layout, and other object-server
implementations would be required to adopt this scheme too.
Instead the object-replicator is now aware that there is a partition
power increase in progress and will skip replication of data in that
storage policy; the relinker tool should be simply run and afterwards
the partition power will be increased. This shouldn't take that much
time (it's only walking the filesystem and hardlinking); impact should
be low therefore. The relinker should be run on all storage nodes at the
same time in parallel to decrease the required time (though this is not
mandatory). Failures during relinking should not affect cluster
operations - relinking can be even aborted manually and restarted later.
Auditors are not quarantining objects written to a path with a different
partition power and therefore working as before (though they are reading
each object twice in the worst case before the no longer needed hard
links are removed).
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
[1] https://specs.openstack.org/openstack/swift-specs/specs/in_progress/
increasing_partition_power.html
Change-Id: I7d6371a04f5c1c4adbb8733a71f3c177ee5448bb
* Adds a composite_builder module which provides the functionality to
build a composite ring from a number of component ring builders.
* Add id to RingBuilder to differentiate rings in composite.
A RingBuilder now gets a UUID when it is saved to file if
it does not already have one. A RingBuilder loaded from
file does NOT get a UUID assigned unless it was previously persisted in
the file. This forces users to explicitly assign an id to
existing ring builders by saving the state back to file.
The UUID is included in first line of the output from:
swift-ring-builder <builder-file>
Background:
This is another implementation for Composite Ring [1]
to enable better dispersion for global erasure coded cluster.
The most significant difference from the related-change [1] is that this
solution attempts to solve the problem as an offline tool rather than
dynamic compositing on the running servers. Due to the change, we gain
advantages such as:
- Less code and being simple
- No complex state validation on the running server
- Easy deployments with an offline tool
This patch does not provide a command line utility for managing
composite rings. The interface for such a tool is still under
discussion; this patch provides the enabling functionality first.
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
[1] Related-Change: I80ef36d3ac4d4b7c97a1d034b7fc8e0dc2214d16
Change-Id: I0d8928b55020592f8e75321d1f7678688301d797
... and remove some cruft that couldn't possibly work
Change-Id: I560f0a29f0a881c63ec3cb910dbf5476fe2a915a
Related-Change-Id: I0d8928b55020592f8e75321d1f7678688301d797
This is follow-up for https://review.openstack.org/#/c/419107
to address:
- Remove unnecessary for-loop block that breaks the end of the loop
always
- Correct mis document says "two more times" but actually does only "one
more time"
Change-Id: I3d76275afc6448709a4b3588259e085bce7fa21d
We only need one additional rebalance to get something well-balanced
and well-dispersed.
Drive-by to break a little earlier in the builder after we move a part.
Change-Id: I78b86292c98be5e247a694b9db71d9267e1a0a22
Currently swift ignores a lot of the Hacking style guide. This patch
enables the H401 and H403 checks and fixes any violations. With this
we can get a little closer to following the OpenStack style guidelines.
Change-Id: I5109a052f2ceb2e6a9a174cded62f4231156d39b
Previously, we could over-assign how many parts should be in a tier.
This would cause the local `parts` variable to go negative, which meant
that our `while parts` loop would never terminate.
Change-Id: Id7e7889742ca37cf1a9c0d55fba78d967e90e8d0
Closes-Bug: 1642538
With a min_part_hours of zero, it's possible to move more than one
replicas of the same part in a single rebalance.
This change in behavior only effects min_part_hour zero rings, which
are understood to be uncommon in production mostly because of this
very specific and strange behavior of min_part_hour zero rings.
With this change, no matter how small your min_part_hours it will
always require at least N rebalances to move N part-replicas of the
same part.
To supplement the existing persisted _last_part_moves structure to
enforce min_part_hours, this change adds a _part_moved_bitmap that
exists only during the life of the rebalance, to track when rebalance
moves a part in order to prevent another replicas of the same part
from being moved in the same rebalance.
Add authors: Clay Gerrard, clay.gerrard@gmail.com
Christian Schwede, cschwede@redhat.com
Closes-bug: #1586167
Change-Id: Ia1629abd5ce6e1b3acc2e94f818ed8223eed993a
pretend_min_part_hours_passed do things like this:
self._last_part_moves[part] = 0xff
this will throw exception if self._last_part_moves is None.
this patch is to check self._last_part_moves to prevent exception.
Closes-bug: #1578835
Change-Id: Ic83c7a338b45bfcf61f5ab6100e6db335c3fa81a
When your device count falls below your replica count you can either add
devices or reduce the replica count.
Trying to reduce your replica count fails about half the time because
removing parts from from failed devices temporarily invalidates your
_replica2part2dev table with NONE_DEV which can result in an IndexError
in _adjust_replica2part2dev_size.
If you adjust the replica count first you won't have to worry about
tracking unassigned parts from failed devices.
Closes-Bug: #1558751
Change-Id: I99dc776fd260a2ba68ca77d7b5ed5120d10b06de
This method increases the partition power of an existing ring by one. It does
not move any data nor does it exposes a CLI command yet; it is only intended to
be used in a future version to do the actual ring modification itself.
An existing object that is currently located on partition X will be placed
either on partition 2*X or 2*X+1 after the partition power got increased. The
reason for this is the Ring.get_part() method, that does a bitwise shift to the
right.
To avoid actual data movement to different disks or even nodes, the allocation
of partitions to nodes needs to be changed. The allocation is pairwise due to
the above mentioned new partition scheme. Therefore devices are allocated like
this, with the partition being the index and the value being the device id:
OLD: 0, 3, 7, 5, 2, 1, ...
NEW: 0, 0, 3, 3, 7, 7, 5, 5, 2, 2, 1, 1, ...
If an operator stops the cluster, increases the partition power and renames &
hardlinks the existing data it is possible to do a power shift without actually
moving data. Please see the partition power spec for further details on this.
Change-Id: I063fd8077497ee8c14d9065f07b4ec0fb5cbe180
Partially-Implements: spec increasing_partition_power
I've found that given a sufficiently bad replica2part2dev table we can
accidently not entirely fix palcement when more than two replicas of a
part are assigned to the duplicate devices.
It shows up most on > 3 replica rings when you have two *different*
devices both holding two replicas. But you can see it on a three
replica ring when all three replicas are assigned to the same device.
Change-Id: Ieb213c1a259815a2ed657291242919cda568c7b5
swift-ring-builder currently only displays min_part_hours and
not the amount of time remaining before a rebalance can occur.
This information is readily available and has been displayed
as a quality of life improvement.
Additionally, a bug where the time since the last rebalance
was always updated when rebalance was called regardless of
if any partitions were reassigned. This can lead to partitions
being unable to be reassigned as they never age according to
the time since last rebalance.
Change-Id: Ie0e2b5e25140cbac7465f31a26a4998beb3892e9
Closes-Bug: #1526017
It's harder than it sounds. There was really three challenges.
Challenge #1 Initial Assignment
===============================
Before starting to assign parts on this new shiny ring you've
constructed, maybe we'll pause for a moment up front and consider the
lay of the land. This process is called the replica_plan.
The replica_plan approach is separating part assignment failures into
two modes:
1) we considered the cluster topology and it's weights and came up with
the wrong plan
2) we failed to execute on the plan
I failed at both parts plenty of times before I got it this close. I'm
sure a counter example still exists, but when we find it the new helper
methods will let us reason about where things went wrong.
Challenge #2 Fixing Placement
=============================
With a sound plan in hand, it's much easier to fail to execute on it the
less material you have to execute with - so we gather up as many parts
as we can - as long as we think we can find them a better home.
Picking the right parts for gather is a black art - when you notice a
balance is slow it's because it's spending so much time iterating over
replica2part2dev trying to decide just the right parts to gather.
The replica plan can help at least in the gross dispersion collection to
gather up the worst offenders first before considering balance. I think
trying to avoid picking up parts that are stuck to the tier before
falling into a forced grab on anything over parts_wanted helps with
stability generally - but depending on where the parts_wanted are in
relation to the full devices it's pretty easy pick up something that'll
end up really close to where it started.
I tried to break the gather methods into smaller pieces so it looked
like I knew what I was doing.
Going with a MAXIMUM gather iteration instead of balance (which doesn't
reflect the replica_plan) doesn't seem to be costing me anything - most
of the time the exit condition is either solved or all the parts overly
aggressively locked up on min_part_hours. So far, it mostly seemds if
the thing is going to balance this round it'll get it in the first
couple of shakes.
Challenge #3 Crazy replica2part2dev tables
==========================================
I think there's lots of ways "scars" can build up a ring which can
result in very particular replica2part2dev tables that are physically
difficult to dig out of. It's repairing these scars that will take
multiple rebalances to resolve.
... but at this point ...
... lacking a counter example ...
I've been able to close up all the edge cases I was able to find. It
may not be quick, but progress will be made.
Basically my strategy just required a better understanding of how
previous algorithms were able to *mostly* keep things moving by brute
forcing the whole mess with a bunch of randomness. Then when we detect
our "elegant" careful part selection isn't making progress - we can fall
back to same old tricks.
Validation
==========
We validate against duplicate part replica assignment after rebalance
and raise an ERROR if we detect more than one replica of a part assigned
to the same device.
In order to meet that requirement we have to have as many devices as
replicas, so attempting to rebalance with too few devices w/o changing
your replica_count is also an ERROR not a warning.
Random Thoughts
===============
As usual with rings, the test diff can be hard to reason about -
hopefully I've added enough comments to assure future me that these
assertions make sense.
Despite being a large rewrite of a lot of important code, the existing
code is known to have failed us. This change fixes a critical bug that's
trivial to reproduce in a critical component of the system.
There's probably a bunch of error messages and exit status stuff that's
not as helpful as it could be considering the new behaviors.
Change-Id: I1bbe7be38806fc1c8b9181a722933c18a6c76e05
Closes-Bug: #1452431
commit 71993d84e8 added
a new 'remove_dev' column to the
swift.common.ring.builder.rebalance return value.
This patch adds the docs for that and clean up a bit to
the variable name to be easy to read.
Change-Id: Idfd46e47b9f6894cbafc8b7701a4c7414212f79f