Commit Graph

122 Commits

Author SHA1 Message Date
Matthew Oliver 00bfc425ce Add FakeStatsdClient to unit tests
Currently we simply mock calls in the FakeLogger for calls statsd calls,
and there are also some helper methods for counting and collating
metrics that were called. This Fakelogger is overloaded and doesn't
simulate the real world.
In real life we use a Statsdclient that is attached to the logger.

We've been in the situation where unit tests pass but the statsd client
stacktraces because we don't actually fake the statsdclient based off
the real one and let it's use its internal logic.

This patch creates a new FakeStatsdClient that is based off the real
one, this can then be used (like the real statsd client) and attached to
the FakeLogger.
There is quite a bit of churn in tests to make this work, because we now
have to looking into the fake statsd client to check the faked calls
made.
The FakeStatsdClient does everything the real one does, except overrides
the _send method and socket creation so no actual statsd metrics are
emitted.

Change-Id: I9cdf395e85ab559c2b67b0617f898ad2d6a870d4
2023-08-07 10:10:45 +01:00
Takashi Natsume 6fd523947a Fix misuse of assertTrue
Fix misuse of assertTrue in
test/unit/obj/test_reconstructor.py.

Change-Id: I9c55bb16421ec85a20d3d4a0e6be43ce20c08b3c
Closes-Bug: 1986776
Signed-off-by: Takashi Natsume <takanattie@gmail.com>
2022-08-17 18:05:26 +09:00
Clay Gerrard 12bc79bf01 Add ring_ip option to object services
This will be used when finding their own devices in rings, defaulting to
the bind_ip.

Notably, this allows services to be containerized while servers_per_port
is enabled:

* For the object-server, the ring_ip should be set to the host ip and
  will be used to discover which ports need binding. Sockets will still
  be bound to the bind_ip (likely 0.0.0.0), with the assumption that the
  host will publish ports 1:1.

* For the replicator and reconstructor, the ring_ip will be used to
  discover which devices should be replicated. While bind_ip could
  previously be used for this, it would have required a separate config
  from the object-server.

Also rename object deamon's bind_ip attribute to ring_ip so that it's
more obvious wherever we're using the IP for ring lookups instead of
socket binding.

Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Change-Id: I1c9bb8086994f7930acd8cda8f56e766938c2218
2022-06-02 16:31:29 -05:00
Tim Burke 1907594bd8 reconstructor: Abort just the changed policies
We've already walked the disks looking for work, may as well continue
with the work that's definitely still valid.

Change-Id: I4c33ed5f5a66d89d259761b5ce12fb6652b28c40
2022-01-10 09:05:28 -08:00
Alistair Coles 1b3879e0da reconstructor: include partially reverted handoffs in handoffs_remaining
For a reconstructor revert job, if sync'd to sufficient other nodes,
the handoff partition is considered done and handoffs_remaining is not
incremented. With the new max_objects_per_revert option [1], a ssync
job may appear to be complete but not all objects have yet been
reverted, so handoffs remaining should be incremented.

[1] Related-Change: If81760c80a4692212e3774e73af5ce37c02e8aff
Change-Id: I59572f75b9b0ba331369eb7358932943b7935ff0
2021-12-03 14:37:59 +00:00
Alistair Coles 8ee631ccee reconstructor: restrict max objects per revert job
Previously the ssync Sender would attempt to revert all objects in a
partition within a single SSYNC request. With this change the
reconstructor daemon option max_objects_per_revert can be used to limit
the number of objects reverted inside a single SSYNC request for revert
type jobs i.e. when reverting handoff partitions.

If more than max_objects_per_revert are available, the remaining objects
will remain in the sender partition and will not be reverted until the
next call to ssync.Sender, which would currrently be the next time the
reconstructor visits that handoff partition.

Note that the option only applies to handoff revert jobs, not to sync
jobs.

Change-Id: If81760c80a4692212e3774e73af5ce37c02e8aff
2021-12-03 12:43:23 +00:00
Alistair Coles ada9f0eeb0 reconstructor: purge meta files in pure handoffs
Previously, after reverting handoff files, the reconstructor would
only purge tombstones and data files for the reverted fragment
index. Any meta files were not purged because the partition might
also be on a primary node for a different fragment index.

For example, if, before the reconstructor visits, the object hash dir
contained:

  t1#1#d.data
  t1#2#d.data
  t2.meta

where frag index 1 is a handoff and gets reverted, then, after the
reconstructor has visited, the hash dir should still contain:

  t1#2#d.data
  t2.meta

If, before the reconstructor visits, the object hash dir contained:

  t1#1#d.data
  t2.meta

then, after the reconstructor has visited, the hash dir would still
contain:

  t2.meta

The retention of meta files is undesirable when the partition is a
"pure handoff" i.e. the node is not a primary for the partition for
any fragment index. With this patch the meta files are purged after
being reverted if the reconstructor has no sync job for the partition
(i.e. the partition is a "pure handoff") and there are no more
fragments to revert.

Change-Id: I107af3bc2d62768e063ef3176645d60ef22fa6d4
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
2021-11-24 12:20:52 +00:00
Alistair Coles 092d409c4b reconstructor: silence traceback when purging
Catch DiskFileNotExist exceptions when attempting to purge files. The
file may have passed its reclaim age since being reverted and will be
cleaned up when the reconstructor opens it for purging, raising a
DiskFileNotExist. The exception is OK - the diskfile was about to be
purged.

Change-Id: I5dfdf5950c6bd7fb130ab557347fbe959270c6e9
2021-11-24 12:11:50 +00:00
Alistair Coles e3069e6f7e reconstructor: remove non-durable files on handoffs
When a non-durable EC data fragment has been reverted from a handoff
node, it should removed if its mtime is older than the
commit_window. The test on the mtime was broken [1]: an incomplete
file path was given and the test always returned False i.e. the file
was never considered old enough to remove. As a result, non-durable
files would remain on handoff nodes until their reclaim age had
passed.

[1] Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
Change-Id: I7f6458af3ed753ef8700a456d5a977b847f17ee8
Closes-Bug: 1951598
2021-11-19 14:39:24 +00:00
Tim Burke a5fbe6ca41 ec: Use replication network to get frags for reconstruction
Closes-Bug: #1946267
Change-Id: Idb4fe7478275f71b4032024d6116181766ac6759
2021-10-06 15:17:01 -07:00
Alistair Coles 2696a79f09 reconstructor: retire nondurable_purge_delay option
The nondurable_purge_delay option was introduced in [1] to prevent the
reconstructor removing non-durable data files on handoffs that were
about to be made durable. The DiskFileManager commit_window option has
since been introduced [2] which specifies a similar time window during
which non-durable data files should not be removed. The commit_window
option can be re-used by the reconstructor, making the
nondurable_purge_delay option redundant.

The nondurable_purge_delay option has not been available in any tagged
release and is therefore removed with no backwards compatibility.

[1] Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
[2] Related-Change: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13
Change-Id: I1589a7517b7375fcc21472e2d514f26986bf5079
2021-07-19 21:18:06 +01:00
Alistair Coles bbaed18e9b diskfile: don't remove recently written non-durables
DiskFileManager will remove any stale files during
cleanup_ondisk_files(): these include tombstones and nondurable EC
data fragments whose timestamps are older than reclaim_age. It can
usually be safely assumed that a non-durable data fragment older than
reclaim_age is not going to become durable. However, if an agent PUTs
objects with specified older X-Timestamps (for example the reconciler
or container-sync) then there is a window of time during which the
object server has written an old non-durable data file but has not yet
committed it to make it durable.

Previously, if another process (for example the reconstructor) called
cleanup_ondisk_files during this window then the non-durable data file
would be removed. The subsequent attempt to commit the data file would
then result in a traceback due to there no longer being a data file to
rename, and of course the data file is lost.

This patch modifies cleanup_ondisk_files to not remove old, otherwise
stale, non-durable data files that were only written to disk in the
preceding 'commit_window' seconds. 'commit_window' is configurable for
the object server and defaults to 60.0 seconds.

Closes-Bug: #1936508
Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
Change-Id: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13
2021-07-19 21:18:02 +01:00
Alistair Coles 2fd5b87dc5 reconstructor: make quarantine delay configurable
Previously the reconstructor would quarantine isolated durable
fragments that were more than reclaim_age old. This patch adds a
quarantine_age option for the reconstructor which defaults to
reclaim_age but can be used to configure the age that a fragment must
reach before quarantining.

Change-Id: I867f3ea0cf60620c576da0c1f2c65cec2cf19aa0
2021-07-06 16:41:08 +01:00
Alistair Coles 2934818d60 reconstructor: Delay purging reverted non-durable datafiles
The reconstructor may revert a non-durable datafile on a handoff
concurrently with an object server PUT that is about to make the
datafile durable.  This could previously lead to the reconstructor
deleting the recently written datafile before the object-server
attempts to rename it to a durable datafile, and consequently a
traceback in the object server.

The reconstructor will now only remove reverted nondurable datafiles
that are older (according to mtime) than a period set by a new
nondurable_purge_delay option (defaults to 60 seconds). More recent
nondurable datafiles may be made durable or will remain on the handoff
until a subsequent reconstructor cycle.

Change-Id: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
2021-06-24 09:33:06 +01:00
Alistair Coles 46ea3aeae8 Quarantine stale EC fragments after checking handoffs
If the reconstructor finds a fragment that appears to be stale then it
will now quarantine the fragment.  Fragments are considered stale if
insufficient fragments at the same timestamp can be found to rebuild
missing fragments, and the number found is less than or equal to a new
reconstructor 'quarantine_threshold' config option.

Before quarantining a fragment the reconstructor will attempt to fetch
fragments from handoff nodes in addition to the usual primary nodes.
The handoff requests are limited by a new 'request_node_count'
config option.

'quarantine_threshold' defaults to zero i.e. no fragments will be
quarantined. 'request node count' defaults to '2 * replicas'.

Closes-Bug: 1655608

Change-Id: I08e1200291833dea3deba32cdb364baa99dc2816
2021-05-10 20:45:17 +01:00
Alistair Coles eeaac713fd reconstructor: gather rebuild fragments by x-data-timestamp
Fix the reconstructor fragment rebuild to gather other fragments in
buckets keyed by x-backend-data-timestamp rather than
x-backend-timestamp. The former is the actual .data file timestamp;
the latter can vary when .meta files have been written to some but not
all fragment hash dirs, causing rebuild to fail.

Change-Id: I8bbed8cb80b2796907492a39cd5b2d7069e1ca55
Closes-Bug: 1927720
2021-05-07 13:39:03 +01:00
Zuul 020a13ed3c Merge "reconstructor: log more details when rebuild fails" 2021-04-28 23:07:28 +00:00
Clay Gerrard 2a312d1cd5 Cleanup tests' import of debug_logger
Change-Id: I19ca860deaa6dbf388bdcd1f0b0f77f72ff19689
2021-04-27 12:04:41 +01:00
Alistair Coles 7960097f02 reconstructor: log more details when rebuild fails
When the reconstructor fails to gather enough fragments to rebuild a
missing fragment, log more details about the responses that it *did*
get:

  - log total number of ok responses, as well as the number of useful
    responses, to reveal if, for example, there might have been
    duplicate frag indexes or mixed etags.

  - log the mix of error status codes received to reveal if, for
    example, they were all 404s.

Also refactor reconstruct_fa to track all state related to a timestamp
in a small data encapsulation class rather than in multiple dicts.

Related-Bug: 1655608
Change-Id: I3f87933f788685775ce59f3724f17d5db948d502
2021-04-27 11:54:35 +01:00
Alistair Coles 1dceafa7d5 ssync: sync non-durable fragments from handoffs
Previously, ssync would not sync nor cleanup non-durable data
fragments on handoffs. When the reconstructor is syncing objects from
a handoff node (a 'revert' reconstructor job) it may be useful, and is
not harmful, to also send non-durable fragments if the receiver has
older or no fragment data.

Several changes are made to enable this. On the sending side:

  - For handoff (revert) jobs, the reconstructor instantiates
    SsyncSender with a new 'include_non_durable' option.
  - If configured with the include_non_durable option, the SsyncSender
    calls the diskfile yield_hashes function with options that allow
    non-durable fragments to be yielded.
  - The diskfile yield_hashes function is enhanced to include a
    'durable' flag in the data structure yielded for each object.
  - The SsyncSender includes the 'durable' flag in the metadata sent
    during the missing_check exchange with the receiver.
  - If the receiver requests the non-durable object, the SsyncSender
    includes a new 'X-Backend-No-Commit' header when sending the PUT
    subrequest for the object.
  - The SsyncSender includes the non-durable object in the collection
    of synced objects returned to the reconstructor so that the
    non-durable fragment is removed from the handoff node.

On the receiving side:

  - The object server includes a new 'X-Backend-Accept-No-Commit'
    header in its response to SSYNC requests. This indicates to the
    sender that the receiver has been upgraded to understand the
    'X-Backend-No-Commit' header.
  - The SsyncReceiver is enhanced to consider non-durable data when
    determining if the sender's data is wanted or not.
  - The object server PUT method is enhanced to check for and
    'X-Backend-No-Commit' header before committing a diskfile.

If a handoff sender has both a durable and newer non-durable fragment
for the same object and frag-index, only the newer non-durable
fragment will be synced and removed on the first reconstructor
pass. The durable fragment will be synced and removed on the next
reconstructor pass.

Change-Id: I1d47b865e0a621f35d323bbed472a6cfd2a5971b
Closes-Bug: 1778002
2021-01-20 12:00:10 +00:00
Ade Lee 5320ecbaf2 replace md5 with swift utils version
md5 is not an approved algorithm in FIPS mode, and trying to
instantiate a hashlib.md5() will fail when the system is running in
FIPS mode.

md5 is allowed when in a non-security context.  There is a plan to
add a keyword parameter (usedforsecurity) to hashlib.md5() to annotate
whether or not the instance is being used in a security context.

In the case where it is not, the instantiation of md5 will be allowed.
See https://bugs.python.org/issue9216 for more details.

Some downstream python versions already support this parameter.  To
support these versions, a new encapsulation of md5() is added to
swift/common/utils.py.  This encapsulation is identical to the one being
added to oslo.utils, but is recreated here to avoid adding a dependency.

This patch is to replace the instances of hashlib.md5() with this new
encapsulation, adding an annotation indicating whether the usage is
a security context or not.

While this patch seems large, it is really just the same change over and
again.  Reviewers need to pay particular attention as to whether the
keyword parameter (usedforsecurity) is set correctly.   Right now, all
of them appear to be not used in a security context.

Now that all the instances have been converted, we can update the bandit
run to look for these instances and ensure that new invocations do not
creep in.

With this latest patch, the functional and unit tests all pass
on a FIPS enabled system.

Co-Authored-By: Pete Zaitcev
Change-Id: Ibb4917da4c083e1e094156d748708b87387f2d87
2020-12-15 09:52:55 -05:00
Tim Burke 3c3cab2645 Stop invalidating suffixes post-SSYNC
We only need the invalidation post-rsync, since rsync was changing data
on disk behind Swift's back. Move the REPLICATE call down into the
rsync() helper function and drop it from the reconstructor entirely.

Change-Id: I576901344f1f3abb33b52b36fde0b25b43e54c8a
Closes-Bug: #1818709
2020-11-16 08:30:07 -06:00
Romain LE DISEZ 8c0a1abf74 Fix a race condition in case of cross-replication
In a situation where two nodes does not have the same version of a ring
and they both think the other node is the primary node of a partition,
a race condition can lead to the loss of some of the objects of the
partition.

The following sequence leads to the loss of some of the objects:

  1. A gets and reloads the new ring
  2. A starts to replicate/revert the partition P to node B
  3. B (with the old ring) starts to replicate/revert the (partial)
     partition P to node A
     => replication should be fast as all objects are already on node A
  4. B finished replication of (partial) partition P to node A
  5. B remove the (partial) partition P after replication succeeded
  6. A finishes replication of partition P to node B
  7. A removes the partition P
  8. B gets and reloads the new ring

All data transfered between steps 2 and 5 will be lost as they are not
anymore on node B and they are also removed from node A.

This commit make the replicator/reconstructor to hold a replication_lock
on partition P so that remote node cannot start an opposite replication.

Change-Id: I29acc1302a75ed52c935f42485f775cd41648e4d
Closes-Bug: #1897177
2020-10-14 19:16:18 -04:00
Zuul 3cceec2ee5 Merge "Update hacking for Python3" 2020-04-09 15:05:28 +00:00
Andreas Jaeger 96b56519bf Update hacking for Python3
The repo is Python using both Python 2 and 3 now, so update hacking to
version 2.0 which supports Python 2 and 3. Note that latest hacking
release 3.0 only supports version 3.

Fix problems found.

Remove hacking and friends from lower-constraints, they are not needed
for installation.

Change-Id: I9bd913ee1b32ba1566c420973723296766d1812f
2020-04-03 21:21:07 +02:00
Romain LE DISEZ 804776b379 Optimize obj replicator/reconstructor healthchecks
DaemonStrategy class calls Daemon.is_healthy() method every 0.1 seconds
to ensure that all workers are running as wanted.

On object replicator/reconstructor daemons, is_healthy() check if the rings
changed to decide if workers must be created/killed. With large rings,
this operation can be CPU intensive, especially on low-end CPU.

This patch:
- increases the check interval to 5 seconds by default, because none of
  these daemons are critical for performance (they are not in the datapath).
  But it allows each daemon to change this value if necessary
- ensures that before doing a computation of all devices in the ring,
  object replicator/reconstructor checks that the ring really changed
  (by checking the mtime of the ring.gz files)

On an Atom N2800 processor, this patch reduced the CPU usage of the main
object replicator/reconstructor from 70% of a core to 0%.

Change-Id: I2867e2be539f325778e2f044a151fd0773a7c390
2020-04-01 08:03:32 -04:00
Tim Burke ff5ea003b3 ec: log durability of frags that fail to reconstruct
Whether the frag is durable or non-durable greatly affects how much I
care whether I can reconstruct it.

Change-Id: Ie6f46267d4bb567ecc0cc195d1fd7ce55c8cb325
2019-08-20 22:23:00 -07:00
Tim Burke e8e7106d14 py3: port obj/reconstructor tests
All of the swift changes we needed for this were already done elsewhere.

Change-Id: Ib2c26fdf7bd36ed1cccd5dbd1fa208f912f4d8d5
2019-06-10 08:31:41 -07:00
Kuan-Lin Chen 37fa12cd83 Do not sync suffixes when remote rejects reconstructor sync
The commit a0fcca1e makes reconstructor not sync suffixes when remote
reject reconstructor revert. However, the exact same logic should
be applied to SYNC job as well. REPLICATE requests aren't generally
needed when using SSYC (which the reconstructor always does).

If a ssync_sender fails to finish a sync the reconstructor should skip
the REPLICATE call entirely and move on to the next partition without
causing any useless remote IO.

Change-Id: Ida50539e645ea7e2950ba668c7f031a8d10da787
Closes-Bug: #1665141
2019-06-03 18:39:51 +08:00
Clay Gerrard 585bf40cc0 Simplify empty suffix handling
We really only need to have one way to cleanup empty suffix dirs, and
that's normally during suffix hashing which only happens when invalid
suffixes get rehashed.

When we iterate a suffix tree using yield hashes, we may discover an
expired or otherwise reapable hashdir - when this happens we will now
simply invalidate the suffix so that the next rehash can clean it up.

This simplification removes an mis-behavior in the handling between the
normal suffix rehashing cleanup and what was implemented in ssync.

Change-Id: I5629de9f2e9b2331ed3f455d253efc69d030df72
Related-Change-Id: I2849a757519a30684646f3a6f4467c21e9281707
Closes-Bug: 1816501
2019-03-18 15:09:54 -05:00
Clay Gerrard ea8e545a27 Rebuild frags for unmounted disks
Change the behavior of the EC reconstructor to perform a fragment
rebuild to a handoff node when a primary peer responds with 507 to the
REPLICATE request.

Each primary node in a EC ring will sync with exactly three primary
peers, in addition to the left & right nodes we now select a third node
from the far side of the ring.  If any of these partners respond
unmounted the reconstructor will rebuild it's fragments to a handoff
node with the appropriate index.

To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
we must give the remote handoff node the correct backend_index for the
fragments it will recieve.  In the common case we will use
determistically different handoffs for each fragment index to prevent
multiple unmounted primary disks from forcing a single handoff node to
hold more than one rebuilt fragment.

Handoff nodes will continue to attempt to revert rebuilt handoff
fragments to the appropriate primary until it is remounted or
rebalanced.  After a rebalance of EC rings (potentially removing
unmounted/failed devices), it's most IO efficient to run in
handoffs_only mode to avoid unnecessary rebuilds.

Closes-Bug: #1510342

Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec
2019-02-08 18:04:55 +00:00
Clay Gerrard fb0e7837af Cleanup EC and SSYNC frag index parameters
An object node should reject a PUT with 409 when the timestamp is less
than or equal to the timestamp of an existing version of the object.

However, if the PUT is part of an SSYNC, and the fragment archive has a
different index than the one on disk we may store it.

We should store it we're the primary holder for that fragment index.

Back before the related change we used to revert fragments to handoffs
and it caused a lot of problems.  Mainly multiple frag indexes piling up
on one handoff node.  Eventually we settled on handoffs only reverting
to primaries but there was some crufty flailing left over.

When EC frag duplication (multi-region EC) came in we also added a new
complexity because a node's primary index (the index in part_nodes list)
was no longer universially equal to the EC frag index (the storage
policy backend end index).  There was a few places we assumed
node_index == frag_index, some of which caused bugs which we've fixed.

This change tries to clean all that up.

Related-Change-Id: Ie351d8342fc8e589b143f981e95ce74e70e52784

Change-Id: I3c5935e2d5f1cd140cf52df779596ebd6442686c
2019-02-04 17:02:17 -06:00
Tim Burke 1d4309dd71 misc test cleanup
Change-Id: I21823e50af6d60bb5ee02427ddc499d700c43577
Related-Change: Ib33ff305615b2d342f0d673ded5ed8f11b663feb
Related-Change: I0855d8a549d1272d056963abed03338f80d68a53
2019-01-18 18:09:56 +00:00
Clay Gerrard 1d9204ac43 Use remote frag index to calculate suffix diff
... instead of the node index, which is different in multi-region EC and
wrongly leads us to always think we're out of sync.

Closes-Bug: #1811268

Change-Id: I0855d8a549d1272d056963abed03338f80d68a53
2019-01-11 14:32:14 -06:00
Tim Burke 3420921a33 Clean up HASH_PATH_* patching
Previously, we'd sometimes shove strings into HASH_PATH_PREFIX or
HASH_PATH_SUFFIX, which would blow up on py3. Now, always use bytes.

Change-Id: Icab9981e8920da505c2395eb040f8261f2da6d2e
2018-11-01 20:52:33 +00:00
Zuul 614e85d479 Merge "Remove empty directories after a revert job" 2018-11-01 04:34:04 +00:00
Clay Gerrard 441df4fc93 Use correct headers in reconstructor requests
As long as the reconstructor collects parts from all policies each job
must be considered to have it's storage policy index and we can't use
global state for policy specific headers.  It's good hygiene to avoid
mutating the global state regardless.

Under load with multiple policies we observed essentially empty handoff
parts "re-appearing" on nodes until adding these changes.

Closes-Bug: #1671180
Change-Id: Id0e5f2743e05d81da7b26b2f05c90ba3c68e4d72
2018-10-31 08:41:56 -05:00
Alexandre Lécuyer d306345ddd Remove empty directories after a revert job
Currently, the reconstructor will not remove empty object and suffixes
directories after processing a revert job. This will only happen during
its next run.

This patch will attempt to remove these empty directories immediately,
while we have the inodes cached.

Change-Id: I5dfc145b919b70ab7dae34fb124c8a25ba77222f
2018-10-26 09:29:14 +02:00
Zuul 3de21d945b Merge "Remove empty part dirs during ssync replication" 2018-06-23 02:19:18 +00:00
Samuel Merritt ecf47553b5 Make final stats dump after reconstructor runs once
When running in multiprocess mode, the object reconstructor would
periodically aggregate its workers' recon data into a single recon
measurement. However, at the end of the run, all that was left in
recon was the last periodic measurement; any work that took place
after that point was not recored in the aggregate. However, it was
recorded in the per-disk stats that the worker processes emitted.

This commit adds a final recon aggregation after the worker processes
have finished.

Change-Id: Ia6a3a931e9e7a23824765b2ab111a5492e509be8
2018-06-04 15:24:45 -07:00
Samuel Merritt a19548b3e6 Remove empty part dirs during ssync replication
When we're pushing data to a remote node using ssync, we end up
walking the entire partition's directory tree. We were already
removing reclaimable (i.e. old) tombstones and non-durable EC data
files plus their containing hash dirs, but we were leaving the suffix
dirs around for future removal, and we weren't cleaning up partition
dirs at all. Now we remove as much of the directory structure as we
can, even up to the partition dir, as soon as we observe that it's
empty.

Change-Id: I2849a757519a30684646f3a6f4467c21e9281707
Closes-Bug: 1706321
2018-05-01 17:18:22 -07:00
Samuel Merritt 26538d3f62 Make multiprocess reconstructor's logs more readable.
Much like the multiprocess object replicator, the reconstructor runs
multiple concurrent worker processes who all log to the same
destination. We re-use the same solution: prepend a prefix with the
worker index and the pid to all the logs emitted from each worker
process.

Example log line:

    [worker 12/24 pid=8539] I did a thing

Change-Id: Ie2f98201193952be4d387bbb01c7c6fccc017a8a
2018-04-25 11:18:35 -07:00
Samuel Merritt c4751d0d55 Make reconstructor go faster with --override-devices
The object reconstructor will now fork all available worker processes
when operating on a subset of local devices.

Example:
  A system has 24 disks, named "d1" through "d24"
  reconstructor_workers = 8
  invoked with --override-devices=d1,d2,d3,d4,d5,d6

In this case, the reconstructor will now use 6 worker processes, one
per disk. The old behavior was to use 2 worker processes, one for d1,
d3, and d5 and the other for d2, d4, and d6 (because 24 / 8 = 3, so we
assigned 3 disks per worker before creating another).

I think the new behavior better matches operators' expectations. If I
give a concurrent program six tasks to do and tell it to operate on up
to eight at a time, I'd expect it to do all six tasks at once, not run
two concurrent batches of three tasks apiece.

This has no effect when --override-devices is not specified. When
operating on all local devices instead of a subset, the new and old
code produce the same result.

The reconstructor's behavior now matches the object replicator's
behavior.

Change-Id: Ib308c156c77b9b92541a12dd7e9b1a8ea8307a30
2018-04-25 11:18:35 -07:00
Samuel Merritt 728b4ba140 Add checksum to object extended attributes
Currently, our integrity checking for objects is pretty weak when it
comes to object metadata. If the extended attributes on a .data or
.meta file get corrupted in such a way that we can still unpickle it,
we don't have anything that detects that.

This could be especially bad with encrypted etags; if the encrypted
etag (X-Object-Sysmeta-Crypto-Etag or whatever it is) gets some bits
flipped, then we'll cheerfully decrypt the cipherjunk into plainjunk,
then send it to the client. Net effect is that the client sees a GET
response with an ETag that doesn't match the MD5 of the object *and*
Swift has no way of detecting and quarantining this object.

Note that, with an unencrypted object, if the ETag metadatum gets
mangled, then the object will be quarantined by the object server or
auditor, whichever notices first.

As part of this commit, I also ripped out some mocking of
getxattr/setxattr in tests. It appears to be there to allow unit tests
to run on systems where /tmp doesn't support xattrs. However, since
the mock is keyed off of inode number and inode numbers get re-used,
there's lots of leakage between different test runs. On a real FS,
unlinking a file and then creating a new one of the same name will
also reset the xattrs; this isn't the case with the mock.

The mock was pretty old; Ubuntu 12.04 and up all support xattrs in
/tmp, and recent Red Hat / CentOS releases do too. The xattr mock was
added in 2011; maybe it was to support Ubuntu Lucid Lynx?

Bonus: now you can pause a test with the debugger, inspect its files
in /tmp, and actually see the xattrs along with the data.

Since this patch now uses a real filesystem for testing filesystem
operations, tests are skipped if the underlying filesystem does not
support setting xattrs (eg tmpfs or more than 4k of xattrs on ext4).

References to "/tmp" have been replaced with calls to
tempfile.gettempdir(). This will allow setting the TMPDIR envvar in
test setup and getting an XFS filesystem instead of ext4 or tmpfs.

THIS PATCH SIGNIFICANTLY CHANGES TESTING ENVIRONMENTS

With this patch, every test environment will require TMPDIR to be
using a filesystem that supports at least 4k of extended attributes.
Neither ext4 nor tempfs support this. XFS is recommended.

So why all the SkipTests? Why not simply raise an error? We still need
the tests to run on the base image for OpenStack's CI system. Since
we were previously mocking out xattr, there wasn't a problem, but we
also weren't actually testing anything. This patch adds functionality
to validate xattr data, so we need to drop the mock.

`test.unit.skip_if_no_xattrs()` is also imported into `test.functional`
so that functional tests can import it from the functional test
namespace.

The related OpenStack CI infrastructure changes are made in
https://review.openstack.org/#/c/394600/.

Co-Authored-By: John Dickinson <me@not.mn>

Change-Id: I98a37c0d451f4960b7a12f648e4405c6c6716808
2017-11-03 13:30:05 -04:00
Pavel Kvasnička 163fb4d52a Always require device dir for containers
For test purposes (e.g. saio probetests) even if mount_check is False,
still require check_dir for account/container server storage when real
mount points are not used.

This behavior is consistent with the object-server's checks in diskfile.

Co-Author: Clay Gerrard <clay.gerrard@gmail.com>
Related lp bug #1693005
Related-Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b
Depends-On: I52c4ecb70b1ae47e613ba243da5a4d94e5adedf2
Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
2017-09-01 10:32:12 -07:00
Clay Gerrard 63ca3a74ef Drop reconstructor stats when worker has no devices
If you're watching (new) node's reconstruction_last time to ensure a
cycle finishes since the last ring rebalance you won't ever see
reconstructors with no devices drop recon stats.

Change-Id: I84c07fc6841119b00d1a74078fe53f4ce637187b
2017-08-21 17:50:10 +01:00
Romain LE DISEZ 69df458254 Allow to rebuild a fragment of an expired object
When a fragment of an expired object was missing, the reconstructor
ssync job would send a DELETE sub-request. This leads to situation
where, for the same object and timestamp, some nodes have a data file,
while others can have a tombstone file.

This patch forces the reconstructor to reconstruct a data file, even
for expired objects. DELETE requests are only sent for tombstoned
objects.

Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Closes-Bug: #1652323
Change-Id: I7f90b732c3268cb852b64f17555c631d668044a8
2017-08-04 23:05:08 +02:00
Tim Burke 8d05325f03 Test reconstruct() with no EC policies
We have a test for get_local_devices, but let's make some broader
assertions as well.

Related-Bug: #1707595
Change-Id: Ifa696207ffdb3b39650dfeaa3e7c6cfda94050db
2017-08-01 09:18:07 +01:00
Kota Tsuyuzaki 45cc1d02d0 Fix reconstructer to be able to run non ec policy environment
Since the related change, object-reconstructor gathers the local devices
for ec policy via get_local_devices method but the method causes TypeError when
attempting *reduce* for empty set list. the list can be empty when no EC config
found in swift.conf.

This patch fixes the get_local_devices to return empty set even when no ec config
in swift.conf without errors.

Co-Authored-By: Kirill Zaitsev <k.zaitsev@me.com>
Change-Id: Ic121fb547966787a43f9eae83c91bb2bf640c4be
Related-Change: 701a172afa
Closes-Bug: #1707595
2017-07-31 18:46:22 +09:00
Alistair Coles 56a18ac9b7 Add unit test for ObjectReconstructor.is_healthy
Add a test that verifies that get_all_devices does
fetch devices from the ring.

Related-Change: I28925a37f3985c9082b5a06e76af4dc3ec813abe

Change-Id: Ie2f83694f14f9a614b5276bbb859b9a3c0ec5dcb
2017-07-27 14:14:26 +01:00