The existing test fails on macOS because the value of errno.ENODATA is
platform dependent. On macOS ENODATA is 96:
% man 2 intro|grep ENODATA
96 ENODATA No message available.
Change-Id: Ibc760e641d4351ed771f2321dba27dc4e5b367c1
This has been available since py32 and was backported to py27; there
is no point in us continuing to carry the old idiom forward.
Change-Id: I21f64b8b2970e2dd5f56836f7f513e7895a5dc88
This reverts the fallocate- and punch_hole-related parts of commit
c78a5962b5.
Closes-Bug: #2031035
Related-Change: I3e26f8d4e5de0835212ebc2314cac713950c85d7
Change-Id: I8050296d6982f70bb64a63765b25d287a144cb8d
Replicated, unencrypted metadata is written down differently on py2
vs py3, and has been since we started supporting py3. Fortunately,
we can inspect the raw xattr bytes to determine whether the pickle
was written using py2 or py3, so we can properly read legacy py2 meta
under py3 rather than hitting a unicode error.
Closes-Bug: #2012531
Change-Id: I5876e3b88f0bb1224299b57541788f590f64ddd4
`much_older` has to be much older than `older`, or the test gets
flakey. See
- test_cleanup_ondisk_files_reclaim_non_data_files,
- test_cleanup_ondisk_files_reclaim_with_data_files, and
- test_cleanup_ondisk_files_reclaim_with_data_files_legacy_durable
for a more standard definition of "much_older".
Closes-Bug: #2017024
Change-Id: I1eaa501827f4475ddc0c20d82cf0a6d4a5e98f75
pytest still complains about some 20k warnings, but the vast majority
are actually because of eventlet, and a lot of those will get cleaned up
when upper-constraints picks up v0.33.2.
Change-Id: If48cda4ae206266bb41a4065cd90c17cbac84b7f
Historically, we've allowed objects to get on disk that may not have had
all of their required EC metadata. Add a new method to sanity-check
metadata in the auditor, and quarantine such invalid data.
Additionally, call validation early in the reconstructor's
reconstruct_fa() method so we do not have to attempt reconstruction for
invalid EC fragments.
Change-Id: I73551007843d27041f594923c59e6fd89caf17e5
Currently when the replicator or auditor hits an ENODATA corruption
we bail. This patch will allow them to skip and continue in an effort
to make progress on the work that needs to be done.
Change-Id: Id4c6e31332a93eb64a7660cec2a42ef686a84d22
We've seen a disk corruption take place that throws -ENODATA when a diskfile
it opened. We currently don't quarantine on this IOError.
This patch catches and quarantines the datafile. The catch happens in
diskfile.open in order to group similar quarantines in the same place
for easier maintenance.
Change-Id: I729e3ba5f19160aa09dd8eb983d9594cb102e85b
Previously, after reverting handoff files, the reconstructor would
only purge tombstones and data files for the reverted fragment
index. Any meta files were not purged because the partition might
also be on a primary node for a different fragment index.
For example, if, before the reconstructor visits, the object hash dir
contained:
t1#1#d.data
t1#2#d.data
t2.meta
where frag index 1 is a handoff and gets reverted, then, after the
reconstructor has visited, the hash dir should still contain:
t1#2#d.data
t2.meta
If, before the reconstructor visits, the object hash dir contained:
t1#1#d.data
t2.meta
then, after the reconstructor has visited, the hash dir would still
contain:
t2.meta
The retention of meta files is undesirable when the partition is a
"pure handoff" i.e. the node is not a primary for the partition for
any fragment index. With this patch the meta files are purged after
being reverted if the reconstructor has no sync job for the partition
(i.e. the partition is a "pure handoff") and there are no more
fragments to revert.
Change-Id: I107af3bc2d62768e063ef3176645d60ef22fa6d4
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
DiskFileManager will remove any stale files during
cleanup_ondisk_files(): these include tombstones and nondurable EC
data fragments whose timestamps are older than reclaim_age. It can
usually be safely assumed that a non-durable data fragment older than
reclaim_age is not going to become durable. However, if an agent PUTs
objects with specified older X-Timestamps (for example the reconciler
or container-sync) then there is a window of time during which the
object server has written an old non-durable data file but has not yet
committed it to make it durable.
Previously, if another process (for example the reconstructor) called
cleanup_ondisk_files during this window then the non-durable data file
would be removed. The subsequent attempt to commit the data file would
then result in a traceback due to there no longer being a data file to
rename, and of course the data file is lost.
This patch modifies cleanup_ondisk_files to not remove old, otherwise
stale, non-durable data files that were only written to disk in the
preceding 'commit_window' seconds. 'commit_window' is configurable for
the object server and defaults to 60.0 seconds.
Closes-Bug: #1936508
Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
Change-Id: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13
If a previous partition power increase failed to cleanup all files in
their old partition locations, then during the next partition power
increase the relinker may find the same file to relink in more than
one source partition. This currently leads to an error log due to the
second relink attempt getting an EEXIST error.
With this patch, when an EEXIST is raised, the relinker will attempt
to create/verify a link from older partition power locations to the
next part power location, and if such a link is found then suppress
the error log.
During the relink step, if an alternative link is verified and if a
file is found that is neither linked to the next partition power
location nor in the current part power location, then the file is
removed during the relink step. That prevents the same EEXIST occuring
again during the cleanup step when it may no longer be possible to
verify that an alternative link exists.
For example, consider identical filenames in the N+1th, Nth and N-1th
partition power locations, with the N+1th being linked to the Nth:
- During relink, the Nth location is visited and its link is
verified. Then the N-1th location is visited and an EEXIST error
is encountered, but the new check verifies that a link exists to
the Nth location, which is OK.
- During cleanup the locations are visited in the same order, but
files are removed so that the Nth location file no longer exists
when the N-1th location is visited. If the N-1th location still
has a conflicting file then existence of an alternative link to
the Nth location can no longer be verified, so an error would be
raised. Therefore, the N-1th location file must be removed during
relink.
The error is only suppressed for tombstones. The number of partition
power location that the relinker will look back over may be configured
using the link_check_limit option in a conf file or --link-check-limit
on the command line, and defaults to 2.
Closes-Bug: 1921718
Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c
When the replicator update method calls diskfile._get_hashes on a
local partition the partition path can be unicode [1]. If _get_hashes
does a listdir then os.listdir will return unicode values, and any new
suffixes will be written to the hashes.pkl as unicode. When the
hashes.pkl is next read, these unicode suffixes would previously fail
the valid_suffix check and read_hashes would return {'valid': False},
causing a rehash.
When _get_hashes is called from an object server REPLICATE handler,
the device path is bytes and any listdir would return suffixes as
bytes, so the suffix would then be written to hashes.pkl as bytes, and
not cause an error when read_hashes was next called.
This patch modifies diskfile.valid_suffix to allow suffixes to be
unicode or byte strings.
[1] the replication job device is set from the ring devs, and rings
can have unicode device values.
Change-Id: I5291aea2b4f1b3dbb6906f943ccf63ad8e716ef7
The relinker currently blindly calls get_hashes on partitions in the
new, upper half of the partition space, which can cause extra handoffs
to be created if you're rerunning the relink step for safety.
Even if the relinker were smarter, though, there's no reason that a
curious operator should end up creating empty handoffs just because
they're poking devices with REPLICATE requests
Drive-by: be a little more strict about the "suffixes" we're willing to
put in hashes.invalid and read out of hashes.pkl.
Change-Id: I9aace80088cd00d02c418fe4d782b662fb5c8bcf
Refactor to make more use of the utils.get_partition_for_hash() helper
function that was added in [1].
[1] Related-Change: I5fbf8671f2ee63467d46f651e2c24529a2697626
Change-Id: I0d3c4cae88ed50a09e892e6d77b5aaa96e6195a5
Previously, when a diskfile was relinked in a new partition, the
diskfile.relink_paths() would test for the existence of the link
before attempting to create it. This 'test then set' might race with
another process creating the same diskfile (e.g. a concurrent PUT
which created the very same object in an old partition that the
relinker is also relinking). The race might lead to an attempt to
create a link that already exists (despite the pre-check) and an
EEXIST exception being raised.
This patch modifies relink_paths to tolerate EEXIST exceptions but
only if the existing file is a link to the target file. Otherwise the
EEXIST exception is still raised.
The 'check_existing' argument for relink_paths is removed since it is
no longer relevant.
relink_paths() might also raise an ENOENT exception if, while a
diskfile is being relinked in the 'new' partition dir, another process
PUTs a newer version of the same object and as a result cleans up the
older diskfile in the 'old' partition before the first process has
called os.link().
This patch modifies relink_paths() to tolerate ENOENT exceptions from
os.link() but only if the target path (i.e. the file in the 'old'
partition) no longer exists. Otherwise the ENOENT will still be
raised.
This patch also modifies relink_paths() to return a boolean indicating
if the hard link was created by the call to the method (True) or not
(False).
Closes-Bug: 1917658
Change-Id: I65d4b83c56699ed566fbfb7068f9d2681ca67aa3
Add unit tests for diskfile.relink_paths to cover the new
object dir existing but not being a dir.
Change-Id: I5fbf8671f2ee63467d46f651e2c24529a2697626
Related-Change: Ia041f7b7dbd4e264072325b4b52718fef9966708
When a diskfile is written in a new diskfile dir, then in
diskfile.relink_paths the existence of the diskfile dir is tested
before a call to makedirs. If more than one process is taking the same
action concurrently they may race such that one tests for existence,
then another call makedirs, then the first calls makedirs and an
exception is raised. The first process therefore fails to relink the
object file, which is particularly problematic if the first process
was writing a newer object file.
This patch changes relink_paths to unconditionally attempt the
makedirs but catch and ignore any EEXIST errors.
Change-Id: Ia041f7b7dbd4e264072325b4b52718fef9966708
Previously, ssync would not sync nor cleanup non-durable data
fragments on handoffs. When the reconstructor is syncing objects from
a handoff node (a 'revert' reconstructor job) it may be useful, and is
not harmful, to also send non-durable fragments if the receiver has
older or no fragment data.
Several changes are made to enable this. On the sending side:
- For handoff (revert) jobs, the reconstructor instantiates
SsyncSender with a new 'include_non_durable' option.
- If configured with the include_non_durable option, the SsyncSender
calls the diskfile yield_hashes function with options that allow
non-durable fragments to be yielded.
- The diskfile yield_hashes function is enhanced to include a
'durable' flag in the data structure yielded for each object.
- The SsyncSender includes the 'durable' flag in the metadata sent
during the missing_check exchange with the receiver.
- If the receiver requests the non-durable object, the SsyncSender
includes a new 'X-Backend-No-Commit' header when sending the PUT
subrequest for the object.
- The SsyncSender includes the non-durable object in the collection
of synced objects returned to the reconstructor so that the
non-durable fragment is removed from the handoff node.
On the receiving side:
- The object server includes a new 'X-Backend-Accept-No-Commit'
header in its response to SSYNC requests. This indicates to the
sender that the receiver has been upgraded to understand the
'X-Backend-No-Commit' header.
- The SsyncReceiver is enhanced to consider non-durable data when
determining if the sender's data is wanted or not.
- The object server PUT method is enhanced to check for and
'X-Backend-No-Commit' header before committing a diskfile.
If a handoff sender has both a durable and newer non-durable fragment
for the same object and frag-index, only the newer non-durable
fragment will be synced and removed on the first reconstructor
pass. The durable fragment will be synced and removed on the next
reconstructor pass.
Change-Id: I1d47b865e0a621f35d323bbed472a6cfd2a5971b
Closes-Bug: 1778002
md5 is not an approved algorithm in FIPS mode, and trying to
instantiate a hashlib.md5() will fail when the system is running in
FIPS mode.
md5 is allowed when in a non-security context. There is a plan to
add a keyword parameter (usedforsecurity) to hashlib.md5() to annotate
whether or not the instance is being used in a security context.
In the case where it is not, the instantiation of md5 will be allowed.
See https://bugs.python.org/issue9216 for more details.
Some downstream python versions already support this parameter. To
support these versions, a new encapsulation of md5() is added to
swift/common/utils.py. This encapsulation is identical to the one being
added to oslo.utils, but is recreated here to avoid adding a dependency.
This patch is to replace the instances of hashlib.md5() with this new
encapsulation, adding an annotation indicating whether the usage is
a security context or not.
While this patch seems large, it is really just the same change over and
again. Reviewers need to pay particular attention as to whether the
keyword parameter (usedforsecurity) is set correctly. Right now, all
of them appear to be not used in a security context.
Now that all the instances have been converted, we can update the bandit
run to look for these instances and ensure that new invocations do not
creep in.
With this latest patch, the functional and unit tests all pass
on a FIPS enabled system.
Co-Authored-By: Pete Zaitcev
Change-Id: Ibb4917da4c083e1e094156d748708b87387f2d87
This only applies to post-sync REPLICATE calls, none of which actually
look at the response anyway.
Change-Id: I1de62140e7eb9a23152bb9fdb1fa0934e827bfda
In a situation where two nodes does not have the same version of a ring
and they both think the other node is the primary node of a partition,
a race condition can lead to the loss of some of the objects of the
partition.
The following sequence leads to the loss of some of the objects:
1. A gets and reloads the new ring
2. A starts to replicate/revert the partition P to node B
3. B (with the old ring) starts to replicate/revert the (partial)
partition P to node A
=> replication should be fast as all objects are already on node A
4. B finished replication of (partial) partition P to node A
5. B remove the (partial) partition P after replication succeeded
6. A finishes replication of partition P to node B
7. A removes the partition P
8. B gets and reloads the new ring
All data transfered between steps 2 and 5 will be lost as they are not
anymore on node B and they are also removed from node A.
This commit make the replicator/reconstructor to hold a replication_lock
on partition P so that remote node cannot start an opposite replication.
Change-Id: I29acc1302a75ed52c935f42485f775cd41648e4d
Closes-Bug: #1897177
Ordinarily, an ENOENT in _finalize_durable should mean something's gone
off the rails -- we expected to be able to mark data durable, but
couldn't!
If there are concurrent writers, though, it might actually be OK:
Client A writes .data
Client B writes .data
Client B finalizes .data *and cleans up on-disk files*
Client A tries to finalize but .data is gone
Previously, the above would cause the object server to 500, and if
enough of them did this, the client may see a 503. Now, call it good so
clients get 201s.
Change-Id: I4e322a7be23870a62aaa6acee8435598a056c544
Closes-Bug: #1719860
The repo is Python using both Python 2 and 3 now, so update hacking to
version 2.0 which supports Python 2 and 3. Note that latest hacking
release 3.0 only supports version 3.
Fix problems found.
Remove hacking and friends from lower-constraints, they are not needed
for installation.
Change-Id: I9bd913ee1b32ba1566c420973723296766d1812f
We recently started trying to use O_TMPFILE without the kernel version
check, we gracefully degrade when it isn't available, and we don't give
operators any way to avoid the O_TMPFILE attempt -- logging at warning
seems way too noisy.
Change-Id: I8f42127fd6eb930282161930c65e8614f3e3175f
Related-Change: I3599e2ab257bcd99467aee83b747939afac639d8
If the data in a hashes.pkl is corrupted but still de-serialized without
errors, it will mess up the replication and gets never fixed. This
happens for example if one of the keys is a NULL byte.
This patch checks if the dict keys in hashes.pkl are valid strings and
invalidates it if not.
Closes-Bug: 1830881
Change-Id: I84b062d062ff49935feed0aee3e1963bb72eb5ea
I'm pretty sure I botched the intent of the test when I touched it in
https://github.com/openstack/swift/commit/36c4297
As a side-effect, clean up a warning on py2:
UnicodeWarning: Unicode equal comparison failed to convert
both arguments to Unicode - interpreting them as being unequal
Change-Id: I4da8264f85e688bc5109016c197378ca6dfe06a5
In the diskfile module, there are a couple of cases where we would
quarantine the suffix dir if a single object within is not a directory.
Change the code so that we quarantine only the object.
Change-Id: I35eb16f568c00d973eb089aedd0e5a5619c25a2f
We really only need to have one way to cleanup empty suffix dirs, and
that's normally during suffix hashing which only happens when invalid
suffixes get rehashed.
When we iterate a suffix tree using yield hashes, we may discover an
expired or otherwise reapable hashdir - when this happens we will now
simply invalidate the suffix so that the next rehash can clean it up.
This simplification removes an mis-behavior in the handling between the
normal suffix rehashing cleanup and what was implemented in ssync.
Change-Id: I5629de9f2e9b2331ed3f455d253efc69d030df72
Related-Change-Id: I2849a757519a30684646f3a6f4467c21e9281707
Closes-Bug: 1816501
Previously o_tmpfile support was detected by checking the
kernel version as it was officially introduced in XFS in 3.15.
The problem is that RHEL has backported the support to at least
RHEL 7.6 but the kernel version is not updated.
This patch changes o_tmpfile is detected by actually attempting to
open a file with the O_TMPFILE flag and keeps the information cached
in DiskFileManager so that the check only happens once while process
is running.
Change-Id: I3599e2ab257bcd99467aee83b747939afac639d8
Note that we also want to prevent any IndexErrors that might arise from
pickled data already on disk.
Change-Id: If56190134edddb4b778e41a2cdd360008fbfb0e4
Closes-Bug: 1772378
Commit e199192cae introduced the ability
to have multiple SSYNC running on a single device. It misses a security
to ensure that only one SSYNC request can be running on a partition.
This commit update replication_lock to lock N times the device, then
lock once the partition related to a SSYNC request.
Change-Id: Id053ed7dd355d414d7920dda79a968a1c6677c14