Currently, when object-server serves GET request and DiskFile
reader iterate over disk file chunks, there is no explicit
eventlet sleep called. When network outpace the slow disk IO,
it's possible one large and slow GET request could cause
eventlet hub not to schedule any other green threads for a
long period of time. To improve this, this patch add a
configurable sleep parameter into DiskFile reader, which
is 'cooperative_period' with a default value of 0 (disabled).
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I80b04bad0601b6cd6caef35498f89d4ba70a4fd4
Previously, clients could bypass fallocate_reserve checks by uploading
with `Transfer-Encoding: chunked` rather than sending a `Content-Length`
Now, a chunked transfer may still push a disk past the reserve threshold,
but once over the threshold, further PUTs and POSTs will 507. DELETEs
will still be allowed.
Closes-Bug: #2031049
Change-Id: I69ec7193509cd3ed0aa98aca15190468368069a5
This is intended to be purely internal to our diskfile implementation,
allowing us to have more context about whether we're writing data,
metadata, or a tombstone when calling fallocate in DiskFileWriter.open.
Change-Id: I8c4214080a6bd73787690f75b8c150c999d948fe
The stack showing how we call self._logger.error isn't nearly as
interesting as the stack for the caught exception.
Change-Id: I6bdff90f9d3d26a46b1f19c87db0f3c60260c5b7
Replicated, unencrypted metadata is written down differently on py2
vs py3, and has been since we started supporting py3. Fortunately,
we can inspect the raw xattr bytes to determine whether the pickle
was written using py2 or py3, so we can properly read legacy py2 meta
under py3 rather than hitting a unicode error.
Closes-Bug: #2012531
Change-Id: I5876e3b88f0bb1224299b57541788f590f64ddd4
Historically, we've allowed objects to get on disk that may not have had
all of their required EC metadata. Add a new method to sanity-check
metadata in the auditor, and quarantine such invalid data.
Additionally, call validation early in the reconstructor's
reconstruct_fa() method so we do not have to attempt reconstruction for
invalid EC fragments.
Change-Id: I73551007843d27041f594923c59e6fd89caf17e5
Currently when the replicator or auditor hits an ENODATA corruption
we bail. This patch will allow them to skip and continue in an effort
to make progress on the work that needs to be done.
Change-Id: Id4c6e31332a93eb64a7660cec2a42ef686a84d22
We've seen a disk corruption take place that throws -ENODATA when a diskfile
it opened. We currently don't quarantine on this IOError.
This patch catches and quarantines the datafile. The catch happens in
diskfile.open in order to group similar quarantines in the same place
for easier maintenance.
Change-Id: I729e3ba5f19160aa09dd8eb983d9594cb102e85b
Previously, after reverting handoff files, the reconstructor would
only purge tombstones and data files for the reverted fragment
index. Any meta files were not purged because the partition might
also be on a primary node for a different fragment index.
For example, if, before the reconstructor visits, the object hash dir
contained:
t1#1#d.data
t1#2#d.data
t2.meta
where frag index 1 is a handoff and gets reverted, then, after the
reconstructor has visited, the hash dir should still contain:
t1#2#d.data
t2.meta
If, before the reconstructor visits, the object hash dir contained:
t1#1#d.data
t2.meta
then, after the reconstructor has visited, the hash dir would still
contain:
t2.meta
The retention of meta files is undesirable when the partition is a
"pure handoff" i.e. the node is not a primary for the partition for
any fragment index. With this patch the meta files are purged after
being reverted if the reconstructor has no sync job for the partition
(i.e. the partition is a "pure handoff") and there are no more
fragments to revert.
Change-Id: I107af3bc2d62768e063ef3176645d60ef22fa6d4
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
When a non-durable EC data fragment has been reverted from a handoff
node, it should removed if its mtime is older than the
commit_window. The test on the mtime was broken [1]: an incomplete
file path was given and the test always returned False i.e. the file
was never considered old enough to remove. As a result, non-durable
files would remain on handoff nodes until their reclaim age had
passed.
[1] Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
Change-Id: I7f6458af3ed753ef8700a456d5a977b847f17ee8
Closes-Bug: 1951598
DiskFileManager will remove any stale files during
cleanup_ondisk_files(): these include tombstones and nondurable EC
data fragments whose timestamps are older than reclaim_age. It can
usually be safely assumed that a non-durable data fragment older than
reclaim_age is not going to become durable. However, if an agent PUTs
objects with specified older X-Timestamps (for example the reconciler
or container-sync) then there is a window of time during which the
object server has written an old non-durable data file but has not yet
committed it to make it durable.
Previously, if another process (for example the reconstructor) called
cleanup_ondisk_files during this window then the non-durable data file
would be removed. The subsequent attempt to commit the data file would
then result in a traceback due to there no longer being a data file to
rename, and of course the data file is lost.
This patch modifies cleanup_ondisk_files to not remove old, otherwise
stale, non-durable data files that were only written to disk in the
preceding 'commit_window' seconds. 'commit_window' is configurable for
the object server and defaults to 60.0 seconds.
Closes-Bug: #1936508
Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
Change-Id: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13
The reconstructor may revert a non-durable datafile on a handoff
concurrently with an object server PUT that is about to make the
datafile durable. This could previously lead to the reconstructor
deleting the recently written datafile before the object-server
attempts to rename it to a durable datafile, and consequently a
traceback in the object server.
The reconstructor will now only remove reverted nondurable datafiles
that are older (according to mtime) than a period set by a new
nondurable_purge_delay option (defaults to 60 seconds). More recent
nondurable datafiles may be made durable or will remain on the handoff
until a subsequent reconstructor cycle.
Change-Id: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
It doesn't explain how we saw the traceback we did (self.logger.txn_id
should have been set a few lines earlier in __call__!), but at least now
the transaction ID should come across if we're on the happy path.
Change-Id: Ifa6f81bc02c7c84ad1f4c9ff694b645348c7c6c4
Related-Bug: #1922810
If a previous partition power increase failed to cleanup all files in
their old partition locations, then during the next partition power
increase the relinker may find the same file to relink in more than
one source partition. This currently leads to an error log due to the
second relink attempt getting an EEXIST error.
With this patch, when an EEXIST is raised, the relinker will attempt
to create/verify a link from older partition power locations to the
next part power location, and if such a link is found then suppress
the error log.
During the relink step, if an alternative link is verified and if a
file is found that is neither linked to the next partition power
location nor in the current part power location, then the file is
removed during the relink step. That prevents the same EEXIST occuring
again during the cleanup step when it may no longer be possible to
verify that an alternative link exists.
For example, consider identical filenames in the N+1th, Nth and N-1th
partition power locations, with the N+1th being linked to the Nth:
- During relink, the Nth location is visited and its link is
verified. Then the N-1th location is visited and an EEXIST error
is encountered, but the new check verifies that a link exists to
the Nth location, which is OK.
- During cleanup the locations are visited in the same order, but
files are removed so that the Nth location file no longer exists
when the N-1th location is visited. If the N-1th location still
has a conflicting file then existence of an alternative link to
the Nth location can no longer be verified, so an error would be
raised. Therefore, the N-1th location file must be removed during
relink.
The error is only suppressed for tombstones. The number of partition
power location that the relinker will look back over may be configured
using the link_check_limit option in a conf file or --link-check-limit
on the command line, and defaults to 2.
Closes-Bug: 1921718
Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c
When the replicator update method calls diskfile._get_hashes on a
local partition the partition path can be unicode [1]. If _get_hashes
does a listdir then os.listdir will return unicode values, and any new
suffixes will be written to the hashes.pkl as unicode. When the
hashes.pkl is next read, these unicode suffixes would previously fail
the valid_suffix check and read_hashes would return {'valid': False},
causing a rehash.
When _get_hashes is called from an object server REPLICATE handler,
the device path is bytes and any listdir would return suffixes as
bytes, so the suffix would then be written to hashes.pkl as bytes, and
not cause an error when read_hashes was next called.
This patch modifies diskfile.valid_suffix to allow suffixes to be
unicode or byte strings.
[1] the replication job device is set from the ring devs, and rings
can have unicode device values.
Change-Id: I5291aea2b4f1b3dbb6906f943ccf63ad8e716ef7
The relinker currently blindly calls get_hashes on partitions in the
new, upper half of the partition space, which can cause extra handoffs
to be created if you're rerunning the relink step for safety.
Even if the relinker were smarter, though, there's no reason that a
curious operator should end up creating empty handoffs just because
they're poking devices with REPLICATE requests
Drive-by: be a little more strict about the "suffixes" we're willing to
put in hashes.invalid and read out of hashes.pkl.
Change-Id: I9aace80088cd00d02c418fe4d782b662fb5c8bcf
Before removing object files from the old partition ensure that all
required links exist in the new partition. Allow for the object files
in the old partition to not necessarily be among the required files
if, for example, newer files have been created in the new partition
since relinking occurred.
When required links do not already exist in the new partition an
attempt is made to create them during cleanup.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Change-Id: I6e01ed29eb8971eeff96887efbfdfb76a01b00c3
Previously, when a diskfile was relinked in a new partition, the
diskfile.relink_paths() would test for the existence of the link
before attempting to create it. This 'test then set' might race with
another process creating the same diskfile (e.g. a concurrent PUT
which created the very same object in an old partition that the
relinker is also relinking). The race might lead to an attempt to
create a link that already exists (despite the pre-check) and an
EEXIST exception being raised.
This patch modifies relink_paths to tolerate EEXIST exceptions but
only if the existing file is a link to the target file. Otherwise the
EEXIST exception is still raised.
The 'check_existing' argument for relink_paths is removed since it is
no longer relevant.
relink_paths() might also raise an ENOENT exception if, while a
diskfile is being relinked in the 'new' partition dir, another process
PUTs a newer version of the same object and as a result cleans up the
older diskfile in the 'old' partition before the first process has
called os.link().
This patch modifies relink_paths() to tolerate ENOENT exceptions from
os.link() but only if the target path (i.e. the file in the 'old'
partition) no longer exists. Otherwise the ENOENT will still be
raised.
This patch also modifies relink_paths() to return a boolean indicating
if the hard link was created by the call to the method (True) or not
(False).
Closes-Bug: 1917658
Change-Id: I65d4b83c56699ed566fbfb7068f9d2681ca67aa3
When a diskfile is written in a new diskfile dir, then in
diskfile.relink_paths the existence of the diskfile dir is tested
before a call to makedirs. If more than one process is taking the same
action concurrently they may race such that one tests for existence,
then another call makedirs, then the first calls makedirs and an
exception is raised. The first process therefore fails to relink the
object file, which is particularly problematic if the first process
was writing a newer object file.
This patch changes relink_paths to unconditionally attempt the
makedirs but catch and ignore any EEXIST errors.
Change-Id: Ia041f7b7dbd4e264072325b4b52718fef9966708
Previously, ssync would not sync nor cleanup non-durable data
fragments on handoffs. When the reconstructor is syncing objects from
a handoff node (a 'revert' reconstructor job) it may be useful, and is
not harmful, to also send non-durable fragments if the receiver has
older or no fragment data.
Several changes are made to enable this. On the sending side:
- For handoff (revert) jobs, the reconstructor instantiates
SsyncSender with a new 'include_non_durable' option.
- If configured with the include_non_durable option, the SsyncSender
calls the diskfile yield_hashes function with options that allow
non-durable fragments to be yielded.
- The diskfile yield_hashes function is enhanced to include a
'durable' flag in the data structure yielded for each object.
- The SsyncSender includes the 'durable' flag in the metadata sent
during the missing_check exchange with the receiver.
- If the receiver requests the non-durable object, the SsyncSender
includes a new 'X-Backend-No-Commit' header when sending the PUT
subrequest for the object.
- The SsyncSender includes the non-durable object in the collection
of synced objects returned to the reconstructor so that the
non-durable fragment is removed from the handoff node.
On the receiving side:
- The object server includes a new 'X-Backend-Accept-No-Commit'
header in its response to SSYNC requests. This indicates to the
sender that the receiver has been upgraded to understand the
'X-Backend-No-Commit' header.
- The SsyncReceiver is enhanced to consider non-durable data when
determining if the sender's data is wanted or not.
- The object server PUT method is enhanced to check for and
'X-Backend-No-Commit' header before committing a diskfile.
If a handoff sender has both a durable and newer non-durable fragment
for the same object and frag-index, only the newer non-durable
fragment will be synced and removed on the first reconstructor
pass. The durable fragment will be synced and removed on the next
reconstructor pass.
Change-Id: I1d47b865e0a621f35d323bbed472a6cfd2a5971b
Closes-Bug: 1778002
...because otherwise I'll find myself wanting to just use it
*everywhere*, which seems to defeat the purpose.
Change-Id: I88c7a9fb78890302bb71aeaf63ad587e59767726
md5 is not an approved algorithm in FIPS mode, and trying to
instantiate a hashlib.md5() will fail when the system is running in
FIPS mode.
md5 is allowed when in a non-security context. There is a plan to
add a keyword parameter (usedforsecurity) to hashlib.md5() to annotate
whether or not the instance is being used in a security context.
In the case where it is not, the instantiation of md5 will be allowed.
See https://bugs.python.org/issue9216 for more details.
Some downstream python versions already support this parameter. To
support these versions, a new encapsulation of md5() is added to
swift/common/utils.py. This encapsulation is identical to the one being
added to oslo.utils, but is recreated here to avoid adding a dependency.
This patch is to replace the instances of hashlib.md5() with this new
encapsulation, adding an annotation indicating whether the usage is
a security context or not.
While this patch seems large, it is really just the same change over and
again. Reviewers need to pay particular attention as to whether the
keyword parameter (usedforsecurity) is set correctly. Right now, all
of them appear to be not used in a security context.
Now that all the instances have been converted, we can update the bandit
run to look for these instances and ensure that new invocations do not
creep in.
With this latest patch, the functional and unit tests all pass
on a FIPS enabled system.
Co-Authored-By: Pete Zaitcev
Change-Id: Ibb4917da4c083e1e094156d748708b87387f2d87
This only applies to post-sync REPLICATE calls, none of which actually
look at the response anyway.
Change-Id: I1de62140e7eb9a23152bb9fdb1fa0934e827bfda
In a situation where two nodes does not have the same version of a ring
and they both think the other node is the primary node of a partition,
a race condition can lead to the loss of some of the objects of the
partition.
The following sequence leads to the loss of some of the objects:
1. A gets and reloads the new ring
2. A starts to replicate/revert the partition P to node B
3. B (with the old ring) starts to replicate/revert the (partial)
partition P to node A
=> replication should be fast as all objects are already on node A
4. B finished replication of (partial) partition P to node A
5. B remove the (partial) partition P after replication succeeded
6. A finishes replication of partition P to node B
7. A removes the partition P
8. B gets and reloads the new ring
All data transfered between steps 2 and 5 will be lost as they are not
anymore on node B and they are also removed from node A.
This commit make the replicator/reconstructor to hold a replication_lock
on partition P so that remote node cannot start an opposite replication.
Change-Id: I29acc1302a75ed52c935f42485f775cd41648e4d
Closes-Bug: #1897177
Ordinarily, an ENOENT in _finalize_durable should mean something's gone
off the rails -- we expected to be able to mark data durable, but
couldn't!
If there are concurrent writers, though, it might actually be OK:
Client A writes .data
Client B writes .data
Client B finalizes .data *and cleans up on-disk files*
Client A tries to finalize but .data is gone
Previously, the above would cause the object server to 500, and if
enough of them did this, the client may see a 503. Now, call it good so
clients get 201s.
Change-Id: I4e322a7be23870a62aaa6acee8435598a056c544
Closes-Bug: #1719860
We recently started trying to use O_TMPFILE without the kernel version
check, we gracefully degrade when it isn't available, and we don't give
operators any way to avoid the O_TMPFILE attempt -- logging at warning
seems way too noisy.
Change-Id: I8f42127fd6eb930282161930c65e8614f3e3175f
Related-Change: I3599e2ab257bcd99467aee83b747939afac639d8
This is to fix the duplicated words issue, like:
* "both of which which must be stripped off"
* "In addition the the values set"
and so on
Change-Id: Id3d84281f15815b4185c76874575e91a3589981b
Either all or none of account, container, and object should be provided.
If we get some but not all, that's indicating some kind of a coding bug;
there's a chance it may be benign, but it seems safer to fail early and
loudly.
Change-Id: Ia9a0ac28bde4b5dcbf6e979c131e61297577c120
Related-Change: Ic2e29474505426dea77e178bf94d891f150d851b
If the data in a hashes.pkl is corrupted but still de-serialized without
errors, it will mess up the replication and gets never fixed. This
happens for example if one of the keys is a NULL byte.
This patch checks if the dict keys in hashes.pkl are valid strings and
invalidates it if not.
Closes-Bug: 1830881
Change-Id: I84b062d062ff49935feed0aee3e1963bb72eb5ea
I'm pretty sure I botched the intent of the test when I touched it in
https://github.com/openstack/swift/commit/36c4297
As a side-effect, clean up a warning on py2:
UnicodeWarning: Unicode equal comparison failed to convert
both arguments to Unicode - interpreting them as being unequal
Change-Id: I4da8264f85e688bc5109016c197378ca6dfe06a5
In the diskfile module, there are a couple of cases where we would
quarantine the suffix dir if a single object within is not a directory.
Change the code so that we quarantine only the object.
Change-Id: I35eb16f568c00d973eb089aedd0e5a5619c25a2f