swift

Commit Graph

Author	SHA1	Message	Date
Jianjian Huo	d5877179a5	Object-server: add periodic greenthread yielding during file read. Currently, when object-server serves GET request and DiskFile reader iterate over disk file chunks, there is no explicit eventlet sleep called. When network outpace the slow disk IO, it's possible one large and slow GET request could cause eventlet hub not to schedule any other green threads for a long period of time. To improve this, this patch add a configurable sleep parameter into DiskFile reader, which is 'cooperative_period' with a default value of 0 (disabled). Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: I80b04bad0601b6cd6caef35498f89d4ba70a4fd4	2024-02-27 11:24:41 +11:00
Tim Burke	c5d743347c	diskfile: Ignore invalid suffixes in invalidations file Change-Id: I0357939cf3a12712e6719c257705cf565e3afc8b	2024-02-06 20:24:03 -08:00
Tim Burke	d31a54a65c	object: Block POSTs and chunked PUTs when already past reserve Previously, clients could bypass fallocate_reserve checks by uploading with `Transfer-Encoding: chunked` rather than sending a `Content-Length` Now, a chunked transfer may still push a disk past the reserve threshold, but once over the threshold, further PUTs and POSTs will 507. DELETEs will still be allowed. Closes-Bug: #2031049 Change-Id: I69ec7193509cd3ed0aa98aca15190468368069a5	2023-10-02 11:47:43 -07:00
Tim Burke	4051053be4	diskfile: Pass extension when creating DiskFileWriters This is intended to be purely internal to our diskfile implementation, allowing us to have more context about whether we're writing data, metadata, or a tombstone when calling fallocate in DiskFileWriter.open. Change-Id: I8c4214080a6bd73787690f75b8c150c999d948fe	2023-09-19 09:57:36 -07:00
Tim Burke	effd4931d1	Use format_exc, not format_stack The stack showing how we call self._logger.error isn't nearly as interesting as the stack for the caught exception. Change-Id: I6bdff90f9d3d26a46b1f19c87db0f3c60260c5b7	2023-05-25 11:29:51 -07:00
Tim Burke	7807540962	Properly read py2 object metadata on py3 Replicated, unencrypted metadata is written down differently on py2 vs py3, and has been since we started supporting py3. Fortunately, we can inspect the raw xattr bytes to determine whether the pickle was written using py2 or py3, so we can properly read legacy py2 meta under py3 rather than hitting a unicode error. Closes-Bug: #2012531 Change-Id: I5876e3b88f0bb1224299b57541788f590f64ddd4	2023-05-02 13:20:03 -07:00
indianwhocodes	745a7f04fe	Add validation method for metadata in ECDiskfile Historically, we've allowed objects to get on disk that may not have had all of their required EC metadata. Add a new method to sanity-check metadata in the auditor, and quarantine such invalid data. Additionally, call validation early in the reconstructor's reconstruct_fa() method so we do not have to attempt reconstruction for invalid EC fragments. Change-Id: I73551007843d27041f594923c59e6fd89caf17e5	2022-09-06 13:56:06 -07:00
indianwhocodes	d7c08d8ea7	Make the object replicator and auditor more robust to ENODATA Currently when the replicator or auditor hits an ENODATA corruption we bail. This patch will allow them to skip and continue in an effort to make progress on the work that needs to be done. Change-Id: Id4c6e31332a93eb64a7660cec2a42ef686a84d22	2022-07-07 17:21:38 +00:00
Zuul	92bb7108c6	Merge "diskfile: Quarantine hashdirs on ENODATA"	2022-05-09 21:08:28 +00:00
Matthew Oliver	f92be1bdd0	Obj Auditor: Quarantine ENODATA We've seen a disk corruption take place that throws -ENODATA when a diskfile it opened. We currently don't quarantine on this IOError. This patch catches and quarantines the datafile. The catch happens in diskfile.open in order to group similar quarantines in the same place for easier maintenance. Change-Id: I729e3ba5f19160aa09dd8eb983d9594cb102e85b	2022-03-31 08:14:52 +11:00
Tim Burke	d94ab813f2	diskfile: Quarantine hashdirs on ENODATA Change-Id: I7d3f81b421def4e962fe662b55d5565acc53e721	2022-03-08 21:28:05 -08:00
Zuul	c824bc4dee	Merge "object: Remove a bunch of logging translations"	2021-12-03 18:59:12 +00:00
Tim Burke	80d025233f	object: Remove a bunch of logging translations Change-Id: Id6020345c11834a3e133f720a72b3787b1733bba Partial-Bug: #1674543	2021-11-29 17:05:11 -08:00
Alistair Coles	ada9f0eeb0	reconstructor: purge meta files in pure handoffs Previously, after reverting handoff files, the reconstructor would only purge tombstones and data files for the reverted fragment index. Any meta files were not purged because the partition might also be on a primary node for a different fragment index. For example, if, before the reconstructor visits, the object hash dir contained: t1#1#d.data t1#2#d.data t2.meta where frag index 1 is a handoff and gets reverted, then, after the reconstructor has visited, the hash dir should still contain: t1#2#d.data t2.meta If, before the reconstructor visits, the object hash dir contained: t1#1#d.data t2.meta then, after the reconstructor has visited, the hash dir would still contain: t2.meta The retention of meta files is undesirable when the partition is a "pure handoff" i.e. the node is not a primary for the partition for any fragment index. With this patch the meta files are purged after being reverted if the reconstructor has no sync job for the partition (i.e. the partition is a "pure handoff") and there are no more fragments to revert. Change-Id: I107af3bc2d62768e063ef3176645d60ef22fa6d4 Co-Authored-By: Tim Burke <tim.burke@gmail.com>	2021-11-24 12:20:52 +00:00
Alistair Coles	e3069e6f7e	reconstructor: remove non-durable files on handoffs When a non-durable EC data fragment has been reverted from a handoff node, it should removed if its mtime is older than the commit_window. The test on the mtime was broken [1]: an incomplete file path was given and the test always returned False i.e. the file was never considered old enough to remove. As a result, non-durable files would remain on handoff nodes until their reclaim age had passed. [1] Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0 Change-Id: I7f6458af3ed753ef8700a456d5a977b847f17ee8 Closes-Bug: 1951598	2021-11-19 14:39:24 +00:00
Tim Burke	c9b0795b9e	diskfile: Stop raising ENOENT when we race to mark durable Change-Id: Ia952303a16457a5075e9a4e060806855c21577f5	2021-07-29 09:21:16 -07:00
Alistair Coles	bbaed18e9b	diskfile: don't remove recently written non-durables DiskFileManager will remove any stale files during cleanup_ondisk_files(): these include tombstones and nondurable EC data fragments whose timestamps are older than reclaim_age. It can usually be safely assumed that a non-durable data fragment older than reclaim_age is not going to become durable. However, if an agent PUTs objects with specified older X-Timestamps (for example the reconciler or container-sync) then there is a window of time during which the object server has written an old non-durable data file but has not yet committed it to make it durable. Previously, if another process (for example the reconstructor) called cleanup_ondisk_files during this window then the non-durable data file would be removed. The subsequent attempt to commit the data file would then result in a traceback due to there no longer being a data file to rename, and of course the data file is lost. This patch modifies cleanup_ondisk_files to not remove old, otherwise stale, non-durable data files that were only written to disk in the preceding 'commit_window' seconds. 'commit_window' is configurable for the object server and defaults to 60.0 seconds. Closes-Bug: #1936508 Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0 Change-Id: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13	2021-07-19 21:18:02 +01:00
Zuul	2117a32b99	Merge "Quarantine on IOErrors while reading"	2021-07-05 05:50:20 +00:00
Tim Burke	f25592f3cd	Quarantine on IOErrors while reading Change-Id: I3b89d10096f29652f691428065b9801d57d77ec7	2021-07-02 16:46:08 +00:00
Alistair Coles	2934818d60	reconstructor: Delay purging reverted non-durable datafiles The reconstructor may revert a non-durable datafile on a handoff concurrently with an object server PUT that is about to make the datafile durable. This could previously lead to the reconstructor deleting the recently written datafile before the object-server attempts to rename it to a durable datafile, and consequently a traceback in the object server. The reconstructor will now only remove reverted nondurable datafiles that are older (according to mtime) than a period set by a new nondurable_purge_delay option (defaults to 60 seconds). More recent nondurable datafiles may be made durable or will remain on the handoff until a subsequent reconstructor cycle. Change-Id: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0	2021-06-24 09:33:06 +01:00
Tim Burke	7087fb0d7e	object: Plumb logger_thread_locals through _finalize_put It doesn't explain how we saw the traceback we did (self.logger.txn_id should have been set a few lines earlier in __call__!), but at least now the transaction ID should come across if we're on the happy path. Change-Id: Ifa6f81bc02c7c84ad1f4c9ff694b645348c7c6c4 Related-Bug: #1922810	2021-04-22 15:54:50 -07:00
Zuul	7594d97f38	Merge "relinker: retry links from older part powers"	2021-04-02 00:27:22 +00:00
Alistair Coles	3bdd01cf4a	relinker: retry links from older part powers If a previous partition power increase failed to cleanup all files in their old partition locations, then during the next partition power increase the relinker may find the same file to relink in more than one source partition. This currently leads to an error log due to the second relink attempt getting an EEXIST error. With this patch, when an EEXIST is raised, the relinker will attempt to create/verify a link from older partition power locations to the next part power location, and if such a link is found then suppress the error log. During the relink step, if an alternative link is verified and if a file is found that is neither linked to the next partition power location nor in the current part power location, then the file is removed during the relink step. That prevents the same EEXIST occuring again during the cleanup step when it may no longer be possible to verify that an alternative link exists. For example, consider identical filenames in the N+1th, Nth and N-1th partition power locations, with the N+1th being linked to the Nth: - During relink, the Nth location is visited and its link is verified. Then the N-1th location is visited and an EEXIST error is encountered, but the new check verifies that a link exists to the Nth location, which is OK. - During cleanup the locations are visited in the same order, but files are removed so that the Nth location file no longer exists when the N-1th location is visited. If the N-1th location still has a conflicting file then existence of an alternative link to the Nth location can no longer be verified, so an error would be raised. Therefore, the N-1th location file must be removed during relink. The error is only suppressed for tombstones. The number of partition power location that the relinker will look back over may be configured using the link_check_limit option in a conf file or --link-check-limit on the command line, and defaults to 2. Closes-Bug: 1921718 Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c	2021-04-01 18:56:57 +01:00
Alistair Coles	2a672b6788	py2: diskfile.valid_suffix should allow unicode When the replicator update method calls diskfile._get_hashes on a local partition the partition path can be unicode [1]. If _get_hashes does a listdir then os.listdir will return unicode values, and any new suffixes will be written to the hashes.pkl as unicode. When the hashes.pkl is next read, these unicode suffixes would previously fail the valid_suffix check and read_hashes would return {'valid': False}, causing a rehash. When _get_hashes is called from an object server REPLICATE handler, the device path is bytes and any listdir would return suffixes as bytes, so the suffix would then be written to hashes.pkl as bytes, and not cause an error when read_hashes was next called. This patch modifies diskfile.valid_suffix to allow suffixes to be unicode or byte strings. [1] the replication job device is set from the ring devs, and rings can have unicode device values. Change-Id: I5291aea2b4f1b3dbb6906f943ccf63ad8e716ef7	2021-03-31 15:01:19 +01:00
Zuul	0dd057ea5f	Merge "diskfile: Prevent get_hashes from creating missing partition dirs"	2021-03-30 11:11:01 +00:00
Tim Burke	ade3b28636	diskfile: Prevent get_hashes from creating missing partition dirs The relinker currently blindly calls get_hashes on partitions in the new, upper half of the partition space, which can cause extra handoffs to be created if you're rerunning the relink step for safety. Even if the relinker were smarter, though, there's no reason that a curious operator should end up creating empty handoffs just because they're poking devices with REPLICATE requests Drive-by: be a little more strict about the "suffixes" we're willing to put in hashes.invalid and read out of hashes.pkl. Change-Id: I9aace80088cd00d02c418fe4d782b662fb5c8bcf	2021-03-29 10:26:05 -07:00
Tim Burke	ef44905b3d	Get rid of strip_self It does the same thing as Python's staticmethod, so just use that instead. Change-Id: Ib1974a6d315383583c36c440a503b39679c01443	2021-03-25 13:23:22 -07:00
Clay Gerrard	90660c90de	relinker: use abs path index in part power replace Change-Id: I799bdacb7ef5850096c2178bfb12959f4205920d	2021-03-17 12:29:23 +00:00
Alistair Coles	ebee4d4555	relinker: make cleanup checks more robust Before removing object files from the old partition ensure that all required links exist in the new partition. Allow for the object files in the old partition to not necessarily be among the required files if, for example, newer files have been created in the new partition since relinking occurred. When required links do not already exist in the new partition an attempt is made to create them during cleanup. Co-Authored-By: Tim Burke <tim.burke@gmail.com> Change-Id: I6e01ed29eb8971eeff96887efbfdfb76a01b00c3	2021-03-12 09:24:49 -06:00
Alistair Coles	4bb78de611	Fix os.link exceptions in diskfile.relink_paths Previously, when a diskfile was relinked in a new partition, the diskfile.relink_paths() would test for the existence of the link before attempting to create it. This 'test then set' might race with another process creating the same diskfile (e.g. a concurrent PUT which created the very same object in an old partition that the relinker is also relinking). The race might lead to an attempt to create a link that already exists (despite the pre-check) and an EEXIST exception being raised. This patch modifies relink_paths to tolerate EEXIST exceptions but only if the existing file is a link to the target file. Otherwise the EEXIST exception is still raised. The 'check_existing' argument for relink_paths is removed since it is no longer relevant. relink_paths() might also raise an ENOENT exception if, while a diskfile is being relinked in the 'new' partition dir, another process PUTs a newer version of the same object and as a result cleans up the older diskfile in the 'old' partition before the first process has called os.link(). This patch modifies relink_paths() to tolerate ENOENT exceptions from os.link() but only if the target path (i.e. the file in the 'old' partition) no longer exists. Otherwise the ENOENT will still be raised. This patch also modifies relink_paths() to return a boolean indicating if the hard link was created by the call to the method (True) or not (False). Closes-Bug: 1917658 Change-Id: I65d4b83c56699ed566fbfb7068f9d2681ca67aa3	2021-03-12 09:24:49 -06:00
Alistair Coles	24eb16fc2a	Fix race in diskfile.relink_paths When a diskfile is written in a new diskfile dir, then in diskfile.relink_paths the existence of the diskfile dir is tested before a call to makedirs. If more than one process is taking the same action concurrently they may race such that one tests for existence, then another call makedirs, then the first calls makedirs and an exception is raised. The first process therefore fails to relink the object file, which is particularly problematic if the first process was writing a newer object file. This patch changes relink_paths to unconditionally attempt the makedirs but catch and ignore any EEXIST errors. Change-Id: Ia041f7b7dbd4e264072325b4b52718fef9966708	2021-03-03 14:29:12 +00:00
Alistair Coles	1dceafa7d5	ssync: sync non-durable fragments from handoffs Previously, ssync would not sync nor cleanup non-durable data fragments on handoffs. When the reconstructor is syncing objects from a handoff node (a 'revert' reconstructor job) it may be useful, and is not harmful, to also send non-durable fragments if the receiver has older or no fragment data. Several changes are made to enable this. On the sending side: - For handoff (revert) jobs, the reconstructor instantiates SsyncSender with a new 'include_non_durable' option. - If configured with the include_non_durable option, the SsyncSender calls the diskfile yield_hashes function with options that allow non-durable fragments to be yielded. - The diskfile yield_hashes function is enhanced to include a 'durable' flag in the data structure yielded for each object. - The SsyncSender includes the 'durable' flag in the metadata sent during the missing_check exchange with the receiver. - If the receiver requests the non-durable object, the SsyncSender includes a new 'X-Backend-No-Commit' header when sending the PUT subrequest for the object. - The SsyncSender includes the non-durable object in the collection of synced objects returned to the reconstructor so that the non-durable fragment is removed from the handoff node. On the receiving side: - The object server includes a new 'X-Backend-Accept-No-Commit' header in its response to SSYNC requests. This indicates to the sender that the receiver has been upgraded to understand the 'X-Backend-No-Commit' header. - The SsyncReceiver is enhanced to consider non-durable data when determining if the sender's data is wanted or not. - The object server PUT method is enhanced to check for and 'X-Backend-No-Commit' header before committing a diskfile. If a handoff sender has both a durable and newer non-durable fragment for the same object and frag-index, only the newer non-durable fragment will be synced and removed on the first reconstructor pass. The durable fragment will be synced and removed on the next reconstructor pass. Change-Id: I1d47b865e0a621f35d323bbed472a6cfd2a5971b Closes-Bug: 1778002	2021-01-20 12:00:10 +00:00
Tim Burke	84adf5e3bc	Remove md5_factory function ...because otherwise I'll find myself wanting to just use it everywhere, which seems to defeat the purpose. Change-Id: I88c7a9fb78890302bb71aeaf63ad587e59767726	2020-12-15 14:39:48 -08:00
Ade Lee	5320ecbaf2	replace md5 with swift utils version md5 is not an approved algorithm in FIPS mode, and trying to instantiate a hashlib.md5() will fail when the system is running in FIPS mode. md5 is allowed when in a non-security context. There is a plan to add a keyword parameter (usedforsecurity) to hashlib.md5() to annotate whether or not the instance is being used in a security context. In the case where it is not, the instantiation of md5 will be allowed. See https://bugs.python.org/issue9216 for more details. Some downstream python versions already support this parameter. To support these versions, a new encapsulation of md5() is added to swift/common/utils.py. This encapsulation is identical to the one being added to oslo.utils, but is recreated here to avoid adding a dependency. This patch is to replace the instances of hashlib.md5() with this new encapsulation, adding an annotation indicating whether the usage is a security context or not. While this patch seems large, it is really just the same change over and again. Reviewers need to pay particular attention as to whether the keyword parameter (usedforsecurity) is set correctly. Right now, all of them appear to be not used in a security context. Now that all the instances have been converted, we can update the bandit run to look for these instances and ensure that new invocations do not creep in. With this latest patch, the functional and unit tests all pass on a FIPS enabled system. Co-Authored-By: Pete Zaitcev Change-Id: Ibb4917da4c083e1e094156d748708b87387f2d87	2020-12-15 09:52:55 -05:00
Tim Burke	5eaf15486e	Have REPLICATE with suffixes just append to hashes.invalid This only applies to post-sync REPLICATE calls, none of which actually look at the response anyway. Change-Id: I1de62140e7eb9a23152bb9fdb1fa0934e827bfda	2020-11-16 08:21:16 -06:00
Zuul	43216fa9a6	Merge "Fix a race condition in case of cross-replication"	2020-11-11 21:24:49 +00:00
Tim Burke	95212ac725	Add a slightly more-useful repr for DiskFiles Change-Id: Ica2b637c9ee015c2cf129715e33d1ed1968b0a94	2020-10-27 11:55:05 -07:00
Romain LE DISEZ	8c0a1abf74	Fix a race condition in case of cross-replication In a situation where two nodes does not have the same version of a ring and they both think the other node is the primary node of a partition, a race condition can lead to the loss of some of the objects of the partition. The following sequence leads to the loss of some of the objects: 1. A gets and reloads the new ring 2. A starts to replicate/revert the partition P to node B 3. B (with the old ring) starts to replicate/revert the (partial) partition P to node A => replication should be fast as all objects are already on node A 4. B finished replication of (partial) partition P to node A 5. B remove the (partial) partition P after replication succeeded 6. A finishes replication of partition P to node B 7. A removes the partition P 8. B gets and reloads the new ring All data transfered between steps 2 and 5 will be lost as they are not anymore on node B and they are also removed from node A. This commit make the replicator/reconstructor to hold a replication_lock on partition P so that remote node cannot start an opposite replication. Change-Id: I29acc1302a75ed52c935f42485f775cd41648e4d Closes-Bug: #1897177	2020-10-14 19:16:18 -04:00
Tim Burke	69b8165cd8	obj: _finalize_durable may succeed even when data file is missing Ordinarily, an ENOENT in _finalize_durable should mean something's gone off the rails -- we expected to be able to mark data durable, but couldn't! If there are concurrent writers, though, it might actually be OK: Client A writes .data Client B writes .data Client B finalizes .data and cleans up on-disk files Client A tries to finalize but .data is gone Previously, the above would cause the object server to 500, and if enough of them did this, the client may see a 503. Now, call it good so clients get 201s. Change-Id: I4e322a7be23870a62aaa6acee8435598a056c544 Closes-Bug: #1719860	2020-04-24 15:39:37 -07:00
Tim Burke	55cdb55075	Lower the log level for falling back to mkstemp We recently started trying to use O_TMPFILE without the kernel version check, we gracefully degrade when it isn't available, and we don't give operators any way to avoid the O_TMPFILE attempt -- logging at warning seems way too noisy. Change-Id: I8f42127fd6eb930282161930c65e8614f3e3175f Related-Change: I3599e2ab257bcd99467aee83b747939afac639d8	2020-01-22 11:00:18 -08:00
zhufl	47eb78897d	Fix the duplicated words issue like "the the " This is to fix the duplicated words issue, like: * "both of which which must be stripped off" * "In addition the the values set" and so on Change-Id: Id3d84281f15815b4185c76874575e91a3589981b	2020-01-06 10:34:42 +08:00
Zuul	f4df7c31df	Merge "diskfile: Add some argument validation"	2019-09-05 05:41:48 +00:00
Zuul	d1374d8d96	Merge "Fix a potential bug"	2019-09-05 05:41:45 +00:00
Tim Burke	c71bb25063	diskfile: Add some argument validation Either all or none of account, container, and object should be provided. If we get some but not all, that's indicating some kind of a coding bug; there's a chance it may be benign, but it seems safer to fail early and loudly. Change-Id: Ia9a0ac28bde4b5dcbf6e979c131e61297577c120 Related-Change: Ic2e29474505426dea77e178bf94d891f150d851b	2019-08-30 21:49:48 -07:00
Zuul	4316c53e4a	Merge "Remove invalid dict entries from hashes.pkl"	2019-06-17 20:38:45 +00:00
Christian Schwede	c9e78d15e1	Remove invalid dict entries from hashes.pkl If the data in a hashes.pkl is corrupted but still de-serialized without errors, it will mess up the replication and gets never fixed. This happens for example if one of the keys is a NULL byte. This patch checks if the dict keys in hashes.pkl are valid strings and invalidates it if not. Closes-Bug: 1830881 Change-Id: I84b062d062ff49935feed0aee3e1963bb72eb5ea	2019-05-31 11:07:42 +02:00
Tim Burke	40bd9f16a3	py3: port diskfile Change-Id: Ie73f9df460f9014014168d6e60767e66b24d30b0	2019-05-22 13:25:54 -07:00
Tim Burke	8868e2cd50	Clean up test_write_read_metadata I'm pretty sure I botched the intent of the test when I touched it in https://github.com/openstack/swift/commit/36c4297 As a side-effect, clean up a warning on py2: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal Change-Id: I4da8264f85e688bc5109016c197378ca6dfe06a5	2019-05-22 13:25:28 -07:00
Alexandre Lécuyer	6008b0a83f	Fix quarantine when object path is not a directory In the diskfile module, there are a couple of cases where we would quarantine the suffix dir if a single object within is not a directory. Change the code so that we quarantine only the object. Change-Id: I35eb16f568c00d973eb089aedd0e5a5619c25a2f	2019-05-21 15:10:56 -07:00
Tim Burke	259224f009	py3: port unit/proxy/test_server.py All except the versioned_writes tests, which are hairy. Change-Id: Ieb54869f93a70c8887d33bd2ad46cf04a190d896	2019-05-08 15:59:06 -07:00

1 2 3 4 5 ...

252 Commits