This patch plumbs the object-reconstructor stats that are dropped
into recon cache out through the middleware and swift-recon tool.
This adds a '/recon/reconstruction/object' to the middleware. As such
the swift-recon tool has grown a '-R' or '--reconstruction' option
access this data from each node.
Plus some tests and documentation updates.
Change-Id: I98582732ca5ccb2e7d2369b53abf9aa8c0ede00c
To make it easier to have access to the sharding stats add
/recon/sharding as a recon middleware endpoint.
This allows an easy way to ask a container server for it's sharding
stats using REST inside the cluster:
curl <container-server>/recon/sharding
Also add a get_recon method to the direct client so it can also be used
easily inside tooling and probe tests.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I2a6024277d1198d8c996682682bfe28797344951
...which helps us differentiate between a drive that's not mounted vs.
not a dir better in log messages. We were already doing that a bit in
diskfile.py, and it seems like a useful distinction; let's do it more.
While we're at it, remove some log translations.
Related-Change: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
Related-Change: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: Ife0d34f9482adb4524d1ab1fe6c335c6b287c2fd
Partial-Bug: 1674543
For test purposes (e.g. saio probetests) even if mount_check is False,
still require check_dir for account/container server storage when real
mount points are not used.
This behavior is consistent with the object-server's checks in diskfile.
Co-Author: Clay Gerrard <clay.gerrard@gmail.com>
Related lp bug #1693005
Related-Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b
Depends-On: I52c4ecb70b1ae47e613ba243da5a4d94e5adedf2
Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
There were several implementations of hashing the content
of a file in cli/recon.py and common/middleware/recon.py.
This patch relocates one implementation (_hash_for_ringfile,
introduced in the Related Change) to common/utils.py and
refactors recon cli and middleware to use that function.
Also improves use of mocking in the unit tests to eliminate passing
custom file opener functions to the ReconMiddleware get_ring_md5
and get_swift_conf_md5 methods.
Related-Change: I9623752c3cd2361f57864f3e938e1baf5e9292d7
Change-Id: Iaad88e49aadeb28f614aafa1e9596fe07ce9793a
The actual value computed by md5 isn't that important; even in recon
it's only used as an opaque identifier that assumed to be consistent
across nodes for the same file.
However the way these tests were written with hard coded md5 values
makes them brittle to changes in the RingData format and susceptible
to the burden of needless unrelated test maintenance churn.
e.g.
Related-Change: I23b5e0a8082b30ca257aeb1fab03ab74e6f0b2d4
Change-Id: I9623752c3cd2361f57864f3e938e1baf5e9292d7
Today recon will include normal files in the payload it returns for
/recon/unmounted and /recon/diskusage. As a result it can trigger
bogus alarms on any operations-side monitoring checking for unmounted
disks or disks that show up in diskusage with weird looking stats.
This change adds an isdir check for the entries it finds in /srv/node.
Change-Id: Iad72e03fdda11ff600b81b4c5d58020cc4b9048e
Closes-bug: #1556747
a1c32702, 736cf54a, and 38787d0f remove uses of `simplejson` from
various parts of Swift in favor of the standard libary `json`
module (introduced in Python 2.6). This commit performs the remaining
`simplejson` to `json` replacements, removes two comments highlighting
quirks of simplejson with respect to Unicode, and removes the references
to it in setup documentation and requirements.txt.
There were a lot of places where we were importing json from
swift.common.utils, which is less intuitive than a direct `import json`,
so that replacement is made as well.
(And in two more tiny drive-bys, we add some pretty-indenting to an XML
fragment and use `super` rather than naming a base class explicitly.)
Change-Id: I769e88dda7f76ce15cf7ce930dc1874d24f9498a
Previously the recon middleware was doing a basic scan for object
rings that exist at init time. In situations where an object-server
was started without an object ring present, but received one shortly
after, recon still would not report it in the /recon/ringmd5 response.
This persists even when object-server gleefully chugs along after
picking up the ring, and recon's behavior would only be corrected by
an object-server reload/restart.
This change brings the middleware a bit more up to date to use the
common POLICIES instance to determine what policies were already loaded
based on configuration, and derives the path for each ring.
This effectively makes the config the source of truth for what rings
*should* be present, rather than what's present at startup. Since we
already dynamically check in ReconMiddleware.get_ring_md5 whether each
of the predetermined ring files exist, recon now correctly reports a
previously-missing ring whenever it falls into place.
Change-Id: Ia079418e54ffac5e01ef6a15511f5069b7fe83ea
This patch makes the count of object replication failure in recon.
And "failure_nodes" is added to Account Replicator and
Container Replicator.
Recon shows the count of object repliction failure as follows:
$ curl http://<ip>:<port>/recon/replication/object
{
"replication_last": 1416334368.60865,
"replication_stats": {
"attempted": 13346,
"failure": 870,
"failure_nodes": {
"192.168.0.1": {"sdb1": 3},
"192.168.0.2": {"sdb1": 851,
"sdc1": 1,
"sdd1": 8},
"192.168.0.3": {"sdb1": 3,
"sdc1": 4}
},
"hashmatch": 0,
"remove": 0,
"rsync": 0,
"start": 1416354240.9761429,
"success": 1908
},
"replication_time": 2316.5563162644703,
"object_replication_last": 1416334368.60865,
"object_replication_time": 2316.5563162644703
}
Note that 'object_replication_last' and 'object_replication_time' are
considered to be transitional and will be removed in the subsequent
releases. Use 'replication_last' and 'replication_time' instead.
Additionaly this patch adds the count in swift-recon and it will be
showed as follows:
$ swift-recon object -r
========================================================================
=======
--> Starting reconnaissance on 4 hosts
========================================================================
=======
[2014-11-27 16:14:09] Checking on replication
[replication_failure] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%,
no_result: 0, reported: 4
[replication_success] low: 3, high: 3, avg: 3.0, total: 12,
Failed: 0.0%, no_result: 0, reported: 4
[replication_time] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%,
no_result: 0, reported: 4
[replication_attempted] low: 1, high: 1, avg: 1.0, total: 4,
Failed: 0.0%, no_result: 0, reported: 4
Oldest completion was 2014-11-27 16:09:45 (4 minutes ago) by
192.168.0.4:6002.
Most recent completion was 2014-11-27 16:14:19 (-10 seconds ago) by
192.168.0.1:6002.
========================================================================
=======
In case there is a cluster which has servers, a server runs with this
patch and the other servers run without this patch. If swift-recon
executes on the server which runs with this patch, there are unnecessary
information on the output such as [failure], [success] and [attempted].
Because other servers which run without this patch are not able to
send a response with information that this patch needs.
Therefore once you apply this patch, you also apply this patch to other
servers before you execute swift-recon.
DocImpact
Change-Id: Iecd33655ae2568482833131f422679996c374d78
Co-Authored-By: Kenichiro Matsuda <matsuda_kenichi@jp.fujitsu.com>
Co-Authored-By: Brian Cline <bcline@softlayer.com>
Implements: blueprint enable-object-replication-failure-in-recon
This change add call time to recon middleware and param --time to
recon CLI. This is usefull for checking if time in cluster is
synchronized.
Change-Id: I62373e681f64d0bd71f4aeb287953dd3b2ea5662
This is a follow-on from a previous commit which added recon info
for swift-drive-audit (https://review.openstack.org/#/c/122468/).
Here, the "--drievaudit" option is added to swift-recon tool. This
feature gives the statistics for the system-wide drive errors flagged
by swift-drive-audit. An example of the output is as follows:
(verbose mode)
swift-recon --driveaudit -v
===============================================================================
--> Starting reconnaissance on 5 hosts
===============================================================================
[2015-03-11 17:13:39] Checking drive-audit errors
-> http://1.2.3.4:6000/recon/driveaudit: {'drive_audit_errors': 14}
-> http://1.2.3.5:6000/recon/driveaudit: {'drive_audit_errors': 0}
-> http://1.2.3.6:6000/recon/driveaudit: {'drive_audit_errors': 37}
-> http://1.2.3.7:6000/recon/driveaudit: {'drive_audit_errors': 101}
-> http://1.2.3.8:6000/recon/driveaudit: {'drive_audit_errors': 0}
[drive_audit_errors] low: 0, high: 101, avg: 30.4, total: 152, Failed: 0.0%, no_result: 0, reported: 5
===============================================================================
Change-Id: Ia16c52a9d613eeb3de1a5a428d88dd1233631912
After the release of Swift ver. 2.0.0, some recon responses do not
show each policy's information yet. To make things worse, some recon
results only count on policy-0's score, therefore the total is not
shown in the recon results.
This patch makes the count of quarantined files policy-aware for recon
requests. Suppose a number of quarantined objects for policy-0 is 2
and a number for policy-1 is 3, recon sums up every policy's amount
and shows information for each policy as follows.
$ curl http://<host>:<port>/recon/quarantined
{"accounts": 0, "containers": 0, "objects": 5, "policies": {"0":
{"objects": 2}, "1": {"objects": 3}}}
Moreover, this patch adds stats for each policy in CLI output.
Change-Id: I07217c635f6fc4ea809ddbc3d859c4e81c4fde37
Related-Bug: 1375327
Related-Bug: 1375332
Recon middleware returns object ring file MD5 sums; this patch
updates it to include other object files that may be present
because of Storage Policies. Also adds unit test coverage for
the MD5 reporting function which previously had none.
The recon script will now check all rings the server responds with
match the on-disk md5's regardless of server-type; including any
storage policy object rings.
Note the small change to the ring save method, needed to
stimulate the right code paths in 2.6 and 2.7 versions of
gzip to enable testing of ring MD5 sums.
DocImpact
Implements: blueprint storage-policies
Change-Id: I01efd2999d6d9c57ee8693ac3a6236ace17c5566
I've seen several folks recently have problems with their Swift
clusters because they had different hash prefixes on different
nodes. Let's help them out by having recon check that.
Note that MD5-equality is stronger than what we need (which is
ConfigParser-equality for a particular set of keys), but this way we
don't expose the secret hash prefix and suffix across the internal
network, just the MD5 checksum of the file containing them.
Change-Id: I3af984ee45947345891b3c596a88e3464f178cc7
- swift-recon now handles parsing instances where 'mounted' key (in unmounted
and disk_usage) is an error message instead of a bool.
- Add's checkmount exception handling to the recon umounted endpoint.
- Updates existing unittest to have ismount throw an error.
- Updates unittests to cover the corner cases
Change-Id: Id51d14a8b98de69faaac84b2b34b7404b7df69e9
swift.common.utils.ismount maybe raise some OSError in some special
cases; and the request against /recon/diskusage doesn't handle it
before. This patch let output of mounted keyword is the error's message.
Change-Id: I5d9018f580181e618a3fa072b7a760d41795d8eb
Closes-Bug: #1249181
except x,y: was deprected and is removed in Python 3.x.
Use "except x as y:" instead which works in any Python
version >= 2.6.
Change-Id: I7008c74b807340f3457d3a0c8bd0b83f23169d14
pyflakes itself can't be used in any automated gating way, because there are
two sets of false errors it raises. However, as an exercise, cleaning up the
'valid' ones uncovered three actual bugs. The other changes (mostly unused
variables) are included here for fun.
Command run: pyflakes swift | grep -v "undefined name '_'"
Change-Id: I18696bf047dedad1a9fdbde3463e214fba95f7c6
I've been doing this with cluster-wide log searches for far too long.
This adds support for reporting the oldest replication pass
completion as well as the most recent. This is quite useful for
finding those odd replicators that have hung up for some reason and
need intervention.
Change-Id: I7fd7260eca162d6b085f3e82aaa3cf90670f2d53
This change replaces WebOb with a mostly compatible local library,
swift.common.swob. Subtle changes to WebOb's API over the years have been a
huge headache. Swift doesn't even run on the current version.
There are a few incompatibilities to simplify the implementation/interface:
* It only implements the header properties we use. More can be easily added.
* Casts header values to str on assignment.
* Response classes ("HTTPNotFound") are no longer subclasses, but partials
on Response, so things like isinstance no longer work on them.
* Unlike newer webob versions, will never return unicode objects.
Change-Id: I76617a0903ee2286b25a821b3c935c86ff95233f
Make sure that empty but still valid results (like no unmounted drives)
aren't treated as 500 errors.
Change-Id: I9588e2711d7916406f15613d5a26b9f0cf38235a
Expand recon middleware to include support for account and container
servers in addition to the existing object servers. Also add support
for retrieving recent information from auditors, replicators, and
updaters. In the case of certain checks (such as container auditors)
the stats returned are only for the most recent path processed.
The middleware has also been refactored and should now also handle
errors better in cases where stats are unavailable.
While new check's have been added the output from pre-existing
check's has not changed. This should allow existing 3rd party
utilities such as the Swift ZenPack to continue to function.
Change-Id: Ib9893a77b9b8a2f03179f2a73639bc4a6e264df7
My first stab at unittests for the recon middleware.
Also, made some minor changes to the middleware to make testing
easier now and down the road.
Change-Id: I23ce853398ff035ffbfc2082e90e22038832b966