Commit Graph

50 Commits

Author SHA1 Message Date
Matthew Oliver 7a105b5ef0 Add and pipe reconstructor stats through recon
This patch plumbs the object-reconstructor stats that are dropped
into recon cache out through the middleware and swift-recon tool.

This adds a '/recon/reconstruction/object' to the middleware. As such
the swift-recon tool has grown a '-R' or '--reconstruction' option
access this data from each node.

Plus some tests and documentation updates.

Change-Id: I98582732ca5ccb2e7d2369b53abf9aa8c0ede00c
2021-08-20 00:03:40 +00:00
Matthew Oliver 85e36f7122 recon: refactor common recon names into a common location
Change-Id: I0a0766cfb6672377de0f152ce179c874c327ec54
2021-06-29 15:22:57 -07:00
Tim Burke 39ad468dfe Add async_pending_last time to object.recon
The async_pending count isn't near as useful when we don't know how out
of date it is.

Change-Id: I3e5e904ffc0eba7a7e141e1c2d9f9840e4952041
2021-06-15 08:12:05 -07:00
Matthew Oliver 4ce907a4ae relinker: Add /recon/relinker endpoint and drop progress stats
To further benefit the stats capturing for the relinker, drop partition
progress to a new relinker.recon recon cache and add a new recon endpoint:

  GET /recon/relinker

To gather get live relinking progress data:

  $ curl http://127.0.0.3:6030/recon/relinker |python -mjson.tool
  {
      "devices": {
          "sdb3": {
              "parts_done": 523,
              "policies": {
                  "1": {
                      "next_part_power": 11,
                      "start_time": 1618998724.845616,
                      "stats": {
                          "errors": 0,
                          "files": 1630,
                          "hash_dirs": 1630,
                          "linked": 1630,
                          "policies": 1,
                          "removed": 0
                      },
                      "timestamp": 1618998730.24672,
                      "total_parts": 1029,
                      "total_time": 5.400741815567017
                  }},
              "start_time": 1618998724.845946,
              "stats": {
                  "errors": 0,
                  "files": 836,
                  "hash_dirs": 836,
                  "linked": 836,
                  "removed": 0
              },
              "timestamp": 1618998730.24672,
              "total_parts": 523,
              "total_time": 5.400741815567017
          },
          "sdb7": {
              "parts_done": 506,
              "policies": {
                  "1": {
                      "next_part_power": 11,
                      "part_power": 10,
                      "parts_done": 506,
                      "start_time": 1618998724.845616,
                      "stats": {
                          "errors": 0,
                          "files": 794,
                          "hash_dirs": 794,
                          "linked": 794,
                          "removed": 0
                      },
                      "step": "relink",
                      "timestamp": 1618998730.166175,
                      "total_parts": 506,
                      "total_time": 5.320528984069824
                  }
              },
              "start_time": 1618998724.845616,
              "stats": {
                  "errors": 0,
                  "files": 794,
                  "hash_dirs": 794,
                  "linked": 794,
                  "removed": 0
              },
              "timestamp": 1618998730.166175,
              "total_parts": 506,
              "total_time": 5.320528984069824
          }
      },
      "workers": {
          "100": {
              "drives": ["sda1"],
              "return_code": 0,
              "timestamp": 1618998730.166175}
      }}

Also, add a constant DEFAULT_RECON_CACHE_PATH to help fix failing tests
by mocking recon_cache_path, so that errors are not logged due
to dump_recon_cache exceptions.

Mock recon_cache_path more widely and assert no error logs more
widely.

Change-Id: I625147dadd44f008a7c48eb5d6ac1c54c4c0ef05
2021-05-10 16:13:32 +01:00
Tim Burke f2a4c50dce Include sharding cycle time in recon
Change-Id: Id7e828a56c8a62a1f3e9a1dbbff5a56c928ac6b8
2021-04-25 15:11:49 +00:00
Zuul 75e86425b9 Merge "Plumb sharding stats though recon middleware" 2021-02-26 18:42:21 +00:00
Matthew Oliver b1309c95e5 Plumb sharding stats though recon middleware
To make it easier to have access to the sharding stats add
/recon/sharding as a recon middleware endpoint.

This allows an easy way to ask a container server for it's sharding
stats using REST inside the cluster:

  curl <container-server>/recon/sharding

Also add a get_recon method to the direct client so it can also be used
easily inside tooling and probe tests.

Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I2a6024277d1198d8c996682682bfe28797344951
2021-02-26 15:51:06 +00:00
Alistair Coles 9eac76258a Trivial fixes in recon middleware
Fix docstring and remove unused mount_check var.

Change-Id: I30ead8b72cb616d1311ffc81d9cfecb1afc9a05e
2021-02-24 11:02:47 +00:00
Tim Burke f192f51d37 Have check_drive raise ValueError on errors
...which helps us differentiate between a drive that's not mounted vs.
not a dir better in log messages. We were already doing that a bit in
diskfile.py, and it seems like a useful distinction; let's do it more.

While we're at it, remove some log translations.

Related-Change: I941ffbc568ebfa5964d49964dc20c382a5e2ec2a
Related-Change: I3362a6ebff423016bb367b4b6b322bb41ae08764
Change-Id: Ife0d34f9482adb4524d1ab1fe6c335c6b287c2fd
Partial-Bug: 1674543
2018-06-20 17:15:07 -07:00
Pavel Kvasnička 163fb4d52a Always require device dir for containers
For test purposes (e.g. saio probetests) even if mount_check is False,
still require check_dir for account/container server storage when real
mount points are not used.

This behavior is consistent with the object-server's checks in diskfile.

Co-Author: Clay Gerrard <clay.gerrard@gmail.com>
Related lp bug #1693005
Related-Change-Id: I344f9daaa038c6946be11e1cf8c4ef104a09e68b
Depends-On: I52c4ecb70b1ae47e613ba243da5a4d94e5adedf2
Change-Id: I3362a6ebff423016bb367b4b6b322bb41ae08764
2017-09-01 10:32:12 -07:00
Alistair Coles 609b5182c4 Refactor recon to use single md5_hash_for_file function
There were several implementations of hashing the content
of a file in cli/recon.py and common/middleware/recon.py.
This patch relocates one implementation (_hash_for_ringfile,
introduced in the Related Change) to common/utils.py and
refactors recon cli and middleware to use that function.

Also improves use of mocking in the unit tests to eliminate passing
custom file opener functions to the ReconMiddleware get_ring_md5
and get_swift_conf_md5 methods.

Related-Change: I9623752c3cd2361f57864f3e938e1baf5e9292d7

Change-Id: Iaad88e49aadeb28f614aafa1e9596fe07ce9793a
2016-12-02 18:22:59 +00:00
Clay Gerrard 053b625f42 Remove ring md5 integration check from recon unittests
The actual value computed by md5 isn't that important; even in recon
it's only used as an opaque identifier that assumed to be consistent
across nodes for the same file.

However the way these tests were written with hard coded md5 values
makes them brittle to changes in the RingData format and susceptible
to the burden of needless unrelated test maintenance churn.

e.g.

Related-Change: I23b5e0a8082b30ca257aeb1fab03ab74e6f0b2d4

Change-Id: I9623752c3cd2361f57864f3e938e1baf5e9292d7
2016-11-30 16:55:05 -08:00
Brian Cline a537684c77 Don't report recon mount/usage status on files
Today recon will include normal files in the payload it returns for
/recon/unmounted and /recon/diskusage. As a result it can trigger
bogus alarms on any operations-side monitoring checking for unmounted
disks or disks that show up in diskusage with weird looking stats.

This change adds an isdir check for the entries it finds in /srv/node.

Change-Id: Iad72e03fdda11ff600b81b4c5d58020cc4b9048e
Closes-bug: #1556747
2016-03-14 00:17:47 -05:00
Zack M. Davis 1b8b08039a remove remaining simplejson uses, prefer standard library import
a1c32702, 736cf54a, and 38787d0f remove uses of `simplejson` from
various parts of Swift in favor of the standard libary `json`
module (introduced in Python 2.6). This commit performs the remaining
`simplejson` to `json` replacements, removes two comments highlighting
quirks of simplejson with respect to Unicode, and removes the references
to it in setup documentation and requirements.txt.

There were a lot of places where we were importing json from
swift.common.utils, which is less intuitive than a direct `import json`,
so that replacement is made as well.

(And in two more tiny drive-bys, we add some pretty-indenting to an XML
fragment and use `super` rather than naming a base class explicitly.)

Change-Id: I769e88dda7f76ce15cf7ce930dc1874d24f9498a
2015-11-16 12:34:24 -08:00
Brian Cline 460a7e4b64 Fixes recon bug with initially missing rings
Previously the recon middleware was doing a basic scan for object
rings that exist at init time. In situations where an object-server
was started without an object ring present, but received one shortly
after, recon still would not report it in the /recon/ringmd5 response.
This persists even when object-server gleefully chugs along after
picking up the ring, and recon's behavior would only be corrected by
an object-server reload/restart.

This change brings the middleware a bit more up to date to use the
common POLICIES instance to determine what policies were already loaded
based on configuration, and derives the path for each ring.

This effectively makes the config the source of truth for what rings
*should* be present, rather than what's present at startup. Since we
already dynamically check in ReconMiddleware.get_ring_md5 whether each
of the predetermined ring files exist, recon now correctly reports a
previously-missing ring whenever it falls into place.

Change-Id: Ia079418e54ffac5e01ef6a15511f5069b7fe83ea
2015-09-13 19:10:17 -05:00
Hisashi Osanai 79ba4a8598 Enable Object Replicator's failure count in recon
This patch makes the count of object replication failure in recon.
And "failure_nodes" is added to Account Replicator and
Container Replicator.

Recon shows the count of object repliction failure as follows:
$ curl http://<ip>:<port>/recon/replication/object
{
    "replication_last": 1416334368.60865,
    "replication_stats": {
        "attempted": 13346,
        "failure": 870,
	"failure_nodes": {
            "192.168.0.1": {"sdb1": 3},
            "192.168.0.2": {"sdb1": 851,
                            "sdc1": 1,
                            "sdd1": 8},
            "192.168.0.3": {"sdb1": 3,
                            "sdc1": 4}
	},
        "hashmatch": 0,
        "remove": 0,
        "rsync": 0,
        "start": 1416354240.9761429,
        "success": 1908
    },
    "replication_time": 2316.5563162644703,
    "object_replication_last": 1416334368.60865,
    "object_replication_time": 2316.5563162644703
}

Note that 'object_replication_last' and 'object_replication_time' are
considered to be transitional and will be removed in the subsequent
releases. Use 'replication_last' and 'replication_time' instead.

Additionaly this patch adds the count in swift-recon and it will be
showed as follows:
$ swift-recon object -r
========================================================================
=======
--> Starting reconnaissance on 4 hosts
========================================================================
=======
[2014-11-27 16:14:09] Checking on replication
[replication_failure] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%,
no_result: 0, reported: 4
[replication_success] low: 3, high: 3, avg: 3.0, total: 12,
Failed: 0.0%, no_result: 0, reported: 4
[replication_time] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%,
no_result: 0, reported: 4
[replication_attempted] low: 1, high: 1, avg: 1.0, total: 4,
Failed: 0.0%, no_result: 0, reported: 4
Oldest completion was 2014-11-27 16:09:45 (4 minutes ago) by
192.168.0.4:6002.
Most recent completion was 2014-11-27 16:14:19 (-10 seconds ago) by
192.168.0.1:6002.
========================================================================
=======

In case there is a cluster which has servers, a server runs with this
patch and the other servers run without this patch. If swift-recon
executes on the server which runs with this patch, there are unnecessary
information on the output such as [failure], [success] and [attempted].
Because other servers which run without this patch are not able to
send a response with information that this patch needs.
Therefore once you apply this patch, you also apply this patch to other
servers before you execute swift-recon.

DocImpact
Change-Id: Iecd33655ae2568482833131f422679996c374d78
Co-Authored-By: Kenichiro Matsuda <matsuda_kenichi@jp.fujitsu.com>
Co-Authored-By: Brian Cline <bcline@softlayer.com>
Implements: blueprint enable-object-replication-failure-in-recon
2015-08-18 11:40:02 +09:00
Jenkins 617c6b0107 Merge "Time synchronization check in recon." 2015-08-18 01:21:22 +00:00
Victor Stinner a0db56dcde Fix pep8 E265 warning of hacking 0.10
Fix the warning E265 "block comment should start with '# '" added in pep
1.5.

Change-Id: Ib57282e958be9c7cddffc7bca34fbbf1d4c460fd
2015-07-30 09:33:18 +02:00
Ondrej Novy dd2f1be3b1 Time synchronization check in recon.
This change add call time to recon middleware and param --time to
recon CLI. This is usefull for checking if time in cluster is
synchronized.

Change-Id: I62373e681f64d0bd71f4aeb287953dd3b2ea5662
2015-07-23 11:35:02 +02:00
Lorcan 0a46793662 Add swift-recon feature to track swift-drive-audit error count
This is a follow-on from a previous commit which added recon info
for swift-drive-audit (https://review.openstack.org/#/c/122468/).

Here, the "--drievaudit" option is added to swift-recon tool. This
feature gives the statistics for the system-wide drive errors flagged
by swift-drive-audit. An example of the output is as follows:
(verbose mode)

swift-recon --driveaudit -v
===============================================================================
--> Starting reconnaissance on 5 hosts
===============================================================================
[2015-03-11 17:13:39] Checking drive-audit errors
-> http://1.2.3.4:6000/recon/driveaudit: {'drive_audit_errors': 14}
-> http://1.2.3.5:6000/recon/driveaudit: {'drive_audit_errors': 0}
-> http://1.2.3.6:6000/recon/driveaudit: {'drive_audit_errors': 37}
-> http://1.2.3.7:6000/recon/driveaudit: {'drive_audit_errors': 101}
-> http://1.2.3.8:6000/recon/driveaudit: {'drive_audit_errors': 0}
[drive_audit_errors] low: 0, high: 101, avg: 30.4, total: 152, Failed: 0.0%, no_result: 0, reported: 5
===============================================================================

Change-Id: Ia16c52a9d613eeb3de1a5a428d88dd1233631912
2015-03-23 11:38:32 +00:00
Daisuke Morita f8fa1a9234 Show each policy's information on quarantined files in recon
After the release of Swift ver. 2.0.0, some recon responses do not
show each policy's information yet. To make things worse, some recon
results only count on policy-0's score, therefore the total is not
shown in the recon results.

This patch makes the count of quarantined files policy-aware for recon
requests. Suppose a number of quarantined objects for policy-0 is 2
and a number for policy-1 is 3, recon sums up every policy's amount
and shows information for each policy as follows.

$ curl http://<host>:<port>/recon/quarantined
{"accounts": 0, "containers": 0, "objects": 5, "policies": {"0":
{"objects": 2}, "1": {"objects": 3}}}

Moreover, this patch adds stats for each policy in CLI output.

Change-Id: I07217c635f6fc4ea809ddbc3d859c4e81c4fde37
Related-Bug: 1375327
Related-Bug: 1375332
2015-01-20 18:42:20 +09:00
Paul Luse 8326dc9f2a Add Storage Policy Support to Recon Middleware
Recon middleware returns object ring file MD5 sums; this patch
updates it to include other object files that may be present
because of Storage Policies.  Also adds unit test coverage for
the MD5 reporting function which previously had none.

The recon script will now check all rings the server responds with
match the on-disk md5's regardless of server-type; including any
storage policy object rings.

Note the small change to the ring save method, needed to
stimulate the right code paths in 2.6 and 2.7 versions of
gzip to enable testing of ring MD5 sums.

DocImpact
Implements: blueprint storage-policies
Change-Id: I01efd2999d6d9c57ee8693ac3a6236ace17c5566
2014-06-18 21:09:54 -07:00
Samuel Merritt 31dac18625 Check swift.conf MD5 with recon
I've seen several folks recently have problems with their Swift
clusters because they had different hash prefixes on different
nodes. Let's help them out by having recon check that.

Note that MD5-equality is stronger than what we need (which is
ConfigParser-equality for a particular set of keys), but this way we
don't expose the secret hash prefix and suffix across the internal
network, just the MD5 checksum of the file containing them.

Change-Id: I3af984ee45947345891b3c596a88e3464f178cc7
2014-04-10 14:08:27 -07:00
Greg Lange 8b4876f32a Fix recon docs
Change-Id: Icaa0f61e5796253dcc57b8c005577890de8aa537
2014-02-10 14:31:14 +00:00
Peter Portante a708295d82 Remove trailing slash for consistency
Change-Id: Idd4fd116b6be226e46e33f421883b6fb34947a84
Signed-off-by: Peter Portante <peter.portante@redhat.com>
2014-01-06 18:12:42 -05:00
Florian Hines 62254e42c4 Fix checkmount error parsing in swift-recon
- swift-recon now handles parsing instances where 'mounted' key (in unmounted
  and disk_usage) is an error message instead of a bool.
- Add's checkmount exception handling to the recon umounted endpoint.
- Updates existing unittest to have ismount throw an error.
- Updates unittests to cover the corner cases

Change-Id: Id51d14a8b98de69faaac84b2b34b7404b7df69e9
2013-12-28 20:58:27 -08:00
Kun Huang fd4843f8e7 catch OSError to prevent breaking request /recon/diskusage
swift.common.utils.ismount maybe raise some OSError in some special
cases; and the request against /recon/diskusage doesn't handle it
before. This patch let output of mounted keyword is the error's message.

Change-Id: I5d9018f580181e618a3fa072b7a760d41795d8eb
Closes-Bug: #1249181
2013-11-13 22:46:20 +08:00
ZhiQiang Fan f72704fc82 Change OpenStack LLC to Foundation
Change-Id: I7c3df47c31759dbeb3105f8883e2688ada848d58
Closes-bug: #1214176
2013-09-20 01:02:31 +08:00
Clay Gerrard ce12d66cf9 fix swift i18n
Change-Id: I53cea28a6d7593a1b308dbcf77dddf7f40d76cb2
2013-09-09 20:25:00 -07:00
Dirk Mueller 3d36a76156 Use Python 3.x compatible except construct
except x,y: was deprected and is removed in Python 3.x.
Use "except x as y:" instead which works in any Python
version >= 2.6.

Change-Id: I7008c74b807340f3457d3a0c8bd0b83f23169d14
2013-09-07 10:50:54 +02:00
Alex Gaynor 0f3b0410e3 Removed unnecessary monkeypatching of __builtin__
Replaced it with explicitly importing the gettext function, which is
significantly more readable.

Change-Id: Ia0a7edcf685fb6e4052a8290367b233169529ab8
2013-07-27 21:34:35 -07:00
Marcelo Martins 7fbb97b39e Retrieve the swift version with recon
Adding a '/recon/version' in order to get the swift version

Change-Id: I7b7ddbe70abb87c6a3b1010ddefa09d0acc09710
2013-05-23 15:06:12 -05:00
Monty Taylor abe70e8323 Cleanup based on pyflakes.
pyflakes itself can't be used in any automated gating way, because there are
two sets of false errors it raises. However, as an exercise, cleaning up the
'valid' ones uncovered three actual bugs. The other changes (mostly unused
variables) are included here for fun.

Command run: pyflakes swift | grep -v "undefined name '_'"

Change-Id: I18696bf047dedad1a9fdbde3463e214fba95f7c6
2013-02-01 07:50:17 +11:00
Michael Barton c45e435d1f Add wsgify and split_path utilities to swob
And refactor some of the code to use them.

Remove unused imports.

Change-Id: Ica479c10247fa85c740bb99cf7d1db7fbb1b2c80
2013-01-25 00:38:32 -08:00
gholt a88b412e17 swift-recon: Added oldest and most recent repl
I've been doing this with cluster-wide log searches for far too long.
This adds support for reporting the oldest replication pass
completion as well as the most recent. This is quite useful for
finding those odd replicators that have hung up for some reason and
need intervention.

Change-Id: I7fd7260eca162d6b085f3e82aaa3cf90670f2d53
2013-01-12 05:49:14 +00:00
John Dickinson 8ac292595f changed TRUE_VALUES references to utils.config_true_value() call
cleaned up pep8 (v1.3.3) in all files this patch touches

Change-Id: I30e8314dfdc23fb70ab83741a548db9905dfccff
2012-10-29 13:59:01 -07:00
Michael Barton 5e3e9a882d local WSGI Request and Response classes
This change replaces WebOb with a mostly compatible local library,
swift.common.swob.  Subtle changes to WebOb's API over the years have been a
huge headache.  Swift doesn't even run on the current version.

There are a few incompatibilities to simplify the implementation/interface:
 * It only implements the header properties we use.  More can be easily added.
 * Casts header values to str on assignment.
 * Response classes ("HTTPNotFound") are no longer subclasses, but partials
   on Response, so things like isinstance no longer work on them.
 * Unlike newer webob versions, will never return unicode objects.

Change-Id: I76617a0903ee2286b25a821b3c935c86ff95233f
2012-09-28 14:48:48 -07:00
Florian Hines 243b439507 Ensure empty results are returned
Make sure that empty but still valid results (like no unmounted drives)
aren't treated as 500 errors.

Change-Id: I9588e2711d7916406f15613d5a26b9f0cf38235a
2012-05-31 18:25:05 -05:00
Florian Hines ccb6334c17 Expand recon middleware support
Expand recon middleware to include support for account and container
servers in addition to the existing object servers. Also add support
for retrieving recent information from auditors, replicators, and
updaters. In the case of certain checks (such as container auditors)
the stats returned are only for the most recent path processed.

The middleware has also been refactored and should now also handle
errors better in cases where stats are unavailable.

While new check's have been added the output from pre-existing
check's has not changed. This should allow existing 3rd party
utilities such as the Swift ZenPack to continue to function.

Change-Id: Ib9893a77b9b8a2f03179f2a73639bc4a6e264df7
2012-05-24 14:50:00 -05:00
Jenkins 6168e37fd5 Merge "tests for recon middleware." 2012-03-22 20:39:50 +00:00
John Dickinson 1ecf5ebba1 updated copyright date for all files
Change-Id: Ifd909d3561c2647770a7e0caa3cd91acd1b4f298
2012-03-19 13:45:34 -05:00
Florian Hines 0a461a5b8a tests for recon middleware.
My first stab at unittests for the recon middleware.
Also, made some minor changes to the middleware to make testing
easier now and down the road.

Change-Id: I23ce853398ff035ffbfc2082e90e22038832b966
2012-03-19 13:44:43 -05:00
Chmouel Boudjnah 16a5faaaba PEP8 fixes.
Change-Id: I3c33c03547f97ca7afbb47c3bddfdeabf152afe2
2012-01-20 15:07:55 -06:00
Florian Hines 413ca11a5f Add sockstat info to recon.
Add's support for pulling info from /proc/net/sockstat and /proc/net/sockstat6 via recon.

Change-Id: Idb403c6eda199c5d36d96cc9027ee249c12c7d8b
2011-11-15 17:55:14 +00:00
Florian Hines e9b5cb83ac simplejson import and exception/logging fixes 2011-09-01 13:46:13 -05:00
Florian Hines b762c5acd0 pep8 fix 2011-08-14 10:49:15 -05:00
Florian Hines dcd39d098f account for parent/.. hardlinks 2011-08-12 16:29:13 -05:00
Florian Hines 44803a835d add quarantine stats 2011-08-12 15:01:28 -05:00
Florian Hines 7938a5d777 quick comment on how to load recon.py 2011-08-01 21:43:55 -05:00
Florian Hines aa622eb799 recon middlewear for the object server and utils for cluster monitoring 2011-07-27 10:41:07 -05:00