Offer it both by service and as a single, more easily searchable, page.
That admin guide is *still* too long, but this should help a bit.
Change-Id: I946c72f40dce2f33ef845a0ca816038727848b3a
* Get rid of a bunch of accidental blockquote formatting
* Always declare a lexer to use for ``.. code::`` blocks
Change-Id: I8940e75b094843e542e815dde6b6be4740751813
This patch plumbs the object-reconstructor stats that are dropped
into recon cache out through the middleware and swift-recon tool.
This adds a '/recon/reconstruction/object' to the middleware. As such
the swift-recon tool has grown a '-R' or '--reconstruction' option
access this data from each node.
Plus some tests and documentation updates.
Change-Id: I98582732ca5ccb2e7d2369b53abf9aa8c0ede00c
Note that existing SAIOs with 60xx ports should still work fine.
Change-Id: If5dd79f926fa51a58b3a732b212b484a7e9f00db
Related-Change: Ie1c778b159792c8e259e2a54cb86051686ac9d18
Swift servers can now be seamlessly reloaded by sending them a SIGUSR1
(instead of a SIGHUP). The server forks off a synchronized child to
wait to close the old listen socket(s) until the new server has started
up and bound its listen socket(s). The new server is exec'ed from the
old one so its PID doesn't change. This makes Systemd happier, so a
ReloadExec= stanza can now be used.
The seamless part means that incoming connections will alwyas get
accepted either by the old server or the new one. This eliminates
client-perceived "downtime" during server reloads, while allowing the
server to fully reload, re-reading configuration, becoming a fresh
Python interpreter instance, etc. The SO_REUSEPORT socket option has
already been getting used, so nothing had to change there.
This patch also includes a non-invasive fix for a current eventlet bug;
see https://github.com/eventlet/eventlet/pull/590
That bug prevents a SIGHUP "reload" from properly servicing existing
requests before old worker processes close sockets and exit. The
existing probtests missed this, but the new ones, in this patch, caught
it.
New probe tests cover both old SIGHUP "reload" behavior as well as the
new SIGUSR1 seamless reload behavior.
Change-Id: I3e5229d2fb04be67e53533ff65b0870038accbb7
The object server can be configured to leave a certain amount of disk
space free; default is 1%. This is useful in avoiding 100%-full
filesystems, as those can get Swift in a state where the filesystem is
too full to write tombstones, so you can't delete objects to free up
space.
When a cluster has accounts/containers and objects on the same disks,
then you can wind up with a 100%-full disk since account and container
servers don't respect fallocate_reserve. This commit makes account and
container servers respect fallocate_reserve so that disks shared
between account/container and object rings won't get 100% full.
When a disk's free space falls below the configured reserve, account
and container PUT, POST, and REPLICATE requests will fail with a 507
status code. These are the operations that can significantly increase
the disk space used by a given database.
I called the parameter "fallocate_reserve" for consistency with the
object server. No actual fallocate() call happens under Swift's
control in the account or container servers (sqlite3 might make such a
call, but it's out of our hands).
Change-Id: I083442eef14bf83c0ea717b1decb3e6b56dbf1d0
With a sufficiently undispersed ring it's possible to move an entire
replicas worth of parts and yet the value of dispersion may not get any
better (even though in reality dispersion has dramatically improved).
The problem is dispersion will currently only represent up to one whole
replica worth of parts being undispersed.
However with EC rings it's possible for more than one whole replicas
worth of partitions to be undispersed, in these cases the builder will
require multiple rebalance operations to fully disperse replicas - but
the dispersion value should improve with every rebalance.
N.B. with this change it's possible for rings with a bad dispersion
value to measure as having a significantly smaller dispersion value
after a rebalance (even though they may not have had their dispersion
change) because the total amount of bad dispersion we can measure has
been increased but we're normalizing within a similar range.
Closes-Bug: #1697543
Change-Id: Ifefff0260deac0c3e8b369a1e158686c89936686
Update the doc link brought by the doc migration.
Although we had some effort to fix these, it still left lots of bad
doc link, I separate these changes into 3 patches aim to fix all of
these, this is the 2st patch for doc/manpages.
Change-Id: Id426c5dd45a812ef801042834c93701bb6e63a05
* Change some absolute URLs to internal links
* Fix some bulletted list indentation
* Choose a better lexer for some syntax highlighting
* Use ``inline code`` instead of `italics` for some example command
lines
* Change some quoted paragraphs that only included inlined code to be
proper code blocks
Change-Id: Iaaa7eefb690122f5af9dcb1c871358c22335c743
Fixies this problem:
* swift-drive-audit needs to be run by root, because only root have
"umount" permission
* swift-object servers typically runs as user swift
* if swift-drive-audit is run by root, /var/cache/swift/drive.recon is
owned by root, with 0o600
* recon middleware (inside swift-object-server) can't read this cache
file: swift-object: Error reading recon cache file
This patch adds "user" option to drive-audit config file. Recon cache
is chowned to this user.
Change-Id: Ibf20543ee690b7c5a37fabd1540fd5c0c7b638c9
This comes from discussion in Bristol Hackathon (Feb 2016).
Currently Swift has a couple of choices (Global Cluster and Container
Sync) to sync the stored data into geographically distributed locations.
This patch adds the summary of the discussion comparing between
Global Cluster and Container Sync to enable operators to know which
functionality fits their own use case.
And, to be fairness with container-sync, this patch moves global
cluster docs into overview_global_cluster.rst from admin_guide.rst.
Co-Authored-By: Alistair Coles <alistair.coles@hpe.com>
Change-Id: I624eb519503ae71dbc82245c33dab6e8637d0f8b
An high or increasing partition count due to storing handoffs can have
some severe side-effects, and replication might never be able to catch
up. This patch adds a note to the admin_guide how to check this.
Change-Id: Ib4e161d68f1a82236dbf5fac13ef9a13ac4bbf18
This change updates the admin guide to point out the dispersion command
in swift-ring-builder and mentions the dispersion verbose table to make
it more obvious to operators.
Change-Id: I72b4c8b2d718e6063de0fdabbaf4f2b73694e0a4
Changing the recommended ports for Swift services
from ports 6000-6002 to unused ports 6200-6202;
so they do not conflict with X-Windows or other services.
Updated SAIO docs.
DocImpact
Closes-Bug: #1521339
Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18
This patch cleans up some rough edges that were left (due to
time constraints) in the original commit.
Change-Id: Id4480be8dc1b5c920c19988cb89ca8b60ace91b4
Co-Authored-By: Gerry Drudy gerry.drudy@hpe.com
This removes notes stating support for Erasure coding as beta. Questions
regarding the stability of EC are coming up regularly, and are often referring
to the docs that state EC as still in beta.
Besides this, a note marking statsd support as beta has been removed as well.
Change-Id: If4fb6a5c4cb741d42953db3cee8cb17a1d774e15
The log_statsd_host value can now be an IPv6 address or a hostname
which only resolves to an IPv6 address. In both cases, the new
behavior is to use an AF_INET6 socket on which .sendto() is called
with the originally-configured hostname (or IP). This means the
Swift process is not caching a DNS resolution for the lifetime of
the process (a good thing).
If a hostname resolves to both an IPv6 or IPv4 address, an AF_INET
socket is used (i.e. only the IPv4 address will receive the UDP
packet).
The old behavior is preserved: any invalid IP address literals and
failures in DNS resolution or actual StatsD packet sending do not
halt the process or bubble up; they are caught, logged, and
otherwise ignored.
Change-Id: Ibddddcf140e2e69b08edf3feed3e9a5fa17307cf
Added region prefix to example commands for adding devices to ring.
Also updates description to include region prefix.
Change-Id: Ie6d6485b497cea973e37909b5b19b44946c8aa89
Currently, statistics are organized by command. However, it would also be
useful to display statistics organized by policy. Different policies may be
based on different storage properties (ie, faster disks).
With this change, all the statistics for object timers will be sent per policy
as well.
Policy statistics reporting will use policy index and the name in Graphite will
show as proxy-server.object.policy.<policy-index>.<verb>, etc.
Updated unit tests for per-policy stat reporting and added new unit tests for
invalid cases.
Updated documentation in the Administrator's Guide to reflect this new
aggregation.
Change-Id: Id70491e4833791a3fb8ff385953d69018514cd9c
This patch makes the count of object replication failure in recon.
And "failure_nodes" is added to Account Replicator and
Container Replicator.
Recon shows the count of object repliction failure as follows:
$ curl http://<ip>:<port>/recon/replication/object
{
"replication_last": 1416334368.60865,
"replication_stats": {
"attempted": 13346,
"failure": 870,
"failure_nodes": {
"192.168.0.1": {"sdb1": 3},
"192.168.0.2": {"sdb1": 851,
"sdc1": 1,
"sdd1": 8},
"192.168.0.3": {"sdb1": 3,
"sdc1": 4}
},
"hashmatch": 0,
"remove": 0,
"rsync": 0,
"start": 1416354240.9761429,
"success": 1908
},
"replication_time": 2316.5563162644703,
"object_replication_last": 1416334368.60865,
"object_replication_time": 2316.5563162644703
}
Note that 'object_replication_last' and 'object_replication_time' are
considered to be transitional and will be removed in the subsequent
releases. Use 'replication_last' and 'replication_time' instead.
Additionaly this patch adds the count in swift-recon and it will be
showed as follows:
$ swift-recon object -r
========================================================================
=======
--> Starting reconnaissance on 4 hosts
========================================================================
=======
[2014-11-27 16:14:09] Checking on replication
[replication_failure] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%,
no_result: 0, reported: 4
[replication_success] low: 3, high: 3, avg: 3.0, total: 12,
Failed: 0.0%, no_result: 0, reported: 4
[replication_time] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%,
no_result: 0, reported: 4
[replication_attempted] low: 1, high: 1, avg: 1.0, total: 4,
Failed: 0.0%, no_result: 0, reported: 4
Oldest completion was 2014-11-27 16:09:45 (4 minutes ago) by
192.168.0.4:6002.
Most recent completion was 2014-11-27 16:14:19 (-10 seconds ago) by
192.168.0.1:6002.
========================================================================
=======
In case there is a cluster which has servers, a server runs with this
patch and the other servers run without this patch. If swift-recon
executes on the server which runs with this patch, there are unnecessary
information on the output such as [failure], [success] and [attempted].
Because other servers which run without this patch are not able to
send a response with information that this patch needs.
Therefore once you apply this patch, you also apply this patch to other
servers before you execute swift-recon.
DocImpact
Change-Id: Iecd33655ae2568482833131f422679996c374d78
Co-Authored-By: Kenichiro Matsuda <matsuda_kenichi@jp.fujitsu.com>
Co-Authored-By: Brian Cline <bcline@softlayer.com>
Implements: blueprint enable-object-replication-failure-in-recon
When rsync pushes to a remote node with an unmounted drive and if
certain steps are not taken, rsync may attempt to write files to
the local drive at the location where the drive was mounted.
There are two suggested solutions for this issue:
1) Set the permissions for all mount points in /srv/node/
to root:root 755
2) Mount the drives elsewhere and symlink the drives to /srv/.../
The first method ensures that only root and not the swift user
can write in the /srv/.../ directories.
The second method will prompt a broken link issue if rsync
attempts to write to an unmounted drive.
Change-Id: I60ce4ed9ef8401768d5f78b6806cbb2e2a65303e
Closes-Bug: #1470576
This provides the capability to specify a project_name,
project_domain_name and user_domain_name in /etc/swift/dispersion.conf.
If this values are set in dispersion.conf they get populated to the
swift-client. With this it is possible to have a specific dispersion
project specified, which is not the keystone default domain. Changes
were applied to swift-dispersion-populate and swift-dispersion-report.
Relevant man pages, the example dispersion.conf and the admin guide were
updated accordingly.
DocImpact
Closes-Bug: #1468374
Change-Id: I0e716f8d281b4d0f510bc568bcee4a13fc480ff7
This change add call time to recon middleware and param --time to
recon CLI. This is usefull for checking if time in cluster is
synchronized.
Change-Id: I62373e681f64d0bd71f4aeb287953dd3b2ea5662
Doesn't work for anything other than policy 0. updated to allow user
to specify policy name on cmd line (as with object-info) which
then makes populate/report work with 3x, 2x, or EC style policies
Change-Id: Ib7c298f0f6d666b1ecca25315b88539f45cf9f95
Closes-Bug: 1458688
Sometimes, I get handed a builder file in a support ticket and a
question of the form "why is the balance [not] doing $thing?". When
that happens, I add a bunch of print statements to my local
swift/common/ring/builder.py, figure things out, and then delete the
print statements. This time, instead of deleting the print statements,
I turned them into debug() calls and added a "--debug" flag to the
rebalance command in hopes that someone else will find it useful.
Change-Id: I697af90984fa5b314ddf570280b4585ba0ba363c