Suppress IPv6 metadata DAD failure and delete address

IPv4 DAD is non-existent in Linux or its failure is silent, so we
never needed to catch and ignore it. On the other hand IPv6 DAD
failure is explicit, hence comes this change.

This of course leaves the metadata service dead on hosts where
duplicate address detection failed. But if we catch the
DADFailed exception and delete the address, at least other
functions of the dhcp-agent should not be affected.

With this the IPv6 isolated metadata service is not redundant, which
is the best we can do without a redesign.

Also document the promised service level of isolated metadata.

Added additional tests for the metadata driver as well.

Change-Id: I6b544c5528cb22e5e8846fc47dfb8b05f70f975c
Partial-Bug: #1953165
This commit is contained in:
Bence Romsics 2023-03-06 13:04:01 +01:00 committed by Brian Haley
parent 3eb907c1f4
commit 2aee961ab6
11 changed files with 138 additions and 26 deletions

View File

@ -441,6 +441,38 @@ To test the HA of DHCP agent:
#. Start DHCP agent on HostB. The VM gets the wanted IP again.
No HA for metadata service on isolated networks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All Neutron backends using the DHCP agent can also provide `metadata service
<https://docs.openstack.org/nova/latest/user/metadata.html>`_ in isolated
networks (i.e. networks without a router). In this case the DHCP agent manages
the metadata service (see config option `enable_isolated_metadata
<https://docs.openstack.org/neutron/latest/configuration/dhcp-agent.html#DEFAULT.enable_isolated_metadata>`_).
Note however that the metadata service is only redundant for IPv4, and not
IPv6, even when the DHCP service is configured to be highly available
(config option `dhcp_agents_per_network
<https://docs.openstack.org/neutron/latest/configuration/neutron.html#DEFAULT.dhcp_agents_per_network>`_
> 1). This is because the DHCP agent will insert a route to the well known
metadata IPv4 address (`169.254.169.254`) via its own IP address, so it will
be reachable as long as the DHCP service is available at that IP address.
This also means that recovery after a failure is tied to the renewal of the
DHCP lease, since that route will only change if the DHCP server for a VM
changes.
With IPv6, the well known metadata IPv6 address (`fe80::a9fe:a9fe`) is used,
but directly configured in the DHCP agent network namespace.
Due to the enforcement of duplicate address detection (DAD), this address
can only be configured in at most one DHCP network namespaces at any time.
See `RFC 4862 <https://www.rfc-editor.org/rfc/rfc4862#section-5.4>`_ for
details on the DAD process.
For this reason, even when you have multiple DHCP agents, an arbitrary one
(where the metadata IPv6 address is not in `dadfailed` state) will serve all
metadata requests over IPv6. When that metadata service instance becomes
unreachable there is no failover and the service will become unreachable.
Disabling and removing an agent
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@ -41,6 +41,7 @@ from neutron.agent.linux import external_process
from neutron.agent.linux import ip_lib
from neutron.agent.linux import iptables_manager
from neutron.cmd import runtime_checks as checks
from neutron.common import _constants as common_constants
from neutron.common import utils as common_utils
from neutron.ipam import utils as ipam_utils
from neutron.privileged.agent.linux import dhcp as priv_dhcp
@ -1841,7 +1842,7 @@ class DeviceManager(object):
if self.conf.force_metadata or self.conf.enable_isolated_metadata:
ip_cidrs.append(constants.METADATA_CIDR)
if netutils.is_ipv6_enabled():
ip_cidrs.append(constants.METADATA_V6_CIDR)
ip_cidrs.append(common_constants.METADATA_V6_CIDR)
self.driver.init_l3(interface_name, ip_cidrs,
namespace=network.namespace)

View File

@ -102,6 +102,10 @@ class AddressNotReady(exceptions.NeutronException):
"become ready: %(reason)s")
class DADFailed(AddressNotReady):
pass
InvalidArgument = privileged.InvalidArgument
@ -592,7 +596,7 @@ class IpAddrCommand(IpDeviceCommandBase):
"""Wait until an address is no longer marked 'tentative' or 'dadfailed'
raises AddressNotReady if times out, address not present on interface
or DAD fails
raises DADFailed if Duplicate Address Detection fails
"""
def is_address_ready():
try:
@ -604,7 +608,7 @@ class IpAddrCommand(IpDeviceCommandBase):
# Since both 'dadfailed' and 'tentative' will be set if DAD fails,
# check 'dadfailed' first just to be explicit
if addr_info['dadfailed']:
raise AddressNotReady(
raise DADFailed(
address=address, reason=_('Duplicate address detected'))
if addr_info['tentative']:
return False

View File

@ -33,6 +33,7 @@ from neutron.agent.l3 import namespaces
from neutron.agent.linux import external_process
from neutron.agent.linux import ip_lib
from neutron.agent.linux import utils as linux_utils
from neutron.common import _constants as common_constants
from neutron.common import coordination
from neutron.common import metadata as comm_meta
from neutron.common import utils as common_utils
@ -241,9 +242,30 @@ class MetadataDriver(object):
# HAProxy cannot bind() until IPv6 Duplicate Address Detection
# completes. We must wait until the address leaves its 'tentative'
# state.
ip_lib.IpAddrCommand(
parent=ip_lib.IPDevice(name=bind_interface, namespace=ns_name)
).wait_until_address_ready(address=bind_address_v6)
try:
ip_lib.IpAddrCommand(
parent=ip_lib.IPDevice(name=bind_interface,
namespace=ns_name)
).wait_until_address_ready(address=bind_address_v6)
except ip_lib.DADFailed as exc:
# This failure means that another DHCP agent has already
# configured this metadata address, so all requests will
# be via that single agent.
LOG.info('DAD failed for address %(address)s on interface '
'%(interface)s in namespace %(namespace)s on network '
'%(network)s, deleting it. Exception: %(exception)s',
{'address': bind_address_v6,
'interface': bind_interface,
'namespace': ns_name,
'network': network_id,
'exception': str(exc)})
try:
ip_lib.delete_ip_address(bind_address_v6, bind_interface,
namespace=ns_name)
except Exception as exc:
# do not re-raise a delete failure, just log
LOG.info('Address deletion failure: %s', str(exc))
return
pm.enable()
monitor.register(uuid, METADATA_SERVICE_NAME, pm)
cls.monitors[router_id] = pm
@ -338,6 +360,6 @@ def apply_metadata_nat_rules(router, proxy):
if netutils.is_ipv6_enabled():
for c, r in proxy.metadata_nat_rules(
proxy.metadata_port,
metadata_address=(constants.METADATA_V6_IP + '/128')):
metadata_address=(common_constants.METADATA_V6_CIDR)):
router.iptables_manager.ipv6['nat'].add_rule(c, r)
router.iptables_manager.apply()

View File

@ -86,3 +86,6 @@ TRAIT_NETWORK_TUNNEL = 'CUSTOM_NETWORK_TUNNEL_PROVIDER'
# The lowest binding index for L3 agents and DHCP agents.
LOWEST_AGENT_BINDING_INDEX = 1
# Neutron-lib defines this with a /64 but it should be /128
METADATA_V6_CIDR = constants.METADATA_V6_IP + '/128'

View File

@ -33,7 +33,9 @@ AGENTS_SCHEDULER_OPTS = [
'network. If this number is greater than 1, the '
'scheduler automatically assigns multiple DHCP agents '
'for a given tenant network, providing high '
'availability for the DHCP service.')),
'availability for the DHCP service. However this does '
'not provide high availability for the IPv6 metadata '
'service in isolated networks.')),
cfg.BoolOpt('enable_services_on_agents_with_admin_state_down',
default=False,
help=_('Enable services on an agent with admin_state_up '

View File

@ -37,6 +37,7 @@ from neutron.agent.linux import dhcp
from neutron.agent.linux import interface
from neutron.agent.linux import utils as linux_utils
from neutron.agent.metadata import driver as metadata_driver
from neutron.common import _constants as common_constants
from neutron.common import config as common_config
from neutron.common import utils
from neutron.conf.agent import common as config
@ -1929,7 +1930,7 @@ class TestDeviceManager(base.BaseTestCase):
expected_ips = ['172.9.9.9/24', const.METADATA_CIDR]
if ipv6_enabled:
expected_ips.append(const.METADATA_V6_CIDR)
expected_ips.append(common_constants.METADATA_V6_CIDR)
expected = [mock.call.get_device_name(port)]

View File

@ -33,6 +33,7 @@ import testtools
from neutron.agent.linux import dhcp
from neutron.agent.linux import ip_lib
from neutron.cmd import runtime_checks as checks
from neutron.common import _constants as common_constants
from neutron.common import utils as common_utils
from neutron.conf.agent import common as config
from neutron.conf.agent import dhcp as dhcp_config
@ -3295,7 +3296,7 @@ class TestDeviceManager(TestConfBase):
if enable_isolated_metadata or force_metadata:
expect_ips.extend([
constants.METADATA_CIDR,
constants.METADATA_V6_CIDR])
common_constants.METADATA_V6_CIDR])
mgr.driver.init_l3.assert_called_with('ns-XXX',
expect_ips,
namespace='qdhcp-ns')

View File

@ -792,7 +792,7 @@ class TestIpAddrCommand(TestIPCmdBase):
def test_wait_until_address_dadfailed(self):
self.addr_cmd.list = mock.Mock(
return_value=[{'tentative': True, 'dadfailed': True}])
with testtools.ExpectedException(ip_lib.AddressNotReady):
with testtools.ExpectedException(ip_lib.DADFailed):
self.addr_cmd.wait_until_address_ready('abcd::1234')
@mock.patch.object(common_utils, 'wait_until_true')

View File

@ -25,6 +25,7 @@ from oslo_utils import uuidutils
from neutron.agent.l3 import agent as l3_agent
from neutron.agent.l3 import router_info
from neutron.agent.linux import external_process as ep
from neutron.agent.linux import ip_lib
from neutron.agent.linux import iptables_manager
from neutron.agent.linux import utils as linux_utils
from neutron.agent.metadata import driver as metadata_driver
@ -76,6 +77,7 @@ class TestMetadataDriverProcess(base.BaseTestCase):
EUNAME = 'neutron'
EGNAME = 'neutron'
METADATA_DEFAULT_IP = '169.254.169.254'
METADATA_DEFAULT_IPV6 = 'fe80::a9fe:a9fe'
METADATA_PORT = 8080
METADATA_SOCKET = '/socket/path'
PIDFILE = 'pidfile'
@ -140,7 +142,7 @@ class TestMetadataDriverProcess(base.BaseTestCase):
agent._process_updated_router(router)
f.assert_not_called()
def test_spawn_metadata_proxy(self):
def _test_spawn_metadata_proxy(self, dad_failed=False):
router_id = _uuid()
router_ns = 'qrouter-%s' % router_id
service_name = 'haproxy'
@ -165,22 +167,32 @@ class TestMetadataDriverProcess(base.BaseTestCase):
'NamespaceManager.list_all', return_value={}),\
mock.patch(
'neutron.agent.linux.ip_lib.'
'IpAddrCommand.wait_until_address_ready') as mock_wait:
'IpAddrCommand.wait_until_address_ready') as mock_wait,\
mock.patch(
'neutron.agent.linux.ip_lib.'
'delete_ip_address') as mock_del:
agent = l3_agent.L3NATAgent('localhost')
agent.process_monitor = mock.Mock()
cfg_file = os.path.join(
metadata_driver.HaproxyConfigurator.get_config_path(
agent.conf.state_path),
"%s.conf" % router_id)
mock_open = self.useFixture(
lib_fixtures.OpenFixture(cfg_file)).mock_open
mock_wait.return_value = True
if dad_failed:
mock_wait.side_effect = ip_lib.DADFailed(
address=self.METADATA_DEFAULT_IP, reason='DAD failed')
else:
mock_wait.return_value = True
agent.metadata_driver.spawn_monitored_metadata_proxy(
agent.process_monitor,
router_ns,
self.METADATA_PORT,
agent.conf,
bind_address=self.METADATA_DEFAULT_IP,
router_id=router_id)
router_id=router_id,
bind_address_v6=self.METADATA_DEFAULT_IPV6,
bind_interface='fake-if')
netns_execute_args = [
service_name,
@ -188,6 +200,8 @@ class TestMetadataDriverProcess(base.BaseTestCase):
log_tag = ("haproxy-" + metadata_driver.METADATA_SERVICE_NAME +
"-" + router_id)
bind_v6_line = 'bind %s:%s interface %s' % (
self.METADATA_DEFAULT_IPV6, self.METADATA_PORT, 'fake-if')
cfg_contents = metadata_driver._HAPROXY_CONFIG_TEMPLATE % {
'user': self.EUNAME,
'group': self.EGNAME,
@ -200,19 +214,35 @@ class TestMetadataDriverProcess(base.BaseTestCase):
'pidfile': self.PIDFILE,
'log_level': 'debug',
'log_tag': log_tag,
'bind_v6_line': ''}
'bind_v6_line': bind_v6_line}
mock_open.assert_has_calls([
mock.call(cfg_file, 'w'),
mock.call().write(cfg_contents)],
any_order=True)
if dad_failed:
agent.process_monitor.register.assert_not_called()
mock_del.assert_called_once_with(self.METADATA_DEFAULT_IPV6,
'fake-if',
namespace=router_ns)
else:
mock_open.assert_has_calls([
mock.call(cfg_file, 'w'),
mock.call().write(cfg_contents)], any_order=True)
env = {ep.PROCESS_TAG: service_name + '-' + router_id}
ip_mock.assert_has_calls([
mock.call(namespace=router_ns),
mock.call().netns.execute(netns_execute_args, addl_env=env,
run_as_root=True)
])
env = {ep.PROCESS_TAG: service_name + '-' + router_id}
ip_mock.assert_has_calls([
mock.call(namespace=router_ns),
mock.call().netns.execute(netns_execute_args, addl_env=env,
run_as_root=True)
])
agent.process_monitor.register.assert_called_once_with(
router_id, metadata_driver.METADATA_SERVICE_NAME,
mock.ANY)
mock_del.assert_not_called()
def test_spawn_metadata_proxy(self):
self._test_spawn_metadata_proxy()
def test_spawn_metadata_proxy_dad_failed(self):
self._test_spawn_metadata_proxy(dad_failed=True)
def test_create_config_file_wrong_user(self):
with mock.patch('pwd.getpwnam', side_effect=KeyError):

View File

@ -0,0 +1,16 @@
---
issues:
- |
The high availability of metadata service on isolated networks is limited
or non-existent. IPv4 metadata is redundant when the DHCP agent managing
it is redundant, but recovery is tied to the renewal of the DHCP lease,
making most recoveries very slow. IPv6 metadata is not redundant at all
as the IPv6 metadata address can only be configured in a single place at
a time as it is link-local. Multiple agents trying to configure it will
generate an IPv6 duplicate address detection failure.
Administrators may observe the IPv6 metadata address in "dadfailed" state
in the DHCP namespace for this reason, which is only an indication it is
not highly available. Until a redesign is made to the isolated metadata
service there is not a better deployment option. See `bug 1953165
<https://bugs.launchpad.net/neutron/+bug/1953165>`_ for information.