Commit Graph

239 Commits

Author SHA1 Message Date
Brian Haley 929b383743 Fix some new pylint "R" warnings
After updating pylint, it started emitting additional "R"
warnings in some cases, fix some of them.

  use-a-generator,
  unnecessary-lambda-assignment,
  consider-using-max-builtin,
  consider-using-generator,
  consider-using-in,
  use-list-literal,
  consider-using-from-import

Trivialfix

Change-Id: Ife6565cefcc30b4e8a0df9121c9454cf744225df
2023-07-18 18:06:51 -04:00
labedz f430cd0072 Don't set HA ports down while L3 agent restart.
Because of the fix for bug[1] and issue with linux_utils
get_process_count_by_name() L3 agent puts all it's HA ports down
during initialization phase. Unfortunately such operation can break
already working L3 communication. Rewiring ha-* port from down state to
up can takes few seconds and some VRRP packages could be lost then.
That triggers keepalived on other node so router HA state change
may be triggered.

This change prevents putting HA ports down when during initialization
phase L3 agent finds already configured own net namespaces. Existance
of such net namespace is a good proof that there is a network
configuration existing so host wasn't rebooted so most probably it is
just agent restart.

[1] https://bugs.launchpad.net/neutron/+bug/1597461

Closes-Bug: #1959151
Change-Id: Id9c906b2d141c3bedd80fb5f868190f8a4b66f54
2022-03-01 14:27:42 +00:00
Slawek Kaplonski 41159bd9a4 Cleanup router for which processing added router failed
In the _process_added_router() method of the L3 agent, if processing
router will fail, router_info should be cleaned to e.g. be removed from
the router cache so it will not be treated as updated router in next
iteration of the agent.

Closes-Bug: #1947993
Change-Id: Ic0bc3d951d32efadc116708bfe518a711730429d
2021-11-08 16:42:08 +01:00
Nurmatov Mamatisa ef83719da2 Use payloads for ROUTER AFTER_ callbacks
This patch switches over to callback payloads for ROUTER
AFTER_CREATE, AFTER_UPDATE and AFTER_DELETE events.

Change-Id: Ie818ffbb1a291faa80501157b46ff6671d5c26ba
2021-08-09 14:13:28 +00:00
Nurmatov Mamatisa 40c8f60ee3 Use payloads for ROUTER callbacks
This patch switches over to callback payloads for ROUTER
BEFORE_CREATE, PRECOMMIT_CREATE, BEFORE_UPDATE and
PRECOMMIT_DELETE events.

Change-Id: I4a52c773d3f753c918df0986f1d261083156651c
2021-08-02 12:32:30 +03:00
Slawek Kaplonski 6ce48c30bd [L3] Use processing queue for network update events
Router_info's _process_internal_ports() method is the one which is
manipulating router_info.internal_ports cache and network_update()
method from the L3 agent is relying on that Router_info's cache to
check if updated network is connected to the router or not.
So they shouldn't be run together as that may cause some race conditions
and unexpected issues, like e.g. described in the related bug.

Until now, network_update event was the only one which was processed
without using queue of events. And because of that such race condition
as described above were possible.
To fix that, this patch changes network_update method in the way that it
now adds update events for each router hosted by agent to the queue.
Those events for single routers are then processed, checks if network is
actually connected to the router and if yes, schedules router update to
be processed.

Closes-Bug: #1933234
Change-Id: I2efe66a7415f7a18fb85bd2536a1901e751d6203
2021-07-08 17:03:43 +02:00
Slawek Kaplonski 5c9a7fe1b4 Add extra logs to the network update callback in L3 agent
It may be useful during debugging some L3 and events related issues.

Related-bug: #1933234
Change-Id: I4bcba0ae82d99fac962d758b48b1727f344ec7bb
2021-06-29 13:48:59 +02:00
LIU Yulong ac1597d009 [L3] Add some logs for router processing
In order to dig the real action of a ResourceUpdate, add logs for:
1. add/update router
2. delete router
3. delete namespace
4. agent extension router add/delete/update actions

Change-Id: I5c0ff485cd0c966afe535f8063deca6e410e012d
Related-bug: #1881995
2021-06-22 01:41:28 +00:00
Slawek Kaplonski 0d8ae15767 Remove update_initial_state() method from the HA router
This method was intended to check state of the HA router on the
node and update it in the neutron server.
Patch [1] added check of the initial status to the
neutron_keepalived_state_change_monitor process.
It also could cause some race conditions and event which is setting
correct state of the router will be not processed thus router may endup
with two nodes with "primary" state in the Neutron's DB.

Neutron_keepalived_state_change_monitor was notifying agent about
router's initial state only if this state was 'primary'.
Now it will notify agent always to let agent set router's state as
'backup' if needed (that was previously done by this removed
update_initial_state() method).

[1] https://review.opendev.org/c/openstack/neutron/+/642295

Change-Id: I2cc58c30cf844ee0ecf0611ecdec430086464790
Closes-Bug: #1916022
2021-02-23 14:58:29 +00:00
Slawek Kaplonski 489e0ead72 Fix migration from the HA to non-HA routers
In case if during switching HA router to be down, there will be any
failure, router_info will be stored in L3 agent's cache as HaRouter.
In case when next update on the router is migration to non-HA router
this is wrong class and it causes other issues, e.g. with
remove_vip_by_ip_address() which is correct only for HA routers.

This patch fixes that issue by adding check of the router's ha and
distributed flags and update local cache with new router_info class
in case if at least one of those flags don't match.

Change-Id: Ib0d3a501f88c149baea7d715c7cfe5811bc85e4f
Closes-Bug: #1892846
2020-11-16 21:56:30 +01:00
LIU Yulong d13efc6314 [L3] Let agent extension do delete router first
For some agent extension implementation, it may need the router_info
to do some clean up work. So this patch just moves the extension
delete action forward.

Closes-Bug: #1897423
Change-Id: I3434ec7c0942229b99e67de7500090dedb37b13f
2020-10-07 13:38:11 +00:00
Bernard Cafarelli 5ce0595803
Set process name for agents
Now that we use setproctitle for neutron-server workers (and
neutron-keepalived-state-change), this has the side effect of changing
the process name for agents, impacting some monitoring systems. More
details in launchpad bug.

This patch fixes it by setting the name with setproctitle to:
agent name (original process name).

Also use the newly introduced name constants to replace existing
hardcoded uses.

Change-Id: I74c3a4d3e9f833752571a75f196560cd45529385
Closes-Bug: #1881297
2020-07-01 12:28:29 +02:00
Slawek Kaplonski 6b360d2343 Report L3 extensions enabled in the L3 agent's config
Change-Id: I187ab3bf04d19c07c3f04ffb2161399a7dfd7ff3
Closes-Bug: #1876898
2020-05-05 12:55:49 +02:00
LIU Yulong 12b9149e20 Not remove the running router when MQ is unreachable
When the L3 agent get a router update notification, it will try to
retrieve the router info from neutron server. But at this time, if
the message queue is down/unreachable. It will get exceptions related
message queue. The resync actions will be run then. Sometimes, rabbitMQ
cluster is not so much easy to recover. Then Long time MQ recover time
will cause the router info sync RPC never get successful until it meets
the max retry time. Then the bad thing happens, L3 agent is trying to
remove the router now. It basically shutdown all the existing L3 traffic
of this router.

This patch directly removes the final router removal action, let the
router run as it is.

Closes-Bug: #1871850
Change-Id: I9062638366b45a7a930f31185cd6e23901a43957
2020-04-24 17:44:27 -04:00
Oleg Bondarev 5663517613 Support L3 agent cleanup on shutdown
Add an option to delete all routers on agent shutdown.

Closes-Bug: #1851609
Change-Id: I7a4056680d8453b2ef2dcc853437a0ec4b3e8044
2019-12-16 17:01:31 -05:00
Brian Haley 555238da69 Start using oslo_utils.netutils.is_ipv6_enabled()
Seems that is_enabled_and_bind_by_default() from
neutron.common.ipv6_utils was copied directly into
oslo_utils.netutils, so start using it instead.

Trivialfix

Change-Id: I00fa441e7a20fcd1115485bb8ab75750e6a8cf07
2019-10-16 21:44:56 -04:00
Zuul 1c2e10f859 Merge "Remove get_external_network_id for router" 2019-09-25 19:30:14 +00:00
LIU Yulong f51e5ce924 Remove get_external_network_id for router
L3 agent supports multiple external networks from a long
time ago, so remove this RPC call since it is not used.
According to codesearch of [1] and [2], we just remove
neutron built-in L3 agent RPC. For neutron server side
or RPC callback classes, the function is still remained.

[1] http://codesearch.openstack.org/?q=get_external_network_id
[2] http://codesearch.openstack.org/?q=L3RpcCallback

Change-Id: I764423e175d6e82729a647e415a9f267f495916f
Closes-Bug: #1844168
2019-09-20 13:31:32 +00:00
Michal Arbet 49a66dba31 Fix py3 compatibility
In fetch_and_sync_all_routers method is used python's range function.
Range function accepts integers.

This patch is fixing divide behaviour in py3 where result number is float,
by retyping float to int as it is represented in py2.

Change-Id: Ifffdee0d4a3226d4871cfabd0bdbf13d7058a83e
Closes-Bug: #1824334
2019-09-10 05:47:24 +00:00
Rodolfo Alonso Hernandez 3f022a193f Delay HA router transition from "backup" to "master"
As described in the bug, when a HA router transitions from "master" to
"backup", "keepalived" processes will set the virtual IP in all other
HA routers. Each HA router will then advert it and "keepalived" will
decide, according to a trivial algorithm (higher interface IP), which
one should be "master". At this point, the other "keepalived" processes
running in the other servers, will remove the HA router virtual IP
assigned an instant before

To avoid transitioning some routers form "backup" to "master" and then
to "backup" in a very short period, this patch delays the "backup" to
"master" transition, waiting for a possible new "backup" state. If
during the waiting period (set to the HA VRRP advert time, 2 seconds
default) to set the HA state to "master", the L3 agent receives a new
"backup" HA state, the L3 agent does nothing.

Closes-Bug: #1837635

Change-Id: I70037da9cdd0f8448e0af8dd96b4e3f5de5728ad
2019-08-27 16:47:00 +00:00
Zuul 0cde163967 Merge "Remove 'gateway_external_network_id' config option" 2019-08-05 12:40:08 +00:00
Adrian Chiris 0e80d2251e Pass get_networks() callback to interface driver
In order to support out of tree interface drivers it is required
to pass a callback to allow the drivers to query information about
the network.

- Allow passing **kwargs to interface drivers
- Pass get_networks() as `get_networks_cb` kw arg
  `get_networks_cb` has the same API as
  `neutron.neutron_plugin_base_v2.NeutronPluginBaseV2.get_networks()`
   minus the the request context which will be embeded in the callback
   itself.

The out of tree interface drivers in question are:

MultiInterfaceDriver - a per-physnet interface driver that delegates
                       operations on a per-physnet basis.
IPoIBInterfaceDriver - an interface driver for IPoIB (IP over Infiniband)
                       networks.

Those drivers are a part of networking-mlnx[1], Their implementation
is vendor agnostic so they can later be moved to a more common place
if desired.

[1] https://github.com/openstack/networking-mlnx

Change-Id: I74d9f449fb24f64548b0f6db4d5562f7447efb25
Closes-Bug: #1834176
2019-07-30 20:21:16 +03:00
Slawek Kaplonski 9b2e472ae9 Remove 'gateway_external_network_id' config option
This option was deprecated since couple of releases already.
In Stein we removed 'external_network_bridge' option from L3 agent's
config so now it's time to remove also this one.

There is also new upgrade check introduced to check and warn
users if gateway_external_network_id was used in the deployment.

This patch also removes method _check_router_needs_rescheduling() from
neutron/db/l3_db.py module as it is not needed anymore.

This patch also removes unit tests:
test_update_gateway_agent_exists_supporting_network
test_update_gateway_agent_exists_supporting_multiple_network
test_router_update_gateway_no_eligible_l3_agent
from neutron/tests/unit/extensions/test_l3.py module as those
tests are not needed when there is no "gateway_external_network_id"
config option anymore.

Change-Id: Id01571cd42cfe9c5ce91e90159917c7d3c963878
2019-07-26 13:19:14 +02:00
Adrian Chiris c62c67f413 Add RPC method to get networks for L3 and DHCP agents
- Added get_networks() RPC call for DHCP agent
- Added get_networks() RPC call for L3 agent

This change is required in order to support out of tree
MultiInterfaceDriver and IPoIBInterfaceDriver interface drivers
as they require information on the network a port is being plugged
to.

These RPCs will be passed as kwargs when loading the relevant
interface driver.

get_networks() keyword args map to the keyword arguments of:
neutron.neutron_plugin_base_v2.NeutronPluginBaseV2.get_networks()

Change-Id: I11d82380aad8655a4fdc9656737b912b16e2859b
Partial-Bug: #1834176
2019-07-23 10:55:18 +03:00
LIU Yulong 9c4bd4bd9a Add a common timecost wrapper
And set it to all the L3 RPC functions. Move to neutron-lib
if it will be widely used corss other projects.

Related-Bug: #1835663
Change-Id: Ie7743db097fd45df432af341470336d6a5662c6f
2019-07-15 21:34:05 +08:00
Miguel Lavalle 0b3f5f429d Support multiple external networks in L3 agent
Change [1] removed the deprecated option external_network_bridge. Per
commit message in change [2], "l3 agent can handle any networks by
setting the neutron parameter external_network_bridge and
gateway_external_network_id to empty". So the consequence of [1] was to
introduce a regression whereby multiple external networks are not
supported by the L3 agent anymore.

This change proposes a new simplified rule. If
gateway_external_network_id is defined, that is the network that the L3
agent will use. If not and multiple external networks exist, the L3
agent will handle any of them.

[1] https://review.opendev.org/#/c/567369/
[2] https://review.opendev.org/#/c/59359

Change-Id: Idd766bd069eda85ab6876a78b8b050ee5ab66cf6
Closes-Bug: #1824571
2019-05-27 19:23:28 -05:00
LIU Yulong 0f471a47c0 Async notify neutron-server for HA states
RPC notifier method can sometimes be time-consuming,
this will cause other parallel processing resources
fail to send notifications in time. This patch changes
the notify to asynchronous.

Closes-Bug: #1824911
Change-Id: I3f555a0c78fbc02d8214f12b62c37d140bc71da1
2019-05-10 15:37:27 +00:00
Zuul 554b7cd228 Merge "Add router_factory to l3-agent and L3 extension API" 2019-04-27 06:37:15 +00:00
Yang Youseok ec875b42b6 Add router_factory to l3-agent and L3 extension API
Currently, most implementations override the L3NatAgent class itself
for their own logic since there is no proper interface to extend
RouterInfo class. This adds unnecessary complexity for developers
who just want to extend router mechanism instead of whole RPC.

Add a RouterFactory class that developer can registers RouterInfo class
and delegate it for RouterInfo creation. Seperate functions and variables
which currently used externally to abstract class from RouterInfo, so that
extension can use the basic interface.

Provide the router registration function to the l3 extension API so that
extension can extend RouterInfo itself which correspond to each features
(ha, distribtued, ha + distributed)

Depends-On: https://review.openstack.org/#/c/620348/
Closes-Bug: #1804634
Partially-Implements: blueprint openflow-based-dvr
Change-Id: I1eff726900a8e67596814ca9a5f392938f154d7b
2019-04-26 10:22:50 +09:00
LIU Yulong 9d60716cf1 Add update_id for ResourceUpdate
Add a unique id for resource update, then we can calculate
the resource processing time and track it.

Related-Bug: #1825152
Related-Bug: #1824911
Related-Bug: #1821912
Related-Bug: #1813787

Change-Id: Ib4d197c6c180c32860964440882393794aabb6ef
2019-04-25 09:09:27 +08:00
Boden R 9bbe9911c4 remove neutron.common.constants
All of the externally consumed variables from neutron.common.constants
now live in neutron-lib. This patch removes neutron.common.constants
and switches all uses over to lib.

NeutronLibImpact

Depends-On: https://review.openstack.org/#/c/647836/
Change-Id: I3c2f28ecd18996a1cee1ae3af399166defe9da87
2019-04-04 14:10:26 -06:00
Zuul 7198fb6a0a Merge "Remove deprecated 'external_network_bridge' option" 2019-03-13 15:42:44 +00:00
Sławek Kapłoński b09b44608b Remove deprecated 'external_network_bridge' option
This option is deprecated and marked to be deleted in Ocata. So
as we are now in Stein development cycle I think that it's good time
to remove it.

Change-Id: I07474713206c218710544ad98c08caaa37dbf53a
2019-03-09 22:07:38 +00:00
Swaminathan Vasudevan d9e0bab6ac DVR-HA: Unbinding a HA router from agent does not clear HA interface
Removing an active or a standby HA router from an agent that has a
valid DVR serviceable port (such as DHCP), does not remove the
HA interface associated with the Router in the SNAT namespace.

When we try to add the HA router back to the agent, then it
adds more than one HA interface to the SNAT Namespace causing
more problems and we sometimes also see multiple active routers.

This bug might have been introduced by this patch [1].

Fix the problem by just adding the router namespaces without HA
interfaces when there is no HA and re-insert the HA interfaces
when HA router is bound to the agent into the namespace.

[1] https://review.openstack.org/#/c/522362/
Closes-Bug: #1816698

Change-Id: Ie625abcb73f8185bb2bee06dcd26a01d8af0b0d1
2019-03-07 18:11:32 +00:00
Brian Haley 22369ba7fe Do not print router resize messages when not resizing
I noticed in the functional logs that the l3-agent is constantly
logging this message, even when just adding or removing a single
router:

  Resizing router processing queue green pool size to: 8

It's misleading as the pool is not being resized, it's still 8,
so let's only log when we're actually changing the pool size.

Change-Id: I5dc42fa4b4c1964b7d027681b61550cd82e83234
2019-02-28 11:57:01 -05:00
Zuul 98fdc53c80 Merge "Call _safe_router_removed during pool resize testing" 2019-02-21 00:28:44 +00:00
Zuul f6c6be78ee Merge "Not set the HA port down at regular l3-agent restart" 2019-02-18 22:21:54 +00:00
LIU Yulong a10281bf23 Call _safe_router_removed during pool resize testing
Closes-Bug: #1816239
Change-Id: Ie93d17ff8b5825e401e342d215db4bcfd7b1cd3e
2019-02-18 15:43:24 -05:00
LIU Yulong 5b7d444b31 Not set the HA port down at regular l3-agent restart
If l3-agent was restarted by a regular action, such as config change,
package upgrade, manually service restart etc. We should not set the
HA port down during such scenarios. Unless the physical host was
rebooted, aka the VRRP processes were all terminated.

This patch adds a new RPC call during l3 agent init, it will try to
retrieve the HA router count first. And then compare the VRRP process
(keepalived) count and 'neutron-keepalived-state-change' count
with the hosting router count. If the count matches, then that
set HA port to 'DOWN' state action will not be triggered anymore.

Closes-Bug: #1798475
Change-Id: I5e2bb64df0aaab11a640a798963372c8d91a06a8
2019-02-14 16:58:22 +08:00
LIU Yulong 837c9283ab Dynamically increase l3 router process queue green pool size
There is a race condition between nova-compute boots instance and
l3-agent processes DVR (local) router in compute node. This issue
can be seen when a large number of instances were booted to one
same host, and instances are under different DVR router. So the
l3-agent will concurrently process all these dvr routers in this
host at the same time.
For now we have a green pool for the router ResourceProcessingQueue
with 8 greenlet, but some of these routers can still be waiting, event
worse thing is that there are time-consuming actions during the router
processing procedure. For instance, installing arp entries, iptables
rules, route rules etc.
So when the VM is up, it will try to get meta via the local proxy
hosting by the dvr router. But the router is not ready yet in that
host. And finally those instances will not be able to setup some
config in the guest OS.

This patch adds a new measurement based on the router quantity to
indicate the L3 router process queue green pool size. The pool size
will be limit from 8 (original value) to 32, because we do not want
the L3 agent cost too much host resource on processing router in the
compute node.

Related-Bug: #1813787
Change-Id: I62393864a103d666d5d9d379073f5fc23ac7d114
2019-02-14 16:27:03 +08:00
Boden R 024802aafd remove neutron.common.rpc
The neutron.common.rpc module has been in neutron-lib for awhile now and
neutron is shimmed to use neutron-lib already.
This patch removes neutron.common.rpc and switches the code over to use
neutron-lib's implementation where needed.

NeutronLibImpact

Change-Id: I733f07a8c4a2af071b3467bd710290eee11a4f4c
2019-02-06 11:05:55 -07:00
Marc Koderer 64f2fe7060 Change log level for l3 agent router updates
In case of an l3 agent sync it is important to understand when
a router is processing an update to identify when it applies
changes that can cause failovers.

Change-Id: Ie9ba2a8ffebfcc3bfb35f7a48f73a25352309b4e
2019-02-04 08:52:22 +01:00
Boden R 68fd13af40 remove neutron.common.exceptions
Today the neutron common exceptions already live in neutron-lib and are
shimmed from neutron. This patch removes the neutron.common.exceptions
module and changes neutron's imports over to use their respective
neutron-lib exception module instead.

NeutronLibImpact

Change-Id: I9704f20eb21da85d2cf024d83338b3d94593671e
2019-02-01 14:35:00 -07:00
Brian Haley a78bf152b1 Change L3 agent to log message after failure
If the L3 agent fails to send its report_state
to the server, it logs an exception:

   Failed reporting state!: MessagingTimeout: Timed out...

If it then tries a second time and succeeds it just
goes on happily. It would be nice if it logged that
it had success on the subsequent attempt so someone
looking at the logs know it recovered.

Change-Id: I9019782588caffcb647cf1fd557f76ce89cea254
2019-01-24 16:13:16 -05:00
Brian Haley 4bb78e8c21 Fix l3-agent usage of L3AgentExtension class
The L3AgentExtension class delete_router() method expects a
dict as it's 'data' argument, but the l3-agent code that
deletes a router was passing just the router ID.  Change to
correctly pass a router dictionary if one exists.

Change-Id: I112d1f8dce9defddfbd8fbfa75bf538e308e1561
Closes-bug: #1809134
2019-01-17 20:33:35 +00:00
Slawek Kaplonski 5018d70241 Fix connection between 2 dvr routers
In case when 2 dvr routers are connected to each other with
tenant network, those routers needs to be always deployed
on same compute nodes.
So this patch changes dvr routers scheduler that it will create
dvr router on each host on which there are vms or other dvr routers
connected to same subnets.

Co-Authored-By: Swaminathan Vasudevan <SVasudevan@suse.com>

Closes-Bug: #1786272

Change-Id: I579c2522f8aed2b4388afacba34d9ffdc26708e3
2018-11-01 18:01:25 +01:00
Zuul 4d2fae6b0c Merge "Rename router processing queue code to be more generic" 2018-09-05 08:19:14 +00:00
Boden R 73c7eddb5a use callback payloads for ROUTER/ROUTER_GATEWAY BEFORE_DELETE events
This patch switches callbacks over to the payload object style events
[1] for ROUTER and ROUTER_GATEWAY BEFORE_DELETE based notifications. To
do so a DBEventPayload object is used with the publish() method to pass
along the related data.

NeutronLibImpact

[1] https://docs.openstack.org/neutron-lib/latest/contributor/callbacks.html#event-payloads

Change-Id: I3ce4475643f4f0afed01f2e9956b3bf84714e6f2
2018-07-23 14:03:10 -06:00
Brian Haley f24f3b6b7b Rename router processing queue code to be more generic
Moved the router processing queue code to the agent/common
directory and renamed it "resource processing queue".  This
way it can be consumed by other agents, or possibly even
moved to neutron-lib in the future.

Change-Id: I735cf5b0a915828c420c3316b78a48f6d54035e6
2018-07-20 15:09:20 -04:00
Brian Haley 7cfdf4aa81 Fix all pep8 E129 errors
Fixed all pep8 E129 errors and changed tox.ini to no longer
ignore them.

Change-Id: I0b06d99ce1d473b79a4cfdd173baa4f02e653847
2018-05-03 13:44:04 +09:00