Massively Distributed RPC test plan

This formalizes the discussion started in
https://etherpad.openstack.org/p/1BGhFHDIoi

Change-Id: Idc7ab4eeef24ef57a535b9567541f30734191e87
Co-Authored-By: Ken Giusti <kgiusti@gmail.com>
Co-Authored-By: andy smith <ansmith@redhat.com>
Co-Authored-By: Paul-Andre Raymond <paul-andre.raymond@b-yond.com>
This commit is contained in:
msimonin 2017-08-08 15:39:44 +02:00
parent 9c2e67cdb2
commit 5e8e2cec9e
2 changed files with 517 additions and 0 deletions

View File

@ -0,0 +1,112 @@
Environment description
^^^^^^^^^^^^^^^^^^^^^^^
The environment description includes hardware specification of servers, network
parameters and operating system configuration.
Hardware
""""""""
This section contains list of all types of hardware nodes.
+-----------+-------+----------------------------------------------------+
| Parameter | Value | Comments |
+-----------+-------+----------------------------------------------------+
| model | | e.g. Supermicro X9SRD-F |
+-----------+-------+----------------------------------------------------+
| CPU | | e.g. 6 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz |
+-----------+-------+----------------------------------------------------+
Network
"""""""
This section contains list of interfaces and network parameters. In the
context of a cloud massively distributed (e.g across a WAN), the network links
may present different characteristics in terms of latency, bandwidth, packet
loss. These characteristics can be emulated (e.g `tc`_) or be the result of a
real deployment over a large geographical area. In any cases, link
characteristics must be described.
+------------------+-------+-------------------------+
| Parameter | Value | Comments |
+------------------+-------+-------------------------+
| card model | | e.g. Intel |
+------------------+-------+-------------------------+
| driver | | e.g. ixgbe |
+------------------+-------+-------------------------+
| speed | | e.g. 10G or 1G |
+------------------+-------+-------------------------+
Software
""""""""
This section describes installed Operating System and other relevant parameter
(e.g system level) and software.
+-----------------+-------+---------------------------+
| Parameter | Value | Comments |
+-----------------+-------+---------------------------+
| OS | | e.g. Ubuntu 14.04.3 |
+-----------------+-------+---------------------------+
| oslo.messaging | | e.g 5.30.0 |
+-----------------+-------+---------------------------+
Backend specific versions must be gathered as well as third party tools.
* RabbitMQ backend
+-----------------+-------+---------------------------+
| Parameter | Value | Comments |
+-----------------+-------+---------------------------+
| RMQ server | | e.g 3.6.11 |
+-----------------+-------+---------------------------+
| kombu client | | e.g 4.10 |
+-----------------+-------+---------------------------+
| AMQP client | | e.g 2.2.1 |
+-----------------+-------+---------------------------+
* AMQP backend
+----------------------+-------+---------------------------+
| Parameter | Value | Comments |
+----------------------+-------+---------------------------+
| Qpid dispatch router | | e.g 0.8.0 |
+----------------------+-------+---------------------------+
| python-qpid-proton | | e.g 0.17.0 |
+----------------------+-------+---------------------------+
| pyngus | | e.g 2.2.1 |
+----------------------+-------+---------------------------+
* ZeroMQ backend
+----------------------+-------+---------------------------+
| Parameter | Value | Comments |
+----------------------+-------+---------------------------+
| pyzmq | | e.g 17.0.0 |
+----------------------+-------+---------------------------+
| redis-server | | e.g 4.0 |
+----------------------+-------+---------------------------+
* Kafka backend
+----------------------+-------+---------------------------+
| Parameter | Value | Comments |
+----------------------+-------+---------------------------+
| Kafka server | | e.g 0.10.2 |
+----------------------+-------+---------------------------+
| Kafka python | | e.g 1.3.4 |
+----------------------+-------+---------------------------+
| Java RE | | e.g 8u144 |
+----------------------+-------+---------------------------+
Messaging middleware topology
"""""""""""""""""""""""""""""
The actual deployment of the messaging middleware. A graph may be used to
illustrate thoroughly the topology of the messaging entities (e.g federated
RabbitMQ clusters, set of qdrouterd daemons)
Openstack version
"""""""""""""""""
For the operational testings, OpenStack version must be specified.

View File

@ -0,0 +1,405 @@
.. _message_queue_performance:
==========================
Massively Distributed RPCs
==========================
:status: **draft**
:version: 1.0
:Abstract:
This document describes a test plan for evaluating the OpenStack message bus
in the context of a massively distributed cloud architecture. For instance, a
massively distributed cloud addresses the use case of the Fog or Edge
computing where the services are geographically distributed accross a large
area. For the time being, OpenStack inter-service communications rely on
oslo_messaging which defines a common abstraction to different instantiations
of the message bus. Historically broker-based implementations (e.g RabbitMQ,
QPid) competed with brokerless based implementations (e.g ZeroMQ), but with
the advent of AMQP1.0 in oslo_messaging, `alternative non-broker`_ messaging
system can be now envisioned. In the latter messages can traverse a set of
inter-connected agents (broker or routers) before reaching their destination.
The test plan takes place in the context of a prospective effort to evaluate
the distribution of the messaging bus using emerging solutions (e.g qpid
dispatch router) or established ones (e.g Zero-MQ) compared to the
traditional and centralized solutions (e.g RabbitMQ). Finally the scope of
the test plan is RPC communication between OpenStack services, thus
notification is out of the scope of range.
Test Plan
=========
Test Environment
----------------
Most of the following test cases are synthetic tests. Those tests are performed
on top of oslo_messaging in isolation from any OpenStack components. The test
plan is completed by an operational testing. It aims to evaluate the overall
behaviour of Openstack using similar deployment of the messaging middleware.
Preparation
^^^^^^^^^^^
For the synthetic tests tools like `ombt2`_ or `simulator`_ can be used. In the
former case it must be configured to use a separated control bus (e.g RabbitMQ)
different from the message bus under test. This will avoid any unwanted
perturbations in the measurements. Failure injection can leverage `os_faults`_.
Both synthetic and operational experiments can be scripted using `enoslib`_.
Finally operational testing can leverage `rally`_ .
.. include:: environment_description.rst
Test Case 1 : One single large distributed target
-------------------------------------------------
Description
^^^^^^^^^^^
In this test case clients are sending requests to the same Target. Servers are
serving those requests. The goal of this test case is to evaluate how large a
single distributed queue can be in terms of number of clients/servers. Moreover
RPC clients and servers must be distributed evenly across the messaging components
Methodology
^^^^^^^^^^^^
.. code-block:: none
Start
* Provision a single RPC server on Target T
* Provision a single RPC client
* RPC client issues calls to T using a fixed delay between two messages.
Repeat:
* Add additional clients until RPC server CPU utilization reaches >70%
* Provision another RPC server on T
List of performance metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following metrics are recorded for each repetition.
======== ================== ================= ================================
Priority Value Measurement Units Description
======== ================== ================= ================================
1 Messages rate msg/sec Number of calls made by the
callers per second
(overall and by client)
2 Latency ms The round-trip latency in
message processing
2 Latency stddev ms Standard deviation of the latency.
3 Sent - The number of messages sent
(overall and by client)
3 Processed - The number of messages processed
(overall and by server)
4 Throughput bytes/sec Volume of raw data flowing
through the bus by unit of time.
======== ================== ================= ================================
.. note::
- This test case can be run for RPC call and RPC cast.
- In the case of RPC *call* tests, throughput and latency should be
measured from the RPC client (the caller). For *cast* tests the latency
and throughput should be measured from the RPC server (since the client
does not block for ack). More specifically the latency is the time taken
by a message to reach the server. The throughput will be calculated by
dividing the total number of messages by the time interval between the
first message sent by a client and the last message received by a server.
- Throughput is correlated to the message rate but depends on the actual
encoding of the message payload. This can be obtained by different means
e.g: monitoring statistics from the bus itself or estimation based on the
wired protocol used. This must be specified clearly to allow fair
comparisons.
Test Case 2: Multiple distributed targets
-----------------------------------------
Description
^^^^^^^^^^^
The objective of the test case is to evaluate how many queues can be
simultaneously active and managed by the messaging middleware.
Methodology
^^^^^^^^^^^^
.. code-block:: none
Start:
* Provision a single RPC server on Target T
* Provision a single RPC client
* RPC client issues calls to T using a fixed delay between two messages.
Repeat:
* Add additional couple (client, server) on another Target.
List of performance metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following metrics are recorded for each repetition.
======== ================== ================= ================================
Priority Value Measurement Units Description
======== ================== ================= ================================
1 Messages rate msg/sec Number of calls made by the
callers per second
(overall and by client)
2 Latency ms The round-trip latency in
message processing
2 Latency stddev ms Standard deviation of the latency.
3 Sent - The number of messages sent
(overall and by client)
3 Processed - The number of messages processed
(overall and by server)
4 Throughput bytes/sec Volume of raw data flowing
through the bus by unit of time.
======== ================== ================= ================================
.. note::
This test case can be run for RPC call and RPC cast.
Note that throughput is less interesting in the case of cast
messages since it can be artificially high due to the lack of ack.
Test Case 3 : one single large distributed fanout
-------------------------------------------------
Description
^^^^^^^^^^^
The goal of this test case is to evaluate the ability of the message bus to
handle large fanout.
Methodology
^^^^^^^^^^^
.. code-block:: none
Start:
* Provision a single RPC server on Target T
* Provision a single RPC client
* RPC client issues fanout cast to T:
- 1 cast every second
- n messages
Repeat:
* Add additional RPC server on T
List of performance metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following metrics are recorded for each repetition.
======== ================== ================= ================================
Priority Value Measurement Units Description
======== ================== ================= ================================
1 Latency ms Latency
2 Sent - The number of messages sent
(overall and by client)
2 Processed - The number of messages processed
(overall and by server)
======== ================== ================= ================================
.. note::
In case of fanout cast, no ack are sent to the sender. The latency will be the
time interval between the message is sent and the message is received by all the
servers.
Test Case 4 : multiple distributed fanouts
------------------------------------------
Description
^^^^^^^^^^^
The goal of this test case is to scale the number of fanouts handled by the
message bus.
Methodology
^^^^^^^^^^^
.. code-block:: none
Start:
* Provision n RPC servers on Target T
* Provision a single RPC client
* RPC client issues fanout cast to T:
- 1 cast every second
- m messages
Repeat:
* Add (n RPC servers, 1 RPC client) on another Target
List of performance metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following metrics are recorded for each repetition.
======== ================== ================= ================================
Priority Value Measurement Units Description
======== ================== ================= ================================
1 Latency ms Latency
2 Sent - The number of messages sent
(overall and by client)
2 Processed - The number of messages processed
(overall and by server)
======== ================== ================= ================================
.. note::
In case of fanout cast, no ack are sent to the sender. The latency will be the
time interval between the message is sent and the message is received by all the
servers.
Test Case 5 : Resilience
------------------------
Description
^^^^^^^^^^^
Usual centralized solutions offer some solution to increase their scalability
while providing high-availability (e.g RabbitMQ clustering, mirroring). This
kind of solution fit well the one-datacenter case but doesn't cope with the
distributed case where high latency between communicating entities can be
observed. In a massively distributed case, communicating entities may fail more
often (link down, hardware failure). The goal of this test case is to evaluate
the resiliency of the messaging layer to failures.
Methodology
^^^^^^^^^^^
The messaging infrastructure must be configured in such a way as to ensure
functionality can be preserved in the case of loss of any one messaging
component (e.g. three rabbit brokers in a cluster, two alternate paths across a
router mesh, etc.) Each messaging client must be configured with a fail-over
address for re-connecting to the message bus should its primary connection fail
(see the oslo.messaging documentation for `TransportURL`_ addresses).
The test environment is the same as that for Test Case 1 : One single large
distributed queue, with the caveat that each process comprising the message bus
maintains a steady state CPU load of approximately 50%. In other words the
test traffic should maintain a reasonable and consistent load on the message
bus without overloading it. Additionally test will be based on RPC call
traffic. RPC cast traffic is sent "least effort" - cast messages are more
likely to be dropped than calls since there is no return ACKs in the case of
cast.
.. code-block:: none
Start:
* Provision the test environment as described above
Phase 1:
* reboot one component of the message bus (e.g. a single rabbit broker in
the cluster)
* wait until the component recovers and the message bus returns to steady state
Repeat:
* Phase 1 for each component of the message bus
Phase 2:
* force the failure of one of the TCP connections linking two components
of the message bus (e.g. the connection between two rabbit brokers in a
cluster).
* wait 60 seconds
* restore the connection
* wait until message bus stabilizes
Repeat:
* Phase 2 for each TCP connection connecting any two message bus
components.
Phase 3:
* force the failure of one of the TCP connections linking one client and
its connected endpoint
* wait until client reconnects using one fail-over address
* restore the connection
Repeat:
* Phase 3 for each TCP connection connecting one client and the bus
.. note::
Message bus backend are likely to offer specific ways to know when a steady
state is reached after the recovery of one agent (e.g polling RabbitMQ API for
the cluster status). This must be clearly stated in the test resut.
List of performance metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^
======== =================== ================= ===============================
Priority Value Measurement Units Description
======== =================== ================= ===============================
1 Call failures - Total number of RPC call
operations that failed grouped
by exception type.
2 Reboot recovery seconds The average time between a
component reboot and recovery
of the message bus
2 Reconnect recovery seconds The average time between the
restoration of an internal
TCP connection and the recovery
of the message bus
3 Message duplication - Total number of duplicated
messages received by the servers.
======== =================== ================= ===============================
Common metrics to all test cases
--------------------------------
For each agent involved in the communication middleware, metrics about their
resource consumption under load must be gathered.
======== ================= ================= ================================
Priority Value Measurement Units Description
======== ================= ================= ================================
1 CPU load Mhz CPU load
2 RAM consumption Gb RAM consumption
3 Opened Connection Number of TCP sockets opened
======== ================= ================= ================================
Test Case 6 : Operational testing
----------------------------------
Description
^^^^^^^^^^^
Operational testing intends to evaluate the correct behaviour of a running
OpenStack on top of a specific deployment of the messaging middleware.
This test case aims to measure the correct behaviour of OpenStack
under WAN at messaging plane level. It relies on `rally`_ that runs
loads on the current OpenStack. Then Rally reports can be used to get
time of operations executions and percent of failure to evaluate
OpenStack. The chosen Rally scenarios and those known to be intensive on the
messaging layer (e.g `Neutron scenarios`_ and `Nova scenarios`_).
List of performance metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since `rally`_ is used, the performance metrics are those reported by the
framework.
.. references:
.. _alternative non-broker: https://review.openstack.org/#/c/314603
.. _ombt2: https://github.com/kgiusti/ombt
.. _simulator: https://github.com/openstack/oslo.messaging/blob/master/tools/simulator.py
.. _enoslib: https://github.com/BeyondTheClouds/enoslib
.. _rally: https://github.com/openstack/rally
.. _Neutron scenarios: https://github.com/openstack/rally/tree/0aa2f58e43ff143a27880b5fa527b09c7670d5f7/samples/tasks/scenarios/neutron
.. _Nova scenarios: https://github.com/openstack/rally/tree/0aa2f58e43ff143a27880b5fa527b09c7670d5f7/samples/tasks/scenarios/nova
.. _TransportURL: https://docs.openstack.org/oslo.messaging/latest/reference/transport.html
.. _tc: http://www.tldp.org/HOWTO/html_single/Traffic-Control-HOWTO/
.. _os_faults: https://github.com/openstack/os-faults
Reports
=======