summaryrefslogtreecommitdiff
path: root/doc/source/troubleshooting-guide.rst
blob: b5c5ac0f9ac5d623867e0de7699116b134a8e96a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
============================
Magnum Troubleshooting Guide
============================

This guide is intended for users who use Magnum to deploy and manage
clusters of hosts for a Container Orchestration Engine.  It describes
common failure conditions and techniques for troubleshooting.  To help
the users quickly identify the relevant information, the guide is
organized as a list of failure symptoms: each has some suggestions
with pointers to the details for troubleshooting.

A separate section `for developers`_ describes useful techniques such as
debugging unit tests and gate tests.

================
Failure symptoms
================

My cluster-create takes a really long time
  If you are using devstack on a small VM, cluster-create will take a long
  time and may eventually fail because of insufficient resources.
  Another possible reason is that a process on one of the nodes is hung
  and heat is still waiting on the signal.  In this case, it will eventually
  fail with a timeout, but since heat has a long default timeout, you can
  look at the `heat stacks`_ and check the WaitConditionHandle resources.

My cluster-create fails with error: "Failed to create trustee XXX in domain XXX"
  Check the `trustee for cluster`_

Kubernetes cluster-create fails
  Check the `heat stacks`_, log into the master nodes and check the
  `Kubernetes services`_ and `etcd service`_.

Swarm cluster-create fails
  Check the `heat stacks`_, log into the master nodes and check the `Swarm
  services`_ and `etcd service`_.

Mesos cluster-create fails
  Check the `heat stacks`_, log into the master nodes and check the `Mesos
  services`_.

I get the error "Timed out waiting for a reply" when deploying a pod
  Verify the `Kubernetes services`_ and `etcd service`_ are running on the
  master nodes.

I deploy pods on Kubernetes cluster but the status stays "Pending"
  The pod status is "Pending" while the Docker image is being downloaded,
  so if the status does not change for a long time, log into the minion
  node and check for `Cluster internet access`_.

I deploy pods and services on Kubernetes cluster but the app is not working
  The pods and services are running and the status looks correct, but
  if the app is performing communication between pods through services,
  verify `Kubernetes networking`_.

Swarm cluster is created successfully but I cannot deploy containers
  Check the `Swarm services`_ and `etcd service`_ on the master nodes.

Mesos cluster is created successfully but I cannot deploy containers on Marathon
  Check the `Mesos services`_ on the master node.

I get a "Protocol violation" error when deploying a container
  For Kubernetes, check the `Kubernetes services`_ to verify that
  kube-apiserver is running to accept the request.
  Check `TLS`_ and `Barbican service`_.

My cluster-create fails with a resource error on docker_volume
  Check for available volume space on Cinder and the `request volume
  size`_ in the heat template.
  Run "nova volume-list" to check the volume status.


=======================
Troubleshooting details
=======================

Heat stacks
-----------
*To be filled in*

A cluster is deployed by a set of heat stacks:  one top level stack and several
nested stack.  The stack names are prefixed with the cluster name and the
nested stack names contain descriptive internal names like *kube_masters*,
*kube_minions*.

To list the status of all the stacks for a cluster:

    heat stack-list -n | grep *cluster-name*

If the cluster has failed, then one or more of the heat stacks would have
failed. From the stack list above, look for the stacks that failed, then
look for the particular resource(s) that failed in the failed stack by:

    heat resource-list *failed-stack-name* | grep "FAILED"

The resource_type of the failed resource should point to the OpenStack
service, e.g. OS::Cinder::Volume.  Check for more details on the failure by:

    heat resource-show *failed-stack-name* *failed-resource-name*

The resource_status_reason may give an indication on the failure, although
in some cases it may only say "Unknown".

If the failed resource is OS::Heat::WaitConditionHandle, this indicates that
one of the services that are being started on the node is hung.  Log into the
node where the failure occurred and check the respective `Kubernetes
services`_, `Swarm services`_ or `Mesos services`_.  If the failure is in
other scripts, look for them as `Heat software resource scripts`_.


Trustee for cluster
-------------------
When a user creates a cluster, Magnum will dynamically create a service account
for the cluster. The service account will be used by the cluster to
access the OpenStack services (i.e. Neutron, Swift, etc.). A trust relationship
will be created between the user who created the cluster (the "trustor") and
the service account created for the cluster (the "trustee"). For details,
please refer
<http://git.openstack.org/cgit/openstack/magnum/tree/specs/create-trustee-user-for-each-cluster.rst>`_.

If Magnum fails to create the trustee, check the magnum config file (usually
in /etc/magnum/magnum.conf). Make sure 'trustee_*' and 'auth_uri' are set and
their values are correct:

    [keystone_authtoken]
    auth_uri = http://controller:5000/v3
    ...

    [trust]
    trustee_domain_admin_password = XXX
    trustee_domain_admin_id = XXX
    trustee_domain_id = XXX

If the 'trust' group is missing, you might need to create the trustee domain
and the domain admin:

.. code-block:: bash

    source /opt/stack/devstack/accrc/admin/admin
    export OS_IDENTITY_API_VERSION=3
    unset OS_AUTH_TYPE
    openstack domain create magnum
    openstack user create trustee_domain_admin --password=secret \
        --domain=magnum
    openstack role add --user=trustee_domain_admin --user-domain magnum --domain=magnum admin

    source /opt/stack/devstack/functions
    export MAGNUM_CONF=/etc/magnum/magnum.conf
    iniset $MAGNUM_CONF trust trustee_domain_id \
        $(openstack domain show magnum | awk '/ id /{print $4}')
    iniset $MAGNUM_CONF trust trustee_domain_admin_id \
        $(openstack user show trustee_domain_admin | awk '/ id /{print $4}')
    iniset $MAGNUM_CONF trust trustee_domain_admin_password secret

Then, restart magnum-api and magnum-cond to pick up the new configuration.
If the problem still exists, you might want to manually verify your domain
admin credential to ensure it has the right privilege. To do that, run the
script below with the credentials replaced (you must use the IDs where
specified). If it fails, that means the credential you provided is invalid.

.. code-block:: python

    from keystoneauth1.identity import v3 as ka_v3
    from keystoneauth1 import session as ka_session
    from keystoneclient.v3 import client as kc_v3

    auth = ka_v3.Password(
        auth_url=YOUR_AUTH_URI,
        user_id=YOUR_TRUSTEE_DOMAIN_ADMIN_ID,
        domain_id=YOUR_TRUSTEE_DOMAIN_ID,
        password=YOUR_TRUSTEE_DOMAIN_ADMIN_PASSWORD)

    session = ka_session.Session(auth=auth)
    domain_admin_client = kc_v3.Client(session=session)
    user = domain_admin_client.users.create(
        name='anyname',
        password='anypass')


TLS
---
*To be filled in*


Barbican service
----------------
*To be filled in*


Cluster internet access
-----------------------
The nodes for Kubernetes, Swarm and Mesos are connected to a private
Neutron network, so to provide access to the external internet, a router
connects the private network to a public network.  With devstack, the
default public network is "public", but this can be replaced by the
parameter "external-network" in the ClusterTemplate.  The "public" network
with devstack is actually not a real external network, so it is in turn
routed to the network interface of the host for devstack.  This is
configured in the file local.conf with the variable PUBLIC_INTERFACE,
for example::

    PUBLIC_INTERFACE=eth1

If the route to the external internet is not set up properly, the ectd
discovery would fail (if using public discovery) and container images
cannot be downloaded, among other failures.

First, check for connectivity to the external internet by pinging
an external IP (the IP shown here is an example; use an IP that
works in your case)::

    ping 8.8.8.8

If the ping fails, there is no route to the external internet.
Check the following:

- Is PUBLIC_INTERFACE in devstack/local.conf the correct network
  interface?  Does this interface have a route to the external internet?
- If "external-network" is specified in the ClusterTemplate, does this
  network have a route to the external internet?
- Is your devstack environment behind a firewall?  This can be the case for some
  enterprises or countries.  In this case, consider using a `proxy server
  <https://github.com/openstack/magnum/blob/master/doc/source/magnum-proxy.rst>`_.
- Is the traffic blocked by the security group? Check the
  `rules of security group
  <http://docs.openstack.org/ops-guide/ops-user-facing-operations.html#security-groups>`_.
- Is your host NAT'ing your internal network correctly? Check your host
  `iptables <http://docs.openstack.org/ops-guide/ops-network-troubleshooting.html#iptables>`_.
- Use *tcpdump* for `networking troubleshooting
  <http://docs.openstack.org/ops-guide/ops-network-troubleshooting.html#tcpdump>`_.
  You can run *tcpdump* on the interface *docker0, flannel0* and *eth0* on the
  node and then run *ping* to see the path of the message from the container.

If ping is successful, check that DNS is working::

    wget google.com

If DNS works, you should get back a few lines of HTML text.

If the name lookup fails, check the following:

- Is the DNS entry correct in the subnet?  Try "neutron subnet-show
  <subnet-id>" for the private subnet and check dns_nameservers.
  The IP should be either the default public DNS 8.8.8.8 or the value
  specified by "dns-nameserver" in the ClusterTemplate.
- If you are using your own DNS server by specifying "dns-nameserver"
  in the ClusterTemplate, is it reachable and working?
- More help on `DNS troubleshooting <http://docs.openstack.org/ops-guide/ops-network-troubleshooting.html#debugging-dns-issues>`_.


Kubernetes networking
---------------------

The networking between pods is different and separate from the neutron
network set up for the cluster.
Kubernetes presents a flat network space for the pods and services
and uses different network drivers to provide this network model.

It is possible for the pods to come up correctly and be able to connect
to the external internet, but they cannot reach each other.
In this case, the app in the pods may not be working as expected.
For example, if you are trying the `redis example
<https://github.com/kubernetes/kubernetes/blob/release-1.1/examples/redis/README.md>`_,
the key:value may not be replicated correctly.  In this case, use the
following steps to verify the inter-pods networking and pinpoint problems.

Since the steps are specific to the network drivers, refer to the
particular driver being used for the cluster.

Using Flannel as network driver
...............................

Flannel is the default network driver for Kubernetes clusters.  Flannel is
an overlay network that runs on top of the neutron network.  It works by
encapsulating the messages between pods and forwarding them to the
correct node that hosts the target pod.

First check the connectivity at the node level.  Log into two
different minion nodes, e.g. node A and node B, run a docker container
on each node, attach to the container and find the IP.

For example, on node A::

    sudo docker run -it alpine
    # ip -f inet -o a | grep eth0 | awk '{print $4}'
    10.100.54.2/24

Similarly, on node B::

    sudo docker run -it alpine
    # ip -f inet -o a | grep eth0 | awk '{print $4}'
    10.100.49.3/24

Check that the containers can see each other by pinging from one to another.

On node A::

    # ping 10.100.49.3
    PING 10.100.49.3 (10.100.49.3): 56 data bytes
    64 bytes from 10.100.49.3: seq=0 ttl=60 time=1.868 ms
    64 bytes from 10.100.49.3: seq=1 ttl=60 time=1.108 ms

Similarly, on node B::

    # ping 10.100.54.2
    PING 10.100.54.2 (10.100.54.2): 56 data bytes
    64 bytes from 10.100.54.2: seq=0 ttl=60 time=2.678 ms
    64 bytes from 10.100.54.2: seq=1 ttl=60 time=1.240 ms

If the ping is not successful, check the following:

- Is neutron working properly?  Try pinging between the VMs.

- Are the docker0 and flannel0 interfaces configured correctly on the
  nodes? Log into each node and find the Flannel CIDR by::

    cat /run/flannel/subnet.env | grep FLANNEL_SUBNET
    FLANNEL_SUBNET=10.100.54.1/24

  Then check the interfaces by::

    ifconfig flannel0
    ifconfig docker0

  The correct configuration should assign flannel0 with the "0" address
  in the subnet, like *10.100.54.0*, and docker0 with the "1" address, like
  *10.100.54.1*.

- Verify the IP's assigned to the nodes as found above are in the correct
  Flannel subnet.  If this is not correct, the docker daemon is not configured
  correctly with the parameter *--bip*.  Check the systemd service for docker.

- Is Flannel running properly?  check the `Running Flannel`_.

- Ping and try `tcpdump
  <http://docs.openstack.org/ops-guide/ops-network-troubleshooting.html#tcpdump>`_
  on each network interface along the path between two nodes
  to see how far the message is able to travel.
  The message path should be as follows:

  1. Source node: docker0
  2. Source node: flannel0
  3. Source node: eth0
  4. Target node: eth0
  5. Target node: flannel0
  6. Target node: docker0

If ping works, this means the flannel overlay network is functioning
correctly.

The containers created by Kubernetes for pods will be on the same IP
subnet as the containers created directly in Docker as above, so they
will have the same connectivity.  However, the pods still may not be
able to reach each other because normally they connect through some
Kubernetes services rather than directly.  The services are supported
by the kube-proxy and rules inserted into the iptables, therefore
their networking paths have some extra hops and there may be problems
here.

To check the connectivity at the Kubernetes pod level, log into the
master node and create two pods and a service for one of the pods.
You can use the examples provided in the directory
*/etc/kubernetes/examples/* for the first pod and service.  This will
start up an nginx container and a Kubernetes service to expose the
endpoint.  Create another manifest for a second pod to test the
endpoint::

    cat > alpine.yaml << END
    apiVersion: v1
    kind: Pod
    metadata:
      name: alpine
    spec:
      containers:
      - name: alpine
        image: alpine
        args:
        - sleep
        - "1000000"
    END

    kubectl create -f /etc/kubernetes/examples/pod-nginx-with-label.yaml
    kubectl create -f /etc/kubernetes/examples/service.yaml
    kubectl create -f alpine.yaml

Get the endpoint for the nginx-service, which should route message to the pod
nginx::

    kubectl describe service nginx-service | grep -e IP: -e Port:
    IP:                     10.254.21.158
    Port:                   <unnamed>       8000/TCP

Note the IP and port to use for checking below.  Log into the node
where the *alpine* pod is running.  You can find the hosting node by
running this command on the master node::

    kubectl get pods -o wide  | grep alpine | awk '{print $6}'
    k8-gzvjwcooto-0-gsrxhmyjupbi-kube-minion-br73i6ans2b4

To get the IP of the node, query Nova on devstack::

    nova list

On this hosting node, attach to the *alpine* container::

    export DOCKER_ID=`sudo docker ps | grep k8s_alpine | awk '{print $1}'`
    sudo docker exec -it $DOCKER_ID sh

From the *alpine* pod, you can try to reach the nginx pod through the nginx
service using the IP and Port found above::

    wget 10.254.21.158:8000

If the connection is successful, you should receive the file *index.html* from
nginx.

If the connection is not successful, you will get an error message like::xs

    wget: can't connect to remote host (10.100.54.9): No route to host

In this case, check the following:

- Is kube-proxy running on the nodes? It runs as a container on each node.
  check by logging in the minion nodes and run::

    sudo docker ps | grep k8s_kube-proxy

- Check the log from kube-proxy by running on the minion nodes::

    export PROXY=`sudo docker ps | grep "hyperkube proxy" | awk '{print $1}'`
    sudo docker logs $PROXY

- Try additional `service debugging
  <https://github.com/kubernetes/kubernetes/blob/release-1.1/docs/user-guide/debugging-services.md>`_.
  To see what's going during provisioning::

    kubectl get events

  To get information on a service in question::

    kubectl describe services <service_name>



etcd service
------------

The etcd service is used by many other components for key/value pair
management, therefore if it fails to start, these other components
will not be running correctly either.
Check that etcd is running on the master nodes by::

    sudo service etcd status -l

If it is running correctly, you should see that the service is
successfully deployed::

    Active: active (running) since ....

The log message should show the service being published::

    etcdserver: published {Name:10.0.0.5 ClientURLs:[http://10.0.0.5:2379]} to cluster 3451e4c04ec92893

In some cases, the service may show as *active* but may still be stuck
in discovery mode and not fully operational.  The log message may show
something like::

    discovery: waiting for other nodes: error connecting to https://discovery.etcd.io, retrying in 8m32s

If this condition persists, check for `Cluster internet access`_.

If the daemon is not running, the status will show the service as failed,
something like::

    Active: failed (Result: timeout)

In this case, try restarting etcd by::

    sudo service etcd start

If etcd continues to fail, check the following:

- Check the log for etcd::

    sudo journalctl -u etcd

- etcd requires discovery, and the default discovery method is the
  public discovery service provided by etcd.io; therefore, a common
  cause of failure is that this public discovery service is not
  reachable.  Check by running on the master nodes::

    source /etc/sysconfig/heat-params
    curl $ETCD_DISCOVERY_URL

  You should receive something like::

    {"action":"get",
     "node":{"key":"/_etcd/registry/00a6b00064174c92411b0f09ad5466c6",
             "dir":true,
             "nodes":[
               {"key":"/_etcd/registry/00a6b00064174c92411b0f09ad5466c6/7d8a68781a20c0a5",
                "value":"10.0.0.5=http://10.0.0.5:2380",
                "modifiedIndex":978239406,
                "createdIndex":978239406}],
             "modifiedIndex":978237118,
             "createdIndex":978237118}
    }

  The list of master IP is provided by Magnum during cluster deployment,
  therefore it should match the current IP of the master nodes.
  If the public discovery service is not reachable, check the
  `Cluster internet access`_.

Running Flannel
---------------

When deploying a COE, Flannel is available as a network driver for
certain COE type.  Magnum currently supports Flannel for a Kubernetes
or Swarm cluster.

Flannel provides a flat network space for the containers in the cluster:
they are allocated IP in this network space and they will have connectivity
to each other.  Therefore, if Flannel fails, some containers will not
be able to access services from other containers in the cluster.  This can be
confirmed by running *ping* or *curl* from one container to another.

The Flannel daemon is run as a systemd service on each node of the cluster.
To check Flannel, run on each node::

    sudo service flanneld status

If the daemon is running, you should see that the service is successfully
deployed::

    Active: active (running) since ....

If the daemon is not running, the status will show the service as failed,
something like::

    Active: failed (Result: timeout) ....

or::

    Active: inactive (dead) ....

Flannel daemon may also be running but not functioning correctly.
Check the following:

- Check the log for Flannel::

    sudo journalctl -u flanneld

- Since Flannel relies on etcd, a common cause for failure is that the
  etcd service is not running on the master nodes.  Check the `etcd service`_.
  If the etcd service failed, once it has been restored successfully, the
  Flannel service can be restarted by::

    sudo service flanneld restart

- Magnum writes the configuration for Flannel in a local file on each master
  node.  Check for this file on the master nodes by::

    cat /etc/sysconfig/flannel-network.json

  The content should be something like::

    {
      "Network": "10.100.0.0/16",
      "Subnetlen": 24,
      "Backend": {
        "Type": "udp"
      }
    }

  where the values for the parameters must match the corresponding
  parameters from the ClusterTemplate.

  Magnum also loads this configuration into etcd, therefore, verify
  the configuration in etcd by running *etcdctl* on the master nodes::

    . /etc/sysconfig/flanneld
    etcdctl get $FLANNEL_ETCD_KEY/config

- Each node is allocated a segment of the network space.  Check
  for this segment on each node by::

    grep FLANNEL_SUBNET /run/flannel/subnet.env

  The containers on this node should be assigned an IP in this range.
  The nodes negotiate for their segment through etcd, and you can use
  *etcdctl* on the master node to query the network segment associated
  with each node::

    . /etc/sysconfig/flanneld
    for s in `etcdctl ls $FLANNEL_ETCD_KEY/subnets`
    do
    echo $s
    etcdctl get $s
    done

    /atomic.io/network/subnets/10.100.14.0-24
    {"PublicIP":"10.0.0.5"}
    /atomic.io/network/subnets/10.100.61.0-24
    {"PublicIP":"10.0.0.6"}
    /atomic.io/network/subnets/10.100.92.0-24
    {"PublicIP":"10.0.0.7"}

  Alternatively, you can read the full record in ectd by::

    curl http://<master_node_ip>:2379/v2/keys/coreos.com/network/subnets

  You should receive a JSON snippet that describes all the segments
  allocated.

- This network segment is passed to Docker via the parameter *--bip*.
  If this is not configured correctly, Docker would not assign the correct
  IP in the Flannel network segment to the container.  Check by::

    cat /run/flannel/docker
    ps -aux | grep docker

- Check the interface for Flannel::

    ifconfig flannel0

  The IP should be the first address in the Flannel subnet for this node.

- Flannel has several different backend implementations and they have
  specific requirements.  The *udp* backend is the most general and have
  no requirement on the network.  The *vxlan* backend requires vxlan
  support in the kernel, so ensure that the image used does provide
  vxlan support.  The *host-gw* backend requires that all the hosts are
  on the same L2 network.  This is currently met by the private Neutron
  subnet created by Magnum;  however, if other network topology is used
  instead, ensure that this requirement is met if *host-gw* is used.

Current known limitation:  the image fedora-21-atomic-5.qcow2 has
Flannel version 0.5.0.  This version has known bugs that prevent the
backend vxland and host-gw to work correctly.  Only the backend udp
works for this image.  Version 0.5.3 and later should work correctly.
The image fedora-21-atomic-7.qcow2 has Flannel version 0.5.5.

Kubernetes services
-------------------
*To be filled in*

(How to introspect k8s when heat works and k8s does not)

Additional `Kubenetes troubleshooting guide
<http://kubernetes.io/v1.0/docs/troubleshooting.html>`_ is available.

Swarm services
--------------
*To be filled in*

(How to check on a swarm cluster: see membership information, view master,
agent containers)

Mesos services
--------------
*To be filled in*


Barbican issues
---------------
*To be filled in*


Docker CLI
----------
*To be filled in*


Request volume size
-------------------
*To be filled in*


Heat software resource scripts
------------------------------
*To be filled in*


==============
For Developers
==============

This section is intended to help with issues that developers may
run into in the course of their development adventures in Magnum.

Troubleshooting in Gate
-----------------------

Simulating gate tests
  *Note*: This is adapted from Devstack Gate's `README`_ which
  is worth a quick read to better understand the following)

  #. Boot a VM like described in the Devstack Gate's `README`_ .
  #. Provision this VM like so::

      apt-get update \
      && apt-get upgrade -y \ # Kernel upgrade, as recommended by README, select to keep existing grub config
      && apt-get install -y git tmux vim \
      && git clone https://git.openstack.org/openstack-infra/system-config \
      && system-config/install_puppet.sh && system-config/install_modules.sh \
      && puppet apply \
      --modulepath=/root/system-config/modules:/etc/puppet/modules \
      -e "class { openstack_project::single_use_slave: install_users => false,
      ssh_key => \"$( cat .ssh/authorized_keys | awk '{print $2}' )\" }" \
      && echo "jenkins ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers \
      && cat ~/.ssh/authorized_keys >> /home/jenkins/.ssh/authorized_keys
  #. Compare ``~/.ssh/authorized_keys`` and ``/home/jenkins/.ssh/authorized_keys``.  Your original public SSH key should now be in ``/home/jenkins/.ssh/authorized_keys``.  If it's not, explicitly copy it (this can happen if you spin up a using ``--key-name <name>``, for example).
  #. Assuming all is well up to this point, now it's time to ``reboot`` into the latest kernel
  #. Once you're done booting into the new kernel, log back in as ``jenkins`` user to continue with setting up the simulation.
  #. Now it's time to set up the workspace::

      export REPO_URL=https://git.openstack.org
      export WORKSPACE=/home/jenkins/workspace/testing
      export ZUUL_URL=/home/jenkins/workspace-cache2
      export ZUUL_REF=HEAD
      export ZUUL_BRANCH=master
      export ZUUL_PROJECT=openstack/magnum
      mkdir -p $WORKSPACE
      git clone $REPO_URL/$ZUUL_PROJECT $ZUUL_URL/$ZUUL_PROJECT \
      && cd $ZUUL_URL/$ZUUL_PROJECT \
      && git checkout remotes/origin/$ZUUL_BRANCH
  #. At this point, you may be wanting to test a specific change. If so, you can pull down the changes in ``$ZUUL_URL/$ZUUL_PROJECT`` directory::

      cd $ZUUL_URL/$ZUUL_PROJECT \
      && git fetch https://review.openstack.org/openstack/magnum refs/changes/83/247083/12 && git checkout FETCH_HEAD
  #. Now you're ready to pull down the ``devstack-gate`` scripts that will let you run the gate job on your own VM::

      cd $WORKSPACE \
      && git clone --depth 1 $REPO_URL/openstack-infra/devstack-gate
  #. And now you can kick off the job using the following script (the ``devstack-gate`` documentation suggests just copying from the job which can be found in the `project-config <https://github.com/openstack-infra/project-config>`_ repository), naturally it should be executable (``chmod u+x <filename>``)::

      #!/bin/bash -xe
      cat > clonemap.yaml << EOF
      clonemap:
        - name: openstack-infra/devstack-gate
          dest: devstack-gate
      EOF
      /usr/zuul-env/bin/zuul-cloner -m clonemap.yaml --cache-dir /opt/git \
          git://git.openstack.org \
          openstack-infra/devstack-gate
      export PYTHONUNBUFFERED=true
      export DEVSTACK_GATE_TIMEOUT=240 # bump this if you see timeout issues.  Default is 120
      export DEVSTACK_GATE_TEMPEST=0
      export DEVSTACK_GATE_NEUTRON=1
      # Enable tempest for tempest plugin
      export ENABLED_SERVICES=tempest
      export BRANCH_OVERRIDE="default"
      if [ "$BRANCH_OVERRIDE" != "default" ] ; then
          export OVERRIDE_ZUUL_BRANCH=$BRANCH_OVERRIDE
      fi
      export PROJECTS="openstack/magnum $PROJECTS"
      export PROJECTS="openstack/python-magnumclient $PROJECTS"
      export PROJECTS="openstack/barbican $PROJECTS"
      export DEVSTACK_LOCAL_CONFIG="enable_plugin magnum git://git.openstack.org/openstack/magnum stable/ocata"
      export DEVSTACK_LOCAL_CONFIG+=$'\n'"enable_plugin ceilometer git://git.openstack.org/openstack/ceilometer stable/ocata"
      # Keep localrc to be able to set some vars in post_test_hook
      export KEEP_LOCALRC=1
      function gate_hook {
           cd /opt/stack/new/magnum/
          ./magnum/tests/contrib/gate_hook.sh api # change this to swarm to run swarm functional tests or k8s to run kubernetes functional tests
      }
      export -f gate_hook
      function post_test_hook {
          source $BASE/new/devstack/accrc/admin/admin
          cd /opt/stack/new/magnum/
          ./magnum/tests/contrib/post_test_hook.sh api # change this to swarm to run swarm functional tests or k8s to run kubernetes functional tests
      }
      export -f post_test_hook
      cp devstack-gate/devstack-vm-gate-wrap.sh ./safe-devstack-vm-gate-wrap.sh
      ./safe-devstack-vm-gate-wrap.sh

Helpful nuances about the Devstack Gate
  * Main job is in ``project-config``'s `magnum.yaml <https://github.com/openstack-infra/project-config/blob/master/jenkins/jobs/magnum.yaml>`_.

    * Must modify parameters passed in since those are escaped:

      * Anything with ``{}`` should be set as an environment variable

      * Anything with ``{{ }}`` should have those brackets changed to
        single brackets - ``{}``.

      * As with the documentation for Devstack Gate, you can just create
        a new file for the job you want, paste in what you want, then
        ``chmod u+x <filename>`` and run it.

    * Parameters can be found in `projects.yaml <https://github.com/openstack-infra/project-config/blob/master/jenkins/jobs/projects.yaml>`_.
      This file changes a lot, so it's more reliable to say that you can
      search for the magnum jobs where you'll see examples of what
      gets passed in.

  * Three jobs are usually run as a part of Magnum gate, all of with are found in ``project-config``'s `macros.yml <https://github.com/openstack-infra/project-config/blob/master/jenkins/jobs/macros.yaml>`_:

    * link-logs

    * net-info

    * devstack-checkout

  * After you run a job, it's ideal to clean up and start over with a
    fresh VM to best simulate the Devstack Gate environment.

.. _README: https://github.com/openstack-infra/devstack-gate/blob/master/README.rst#simulating-devstack-gate-tests P