Debugging a disconnected managed node from Conductor

By Tyler Carroll posted 02-05-2019 16:56

  
Before beginning this process, if you have deployed an HA Conductor please ensure that you have full connectivity between Conductor nodes. If show system connectivity shows a Conductor node is disconnected, please refer to this blog post first to establish full connectivity between nodes within a Router or Conductor: https://community.128technology.com/blogs/tyler-carroll/2019/02/06/debugging-a-disconnected-peer-node. If an HA Conductor is having connectivity issues between nodes then the managed nodes may display incorrect connectivity information.

In this blog post I will discuss some trouble shooting techniques to determine why a managed node is in a disconnected state from the point of view of a Conductor. In this example I have deployed a 128T Conductor named Conductor, which has two nodes T121_DUT1 and T121_DUT2 and a 128T Router named Router, which has two nodes T121_DUT3 and T121_DUT4. Here you can see that from the point of view of Conductor node T121_DUT1 that Router node T121_DUT4 is in a disconnected state:
admin@T121_DUT1.Conductor# show system connectivity
Tue 2019-02-05 19:05:27 UTC

===================== ===================== ==============
Local Node             Remote Node           State
===================== ===================== ==============
T121_DUT1.Conductor    T121_DUT2.Conductor   connected
T121_DUT1.Conductor    T121_DUT3.Router      connected
T121_DUT1.Conductor    T121_DUT4.Router      disconnected
​

First, make sure that 128T is up and running on the node. You can quickly determine this information from the Conductor by using the show assets command or checking the assets page in the GUI:
admin@T121_DUT1.Conductor# show assets
Tue 2019-02-05 19:32:24 UTC

=========== =========== =========== ================================================ ========= ========
 Router      Node        Asset Id    128T Version                                     Status    Errors
=========== =========== =========== ================================================ ========= ========
 Conductor   T121_DUT1   T121_DUT1   4.2.0-0.201902041802.snapshot.debug.el7.centos   running        0
             T121_DUT2   T121_DUT2   4.2.0-0.201902041802.snapshot.debug.el7.centos   running        0
 Router      T121_DUT3   T121_DUT3   4.2.0-0.201902041802.snapshot.debug.el7.centos   running        0
             T121_DUT4   T121_DUT4   4.2.0-0.201902041802.snapshot.debug.el7.centos   stopped        0​

If 128T on the node is stopped then you can start it directly from the Conductor by executing the command send command start router <router-name> node <node-name>, or pressing the start button on the assets page in the GUI.

Next, double check the environment config of the disconnected node. Examine the file /etc/128technology/global.init and make sure it contains the correct amount of Conductor addresses and verify that these addresses are correct. Please note that the Conductor node names do not matter in global.init, they are just placeholders because the managed nodes dynamically learn the names of the Conductor. If anything appears incorrect with the global.init then correct it and restart 128T:
[root@t121-dut4 ~]# cat /etc/128technology/global.init
{
  "init" : {
    "control" : {
      "T121_DUT3" : {
        "host" : "172.16.1.3"
      },
      "T121_DUT4" : {
        "host" : "172.16.1.4"
      }
    },
    "conductor" : {
      "conductor-node-two" : {
        "host" : "192.168.1.11"
      },
      "conductor-node-one" : {
        "host" : "192.168.1.10"
      }
    },
    "routerName" : "Router"
  }
}​

Next, check the output of show system connectivity internal on the disconnected node and see if there were any errors trying to connect to the Conductor. I have cut out some of the output in the example:
admin@T121_DUT4.Router# show system connectivity internal
Tue 2019-02-05 21:33:04 UTC

============ ==================================== ========================= ================= ====================================
 Local Node   Remote Node                          Service                   Address           Message
============ ==================================== ========================= ================= ====================================
 T121_DUT4    T121_DUT4.Router                     Zookeeper                 127.0.0.1:4370    Connected
 T121_DUT4    T121_DUT4.Router                     epm                       127.0.0.2:14444   Connected
 T121_DUT4    T121_DUT4.Router                     ntp                       127.0.0.1:4390    Connected
...
 T121_DUT4    T121_DUT4.Router                     ssc                       127.0.0.2:12222   Connected
 T121_DUT4    UNKNOWN-conductor-node-one.UNKNOWN   ssc                       127.0.1.2:12222   Public key authentication failed :
                                                                                               192.168.1.11

Completed in 1.13 seconds​

This error indicates that the Conductor is missing the public key for internal process communication of the managed node. This key is copied to the Conductor automatically when the managed node is added as an asset. You can try restarting the salt minion on the managed node to force a resync by running systemctl restart salt-minion from Linux on the managed node. Give the asset a few minutes and see if it makes it to the running state and double check to see that the error has gone away. If the problem persists you can copy the public key manually. You can find the public key of the managed node at /etc/128technology/ssh/pdc_ssh_key.pub:
[root@t121-dut4 ~]# cat /etc/128technology/ssh/pdc_ssh_key.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCxyS2oMZn6CI8kTUKgDwYvPLuub15ZAlPIX93EkSD90+QvUvgNDPVO77fyJBCQhVBlbmT8RJ4eC5CjllfWdGkL6lYUsR8QFYq8MFWEsaoRm+Htlm5VyaPEN96pwv2/XfxuHdND5DJzlEKK4YC3AGfTxF6f0QRfjx/FFnMBS1Ok7ZXZsnBBBhcXESI4ZHWGKowZqSTxgahS49VtJN7cEYPgw+zI7CLlW3hbwCR+JlgDL9cTrcv09nbRavSLULPBzCdlzByeYGD7MiKxeyRgCEW3B+GRLKwW8qYsWPioORggTNrxp41fF7KwrVReETuBxBfi7RrnAUVHpd3H2Mj9WPjn T121_DUT4​

Take the contents of the public key file and add it into the authorized_keys file on the Conductor located at /etc/128technology/ssh/authorized_keys, including the encryption type, key itself and comment at the end of the line which will match the node name.
[root@t121-dut1 ~]# cat /etc/128technology/ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDEJ31GY7HmVJB36T5LHph7zgsgoQppQV14S/yx4lT3UJLsFnAIDjW/IZcVfjVuES3HhzXFPfFyZAwHB/GDsZKyuNerY5rAviro5AQmtKXV4K11VMcw6v9GWTymSyuTWh1FgV/fzeE+bAqJ3IVIqIsp91SovhHHnJLKsM08RtjvyFdGlVke6FPTA+pcPtvbMQyHe8XQnymuZD76DSnAJjHQ/bEpF1tkE/kjlapNAjcO8oiHpKfjlDiAba8VUErsk3/wWNy4BCiArrJPeajw5g9VEtLukJ6CGjqEBA7pXERScn5ZuCF2TMfW60yKrgoStA+T85gooilBBqectihrXh// T121_DUT1

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCo4u57y11SYzfAHTu+XFdyGMx5rkZ/W//Tkt9+exQjDjqNVvajtEtDhSvbcbZxEOnv6R9dho88qsU8bmcHAhxwRylPk7viIJbAKEMzdmGZNfwPUnr9Hvgk4v57yElqcAtROQPMFB4lLwmRQ6R5cWaC0SwqInwHBOdD7SW4cbMG4/XfO+DXoFkalRhuh6GD4WeBi07YPRR3hq/K4fvs6/9TXKLQFpVrxYFSrDiAMdLkSzLXKLO9AD1ALXcvGRm6KhVuNeZbY8nbIUQpD49l9JfCdNNjsPevci1qdAkmoL+kl1eNdGe4Kzed4J4N1bN8G2luNKB5Q6Gi8VCa6lvj7up1 T121_DUT3

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCxyS2oMZn6CI8kTUKgDwYvPLuub15ZAlPIX93EkSD90+QvUvgNDPVO77fyJBCQhVBlbmT8RJ4eC5CjllfWdGkL6lYUsR8QFYq8MFWEsaoRm+Htlm5VyaPEN96pwv2/XfxuHdND5DJzlEKK4YC3AGfTxF6f0QRfjx/FFnMBS1Ok7ZXZsnBBBhcXESI4ZHWGKowZqSTxgahS49VtJN7cEYPgw+zI7CLlW3hbwCR+JlgDL9cTrcv09nbRavSLULPBzCdlzByeYGD7MiKxeyRgCEW3B+GRLKwW8qYsWPioORggTNrxp41fF7KwrVReETuBxBfi7RrnAUVHpd3H2Mj9WPjn T121_DUT4
​

No need to restart the managed node after copying the public key to the Conductor, if everything is done properly then the node will connect automatically.

There are a variety of other errors that could appear in the show system connectivity internal command. Here is the complete list:

TCP/Socket Errors:
  • Failed to resolve server host
  • Connect to server timed out
  • No route to server
  • Connection refused by server
  • Failed to connect to server
OpenSSH Errors:
  • Public key authentication failed
  • Known host verification failed
  • Server keepalive timeout
  • Client keepalive timeout
  • Server not responding
  • Client not responding
Generic Errors:
  • Disconnected

The error message should give you a clue why a managed node cannot communicate with the Conductor, such as having no route to reach its the Conductor node, or having the connection refused because port 930 is blocked by the Conductor node. If the problem continues to persist then please execute a save tech-support-info on each node and file a defect for further triage.
1 comment
111 views

Permalink

Comments

04-22-2019 05:25

very good article for the keys, thanks!