Debugging a disconnected peer node

By Tyler Carroll posted 02-06-2019 13:33

  
In this blog post I will discuss some trouble shooting techniques to determine why a Router's peer node is in a disconnected state. In this example I have deployed a 128T Router named Router, which has two nodes T121_DUT1 and T121_DUT2. Here you can see that from the point of view of node T121_DUT1 that node T121_DUT2 is in a disconnected state:
admin@T121_DUT1.Router# show system connectivity
Tue 2019-02-05 19:05:27 UTC

===================== ===================== ==============
Local Node             Remote Node           State
===================== ===================== ==============
T121_DUT1.Router       T121_DUT2.Router      disconnected​

First, make sure that 128T is up and running on the node. You can determine if a node is running by logging into that node, dropping to the Linux shell and executing systemctl status 128T and checking the line that starts with Active:. If 128T is not running then you can start 128T by executing systemctl start 128T. I have cut out some of the output in the example:
[root@t121-dut4 ~]# systemctl status 128T
● 128T.service - 128T service
   Loaded: loaded (/usr/lib/systemd/system/128T.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-02-05 21:32:52 UTC; 35s ago
 Main PID: 23000 (processManager)
    Tasks: 387
   Memory: 565.3M
   CGroup: /system.slice/128T.service
           ├─23000 /usr/bin/processManager
           ├─23011 /usr/bin/secureCommunicationManager --startup-log-level Debug
           ├─23017 /usr/bin/persistentDataManager --startup-log-level Debug
           ├─23051 /usr/sbin/openssh-fips/sshd -p 930 -o ListenAddress=0.0.0.0 -D -o ClientAliveInterval=1 -o ClientAliveCountMax=9 ...
           ├─23060 /usr/sbin/openssh-fips/sshd -p 931 -o ListenAddress=127.0.0.1 -D -o ClientAliveInterval=1 -o ClientAliveCountMax=...
           ├─23084 /usr/bin/openssh-fips/ssh -L 127.0.0.1:4370:127.0.0.1:2181 172.16.1.3 -p 930 -N -o StrictHostKeyChecking=no -o Us...
           ├─23086 /usr/bin/openssh-fips/ssh -L 127.0.0.1:4371:127.0.0.1:2181 127.0.0.1 -p 930 -N -o StrictHostKeyChecking=no -o Use..​


Next, double check the environment config of both nodes. Examine the file /etc/128technology/global.init and make sure it contains two control nodes and verify that the addresses are correct. If anything appears incorrect with the global.init then correct it and restart 128T:

[root@t121-dut1 ~]# cat /etc/128technology/global.init
{
  "init" : {
    "control" : {
      "T121_DUT1" : {
        "host" : "172.16.1.1"
      },
      "T121_DUT2" : {
        "host" : "172.16.1.2"
      }
    },
    "conductor" : {
    },
    "routerName" : "Router"
  }
}​

Next, check the output of show system connectivity internal on each node and see if there were any errors trying to connect to its peer node. I have cut out some of the output in the example:
admin@T121_DUT1.Router# show system connectivity internal
Tue 2019-02-05 21:33:04 UTC

============ ==================================== ========================= ================= ====================================
 Local Node   Remote Node                          Service                   Address           Message
============ ==================================== ========================= ================= ====================================
 T121_DUT1    T121_DUT1.Router                     Zookeeper                 127.0.0.1:4370    Connected
 T121_DUT1    T121_DUT1.Router                     epm                       127.0.0.2:14444   Connected
 T121_DUT1    T121_DUT1.Router                     ntp                       127.0.0.1:4390    Connected
...
 T121_DUT1    T121_DUT1.Router                     ssc                       127.0.0.2:12222   Connected
 T121_DUT1    T121_DUT2.Router                     ssc                       127.0.0.3:12222   Public key authentication failed :
                                                                                               192.168.1.11

Completed in 1.13 seconds​​

This particular error indicates that the peer node is missing the public key for internal process communication of the current node. This key is copied between nodes within a router during installation or during ZTP, depending on the process used to provision the node. To correct the error you can copy the public key manually. You can find the public key of the node at /etc/128technology/ssh/pdc_ssh_key.pub:
[root@t121-dut1 ~]# cat /etc/128technology/ssh/pdc_ssh_key.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCxyS2oMZn6CI8kTUKgDwYvPLuub15ZAlPIX93EkSD90+QvUvgNDPVO77fyJBCQhVBlbmT8RJ4eC5CjllfWdGkL6lYUsR8QFYq8MFWEsaoRm+Htlm5VyaPEN96pwv2/XfxuHdND5DJzlEKK4YC3AGfTxF6f0QRfjx/FFnMBS1Ok7ZXZsnBBBhcXESI4ZHWGKowZqSTxgahS49VtJN7cEYPgw+zI7CLlW3hbwCR+JlgDL9cTrcv09nbRavSLULPBzCdlzByeYGD7MiKxeyRgCEW3B+GRLKwW8qYsWPioORggTNrxp41fF7KwrVReETuBxBfi7RrnAUVHpd3H2Mj9WPjn T121_DUT1​


Take the contents of the public key file and add it into the authorized_keys file on the peer node located at /etc/128technology/ssh/authorized_keys, including the encryption type, key itself and comment at the end of the line which will match the node name:

[root@t121-dut2 ~]# cat /etc/128technology/ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCxyS2oMZn6CI8kTUKgDwYvPLuub15ZAlPIX93EkSD90+QvUvgNDPVO77fyJBCQhVBlbmT8RJ4eC5CjllfWdGkL6lYUsR8QFYq8MFWEsaoRm+Htlm5VyaPEN96pwv2/XfxuHdND5DJzlEKK4YC3AGfTxF6f0QRfjx/FFnMBS1Ok7ZXZsnBBBhcXESI4ZHWGKowZqSTxgahS49VtJN7cEYPgw+zI7CLlW3hbwCR+JlgDL9cTrcv09nbRavSLULPBzCdlzByeYGD7MiKxeyRgCEW3B+GRLKwW8qYsWPioORggTNrxp41fF7KwrVReETuBxBfi7RrnAUVHpd3H2Mj9WPjn T121_DUT1

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCo4u57y11SYzfAHTu+XFdyGMx5rkZ/W//Tkt9+exQjDjqNVvajtEtDhSvbcbZxEOnv6R9dho88qsU8bmcHAhxwRylPk7viIJbAKEMzdmGZNfwPUnr9Hvgk4v57yElqcAtROQPMFB4lLwmRQ6R5cWaC0SwqInwHBOdD7SW4cbMG4/XfO+DXoFkalRhuh6GD4WeBi07YPRR3hq/K4fvs6/9TXKLQFpVrxYFSrDiAMdLkSzLXKLO9AD1ALXcvGRm6KhVuNeZbY8nbIUQpD49l9JfCdNNjsPevci1qdAkmoL+kl1eNdGe4Kzed4J4N1bN8G2luNKB5Q6Gi8VCa6lvj7up1 T121_DUT2

No need to restart the either node after copying the public key to the peer node, if everything is done properly then the node will connect automatically.

There are a variety of other errors that could appear in the show system connectivity internal command. Here is the complete list:

TCP/Socket Errors:
  • Failed to resolve server host
  • Connect to server timed out
  • No route to server
  • Connection refused by server
  • Failed to connect to server
OpenSSH Errors:
  • Public key authentication failed
  • Known host verification failed
  • Server keepalive timeout
  • Client keepalive timeout
  • Server not responding
  • Client not responding
Generic Errors:
  • Disconnected

The error message should give you a clue why one node cannot communicate with another, such as having no route to reach its peer node, or having the connection refused because port 930 is blocked by the peer node. If the problem continues to persist then please execute a save tech-support-info on each node and file a defect for further triage.

If you are debugging issues of a disconnected node on a managed router then please refer to this blog post instead:
https://community.128technology.com/blogs/tyler-carroll/2019/02/05/debugging-a-disconnected-node-from-conductor
0 comments
31 views

Permalink