Tuesday, September 8, 2015

Bidirectional replication with Unison



About UNISON


Unison is a file-synchronization tool for Unix and Windows. It allows two replicas of a collection of files and directories to be stored on different hosts (or different disks on the same host), modified separately, and then brought up to date by propagating the changes in each replica to the other.

For this implementation I use unison227-2.27.57-13.el6.x86_64.rpm on RHEL6.

The setup work under following scenarios:

1) Replication is for two hosts (node1 & node2)
2) The synced folder is user's home directory - /home/rsftp
3) The file transfer is done using passwordless ssh protocol
4) The synchronization is done in 1 minute interval using cron scheduler

Step 1: Install unison rpm and add user

On both nodes, install the package

rpm  -ivh  unison227-2.27.57-13.el6.x86_64.rpm
useradd rsftp 
echo  rsftp!@#  |  passwd  --stdin  rsftp

above says "set rsftp!@# as password for rsftp user.

Step 2: Create ssh public key and exchange to both nodes

As rsftp user, 

On node1:

[rsftp@node1 ~] $ ssh-keygen -t  dsa
[rsftp@node1 ~] $ ssh-copy-id  -i  /home/rsftp/.ssh/id_dsa.pub node2

On node2:

[rsftp@node2 ~] $ ssh-keygen -t  dsa
[rsftp@node2 ~] $ ssh-copy-id  -i  /home/rsftp/.ssh/id_dsa.pub node1


Once done, next is try ssh in to the partner, expected result is passwordless login.

Step 3: Modify unison configuration file

The configuration file is /home/rsftp/.unison/default.prf

On node1, put these lines

root = /home/rsftp
root = ssh://node2//home/rsftp
auto = true
batch = true

do the same on node2,

root = /home/rsftp
root = ssh://node1//home/rsftp
auto = true
batch = true

Step 4: Create a cronjob on both nodes

As rsftp user

crontab -e

put this line

*/1 * * * * /usr/bin/unison &> /dev/null


Step 5: Create a sample file using dd command to create a file with 10MB size

On node1,

[rsftp@node1 ~] $ cd  ~
[rsftp@node1 ~] $ dd if=/dev/zero of=file1.dat  count=1 bs=10M


After 1 minute pass, on node2 list file in rsftp home directory,

On node2,

[rsftp@node2 ~] $ cd  ~
[rsftp@node2 ~] $ ls -lah


You should see file1.dat with 10MB size.Repeat above steps, create a file2.dat in node2 and list it in node1. Also try other operation such deleting and modifying a file.

Thursday, September 3, 2015

iscsi target, initiator and multipath configuration


What is iSCSI?

iSCSI (Internet Small Computer System Interface) is a TCP/IP based protocol for sending SCSI command over IP based networks. This allows iSCSI infrastructures to extend beyond local LAN and be used over a WAN.

It typically viewed as a low-cost solution to SAN Fibre Channel, it is however limited by speed by the network infrastructure. It is recommend to use a separate and dedicated link for iSCSI.

In this tutorial, I will demonstrate how to install, configure iscsi target (server), initiator (client) and multipath for redundancy.

Brief explanation, I would suggest you to read Redhat Multipath

The setup consists of two virtual machines running rhel6 (VMWare Player) with local repo, refer here for setting up local repositories.


A) host - iscsi target (server)

IP: 192.168.136.128 (represent 1st storage controller)
IP: 192.168.136.138 (represent 2nd storage controller)

B) node1 - iscsi initiator (client)

IP: 192.168.136.129


Step 1: Install packages and configure a backing store on the target

To create a target we first need to install the SCSI target daemon and utility programs

yum -y install scsi-target-utils

and create a backing store (will be presented as LUN to the client), this can be a regular file, a partition, a logical volume or even an entire drive, for flexibility we will use lvm.

On the vm setting, I add a new vdisk with 10GB size to the target server.


















If the vm is running while you added the new vdisk, we use scsi-rescan command (part of sg3_utils package) to rescan the new disk thus no reboot required.

scsi-rescan

from fdisk I can see the new disk /dev/sdc

fdisk -l |grep -i sd

Disk /dev/sdc: 10.7 GB, 10737418240 bytes
.
.
output truncated

with sdc available we can proceed with logical volume creation, using 100% of the disk size.

pvcreate /dev/sdc
vgcreate vgiscsi /dev/sdc
lvcreate -n lvol01  -l 100%FREE vgiscsi 










We will specify the backing store in  /etc/tgt/targets.conf, edit this file, shift+g to go to the last line


<target iqn.2015-09.serveritas.com.ansible:lun1>
        backing-store /dev/vgiscsi/lvol01
</target>


Enable and start tgtd daemon

chkconfig tgtd on ; service tgtd start 

Check the backing store created above

tgt-admin -s

Target 1: iqn.2015-09.serveritas.com.ansible:lun1
    System information:
        Driver: iscsi
        State: ready
    I_T nexus information:
    LUN information:
            LUN: 1
            Type: disk
            SCSI ID: IET     00010001
            SCSI SN: beaf11
            Size: 10733 MB, Block size: 512
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Backing store type: rdwr
            Backing store path: /dev/vgiscsi/lvol01
            Backing store flags:
    Account information:
    ACL information:
    ALL

To this point, bring down second nic (eth1)

Step 2: Install packages and configure a LUN on the initiator

yum -y install iscsi-initiator-utils

Enable and start iscsi daemon

chkconfig iscsi on ; service iscsi restart

Before we can start using a target, we must first discover. Discovering a target will store configuration and discovery information for this target in
/var/lib/iscsi/nodes.

iscsiadm -m discovery -t sendtargets -p 192.168.136.128






Lets have a look at existing disk in node1 before adding the LUN




Now lets use the LUN by login in to the iSCSI target, we use

iscsiadm -m node -T iqn.2015-09.serveritas.com.ansible:lun1 [ -p 192.168.136.128 ] -l





from fdisk, the new disk appear as sdb, you can make a filesystem on it, but lets continue with multipath.







Step 3: Install packages and configure a Multipathing on the initiator

Multipathing allows you to combine multiple physical connections between a server and a storage array into one virtual device. This can be done to provide a more resilient connection to the storage array.

To simulate above scenario, we now bring up second interface on the target.

On the initiator, re-run discovery but this time with second IP of the target

iscsiadm -m discovery -t sendtargets -p 192.168.136.138

Log in to the target using both IP Addresses (represent two storage controllers)

iscsiadm -m node -T iqn.2015-09.serveritas.com.ansible:lun1 [ -p 192.168.136.138 ] -l

Find the iSCSI disk name

grep "Attached SCSI" /var/log/messages

Sep  3 02:29:27 node1 kernel: sd 35:0:0:1: [sdb] Attached SCSI disk
Sep  3 02:29:44 node1 kernel: sd 36:0:0:1: [sdc] Attached SCSI disk











Above screenshot shows sdb and sdc and they are actually the same disk coming from 2 paths.

Now install device-mapper-multipath on node1 and enable it.

yum -y install device-mapper-multipath
chkconfig multipathd on
service multipathd start

Monitor the status with multipath command.

multipath -l
multipath -ll

If everything is good a path will show active ready













RHEL6 support multipathing using the dm-multipath subsystem in which the kernel device mapper used to create virtual device.

Once device-mapper-multipath is installed, configured and started, the device node will be listed in in /dev/mapper. In this example the name is 1IET\x20\x20\x20\x20\x2000010001 and that is not user friendly :(








To make it more human readible we run,

mpathconf --user_friendly_names y

The line will be put in /etc/multipath.conf

## Use user friendly names, instead of using WWIDs as names.
defaults {
        user_friendly_names yes
}














The device name now become mpatha with spaces between 1IET  and  00010001.
Remove spaces for simplicity by adding these lines into /etc/multipath.conf to the default section.

getuid_callout "/lib/udev/scsi_id --replace-whitespace --whitelisted --device=/dev/%n"











There is a situation where we want to identify which device for what use, for example you we have few multipaths disks come in such mpathA mpathB mpathC. Now lets give an alias such as webdata.

Find wwid for the disk

scsi_id --whitelisted --device=/dev/sdb





Modify the /etc/multipath.conf to below,

## Use user friendly names, instead of using WWIDs as names.
defaults {
        user_friendly_names n
     
}

## Give an alias/name to a path for easy to identify
multipaths {
 multipath  {
  wwid "1IET     00010001"
   alias "webdata"
 }

Restart multipath,

service multipathd restart
multipath -ll











Step 4: Path down simulation

On the target server, lets bring down eth1 (second storage controller link)

ifdown eth1

On the initiatior, check multipath status

multipath -ll








The second disk (sdb) status is failed faulty

Bring up back eth1 on the target, the status should turn to active ready.

The /var/log/messages indicate connection has been lost and finally once the link is ready the paths are active.


Sep  6 23:26:07 node1 iscsid: connect to 192.168.136.138:3260 failed (No route to host)
Sep  6 23:26:13 node1 iscsid: connect to 192.168.136.138:3260 failed (No route to host)
Sep  6 23:26:19 node1 iscsid: connect to 192.168.136.138:3260 failed (No route to host)
Sep  6 23:26:25 node1 iscsid: connect to 192.168.136.138:3260 failed (No route to host)
Sep  6 23:26:32 node1 iscsid: connect to 192.168.136.138:3260 failed (No route to host)
Sep  6 23:26:50 node1 multipathd: webdata: sdb - directio checker reports path is down
Sep  6 23:26:50 node1 iscsid: connection2:0 is operational after recovery (57 attempts)
Sep  6 23:26:56 node1 multipathd: webdata: sdb - directio checker reports path is up
Sep  6 23:26:56 node1 multipathd: 8:16: reinstated
Sep  6 23:26:56 node1 multipathd: webdata: remaining active paths: 2

As normal operation we will make a filesystem on the LUN and mount it somewhere in the server.

pvcreate /dev/mapper/webdata
vgcreate vgwebdata /dev/mapper/webdata
lvcreate -n lvol01  -l 100%FREE vgwebdata
mkfs.ext4 /dev/mapper/vgwebdata-lvol01
mount /dev/mapper/vgwebdata-lvol01 /var/www/html

Monday, August 31, 2015

Monitor Remote Linux System with Nagios using SSH


NRPE is the most popular method to monitor remote Linux systems. But in some cases we don’t want to install NRPE on remote system or we can’t install it or we may restricted by firewall since NRPE requires TCP port 5666 to be opened. In that case we can make use SSH using check_by_ssh method.









Above picture shows Nagios server checks remote linux servers using ssh protocol, in this case the the monitoring host is the ssh client whereas the remote servers are the ssh servers.

In this tutorial, I will setup a Nagios server to monitor 1 remote host (remote1), all running RHEL6.x.

Download Nagios installer from nagios.org (I use nagios-4.0.8 (nagios core)) and nagios-plugins-2.0.3. Extract the tarballs to the server.

Step 1:  Create Nagios User and Group

useradd nagios
passwd nagios
groupadd nagcmd
usermod -G nagcmd nagios

Step 2: Install Nagios and its dependencies on the Nagios host

yum -y install openssl-devel.x86_64
yum -y install perl-Date-Manip.noarch
yum -y install perl-TimeDate.noarch perl-Date-Calc.noarch perl-DateTime.x86_64
yum -y install mysql.x86_64
yum install -y httpd php gcc glibc glibc-common gd gd-devel make net-snmp

Depends on your rhel installation, nagios may need additional libraries or packages to be installed.

./configure --with-nagios-group=nagios --with-command-group=nagcmd

Carefully look at the messages, see dependencies it requires before compile, once everything is good proceed with

make all ; make install ;  make install-init ;  make install-config ; make install-commandmode

Above is one liner for make, install program, libraries and configuration files.

Step 3: Install plugins

./configure --with-nagios-group=nagios --with-command-group=nagcmd --with-gd-lib=/usr/lib --with-gd-inc=/usr/include --with-openssl=/usr/bin/openssl  --enable-perl-modules

make && make install

The plugins are installed at /usr/local/nagios/libexec/ ,

Set httpd and nagios service start at runlevel 3 & 5, start nagios and httpd service.

chkconfig --level 35 nagios on
chkconfig --level 35 httpd on


Step 4: Create a Default User for Web Access

htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin

Type in the nagiosadmin password, this is the password you will use to login at the main page.

service nagios start
service httpd start

Access the main page at  http://Your-server-IP-address/nagios

To this point, you will see only one server which is localhost (nagios server).

Step 5: Configure password-less ssh login to the remote servers.

To ensure that the central Nagios server is able to connect to the remote host via SSH in a manner that does not require a password. This would require creating a password-less public/private keypair as the user running the Nagios service (typically "nagios"), sending the public key to the remote server, and then (as user "nagios") logging into the remote system.

On the remote servers, add nagios user.

useradd nagios
passwd nagios
groupadd nagcmd
usermod -G nagcmd nagios

On Nagios server, if you work as root change to nagios user,

su - nagios

Create a public key and distribute to the remotes servers.

ssh-keygen -t dsa

enter default values and not to specify passphrase,
the public key will be created in /home/nagios/.ssh/id_dsa.pub

ssh-copy-id -i /home/nagios/.ssh/id_dsa.pub nagios@remote1

Once you done with above task, ssh to the remote server, this time you won't be asked for password.

Step 6: Copy nagios plugins to the remote server.

On Nagios server as nagios user copy the plugins to the remote server, I will copy them all to /home/nagios/libexec ,

scp /usr/local/nagios/libexec/*  nagios@remote1:/home/nagios/libexec

On Nagios server, the plugins are at /usr/local/nagios/libexec/ while on the remote1 the location is at /home/nagios/libexec

Step 7: Detail configuration

On the Nagios server, in the localhost.cfg configuration file, define the remote hosts to check, you can also define host group for a better view in the nagios web page.

I will define entries for the nagios server itself called nms and the remote server named remote1

# Define a host for the nagios server in credit dept

define host{
        use              linux-server
        host_name        nms
        alias            nms
        address          192.168.122.16
        icon_image       redhat.gif
        statusmap_image  redhat.gd2
        check_command    check_tcp_port #command1
        passive_checks_enabled 1
}


## Define a host for linux servers in credit dept in a remote site

## 1) remote1 ##

define host{
        use                     linux-server
        host_name               remote1
        alias                   remote1
        address                 10.22.122.122
icon_image              redhat.gif
        statusmap_image         redhat.gd2
check_command           check_tcp_port #command1
passive_checks_enabled 1
}


## Now lets define the service to check in remote1 server

# remote1

define service{
        use                     local-service
        host_name               remote1
        service_description     MySQL
        check_command           check_ssh_mysql_port #command2
        }


define service{
        use                     local-service
        host_name               remote1
        service_description     VSFTPD-HA
        check_command           check_ssh_ftp #command3
        }


define service{
        use                     local-service
        host_name               remote1
        service_description     VSFTPD-CERT-EXPIRY
        check_command           check_ssh_ssl_cert #command4
        }


define service{
        use                     local-service
        host_name               remote1
        service_description     Disk Utilization
        check_command           check_ssh_free_space #command5
        }


define service{
        use                     local-service
        host_name               remote1
        service_description     MEMORY Utilization
        check_command           check_ssh_free_mem #command6
        }

define service{
        use                     local-service
        host_name               remote1
        service_description     Uptime
        check_command           check_ssh_uptime #command7
        }


define service{
        use                     local-service
        host_name               remote1
        service_description     MySQL Replication
        check_command           check_ssh_mysqlrepl #command8
        }



## End host and service definitions


The icon_image and statusmap_image will set an icon with redhat.gif and redhat.gd2 icon for that particular host in nagios web. The check_command will use a predefine command called check_tcp_port which will check port 22 (ssh), if it is connected then the host status is up.
We will come to check command (red color) after this.

Why don't just use ping command to check, as I said earlier you might have a remote server which is behind firewall and not pingable for whatever reason.

Now lets see the command.cfg entries and its relation to localhost.cfg

## Check using SSH command ##


define command {
command_name check_ssh_port
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/home/nagios/libexec/check_ssh -t 60 -H $HOSTADDRESS$" -E
}

#command1
define command {
command_name check_tcp_port
command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 22
}


#command2
define command {
command_name check_mysql_port
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/home/nagios/libexec/check_tcp -p 3306 -t 120 $HOSTADDRESS$" -E -t 120
}

#command3
define command {
command_name check_ssh_ftp
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/home/nagios/libexec/check_ftp -p 21 -t 60 $HOSTADDRESS$" -E
}

#command4
define command {
command_name check_ssh_ssl_cert
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/home/nagios/libexec/check_ssl_cert -H $HOSTADDRESS$ -p 21 -P ftp -s -w 120 -c 60" -E -t 120
}

#command5
define command {
command_name check_ssh_free_space
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/home/nagios/libexec/check_disk -u GB -w 20% -c 10% -p / -p /opt -p /var" -E -t 120
}

#command6
define command {
command_name check_ssh_free_mem
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/home/nagios/libexec/check_mem.pl -f -w 10 -c 5" -E -t 120
}

#command7
define command {
command_name check_ssh_uptime
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/home/nagios/libexec/check_uptime" -E -t 120
}

#command8
define command {
command_name check_ssh_mysqlrepl
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/home/nagios/libexec/check_mysql_slavestatus.sh -H localhost -P 3306 -u xxxxx -p xxxxxx -w 60 -c 120" -E -t 120
}

## End command definitions

You may also pre-run the command manually before applying to nagios configuration. Giving example below, on the Nagios server as nagios user to check the uptime of remote server, run

/usr/local/nagios/libexec/check_by_ssh -H remote1 -C "/home/nagios/libexec/check_uptime -u days -w 5 -c 1" -E

Once the host, service and commands are well defined, you may check first if there any error

/usr/local/nagios/bin/nagios -v  /usr/local/nagios/etc/nagios.cfg

Finally, apply the changes made with restarting nagios

/etc/init.d/nagios restart

You will need to give few minutes for the next round check to run, you will see two hosts with service status defined in the configurations.


Host check status - checking port 22 (ssh)
















Service status on a host



Saturday, April 11, 2015

FTP - FTPes vs Firewall and IPS - Who is blocking ??


Couple of months ago I really had a strange situation ,the ftp client suddenly was not be able to login to the ftp server (vsftpd) and not to mention to put and get data. All this while it works.

Since the setup was in the customer premise, I need to find out what went wrong, first they told me that they replace the old firewall to a brand new, aha.. could that be the culprit?

Details of my environment:

1) There were a FTP server (vsftpd) running on RHEL 6.5.
    - selinux enabled
    - SSL enabled and that makes the communication and file transfer in an encrypted form hence the connection being used is ftpes
    - passive port enabled, the control port is port 21 tcp and range of 30000-30100 tcp for data port.

Clik Here for more explanation on active vs passive ftp.


2) 2 FTP clients namely IBM data power (an appliance) and Filezilla ftp client on Windows

The connection illustration below shows three Agencies, I name it Agency-A, Agency-B and Agency-X being the ftp clients. They all are at different sites on different network and different firewalls.

Lets focus on Agency-x to Agency-A connection, since the issue was between them.



Below are the steps I used to find the root cause:

1) Testing made behind firewall in Agency-A between the client and server works fine.
2)  Now, from Agency-x to Agency-A, there were three set of test I used:
 
    A) use default setting with ssl enabled (ftpes) - failed
    B) disabled ssl (configured in vsftpd) - success
    C) telnet to port 21 and randomly port 30000-30100 - success, means firewall not blocking the tcp   ports.

Could that be the setting in vsftpd, I don't think so since it worked previously, however let get more information from the log.

On the vsftpd.conf, I enabled the debug_ssl option, for a complete of the configuration, lets look the following:

anonymous_enable=NO
local_enable=YES
write_enable=YES
local_umask=022
dirmessage_enable=YES
xferlog_enable=NO
connect_from_port_20=YES
xferlog_file=/var/log/xferlog
xferlog_std_format=YES
ftpd_banner=Welcome to server FTP service.
chroot_local_user=YES
listen=YES

pam_service_name=vsftpd
userlist_enable=YES
tcp_wrappers=YES

ssl_enable=YES
allow_anon_ssl=NO
force_local_data_ssl=YES
force_local_logins_ssl=YES
ssl_tlsv1=YES
ssl_sslv2=NO
ssl_sslv3=NO
ssl_ciphers=HIGH
require_ssl_reuse=NO
ssl_request_cert=no

rsa_cert_file=/etc/vsftpd/server.crt
rsa_private_key_file=/etc/vsftpd/server.key

pasv_enable=YES
pasv_min_port=30000
pasv_max_port=30100

dual_log_enable=YES
log_ftp_protocol=YES
vsftpd_log_file=/var/log/vsftpd.log
xferlog_enable=YES
xferlog_std_format=NO
xferlog_file=/var/log/xferlog
debug_ssl=yes

Now, lets look at the vsftpd.log entries

05:54:52 2015 [pid 20383] CONNECT: Client "10.x.x.x"
Thu Jan 22 05:54:52 2015 [pid 20383] FTP response: Client "10.x.x.x", "220 Welcome to server FTP service."
Thu Jan 22 05:54:52 2015 [pid 20383] FTP command: Client "10.x.x.x", "AUTH TLS"
Thu Jan 22 05:54:52 2015 [pid 20383] FTP response: Client "10.x.x.x", "234 Proceed with negotiation."
Thu Jan 22 05:54:52 2015 [pid 20383] DEBUG: Client "10.x.x.x", "SSL_accept failed: error:00000000:lib(0):func(0):reason(0)"

 "DEBUG: Client "10.x.x.x", "SSL_accept failed: error:00000000:lib(0):func(0):reason(0)"

Above line indicate something wrong with the SSL, literally it says "Hey FTP Server, I am the client, since you require ssl connection then I need your certificate to continue hand shaking, but I am not getting it"

Well then who is interrupting or denying the handshake ?? 

It took me few days to google information, what could be root cause, the hunting for clues brought me to this search:

"http://www.experts-exchange.com/Software/Internet_Email/File_Sharing/Q_22690366.html"

As I have subscription with Redhat, I opened a case and that tooks sometime for Redhat support engineer to trace, from the trace it notice that the client terminate the connection, but why?

strace output:

~~~
16966 write(0, "234 Proceed with negotiation.\r\n", 31) = 31 <0.000034>
16966 read(0, 0x7f6fb88c5d30, 11)       = -1 ECONNRESET (Connection reset by peer) <0.003219>        <<---
16966 brk(0x7f6fb88fa000)               = 0x7f6fb88fa000 <0.000031>
16966 fcntl(4, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0 <0.000072>
16966 write(4, "Fri Jan 23 17:01:34 2015 [pid 16"..., 127) = 127 <0.000067>
16966 fcntl(4, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 <0.000018>
16966 fcntl(0, F_GETFL)                 = 0x2 (flags O_RDWR) <0.000023>
16966 fcntl(0, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000017>
16966 write(0, "500 OOPS: ", 10)        = -1 EPIPE (Broken pipe) <0.000022>
16966 --- SIGPIPE (Broken pipe) @ 0 (0) ---
16966 rt_sigreturn(0xd)                 = -1 EPIPE (Broken pipe) <0.000016>
16966 write(0, "error:00000000:lib(0):func(0):re"..., 39) = -1 EPIPE (Broken pipe) <0.000017>
16966 --- SIGPIPE (Broken pipe) @ 0 (0) ---
~~~


As soon as vsftpd server proceeds with SSL connection, client terminates the connection. Here we can see connection reset by peer.

Now, the last thing to check is the firewall itself. To make the long story short, I met the firewall guy and ask him to check if there indicators in the firewall log, but it was not there, no nothing about firewall blocking designated ports.

Again, this is not about port connectivity issue, I emphasized on checking if there rules made to deny encrypted content to pass through the firewall since protocol being used is ftpes.

Walla... finally he notice that the default setting that comes with the appliance is Intrusion Prevention System (IPS) is ENABLED and DROPPED encrypted file transfer protocol.

It took just couple of minutes for the firewall guy to configure to allow the protocol, root cause found and problem resolved !!

Tuesday, March 3, 2015

LibVirt Fencing - RHEL Virtual Machine High Availability


To configure fencing for virtual machines running on RHEL 6 host with libvirt you can configure a fencing device of Fence virt (Multicast Mode) type.

fence_virt and fence_xvm are an I/O Fencing agents which can be used with virtual machines. 

Libvirt fencing in multicast works by sending a fencing request signed with a shared secret key to the libvirt multicast group, in this case the hypervisor running the virtual machine will need to have a daemon to handle this.

On your host machine or hypervisor

1) install the fence-virtd, fence-virtd-libvirt and fence-virtd-multicast packages.

yum -y install fence-virtd{,-libvirt, -multicast}

2) Create a shared secret key /etc/cluster/fence_xvm.key, you will need to create /etc/cluster directory first

mkdir -p /etc/cluster

dd if=/dev/urandom  of=/etc/cluster/fence_xvm.key bs=1k  count=4


3) Configure the fence_virtd daemon

fence_virtd -c

Module search path [/usr/lib64/fence-virt]:

Available backends:
    libvirt 0.1
Available listeners:
    serial 0.4
    multicast 1.1

Listener modules are responsible for accepting requests
from fencing clients.

Listener module [multicast]:

The multicast listener module is designed for use environments
where the guests and hosts may communicate over a network using
multicast.

The multicast address is the address that a client will use to
send fencing requests to fence_virtd.

Multicast IP Address [225.0.0.12]:

Using ipv4 as family.

Multicast IP Port [1229]:

Setting a preferred interface causes fence_virtd to listen only
on that interface.  Normally, it listens on the default network
interface.  In environments where the virtual machines are
using the host machine as a gateway, this *must* be set
(typically to virbr0).
Set to 'none' for no interface.

Interface [br0]:

The key file is the shared key information which is used to
authenticate fencing requests.  The contents of this file must
be distributed to each physical host and virtual machine within
a cluster.

Key File [/etc/cluster/fence_xvm.key]:

Backend modules are responsible for routing requests to
the appropriate hypervisor or management layer.

Backend module [libvirt]:

The libvirt backend module is designed for single desktops or
servers.  Do not use in environments where virtual machines
may be migrated between hosts.

Libvirt URI [qemu:///system]:

Configuration complete.

=== Begin Configuration ===
fence_virtd {
    listener = "multicast";
    backend = "libvirt";
    module_path = "/usr/lib64/fence-virt";
}

listeners {
    multicast {
        key_file = "/etc/cluster/fence_xvm.key";
        address = "225.0.0.12";
        family = "ipv4";
        port = "1229";
        interface = "br0";
    }

}

backends {
    libvirt {
        uri = "qemu:///system";
    }

}

=== End Configuration ===
Replace /etc/fence_virt.conf with the above [y/N]? y

Please  note that in my setup I use br0 for my network interface, yours might have different interface.

4) Enable and start the fence_virtd service on your hypervisor

chkconfig fence_virtd on ; service fence_virtd start

5) distribute the /etc/cluster/fence_xvm.key to the cluster nodes in /etc/cluster/ folder

In my setup I have two nodes,

virsh list --all

  Id    Name                           State
 ----------------------------------------------------
 1     node1                          running
 2     node2                          running





On the cluster nodes, check workability of the fencing device, example here on node1 get the list of cluster nodes with

fence_xvm -o list

node1                f238c4a1-d6ce-e920-a5af-70fbc62b3203 on
node2                608b396f-becb-5e54-081a-692301aee064 on


On node1 try fencing the other node

fence_node node2

If you configuration works, the second node shall be rebooted by the host, i.e the running node ask the hypervisor to reboot the other node.


Since this tutorial is not on step by step creating cluster on rhel but more on fence_libvirt, the configuration (/etc/cluster/cluster.conf) for my cluster as below:

<?xml version="1.0"?>
<cluster config_version="39" name="mycluster">
    <clusternodes>
        <clusternode name="exis01.ex.net.my" nodeid="1">
            <fence>
                <method name="Method">
                    <device domain="exis01.ex.net.my" name="fencexvm"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="exis02.ex.net.my" nodeid="2">
            <fence>
                <method name="Method">
                    <device domain="exis02.ex.net.my" name="fencexvm"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <cman expected_votes="1" transport="udpu" two_node="1"/>
    <fencedevices>
        <fencedevice agent="fence_xvm" name="fencexvm"/>
    </fencedevices>
    <rm>
        <failoverdomains>
            <failoverdomain name="anynode" nofailback="1" ordered="1">
                <failoverdomainnode name="exis01.ex.net.my" priority="1"/>
                <failoverdomainnode name="exis02.ex.net.my" priority="1"/>
            </failoverdomain>
        </failoverdomains>
        <resources>
            <ip address="172.16.16.32" sleeptime="3"/>
        </resources>
        <service domain="anynode" name="myclusterha" recovery="relocate">
            <ip ref="172.16.16.32"/>
        </service>
    </rm>
    <logging>
        <logging_daemon debug="on" name="rgmanager"/>
    </logging>
</cluster>