Tuesday, June 18, 2013

Guide to unexpected Linux system restarts

Sometimes you really don't have clue of what the root cause of a system restarts. You, your colleagues, nobody then who?

Giving three common examples:

1) A deliberate action of a user (fence event, shutdown command)
2) Software error (kernel panic, NMI, etc)
3) Hardware fault/power failure in the server (power supply, disk, memory, system board, etc.)

Environment

Do we have real picture of what been configured, function of each box?

  • Is the server part of a cluster (cluster node) with fence device?
  • What software installed and does it perform any tasks which would change its typical resource use?
  • Is the server hardware capable of rebooting during system hang (configured with health monitoring software, such as HP ASR)?
  • Does it have Baseboard Management Controller connected to the system? HP iLO, Dell DRAC, etc.?

Gathering information

Potential software faults will most typically leave traces in /var/log/messages

Hardware faults are difficult to diagnose from an OS level, be alert of power failures, maintenance events, or other environmental occurrences around the time of the restart.

Investigation

Examine /var/log/message,

Many but not all restart causes will leave traces in /var/log/messages. All full system restarts will begin by listing the kernel command line, searching the message log for the phrase "Command line" is the first step when beginning an investigation.

Aug 22 03:18:15 node1 kernel: Command line: ro root=LABEL=/ rhgb quiet crashkernel=128M@16M

Now look for similar output from the log,

User initiated:

shutdown: shutting down for system reboot 
init: Switching to runlevel: 6 
exiting on signal 15 
Got SIGTERM, quitting

Veritas Cluster Fence:

GAB WARNING V-15-1-20138 Port h isolated due to client process failure

RHEL High-Availability Cluster Suite Fence Event

fenced[xxxx]: fencing node "node1.example.com" 


Hardware Fault

CPU 1: Machine Check Exception: 3 Bank 3: ba00000000070f0f

Thermal Event/Cooling Failure  Hardware Fault Power Button Pressed

kernel: CPUX: Temperature above threshold, cpu clock throttled
kernel: CPUX: Core power limit notification (total events = 1)


Hardware Fault Power Button Pressed

received event "button/power PWRF 00000000 00000000"
 

Non-Maskable Interrupt Received

Uhhuh. NMI received for unknown reason XX.


Kernel Soft Lockup Task Blocked for Too Long

Kernel: BUG: soft lockup - CPU#7 stuck for 10s!
 


Task Blocked for Too Long

kernel: INFO: task khugepaged:60 blocked for more than 120 seconds.
 


Above messages may not necessarily be the root cause of the reboot, but are important clues for further investigation.





Wednesday, April 3, 2013

RHEV 3.1, Active Directory, VMs High Availability Cluster, fence_rhevm

It has been for quite sometime I left RHEV 3.0 implementation, now I have chance to evaluate RHEV 3.1. New features, good to go.

With a limited resource, my setup as follows:

1) One RHEV Hypervisor, local storage : IBM System x3650, 4GB RAM
2) RHEV Manager: A VM on my Windows Desktop
3) A DNS server, I use BIND, it runs on top of rhev manager server.
4) A domain both DNS and AD, called mms.local
5) Red Hat Evaluation subscription
6) Windows 2003 Server (Active Directory) vm running in RHEV
7) Few Centos 6 and WinXP Vms running in RHEV

This setup follows the Red Hat Enterprise Virtualization 3.1 Evaluation Guide
track B for minimal setup.
 
For RHEV Hypervisor and Manager installation, follow the Evaluation guide. They will not be shown in this entry. Additional task such as installing rhev guest tool onto Windows VMs won't be explained here, you may refer to Red Hat knowledge base.

The objectives are:

RHEV:

installing rhev guest tool onto Centos 6 Vms
adding active directory for rhev users
troubleshooting AD connectivity
user portal preview
using rhev api for managing vms using command line


High Availability Clustering:

high availability cluster using Centos 6:
installing Centos 6 HA components
setting HA services, virtual IP, storage and apache services
setting fence device using fence_rhevm
testing fence from command line
preview on high availability functioning

 RHEV's snapshots:

1) RHEVM admin login (https://rhevm.mms.local/webadmin)















2) Listing items in rhevm










3) cluster









4) host









5) storage







6) Virtual Disk













7) Virtual Machines













8) Active Directory user listing















9) Dashboard












10) Events








Adding Active Directory to RHEV

Without external directory, rhev only provides you a single user (admin@internal) to manage the entire system, sometime you may need to allow users to manage their own resources such as creating, stop/start/pause the virtual machine.

Now, you have an Active Directory (2003 and 2008) and you need to add it to your rhev infrastructure.

Prior to that make sure your rhevm host is able to communicate to the AD server.
Below are the configuration files you might need to refer:

1) /etc/resolv.conf

nameserver 127.0.0.1
nameserver 8.8.8.8
search mms.local

2) dns record for mms.local domain in /var/named

$ttl 38400
@ IN SOA rhevm.mms.local. root.mms.local. (
1323918962
10800
3600
604800
38400 )
@ IN NS rhevm
rhevm.mms.local. IN A 172.24.101.31
rhevh.mms.local. IN A 172.24.101.32
mmssvrad.mms.local. IN A 172.24.101.33
_kerberos._udp IN SRV 0 100 88 mmssvrad.mms.local.
_kerberos._tcp IN SRV 0 100 88 mmssvrad.mms.local.
_ldap._tcp IN SRV 0 100 389 mmssvrad.mms.local.

3) /etc/ovirt-engine/krb5.conf

[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = MMS.LOCAL
dns_lookup_realm = true
dns_lookup_kdc = true
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
[realms]
MMS.LOCAL = {
kdc = mmssvrad.mms.local
}
[domain_realm]
mms.local = MMS.LOCAL
Add the domain,
rhevm-manage-domains -action=add -domain=mms.local -provider=ActiveDirectory -user=rhevadmin -interactive -addPermissions
restart ovirt-engine and jbossass service, open the rhev webadmin, you should now see a new domain from the domain drop down box. Login using the user name, mine is rhevadmin.

In order to check whether this setup survive after reboot, restart/reboot your rhevm and re validate the configuration

rhevm-manage-domains --help
rhevm-manage-domains -action=list
rhevm-manage-domains -action=validate

You might want to look into the log files for errors in /var/log/ovirt-engine

engine.log
engine-manage-domains.log

User portal

As you add users in the AD, as the rhev admin, you will create permission per user for example to manage vms as below:







Installing rhev guest tool onto Centos 6 Vms

There is no official rhev guest tool for Centos6, you may use from the community. I found one here:

yum -y install wget && wget http://www.dreyou.org/ovirt/ovirt-dre.repo -P /etc/yum.repos.d/ && yum -y install rhev-agent-pam-rhev-cred rhev-agent &&
service rhev-agentd start
This tool will provide information in the rhev admin such as IP address, CPU, RAM and network utilization for the vm.





Using rhev api for managing vms using command line

Sometime you need a simple and powerful interface to manage your rhev such as powering down/up a vm.

You need to have the required tools and some configurations:

- install rhevm-cli
- download certificate from rhev manager admin portal if you use https.

wget https://rhevm.mms.local/ca.crt
- connect to the rhevm
rhevm-shell -c -l "https://rhevm.mms.local/api" -P 443 -u "admin@internal" -A ca.crt










- once connected, you may use available command














VM High Availability Clustering

I will use 3 Centos 6 hosts without quorum disk. A floating IP address, a shared/floating disk (new feature in rhev 3.1), the floating disk will be shared among 3 hosts, formatted as ext4, auto mounted to /mnt on active host by cluster service and finally a web server (apache) with index.html in the /mnt folder.

On Centos 6, install virtualization packages such as:

Virtualization
Virtualization Platform
Virtualization Tools

Configure your host resolution in /etc/hosts so that each host can connect to each other using name.

127.0.0.1   localhost localhost.localdomain
172.24.101.26  node1.mms.local
172.24.101.27  node2.mms.local
172.24.101.25  node3.mms.local
172.24.101.31  rhevm.mms.local

Enable luci service, you will need this to configure cluster via web interface e.g https://172.24.101.26:8084

Below are some of the snapshots of the configured cluster in my luci interface.




















































All above details can be viewed in just a single file entry in /etc/cluster/cluster.conf

<?xml version="1.0"?>
<cluster config_version="38" name="webha">
    <clusternodes>
        <clusternode name="node1.mms.local" nodeid="1">
            <fence>
                <method name="Method">
                    <device name="rhevmfence" port="centos6N1"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="node2.mms.local" nodeid="2">
            <fence>
                <method name="Method">
                    <device name="rhevmfence" port="centos6N2"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="node3.mms.local" nodeid="3">
            <fence>
                <method name="Method">
                    <device name="rhevmfence" port="centos6N3"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <fencedevices>
        <fencedevice agent="fence_rhevm" ipaddr="rhevm.mms.local" ipport="443" login="admin@internal" name="rhevmfence" passwd="redhat" power_wait="3" ssl="on"/>
    </fencedevices>
    <rm>
        <failoverdomains>
            <failoverdomain name="fodomain" nofailback="1" ordered="1">
                <failoverdomainnode name="node1.mms.local" priority="1"/>
                <failoverdomainnode name="node2.mms.local" priority="2"/>
                <failoverdomainnode name="node3.mms.local" priority="3"/>
            </failoverdomain>
        </failoverdomains>
        <resources>
            <ip address="172.24.101.29" sleeptime="3"/>
            <apache config_file="conf/httpd.conf" name="httpd" server_root="/etc/httpd" shutdown_wait="0"/>
            <fs device="/dev/vdb" force_fsck="1" force_unmount="1" fsid="56578" mountpoint="/mnt" name="disk" self_fence="1"/>
        </resources>
        <service domain="fodomain" name="haservice" recovery="relocate">
            <ip ref="172.24.101.29"/>
            <apache ref="httpd"/>
            <fs ref="disk"/>
        </service>
    </rm>
</cluster>


Get status from running cluster using clustat command









test fence from command line, we will use the command to test the workability of the fence mechanism, in real situation the problematic node will be fenced automatically for example the node is hang.

In my 3 nodes cluster, I can fence the other node from the command line, for example using fence_rhevm I can fence node2.

before that, get the node2 status using fence_rhevm





from the other node (node3) fence node2 (which is currently running the cluster service)
fence_rhevm -o reboot -z -a rhevm.mms.local -u 443 -l 'admin@internal' -p 'redhat' -n centos6N2
You will notice node2 being rebooted in rhevm page and also you need to monitor the output from clustat, the service then transferred to the other node. Example below get status within 2 second interval.

clustat -i 2
on the fencing workability you might want to see the /var/log/cluster/fenced.log






now cluster service running on node1








That's it.

I hope you readers get something useful from this information.

Tuesday, January 22, 2013

It's Cygwin

What is and why Cygwin,


Cygwin is a set of powerful tools to assist developers in migrating applications from UNIX/Linux to the Microsoft Windows platform. Cygwin delivers the open source standard Red Hat GNU gcc compiler and gdb debugger on Windows.

With Cygwin, administrators can login remotely to any PC, fix problems within a Posix/Linux/UNIX shell on any Windows machine, and run shell command scripts.

Installation

 

For this tutorial, I will use Red Hat Cygwin and install it on Windows XP SP3.
Red Hat Cygwin is supported on Windows 2000, XP, 2003 Server, Vista, 2008 Server, 7, and 2008 Server R2. Cygwin is a 32 bit platform that will operate on both 32 and 64 bit Windows installations.

1) Download the installer
ftp://ftp.ges.redhat.com/private/releng/cygwin-1.8/rhsetup.exe

2) run the setup file (rhsetup.exe)







 






















































































Similar to installing package in Linux, dependencies are resolved.













This shall give you enough time for a coffee break.










There it is, installation complete.














3) Using cygwin

There are many tools or application you can use, what I want to show is how to bring x application on Linux to your windows desktop.


You will find the cygwin terminal shortcut on the desktop, depending on what application you chose earlier, the linux program are ready to use e.g ssh









We first run the X Win Server, find it in the Windows start menu.

Showing is the xclock.












A connection using ssh -X to a Linux host.














Its gedit, lets write something and save.

You may call other GUI program as well.











That are some of what cygwin can do.

Saturday, January 19, 2013

11 Bizarre Things People Say About Linux


Well, it has been so many years Linux co-exist with other OS but misunderstanding is still there.


1. There is no anti-virus software in Linux
2. Linux always crashes.
3. Surely I cannot run any application on a Linux server?
4. What about the future?
5. Isn’t it easy to hack a Linux box with a root account?
6. Linux does not connect to my MS Windows machines.
7. Where does Linux leave UNIX?
8. Linux is not configurable or adaptable.
9. Linux isn’t international.
10. There is no commercial support for Linux servers and related software.

More details:

linuxit.com - flash format
linuxit.com - html

Thursday, January 17, 2013

DNS record, query returns different result.

Scenario:

Names resolution not consistence

A friend of mine complaining that the link to his company web app sometime working and sometime not.

Details:

http://myapp.hiscompany.com (accessible)

After few minutes / from other PC/host

http://myapp.hiscompany.com (page not found)

Resolutions:

Figure out whether DNS query such return consistence result. Using host and dig command from Linux and nslookup from Windows machine return no result.

Using web based lookup http://zoneedit.com/lookup.html returns result.

Nailing down by querying directly from hiscompany DNS server. The first server (ns1.hiscompany.com) resides within the company network while the second server (ns2.hiscompany.com) hosted at ISP.

Querying on first DNS server return result while not on the second server.
The record is not sync on both servers. It should by default the records are synced by zone transfer facility if using BIND in Linux.

Fact: 
DNS servers retrieve information from other DNS.