Archive for the 'UNIX' Category

25
Jan

Solaris MPxIO: LUN distribution

Lately my work tasks are leading me to learn a lot about Solaris storage configuration. I have a “test” environment with 13 HP ProLiant servers, Solaris x86 and HP EVA8400 storage.

EVA8400 is advertised as an active-active (AA) storage – a LUN can be accessed via both controllers. But if you look at the small letters you will see that it is actually asymmetric AA storage (AAA), which is a cheaper class of AA storages. While all LUNs can still be accessed via both controllers, LUNs still have owning controller. If I/O request comes to a non-owning controller it will be transfered to the owner and only then performed. This of course has impact on the performance. To avoid this EVA monitors the requests, and if more than 60% of requests are coming through non-owning controller ownership is changed.

Preferred method of access for AAA storages is using asymmetric logical unit access (ALUA) – method that enables target (storage system) to set different access characteristics to different paths based on the owning controller (this is known as target port group support (TPGS)). For example, if Controller_A owns LUN_1, all paths for LUN_1 to Controller_A will be marked as Active/Optimized, while all paths to Controller_B will be marked as Active/NonOptimized.

Knowing all this, it is only obvious that ideal setup for AAA storages is when one half of LUNs is owned by one controller and the other half by second controller. This enables us to balance the load on both controllers and get the maximum performance out of the storage.

Now, on the client side some systems have Veritas suite installed, but there are a couple of servers that are using Solaris native MPxIO as a multipathing solution. By specification MPxIO is ALUA aware plus it comes with a nice feature to balance the traffic across all Active paths. EVA8400 controllers have 4 x 4Gbps fibre channel ports each, while my servers have 2 x 8Gbps HBAs. Spreading the traffic across all 4 target ports would help me get needed performance.

Everything sounds perfect in theory, but I’m having problems to make this work in practice. For some reason MPxIO does all the I/O over Controller A and, since with MPxIO path priority can not be set manually, after a while all LUNs are moved to this controller leaving second controller idle. Since TPGS specification is vendor specific there seems to be some incompatibility between HP and Sun implementation. Or maybe there are some hidden undocumented options that have to be set – so far I had no luck in finding these. :-/

During the tests and in order to see how the LUNs were spread I wrote a script that counts Active paths per target port. Maybe someone else will find it useful. Download can be found here. And here is how it looks in action:

root@xdb1-ora:~# get_san_paths.py
Initiator ports found:    2
Target ports found:       8
LUNs found:              87
 
Path \ LUNs:
  50001fe150229f78: 77 LUNs
  50001fe150229f79: 77 LUNs
  50001fe150229f7c: 10 LUNs
  50001fe150229f7d: 10 LUNs
  50001fe150229f7a: 77 LUNs
  50001fe150229f7b: 77 LUNs
  50001fe150229f7e: 10 LUNs
  50001fe150229f7f: 10 LUNs

18
Oct

FreeIPA and automount NIS maps

I am playing since recently with FreeIPA, Red Hat's identity management solution built on top of Red Hat's DS389 directory server. One of the main reasons why I decided for FreeIPA (apart from integrated Kerberos for single sign-on and possible integration with Microsoft Active Directory in the future) is, also integrated, NIS server - proxy system that receives requests from NIS clients, gets the data from LDAP server and sends it back to clients. Now, in order to understand why I need to support both LDAP and NIS you need to know few things about the environment I'm in charge of.

I'm working for a software development company. We produce billing software for telecommunication operators - mainly used by mobile telecommunication companies. That means, when you make a call, your call needs to be tracked, recorded and properly billed on the end, all done by our software (called BSCS btw). Sounds simple enough. Multiply that by one hundred million customers making calls and it's not so simple anymore. :) Anyway, our customers use our software on different platforms, most of them use HP-UX, some are on Solaris, some AIX, some are on Linux and we even have some customers on Tru64. In order to provide support to all those customers we need to have all those systems as well. So on the end we end up 100+ servers of all types of UNIX systems. That's not a big problem, it's even interesting, but the problem comes up when those systems are not being upgraded. We have Solaris 2.6 servers and Tru64 4.0D servers, until recently we even had AIX 4.3 and HP-UX 10.30 servers. All of the mentioned systems are 13 years old! Scary!

As you can imagine, those outdated systems do not support many things we take for granted today. Shadow passwords and LDAP authentication are few of those things. And this gets us back to the main topic of this post. FreeIPA (or rather DS389) provides integrated NIS server for unlucky people like myself via SLAPI-NIS plugin. All you have to do in order to use it, is to enable compat and NIS plugin.

# ipa-compat-manage enable
Directory Manager password:
 
Enabling plugin
This setting will not take effect until you restart Directory Server.
 
# ipa-nis-manage enable
Directory Manager password:
 
Enabling plugin
This setting will not take effect until you restart Directory Server.
The rpcbind service may need to be started.

And after directory server restart you have a working NIS server.

# rpcinfo -p
program vers proto   port  service
100000     4   tcp    111  portmapper
100000     3   tcp    111  portmapper
100000     2   tcp    111  portmapper
100000     4   udp    111  portmapper
100000     3   udp    111  portmapper
100000     2   udp    111  portmapper
100024     1   udp  49833  status
100024     1   tcp  36837  status
100004     2   udp    699  ypserv
100004     2   tcp    699  ypserv

By default only passwd, group and netgroup maps are supported but other maps can easily be added. In our environment we are heavily relaying on automounter maps so I had to find a way to add them into FreeIPA NIS server. Luckily, as everything else with FreeIPA, this is very simple. First let me show you how to add automount entries in FreeIPA, it is surprisingly easy.

When it comes to automounter, FreeIPA has support for different locations. So for example, you can have different maps for your production environment, test environment and DMZ environment. Pretty neat. In my example, I will create a new location for our DMZ environment.

# ipa automountlocation-add dmz
  Location: dmz

New location is automatically created with auto.master and auto.direct maps.

# ipa automountmap-find dmz
  Map: auto.master
 
  Map: auto.direct
  ----------------------------
  Number of entries returned 2
  ----------------------------

I would like to add a new map for user home folders.

# ipa automountmap-add dmz auto.home
  Map: auto.home

Then we need to add an entry in auto.master map to associate /home mount point with auto.home map.

# ipa automountkey-add dmz auto.master /home --info=auto.home
  Key: /home
  Mount information: auto.home

Finally, we add an entry into auto.home map specifying which share to mount for user miljan.

# ipa automountkey-add dmz auto.home miljan --info=filer01:/vol/users/home/miljan
  Key: miljan
  Mount information: filer01:/vol/users/home/miljan

And voila, when user miljan logs-on he will have his home folder mounted.

Final step would be to have this in NIS as well. For this we need to manually add few entries into LDAP server. In the example below we add support for auto.master map. There are probably few things you would need to change, though. First, the domain name in DN and nis-domain lines - in the example I am using example.com as a domain. Second, nis-base line - value of this attribute needs to be the DN of your automount map.

# ldapadd -x -D "cn=Directory Manager" -W
dn: nis-domain=example.com+nis-map=auto.master,cn=NIS Server,cn=plugins,cn=config
objectClass: extensibleObject
nis-domain: example.com
nis-map: auto.master
nis-base: automountmapname=auto.master,cn=dmz,cn=automount,dc=example,dc=com
nis-filter: (objectclass=*)
nis-key-format: %{automountKey}
nis-value-format: %{automountInformation}

Repeat the same for auto.home map and you are set to go.

$ ypcat -d example.com -h freeipa.example.com -k auto.master
/home auto.home
 
$ ypcat -d example.com -h freeipa.example.com -k auto.home
miljan filer01:/vol/users/home/miljan

Nice and easy. :)

06
Oct

SVC lsmigrate

A year or so ago, I got annoyed by the ugly and unreadable output of IBM SVC lsmigrate command so I sat down and wrote a short script that will provide much nicer and more informational output. Output includes information about the VDisk that is being migrated (name, ID and size), destination MDisk group and migration information (number of threads and progress). If started in verbose mode, information about the source MDisk group is printed as well.

Non-verbose mode:

$ ./svc_lsmigrate.py -H svccluster
# (ID  ) Vdisk         Size      (ID ) Mdisk Group    Threads Progress
======================================================================
1 (67  ) esx_srvf05_d  1000.00GB (8  ) DS482_5r10_SK1 1       48 %
2 (157 ) esx_srvf06_g  1000.00GB (5  ) DS483_8r5_2SK3 1       96 %
3 (118 ) esx_srvf01_h  1022.00GB (1  ) DS482_8r5_SK3  1       63 %
4 (117 ) esx_srvf01_g     1.00TB (1  ) DS482_8r5_SK3  1       63 %
5 (120 ) esx_srvf01_i  1023.00GB (1  ) DS482_8r5_SK3  1       63 %
6 (39  ) tsm_disk4        5.00GB (10 ) DS484_9r5_1SK2 1       98 %
7 (19  ) oracode_tunis  800.00GB (5  ) DS483_8r5_2SK3 1       59 %

If you find the script useful, you can download it here.

Note: in order for script to work, you need to have SVC connection parameters set in your SSH config file. Example could be:

$ grep -p svccluster ~/.ssh/config
Host svccluster
  Hostname 192.168.1.34
  User admin
  IdentityFile ~/.ssh/admin.key

Note #2: Script was tested on SVC software levels 4.3 and 5.1.

04
May

Linux LVM: logical volume migration

I was recently confronted with the task of migrating logical volume holding MySQL databases to a separate physical volume. Due to important application running on top of MySQL any downtime was out of the question.

Easiest way to perform this operation would be using pvmove command.

[root@r2d2 ~]# pvmove -n test /dev/sda5 /dev/sdb1
/dev/sda5: Moved: 6.2%
/dev/sda5: Moved: 18.8%
/dev/sda5: Moved: 28.1%
/dev/sda5: Moved: 37.5%
/dev/sda5: Moved: 46.9%
/dev/sda5: Moved: 56.2%
/dev/sda5: Moved: 65.6%
/dev/sda5: Moved: 75.0%
/dev/sda5: Moved: 84.4%
/dev/sda5: Moved: 93.8%
/dev/sda5: Moved: 100.0%

In the example we moved logical volume 'test' from physical volume /dev/sda5 to /dev/sdb1. While pvmove is running, we can get the necessary information including the progress by issuing command lvs.

[root@r2d2 ~]# lvs -a -o+devices
LV        VG  Attr   LSize  Move     Copy% Devices
home      sys -wi-ao 20.00G                /dev/sda5(544)
mysql     sys -wi-a- 1.00G                 /dev/sda5(1888)
[pvmove0] sys p-C-ao 1.00G /dev/sda5 56.25 /dev/sda5(1920),/dev/sdb1(0)
root      sys -wi-ao 7.00G                 /dev/sda5(256)
swap      sys -wi-ao 2.00G                 /dev/sda5(480)
test      sys -wI-ao 1.00G                 pvmove0(0)
usr       sys -wi-ao 8.00G                 /dev/sda5(0)
usr       sys -wi-ao 8.00G                 /dev/sda5(1856)
var       sys -wi-ao 2.00G                 /dev/sda5(224)
var       sys -wi-ao 2.00G                 /dev/sda5(1824)

Another, and somewhat more tricky way of doing this is to use LV mirroring. I say more tricky because it requires more work and the last part of the operation is not really intuitive so it may be prone to errors.

We start with this:

[root@r2d2 ~]# pvs
PV        VG  Fmt  Attr PSize   PFree
/dev/sda5 sys lvm2 a-   220.75G 179.75G
/dev/sdb1 sys lvm2 a-   3.72G   3.72G
 
[root@r2d2 ~]# lvs -a -o+devices
LV   VG  Attr   LSize Devices
[---]
test sys -wi-a- 1.00G /dev/sda5(1920)
[---]

Two PVs, the source - sda5, and unused, target PV - sdb1. Logical volume test is the LV we want to move to sdb1.

In order to verify that the content of the LV is consistent on the end, we will create a sample file and compare its content before and after.

[root@r2d2 ~]# dd if=/dev/zero of=/mnt/random bs=512 count=20000
20000+0 records in
20000+0 records out
10240000 bytes (10 MB) copied, 0.122046 s, 83.9 MB/s
 
[root@r2d2 ~]# md5sum /mnt/random
596c35b949baf46b721744a13f76a258 /mnt/random

We start by adding another copy to 'test' LV and placing it on a new PV.

[root@r2d2 ~]# lvconvert -m 1 sys/test /dev/sdb1
Not enough PVs with free space available for parallel allocation.
Consider --alloc anywhere if desperate.
Unable to allocate extents for mirror(s).

Hm, confusing error. :) The reason for this is that Linux LVM implementation requires mirror to use a log volume, and this log volume needs to reside on a PV of its own. In order to go around this we can instruct LVM to place the log in the memory. This is not recommended for situations when mirroring is needed on the long run, since mirror needs to be rebuilt every time the LV is activated. But this is the perfect solution for our needs as our mirror will not last for long.

[root@r2d2 ~]# lvconvert -m 1 sys/test /dev/sdb1 --mirrorlog core
sys/test: Converted: 6.2%
sys/test: Converted: 15.6%
sys/test: Converted: 28.1%
sys/test: Converted: 37.5%
sys/test: Converted: 46.9%
sys/test: Converted: 56.2%
sys/test: Converted: 65.6%
sys/test: Converted: 75.0%
sys/test: Converted: 84.4%
sys/test: Converted: 93.8%
sys/test: Converted: 100.0%
Logical volume test converted.

Again, we can get all information with lvs command.

[root@r2d2 ~]# lvs -a -o+devices
LV              VG  Attr   LSize Copy%  Devices
home            sys -wi-ao 20.00G       /dev/sda5(544)
mysql           sys -wi-a- 1.00G        /dev/sda5(1888)
root            sys -wi-ao 7.00G        /dev/sda5(256)
swap            sys -wi-ao 2.00G        /dev/sda5(480)
test            sys mwi-a- 1.00G 15.62  test_mimage_0(0),test_mimage_1(0)
[test_mimage_0] sys Iwi-ao 1.00G        /dev/sda5(1920)
[test_mimage_1] sys Iwi-ao 1.00G        /dev/sdb1(0)
usr             sys -wi-ao 8.00G        /dev/sda5(0)
usr             sys -wi-ao 8.00G        /dev/sda5(1856)
var             sys -wi-ao 2.00G        /dev/sda5(224)
var             sys -wi-ao 2.00G        /dev/sda5(1824)

So now we have our LV mirrored with copies on both PVs. Just for test, we can verify the content of the file.

[root@r2d2 ~]# md5sum /mnt/random
596c35b949baf46b721744a13f76a258 /mnt/random

Seems fine. :)

Last step, the one where we have to be more careful, is removing mirror copy from old PV. We have to be more careful because the PV name in this case is the name of a PV we want to remove from mirroring, which is the opposite from what we had when establishing the mirror. If we use wrong PV name we will end up on the begining, with LV on a wrong PV and without mirroring.

[root@r2d2 ~]# lvconvert -m 0 sys/test /dev/sda5
Logical volume test converted.

Again we use lvs to see the status.

[root@r2d2 ~]# lvs -a -o+devices
LV   VG  Attr   LSize Devices
[---]
test sys -wi-a- 1.00G /dev/sdb1(0)
[---]

On the end, to verify that we have a clean situation, we can remount the LV and compare the content.

[root@r2d2 ~]# umount /mnt/
[root@r2d2 ~]# lvchange -an /dev/sys/test
[root@r2d2 ~]# lvchange -ay /dev/sys/test
[root@r2d2 ~]# mount /dev/sys/test /mnt
 
[root@r2d2 ~]# md5sum /mnt/random
596c35b949baf46b721744a13f76a258 /mnt/random

Everything fine. :)

Just a small note for the end. In case you try to create a mirror on a logical volume that was not yet activated, sync of the mirror will start after the activation.

[root@r2d2 ~]# lvconvert -m 1 sys/test /dev/sdb1 --mirrorlog core
Conversion starts after activation

30
Nov

PS1 for your Shell?

Few years ago I went on a quest to find a perfect shell prompt. I asked the mighty Internets for ideas, but it seemed futile. I tried many things, simple prompts, complex prompts, but nothing could satisfy my requirements (I don't even remember what were my requirements back then.) So I picked best of both worlds and got this little monster.

:) oscar:~#

Happy face! And in case of an error, it looks sad.

:( 2 oscar:~#

Cute, a? It even prints the exit code. Useful and cute at the same time! And here is definition of the prompt. As you can see it uses simple function to determine return code of executed command and adjust its feelings accordingly.

smiley() {
    RC=$?
    [[ ${RC} == 0 ]] && echo ':)' || echo ":( ${RC}"
}
export PS1="\$(smiley) \u@\h:\w\\$ "

I think I got the idea for the smiley thing somewhere, but unfortunately I don't remember anymore where from.
For more adventurous people, maybe this prompt would be more interesting.

export PS1="\[\033[0;36m\]\033(0l\033(B\[\033[0m\][\[\033[1;31m\]\u\[\033[0m\]]\[\033[0;36m\]\033(0q\033(B\[\033[0m\][\[\033[1;33m\]@\h\[\033[0m\]]\[\033[0;36m\]\033(0q\033(B\[\033[0m\][\[\033[0;37m\]\T\[\033[0m\]]\[\033[0;36m\]\033(0q\033(B\033(0q\033(B\033(0q\033(B\033(0q\033(B\033(0q\033(B\033(0q\033(B\033(0q\033(B\033(0q\033(B\[\033[0m\][\[\033[1;33m\]\w\[\033[0m\]]\n\[\033[0;36m\]\033(0m\033(B\[\033[0m\]\$ "

Magic! :)

What is your favorite prompt? Please leave boring \u@\h:\w for dinner with your parents. :P
25
Mar

Why I love strace

Strace is a tool that should be in a toolbox of every system administrator. Not only that it can help in troubleshooting simple problems (ie. missing libraries in newly created chroot, which ldd mysteriously misses to report) but it also helps in debugging very complex system problems and performance issues.

Recently I experienced a very strange problem with one of the RHEL 3 servers we’ve got. Problem manifested in a very strange way, SSH and su logins hanged, other daemons were also hanging during the startup, only way to reboot or shutdown the server was to physically press the restart/power off button, etc. All this could have been caused by problems on both software and hardware level. First suspicious was bad RAID controller, but after tests this proved to be a mislead. After more tests and brainstorms hardware problems were definitely excluded, so problem has to be on the software side. But what could be the problem?

After few more misleading steps I tried to trace system calls created by su command and found very interesting results.

$ strace -f -s 1024 -o /tmp/su.strace.out su -
[-- cut --]
3138 open(“/dev/audit”, O_RDWR) = 3
3138 fcntl64(3, F_GETFD) = 0
3138 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
3138 ioctl(3, 0x801c406f

And this is where the strace output ends and su command hangs. Audit device file is opened (file descriptor 3) and as soon as the first request is dispatched to this device (ioctl system call to file descriptor 3) command freezes. According to this I should just disable audit on the server and the problem will be gone.
As a test, audit daemon was temporarily stopped and I tried to switch to another user and the problem was indeed gone.

After searching for similar problems with audit daemon I found an article in Red Hat knowledge base regarding the exactly same
issue (http://kbase.redhat.com/faq/FAQ_79_6169.shtm).
From the article:

When the free space in the filesystem holding the audit logs is less than 20%, the above notify command will error out and auditd will enter suspend mode. This causes all system calls to block.

So this behavior is not a bug but actual feature of the software. :o) From security point of view this is expected behaviour – attacker could fill up filesystem where audit logs are stored before the attack and audit will be disabled, meaning no logs of his activity, so better not to allow ANY activity on the system if audit is not able to write to its logs. But still, this kind of behaviour renders the system completely useless to legitimate users.

The topic of this post is not audit, so I will stop here. Important thing is that strace led us directly to the main source of the problem. Resolution of issues like this would be much more complex and time consuming without this great little tool. :)

05
Sep

Easy way to read MBR?

10$ question. Sometime ago you have created backup of your systems Master Boot Record (MBR). Now, after some change, you noticed you did a fatal mistake and your partition table is corrupted and you need to recover it from the backup you created, but you are not sure if it is the correct version. The question is, what is the easiest way to read partition table from the backup of your MBR? No, Hex editor is not the easiest way to do it (and it is bad for your eyes :)). I wonder how many of you said ‘file’ command? Yes, magical file command is able to read the data from the mbr dump and prints you the actual partition table. Here is an example from my laptop.

# file mbr.bin
mbr.bin: x86 boot sector;
partition 1: ID=0×83, active, starthead 1, startsector 63, 40949622 sectors;
partition 2: ID=0×82, starthead 254, startsector 40949685, 2088450 sectors;
partition 3: ID=0x8e, starthead 254, startsector 43038135, 74172105 sectors, code offset 0×48

As you can see I have only 3 partitions on the disk. First one has type 0×83, which is HEX id for ext3 type of partition and it is my / partition (you don’t see it here, but I know it :)). It is also active partition, it means that it is used for booting the system. You can also see the size of the partition in sectors. Knowing that one sector has length of 512 bytes you can easily find out the size of the partition.

# echo $(((40949622/2)/1024))
19994
# df -k /
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 19833488 6504676 12305072 35% /

That’s it, correct. :)

Next partition is 0×82 which is swap partition. And last partition is 0x8e which is id for Linux LVM partition.

While I am here, I could also explain what is MBR and how it is used.

MBRMain Boot Record resides in first 512 bytes of your bootable disk. Besides partition table it also holds bootloader and something called a magic number. As you can see on the picture, bootloader takes the biggest part of MBR, whole 446 bytes. During the boot process BIOS search for a bootable devices attached to your system and once it finds it it looks at the MBR and loads the bootloader, also called primary bootloader. Primary bootloader looks at the partition table inside MBR (next 64 bytes after the bootloader) and searches for an active partition. When it finds active partition it loads the secondary boot loader from that partitions boot record which, in turn, loads the kernel, and so on.

Magic number is used for sanity check of your MBR. It holds only 2 bytes and should be 0xAA55.

So, in short words, MBR is used to easily locate and load kernel from the correct device. (It is also used by your operating system to find the layout of the disk, but that is another story.)

PS: You can create a dump of your MBR by issuing next command:

# dd if=/dev/sda of=mbr.bin bs=512 count=1

Replace /dev/sda with the correct address to your disk.

PS 2: Sorry for bad quality of the MBR scheme, but I didn’t have much time to work on it and I am not a graphic designer. :D

26
Jul

HP-UX UNIX95 Compatibility

HP-UX is well known for the ease of patch and product manipulation. These operations are done via software called Software Distributor (SD). Situations where SD fails are very rare but they can be very strange.

One of those weird situations happened to me last week. I downloaded patch bundle from HP site and tried to create a depot. Very simple action – untar the bundle, run the create_depot_hp-ux_11 script and the script and SD will do all the necessary things. But, here comes the weird part – checksum error for all patches in the bundle.

# create_depot_hp-ux_11
DEPOT: /var/depot
BUNDLE: BUNDLE
TITLE: Patch Bundle
UNSHAR: y
PSF: depot.psf
Expanding patch shar files…
x – PHCO_23651.text
x – PHCO_23651.depot [compressed]
ERROR: wc results of PHCO_23651.depot are 7082 23582 522240 should be 7082 18520 522240
x – PHKL_18543.text
x – PHKL_18543.depot [compressed]
ERROR: wc results of PHKL_18543.depot are 146386 592281 20377600 should be 146386 524212 20377600

I checked the checksum of the bundle itself and it seemed perfectly fine. What a puzzle, a?

Here is the story. HP-UX was supposed to be compatible with UNIX95 specification, but the problem is that, for some reason, this compatibility breaks SD. This compatibility is enforced by environment variable called UNIX95. So if you ever notice problem like this, check first if this variable is active on your server and if that is the case just simply unset it and your SD will be fully functional again.

# set|grep UNIX95
UNIX95=yes
# unset UNIX95
# create_depot_hp-ux_11
DEPOT: /var/depot
BUNDLE: BUNDLE
TITLE: Patch Bundle
UNSHAR: y
PSF: depot.psf
Expanding patch shar files…
x – PHCO_23651.text
x – PHCO_23651.depot [compressed]
x – PHKL_18543.text
x – PHKL_18543.depot [compressed]

Happy patching! :)

12
Jul

AIX 6 ready for download!

Like I previously announced, IBM AIX 6 Beta will be openly available for free download and testing. This time has come and you can start downloading it right now from this page. More info here.

AIX 6 should bring a lot of new stuff especially when it comes to virtualization and high-availability issues. Some new features are ported directly from fault-tolerant systems which should provide even more stable and reliable systems. There will be no official support for Beta testing, but you can ask for help on one of the IBM forums.

Openness of IBM is a pretty new thing. This change in IBM policy is probably influenced by SUN’s opening of Solaris to the community. But even though some changes started, IBM is still far away from OpenSource and from opening code of it’s product to the OpenSource community. And that is a pity because I would really like to see the same usability features on some other UNIX operating systems. Sadly, even Linux is far behind AIX when it comes to usability.

08
Jul

32 * 2 = 16h

Last week I had an interesting assignment, upgrading one AIX 5.2 server from 32bit to 64bit kernel. Process should be pretty straight forward and is very nicely explained in AIX documentation, but as usual, all actions that require application stopping have to be done after working hours – in this case after 9pm. Considering that all changes, system reboot and application start/stop sequence should not take more than 45 minutes this is not a big problem. As many times before, I didn’t count on good ol’ friend of all system administrators – Murphy.

But, let’s start from the start. First thing I did was to check if the server supports 64bit environment and what version of the kernel is currently running.

# bootinfo -y
64
# bootinfo -K
32

So, the hardware on this server is 64bit (as expected) and active kernel is 32bit. Now, let’s stop applications. Only important application on this server is a production Oracle database. We have to stop it before reboot. (Important thing to note at this moment is the version of database, it is old 8.1.7.4 release of Oracle.)

# su – oracle
% sqlplus /nolog
   
SQL*Plus: Release 8.1.7.0.0 – Production on Wed Jul 4 21:01:20 2007
   
(c) Copyright 2000 Oracle Corporation. All rights reserved.
   
SQL> conn / as sysdba
Connected.(
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> exit
Disconnected from Oracle8i Enterprise Edition Release 8.1.7.4.0 – Production
JServer Release 8.1.7.4.0 – Production

In order to be able to execute 64bit binaries we must edit /etc/inittab so the syscall64 kernel extension is loaded during the boot. This is need even with 64bit kernel.

# mkitab “load64bit:2:wait:/etc/methods/cfg64 >/dev/console 2>&1″

The switch to 64bit kernel is done by simply relinking paths to the kernel and libraries, and updating boot image on the boot device. Followed by a reboot. Simple as that.

# ln -sf /usr/lib/boot/unix_64 /unix
# ln -sf /usr/lib/boot/unix_64 /usr/lib/boot/unix
# bosboot -a
# shutdown -Fr

After the reboot, I checked the version of running kernel to see if the change actually took place.

# bootinfo -K
64

Perfect! so simple isn’t it. I just love when things go so smoothly. Now let’s start Oracle.

# su – oracle
% sqlplus /nolog
Could not load program sqlplus:
Symbol resolution failed for sqlplus because:
Symbol pw_post (number 272) is not exported from dependent module /unix.
Symbol pw_wait (number 273) is not exported from dependent module /unix.
Symbol pw_config (number 274) is not exported from dependent module /unix.
Symbol aix_ora_pw_version3_required (number 275) is not exported from dependent module /unix.
Examine .loader section symbols with the ‘dump -Tv’ command.

“Argh, this can’t be happening!” I was thinking, so I tried again. Surprisingly, that didn’t help. After the initial shock, I looked at the message more carefully and tried to figure out what the hell it meant. Kernel doesn’t support necessary Oracle symbols – so maybe the Oracle kernel extension is not loaded, let’s check.

# loadext -r
   
Oracle Kernel Extension Loader for AIX
Copyright (c) 1998,1999 Oracle Corporation
   
sh: /usr/sbin/crash: not found
No Kernel Extension is currently running.

I was on a right trail. But this is strange, Oracle kernel extension is loaded from /etc/inittab during the boot, it SHOULD be loaded. Maybe the inittab got corrupted.

# lsitab -a|grep ora
orapw:2:wait:/etc/loadext -l /etc/pw-syscall

It is there. In the agony I thought maybe syscall64 extension was not loaded so it failed (although it should not matter).

# genkex|grep syscall
4635e70 390 /usr/lib/drivers/syscalls64.ext

It is there. Let’s try to call it manually, maybe it will work now.

# loadext -l /etc/pw-syscall
   
Oracle Kernel Extension Loader for AIX
Copyright (c) 1998,1999 Oracle Corporation
   
Kernel Extension Version: 3
SYS_SINGLELOAD: Exec format error
kmid: 0 (0×0)
path: ‘/etc/pw-syscall’
libpath: ”

Maybe, this extension does not support 64bit environment?

# strings /etc/pw-syscall|head -3
Kernel Extension Version: 3
$Revision: 1.9 $
Supported Oracle Instances: 32-bit & 64-bit

Now I am puzzled even more.

At this point I felt stuck. Reverting back to 32bit kernel was not even an option as this was only one part of the big migration process on this server. But, on the other hand Oracle has to be up and running by morning – this is a very important production server. As I am not an Oracle guru and there was no one from DB team around to ask for advice, I asked Google for help. As many times before, it proved to be wise choice. People already had this problem and solved it by applying small patch for Oracle.

Important thing here is that Oracle version 8 does not support 64bit kernel on AIX. It requires patch number 2896876 in order to do so.

After applying this patch you get a new kernel extension which loads without complaining.

# genkex|grep syscall
466c850 1218 /etc/pw-syscall64
4641ec0 390 /usr/lib/drivers/syscalls64.ext

Now, let’s try to start Oracle.

# su – oracle
% sqlplus /nolog
   
SQL*Plus: Release 8.1.7.0.0 – Production on Thu Jul 5 00:47:45 2007
   
(c) Copyright 2000 Oracle Corporation. All rights reserved.
   
SQL> conn / as sysdba
Connected to an idle instance.
SQL> startup
ORACLE instance started.
   
Total System Global Area  178704276 bytes
Fixed Size                    73620 bytes
Variable Size             135630848 bytes
Database Buffers           41943040 bytes
Redo Buffers                1056768 bytes
Database mounted.
Database opened.
SQL> exit
Disconnected
% ^D

Nice. :) Next thing is to change inittab to load new Oracle kernel extension,

# chitab “orapw:2:wait:/etc/loadext -l /etc/pw-syscall64″

stop oracle and reboot server again to see how it will behave after the reboot. Luckily everything works fine so at 01am I can finally go home. It was about time since I was there for almost 16 hours (hence the subject of the post.) Ah, the pleasures of being a system administrators are flexible working hours, isn’t it? :)