Oracle Database Appliance – Problem Replacing Shared Disk (2.4)

As with all systems, a disk failure can happen and so we had a failing disk on one of our ODA’s. As mentioned in the ODA documentation, replacing a disk is easy and OAK does everything for you. Actually it is doing everything for you, from all OS related actions (like multipath configuration and partitioning) to dropping and adding the replaced disk in ASM.

It all looked fine until  we got a close look after OAK was done adding the disk. Replacing the disk introduced the following 2 problems:

1. The ASM alertlog keeps logging:

WARNING: Read Failed. group:0 disk:1 AU:0 offset:0 size:4096
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_25561.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096

and the /var/log/messages keeps logging messages like:

May 23 22:14:58 emn-odadb-ts1-03 kernel: end_request: I/O error, dev dm-22, sector
May 23 22:14:58 emn-odadb-ts1-03 kernel: Buffer I/O error on device dm-46, logical block 0
....

2. The size of the second partition on the new disk (used by diskgroup RECO) is not the same as the size of all other (orginal) disks. Partition 2 on original disks are 80800 MB but the partition 2 on the new disk is just 75080 MB.

Causes

Cause problem 1:
At first it looked like problem 1 (I/O errors in ASM alertlog and OS messages file) was because of the new disk also failing, but looking further the logfile /opt/oracle/oak/log/<ODA node name>/oak/oakd.log showed the new disk got a new device name, was partitioned and correctly added to the RECO diskgroup. The I/O errors the alertlog and messages file are showing are about the OLD (physically remove) disk and it is pretty hard to read/write to a physically removed disk :-).

The reason why the old disk device still exists, is because there are still processes (Grid Infra/ASM/Database) having open file descriptors to the old device so Linux is not able to remove the device.

Use the following commands to get a list of process id’s for the processes that are still having open file descriptors to the device file of the removed disk:

/sbin/fuser /dev/mapper/<diskname>p1
/sbin/fuser /dev/mapper/<diskname>p2

Example:

/sbin/fuser /dev/mapper/HDD_E1_S05_992884312p2
/dev/mapper/HDD_E1_S05_992884312p2:  3254  3298  5196
ps -ef|grep 3298
grid      3298     1  0  2012 ?        00:00:40 asm_vbg0_+ASM1

Cause problem 2:The /sbin/parted command that is executed by OAK when the new disk is inserted, is called to create 2 partitions on the new disk with using an partition size for partition 2, starting at the next cylinder after where partition 1 ends and ending on the cylinder based on 99% of the disk. When deploying the ODA these partitions are defined exactly the same way, but it seems that due to a change in the parted utility or because the new disk is different (Vendor: HITACHI, Model: HUS1560SCSUN600G) than the original disk (Vendor: SEAGATE, Model: ST360057SSUN600G) the 99% results in different end cylinder number, resulting in a smaller partition (around 6 GB smaller).

Solutions

Solution problem 1:
There is no real solution for getting rid of the old disk device and there is a generic ASM problem (MOS note 1485163.1: Disks of Dismounted Diskgroup Are Still Hold / Lock By Oracle Process) created, and still open, for this problem. It is not an ODA specific problem. The only way to get rid of the old device is to restart CRS stack per node, thus including all database instances running on that node.

Solution problem 2:This problem is said to be fixed in ODA 2.6.0.0.0 where you have the oppertunity to reinitialize the disk using oakcli which will recreate the disk partitions with the correct size (and of course all OS and ASM related actions too).

Posted in Database Appliance | Tagged , , , , , , , , , , , | Leave a comment

Oracle Database Appliance – /opt filling up

Last time we had the /opt filesystem on a couple of ODA nodes filling up. It turned out that the OS Watcher archive directory structure (/opt/oracle/oak/osw/archive) contained lots of old files that should have been cleaned by OSWatcher.

When OSWatcher is started, it will also start a script called OSWatcherFM.sh that is responsible for cleaning up old files. It turned out this OSWatcherFM.sh script was not running and so the archive directory structure was not cleaned up.

Solution:

The solution is simply to restart the OSWatcher which in turn will start the OSWatcherFM.sh script. Execute the following commands as user root:

/opt/oracle/oak/osw/stopOSW.sh

/opt/oracle/oak/osw/startOSW.sh 10 504 gzip

Check if both OSWatcher.sh and OSWatcherFM.sh are running using:

ps -ef|grep OSW

oracle   10046 24783  0 10:23 pts/0    00:00:00 grep OSW
root     12704     1  0  2012 ?        01:11:42 /usr/bin/ksh ./OSWatcher.sh 10 504 gzip
root     12922 12704  0  2012 ?        00:16:48 /usr/bin/ksh ./OSWatcherFM.sh 504

Note: it can take a couple of seconds before the OSWatcherFM.sh script is started, so if it doesn’t show try again a couple of seconds later.

The OSWachterFM.sh script will cleanup the old files after a couple of minutes, so after some minutes the /opt/oracle/oak/osw/archive directory structure will be cleaned up.

Posted in Database Appliance | Tagged , , , , , , | Leave a comment

ORA-29283: invalid file operation using UTL_FILE.FREMOVE

I’ve been breaking my head over a problem I had with the UTL_FILE.FREMOVE command trying to remove a file created by another (not oracle) process. On *nix command line, logged in as user oracle (same as database) I’m able to remove the file using the “rm” command, but when using UTL_FILE.FREMOVE it comes back with ORA-29283.

Sample of directory and file permission for this article:

drwxrwsr-x   oracle:oinstall   /var/import

Content of /var/import:
rw-r--r--    grid:oinstall     something.tst

The first thing I thought about, was that Oracle was looking at the privileges of the file (which shows that the group oinstall only is allowed to READ the file) before it would remove it, but the Oracle documentation clearly states that:

The FREMOVE procedure does not verify privileges before deleting a file. The O/S verifies file and directory permissions.

Because on the directory level oracle (through being part of the oinstall group) is allowed to remove files from the directory, as is also proved by the “rm” command, and UTL_FILE.FREMOVE does not verify the permissions, UTL_FILE.FREMOVE should be able todo so too. So why is the command failing?

Quick answer: Oracle actually does some sort of a permission check before removing the file!

Using the “truss” command on an Oracle session using UTL_FILE.FREMOVE showed me that Oracle is executing the following system calls when removing a file with UTL_FILE.FREMOVE:

1.  Get file attributes (that does not include permissions)
stat(“/var/import/something.tst”, ………….
lstat(“/var/import/something.tst”, ………….

2.  Open the file using file access modes for writing and appending
open(“/var/import/something.tst”, O_WRONGLY|O_APPEND)

3.  Get information about the open file descriptor
fstat(……..)

4.  Close the open file descriptor
close(……..)

5.  Actually remove the file
unlink(“/var/import/something.tst”)

The problem lies in steps 2, 3 and 4, where Oracle just tries to open the file you want to remove for writing, which fails because oracle is not allowed to write to the file (on the file level). In my opinion Oracle actually does check permissions by opening the file for writing and immediately closing it.

Conclusion:

If you want to be able to use UTL_FILE.FREMOVE, make sure you have rw (READ/WRITE) access DIRECTLY on the file. Solutions (maybe not applicable in your environment) might be changing the umask or using ACL’s, otherwise some external solutions should be used.

Posted in Database | Tagged , , , , , , , , , | Leave a comment

ORA-600 [kkpo_rcinfo_defstg:objnotfound] on a 11.2 RAC instance

RECOMMENDATION: Set the deferred_segment_creation parameter to FALSE for 11.2 RAC databases!

Recently our monitoring picked up an ORA-600 from the alertlog of one of our 11.2 RAC instances.

ORA-00600: internal error code, arguments: [kkpo_rcinfo_defstg:objnotfound],[87421], [], [], [], [], [], [], [], [], [], []

Oracle Support was telling me that I was hitting bug 14252187 (Bug 14252187 – ORA-600 [qesmaGetTblSeg1] from deferred segment creation in RAC (Doc ID 14252187.8)) and there is no patch available. The ORA-600 is generated on one of the RAC instances (maybe more, but I have only seen this on 2-node RAC databases) while the other RAC instance does not have the problem.

Once the ORA-600 is thrown, the error is very easy to reproduce by looking at the trace file that is generated to find the (table) object that is involved. Just running a simple SELECT count(1) FROM <tablename> will throw you the error on 1 RAC instance and will show a result on the other RAC instance.

Solution

The only way to solve the problem (there is no patch for it, except an Exadata database patch 11.2.0.3.8) is to:

  1. export (datapump) table (object)
  2. drop table (object)
  3. purge recyclebin
  4. import (datapump) table (object)

Where the table (object) is the object that is involved (most of the times it is a very small table because it was a deferred object). You have to make sure you run the datapump export on the RAC instance that is able to access the object (so execute the SELECT count(1) FROM <tablename> to find out the RAC instance you can use!).

Prevention

The only workaround to prevent this problem from happening, is to set the init parameter deferred_segment_creation to FALSE, that disables the creation of deferred objects. The problem with this workaround is that is will only prevent the problem from happening for tables (objects) that are created after you have set this parameter to false.

Unfortunately there is not way to predict which object is going to get into problems, it could be any of the “deferred” objects. I have to say Oracle Support is not very helpfull on this matter, they just say to set the parameter to false and do an export/import when the error occurs. They just don’t want to hear that this workaround will most likely cause downtime for a production application.

Most likely you already have deferred objects in your database, because the default value for the deferred_segment_creation parameter is TRUE. In a RAC database these are all objects that could eventually result into this ORA-600 bug. To get rid of all deferred objects in the application schema’s I have created some PL/SQL code, that uses the DBMS_SPACE_ADMIN.MATERIALIZE_DEFERRED_SEGMENTS procedure to for the creation of a segment for a deferred table or table partition. I have asked Oracle Support if this will prevent this error for sure, but I haven’t received a response yet.

SET SERVEROUTPUT ON LINESIZE 200 PAGESIZE 500 HEAD OFF FEEDBACK OFF

-- Will use DBMS_SPACE_ADMIN.MATERIALIZE_DEFERRED_SEGMENTS to force segment creation on tables, table partitions and table subpartitions

DECLARE
  CURSOR c_deftabs IS
    SELECT
      owner, table_name
    FROM
      dba_tables
    WHERE
        segment_created='NO'
      AND
        owner NOT IN ('SYS','SYSTEM','WMSYS','EXFSYS','DBSNMP','CTXSYS','APPQOSSYS','XDB','ANONYMOUS','DIP','OUTLN','XS$NULL','MDSYS')
    ORDER BY
      owner, table_name;

  CURSOR c_deftabparts IS
    SELECT
      table_owner, table_name, partition_name
    FROM
      dba_tab_partitions
    WHERE
        segment_created='NO'
      AND
        table_owner NOT IN ('SYS','SYSTEM','WMSYS','EXFSYS','DBSNMP','CTXSYS','APPQOSSYS','XDB','ANONYMOUS','DIP','OUTLN','XS$NULL','MDSYS')
    ORDER BY
      table_owner, table_name, partition_name;

  CURSOR c_deftabsubparts IS
    SELECT
      table_owner, table_name, partition_name, subpartition_name
    FROM
      dba_tab_subpartitions
    WHERE
        segment_created='NO'
      AND
        table_owner NOT IN ('SYS','SYSTEM','WMSYS','EXFSYS','DBSNMP','CTXSYS','APPQOSSYS','XDB','ANONYMOUS','DIP','OUTLN','XS$NULL','MDSYS')
    ORDER BY
      table_owner, table_name, partition_name, subpartition_name;


  l_stmt    VARCHAR2(255);
BEGIN
  dbms_output.put_line('Force segment creation for deferred tables:');
  dbms_output.put_line('============================================================================================');

  FOR r_deftab IN c_deftabs
  LOOP
    BEGIN
      -- Force segment creation
      DBMS_SPACE_ADMIN.MATERIALIZE_DEFERRED_SEGMENTS(schema_name => r_deftab.owner, table_name => r_deftab.table_name);
      dbms_output.put_line('Segment created for deferred table: '||r_deftab.owner||'.'||r_deftab.table_name);
    EXCEPTION
      WHEN OTHERS THEN
        dbms_output.put_line('[ERROR] Error executing statement:');
        dbms_output.put_line(l_stmt);
        dbms_output.put_line(SUBSTR(SQLERRM, 1, 200));
    END;
  END LOOP;

  -- Force segment creation for table partitions
  FOR r_deftabpart IN c_deftabparts
  LOOP
    BEGIN
      -- Force segment creation for table partition
      DBMS_SPACE_ADMIN.MATERIALIZE_DEFERRED_SEGMENTS(schema_name => r_deftabpart.table_owner, table_name => r_deftabpart.table_name);
      dbms_output.put_line('Segment created for deferred table: '||r_deftabpart.table_owner||'.'||r_deftabpart.table_name);
    EXCEPTION
      WHEN OTHERS THEN
        dbms_output.put_line('[ERROR] Error executing statement:');
        dbms_output.put_line(l_stmt);
        dbms_output.put_line(SUBSTR(SQLERRM, 1, 200));
    END;
  END LOOP;

  -- Force segment creation for table partitions (including subpartitions)
  FOR r_deftabsubpart IN c_deftabsubparts
  LOOP
    BEGIN
      -- Force segment creation for table subpartition
      DBMS_SPACE_ADMIN.MATERIALIZE_DEFERRED_SEGMENTS(schema_name => r_deftabsubpart.table_owner, table_name => r_deftabsubpart.table_name);
      dbms_output.put_line('Segment created for deferred table: '||r_deftabsubpart.table_owner||'.'||r_deftabsubpart.table_name);
    EXCEPTION
      WHEN OTHERS THEN
        dbms_output.put_line('[ERROR] Error executing statement:');
        dbms_output.put_line(l_stmt);
        dbms_output.put_line(SUBSTR(SQLERRM, 1, 200));
    END;
  END LOOP;
END;
/

 

Posted in RAC | Tagged , , , , , , , , , | Leave a comment

ORA-19809: limit exceeded for recovery files

In our cloning procedures we use the RMAN “DUPLICATE” command for duplicating databases. This all went fine for a long time until one moment we start getting errors during the duplication process (from the point where media recovery gets started). Here is an example of the errors we received:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 01/07/2013 13:52:31
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script
ORA-19870: error while restoring backup piece oanupn5l_1_1
ORA-19809: limit exceeded for recovery files
ORA-19804: cannot reclaim 15728640 bytes disk space from 5218762752 limit
RMAN>

RMAN fails while it is trying to restore archivelogs that will be used for media recovery. It tells you that there is not enough space in the FRA (of the auxiliary database) to restore the archivelogs, even when the FRA for the auxiliary database is totally empty. The problem is due to a bug which makes RMAN to take the FRA usage (ignoring the value of reclaimable space) of the target database and compare that to FRA size of the auxiliary database.

Workaround

If you cannot have the same FRA size for the auxiliary database as the target database, then the workaround we use to “solve” this problem with cloning databases, is to just increase the size of the FRA (parameter db_recovery_file_dest_size) on the auxiliary database to be bigger than the used space (including reclaimable space!) of the FRA for the target database.
If you don’t have enough physical disk space available for the FRA for the auxiliary database then you can  just temporarily increase the db_recovery_file_dest_size parameter to a value that’s bigger than the physical available space. It is just to make RMAN happy when it checks the FRA usage and it will proceed with the media recovery. After you have finished the clone you can resize the db_recovery_file_dest_size parameter again to its original value.

Patch

This bug is registered by Oracle support as Bug 13741583 – RMAN duplication erroneously fails with ORA-19804 using fast recovery area [ID 13741583.8].

It seems a patch has currently been made available for 11.2.0.3 databases running on Solaris SPARC (64 bit) and Linux x86-64 (Patch 13741583: RMAN DUPLICATION ERRONEOUSLY FAILS WITH ORA-19804).

Posted in Database | Tagged , , , , , , , | Leave a comment

Oracle Database Appliance – /boot filling up

After the installation of the ODA bundle patch 2.4.0.0.0, the /boot filesystem on the ODA nodes was filled up to 84% (16 MB of 99 MB was free) and so Oracle Enterprise Manager started sending warnings about the default 80% warning threshold being overridden for the /boot filesystem.

Although the /boot filesystem is, while not patching or so, not growing. I could have just increased the threshold for the /boot filesystem. But what worried me more, was that with the next ODA bundle patch there would be a big chance of the /boot filesystem becomming completely filled up, because of a new kernel version that probably gets installed. And I don’t like the idea of yet another ODA patch failing because of this and hoping that Oracle has good error handling built-in that makes sure everything still works after the failed kernel update installation!
Although the installation of the ODA bundle patch 2.4.0.0.0 on all our ODA’s went completely fine this time, I wouldn’t put my money on it!

Anyway I had opened a Service Request asking what I could safely cleanup from this /boot filesystem without getting problems with booting my ODA nodes and with future ODA bundle patches that expect certain files to be there. Oracle Support got development involved and they came up with the following answer:

  • Development will add a cleanup procedure in the future ODA bundle patches
  • (Re)move all /boot/*.dup_orig
  • Move all /boot/*2.6.18* files to a backup directory

What I’ve done (on each ODA node), as user root:

Only move this files if you ODA 2.4.0.0.0 running!!!!

mkdir -p /root/backup/boot
mv /boot/*.dup_orig /root/backup/boot
mv /boot/*2.6.18* /root/backup/boot

I’ve rebooted the ODA node to make sure we don’t have a problem the next time the ODA nodes gets rebooted (during the next bundle patch installation)!

I’m not sure if people having ODA’s that have been delivered more recently have the same problem because they probably didn’t have to upgrade from ODA version 2.1.0.0.0 and so don’t have that much different kernel updates in the /boot filesystem.
If you do have the same “problem” you could also wait for the next ODA bundle patch, but
make sure that the cleanup procedure (that development has promised) is really implemented!

 

Posted in Database Appliance | Tagged , , , , , , , , | Leave a comment

Determine versions of connected Oracle clients

Last week I was trying to find a way to determine the version of the Oracle clients that are connected/connecting to databases from these databases. We cannot connect to each application server to figure out the used version and certainly not all laptops and desktops from which some users can connect.

The v$session_connect_info view contains a column named client_version so I was hopefull this would provide me with this information, but this info turns out to be VERY unreliable (if present). So I posted the question on the My Oracle Support communities and got a very usefull answer from BobB and MarkDPowell (some very active and bright users on these communities!) comming up with a table called SYS.X$KSUSECON and some decoding on a column (ksuseclvsn) that contains the client version codes as a number.
Using this table you can find the version of connected clients (tested from 11.1.0.7 and higher). Pre 11.1 connected Oracle clients will have a version number of 0 and won’t show up in the query listed next.

Continue reading

Posted in Database | Tagged , , , , , , | Leave a comment

Cryptographic flaws in Oracle Database authentication protocol

Recently a security researcher (Esteban Martinez Fayo) made the world aware of a problem with the O5LOGON Oracle database authentication protocol (used in 11g – 11.1 & 11.2). This problem, known as CVE-2012-3137, makes it relatively simple for attackers to get hold of passwords using a brute-force attack on the encrypted (AES  -192 bit) session key that is returned by the Oracle database when connecting. This means you don’t need the password hash (SHA-1 hash as of 11g) to brute force the password anymore. The information (the encrypted session key – AUTH_SESSKEY – and the password SALT value – AUTH_VFR_DATA) returned by the server at an very early state of the authentication process if enough.

This article is gives more information about the actual problem and the steps you can take to protect your databases from leaking password information.

Continue reading

Posted in Database, Security | Tagged , , , , , , , , , , , , , , , , , , , , , , , | 11 Comments

Linking Oracle BBED in 11g

Warning:

I use the Oracle Block Browser and Edit tool (bbed) for studying the internal structure of database blocks. Be very careful using the tool, you can destroy blocks using this tool so make sure you know what you are doing.

There is a nice document (PDF) about the BBED tool, titled Disassembling the Oracle Data Block which can be found here: http://orafaq.com/papers/dissassembling_the_data_block.pdf

Copy the following files from an 10g Oracle Home to 11g Oracle Home:

  • $ORACLE_HOME/rdbms/lib/ssbbded.o
  • $ORACLE_HOME/rdbms/lib/sbbdpt.o
  • $ORACLE_HOME/rdbms/mesg/bbedus.msb
  • $ORACLE_HOME/rdbms/mesg/bbedus.msg

Important: make sure you copy these files from an Oracle Home installed on the same server architecture!

Issue the following command (with the 11g Oracle home environment variables set):

make -f $ORACLE_HOME/rdbms/lib/ins_rdbms.mk BBED=$ORACLE_HOME/bin/bbed $ORACLE_HOME/bin/bbed

Output:

Linking BBED utility (bbed)
rm -f /u01/app/oracle/product/11.2.0.2/bin/bbed
/usr/ccs/bin/ld -o /u01/app/oracle/product/11.2.0.2/bin/bbed
-L/u01/app/oracle/product/11.2.0.2/rdbms/lib/
-L/u01/app/oracle/product/11.2.0.2/lib/
-dy /u01/app/oracle/product/11.2.0.2/lib/prod/lib/v9/crti.o
/u01/app/oracle/product/11.2.0.2/lib/prod/lib/v9/crt1.o
/u01/app/oracle/product/11.2.0.2/lib/s0main.o
/u01/app/oracle/product/11.2.0.2/rdbms/lib/ssbbded.o
/u01/app/oracle/product/11.2.0.2/rdbms/lib/sbbdpt.o
`cat /u01/app/oracle/product/11.2.0.2/lib/ldflags`   
-lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11  -ldbtools11
-lclntsh  `cat /u01/app/oracle/product/11.2.0.2/lib/ldflags`   
-lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11
-lnro11 `cat /u01/app/oracle/product/11.2.0.2/lib/ldflags`   
-lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -lnnz11 -lzt11 -lztkg11
-lztkg11 -lclient11 -lnnetd11  -lvsn11  -lcommon11 -lgeneric11
-lmm -lsnls11 -lnls11  -lcore11 -lsnls11  -lnls11 -lcore11 -lsnls11
-lnls11 -lxml11  -lcore11 -lunls11 -lsnls11 -lnls11 -lcore11
-lnls11 `cat /u01/app/oracle/product/11.2.0.2/lib/ldflags`   
-lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11
-lnro11 `cat /u01/app/oracle/product/11.2.0.2/lib/ldflags`   
-lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -lclient11 -lnnetd11 
-lvsn11  -lcommon11 -lgeneric11   -lsnls11 -lnls11  -lcore11
-lsnls11  -lnls11 -lcore11 -lsnls11 -lnls11 -lxml11  -lcore11
-lunls11 -lsnls11 -lnls11 -lcore11 -lnls11  -lclient11 -lnnetd11 
-lvsn11  -lcommon11 -lgeneric11 -lsnls11 -lnls11  -lcore11
-lsnls11  -lnls11 -lcore11 -lsnls11 -lnls11 -lxml11  -lcore11
-lunls11 -lsnls11 -lnls11 -lcore11
-lnls11   `cat /u01/app/oracle/product/11.2.0.2/lib/sysliblist`
-R /opt/SUNWcluster/lib/sparcv9
-R/u01/app/oracle/product/11.2.0.2/lib
-R /opt/ORCLcluster/lib/ 
-Y P,:/opt/SUNWcluster/lib/sparcv9:/opt/ORCLcluster/lib/:/usr/ccs/lib/sparcv9:/usr/lib/sparcv9
-Qy -lm  /u01/app/oracle/product/11.2.0.2/lib/prod/lib/v9/crtn.o

Starting Oracle’s Block Browser and Editor (BBED):

$ORACLE_HOME/bin/bbed

$ORACLE_HOME/bin/bbed
Password:

BBED: Release 2.0.0.0.0 - Limited Production on Fri Sep 7 11:56:19 2012

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

************* !!! For Oracle Internal Use only !!! ***************

BBED>

When you start BBED it will ask you for a password. This password is: blockedit

Posted in Database, Uncategorized | Tagged , , , , , | Leave a comment

Grid Infra – ASM 11.2.0.3 bug on Solaris – slice 2 CANDIDATE

We recently ran into a bug (at least that how I look at it) that is introduced with ASM 11.2.0.3. As of this version, ASM shows slice 2 (the complete disk, starting a cylinder 0 thus inlcuding the VTOC) as an CANDIDATE disk in ASM. In previous versions of ASM (Grid Infra) only slices that started from cylinder 3 (or higher) where shown as CANDIDATE disks in ASM, so you couldn’t accidently select slice 2 to be used for a diskgroup.

In ASM 11.2.0.3 it is possible to add the slice 2 (for example *c0t0d0s2) to an ASM diskgroup while another slice of the same disk, starting at cylinder 3, is already part of the same or even another diskgroup. You can understand what a mess this gives!
Example partition table of solaris disk, where until 11.2.0.3 you would only see partition (slice) 0, starting at cylinder 3 (see MOS note: ASM Does Not Discover Disk on Solaris [ID 368840.1]).

Current partition table (original):
Total disk cylinders available: 13651 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       3 - 13650       49.98GB    (13648/0/0) 104816640
  1 unassigned    wu       0                0         (0/0/0)             0
  2     backup    wu       0 - 13650       49.99GB    (13651/0/0) 104839680
  3 unassigned    wm       0                0         (0/0/0)             0
  4 unassigned    wm       0                0         (0/0/0)             0
  5 unassigned    wm       0                0         (0/0/0)             0
  6 unassigned    wm       0                0         (0/0/0)             0
  7 unassigned    wm       0                0         (0/0/0)             0

ASM (11.2.0.3) discovers the following disks now. Take a good look at the disk number of the MEMBER disks and CANDIDATE disks:

SQL> select path,header_status from v$asm_disk order by 1,2;

PATH                                               HEADER_STATUS
-------------------------------------------------- ---------------
/dev/rdsk/c3t60060E8005492800000049280000319Dd0s0  MEMBER
/dev/rdsk/c3t60060E8005492800000049280000319Dd0s2  CANDIDATE
/dev/rdsk/c3t60060E8005492800000049280000319Ed0s0  MEMBER
/dev/rdsk/c3t60060E8005492800000049280000319Ed0s2  CANDIDATE
/dev/rdsk/c3t60060E8005492800000049280000319Fd0s0  MEMBER
/dev/rdsk/c3t60060E8005492800000049280000319Fd0s2  CANDIDATE
/dev/rdsk/c3t60060E800549280000004928000031A0d0s0  MEMBER
/dev/rdsk/c3t60060E800549280000004928000031A0d0s2  CANDIDATE
/dev/rdsk/c3t60060E800549280000004928000031A6d0s0  MEMBER
/dev/rdsk/c3t60060E800549280000004928000031A6d0s2  CANDIDATE
/dev/rdsk/c3t60060E800549280000004928000031A7d0s0  MEMBER
/dev/rdsk/c3t60060E800549280000004928000031A7d0s2  CANDIDATE
/dev/rdsk/c3t60060E800549280000004928000031A8d0s0  MEMBER
/dev/rdsk/c3t60060E800549280000004928000031A8d0s2  CANDIDATE
/dev/rdsk/c3t60060E800549280000004928000031A9d0s0  MEMBER
/dev/rdsk/c3t60060E800549280000004928000031A9d0s2  CANDIDATE
/dev/rdsk/c3t60060E800549280000004928000031CBd0s0  MEMBER
/dev/rdsk/c3t60060E800549280000004928000031CBd0s2  CANDIDATE

I have created an SR for this problem, but until then you’ll have the following options:

  1. Set the ASM_DISKSTRING parameter to /dev/rdsk/*s0 (or whatever slice you normally use)
  2. Make sure the user grid does not have any privileges on the /dev/rdsk/*s2 devices.

Update 26-02-2013:
Oracle Support finally came back with an answer and a patch. The most important thing is that they agree that this is actually a bug and created a bug (bug nr. 14577881) for it. Support tells me that the problem is “really” fixed in Oracle 12.2 (when even 12.1 is not even released yet ;-)) but they have created a backport request for it. I just got back that the backport is available as a one-off patch downloadable as patch 14577881 for 11.2.0.3.

Posted in Grid Infrastructure, Uncategorized | Tagged , , , , , , , , , , , , , , , , | 2 Comments