Oracle Database Appliance



Oracle Database Appliance – Problem Replacing Shared Disk (2.4)

As with all systems, a disk failure can happen and so we had a failing disk on one of our ODA’s. As mentioned in the ODA documentation, replacing a disk is easy and OAK does everything for you. Actually it is doing everything for you, from all OS related actions (like multipath configuration and partitioning) to dropping and adding the replaced disk in ASM.

It all looked fine until  we got a close look after OAK was done adding the disk. Replacing the disk introduced the following 2 problems:

1. The ASM alertlog keeps logging:

WARNING: Read Failed. group:0 disk:1 AU:0 offset:0 size:4096
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_25561.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096

and the /var/log/messages keeps logging messages like:

May 23 22:14:58 emn-odadb-ts1-03 kernel: end_request: I/O error, dev dm-22, sector
May 23 22:14:58 emn-odadb-ts1-03 kernel: Buffer I/O error on device dm-46, logical block 0
....

2. The size of the second partition on the new disk (used by diskgroup RECO) is not the same as the size of all other (orginal) disks. Partition 2 on original disks are 80800 MB but the partition 2 on the new disk is just 75080 MB.

Causes

Cause problem 1:
At first it looked like problem 1 (I/O errors in ASM alertlog and OS messages file) was because of the new disk also failing, but looking further the logfile /opt/oracle/oak/log/<ODA node name>/oak/oakd.log showed the new disk got a new device name, was partitioned and correctly added to the RECO diskgroup. The I/O errors the alertlog and messages file are showing are about the OLD (physically remove) disk and it is pretty hard to read/write to a physically removed disk :-).

The reason why the old disk device still exists, is because there are still processes (Grid Infra/ASM/Database) having open file descriptors to the old device so Linux is not able to remove the device.

Use the following commands to get a list of process id’s for the processes that are still having open file descriptors to the device file of the removed disk:

/sbin/fuser /dev/mapper/<diskname>p1
/sbin/fuser /dev/mapper/<diskname>p2

Example:

/sbin/fuser /dev/mapper/HDD_E1_S05_992884312p2
/dev/mapper/HDD_E1_S05_992884312p2:  3254  3298  5196
ps -ef|grep 3298
grid      3298     1  0  2012 ?        00:00:40 asm_vbg0_+ASM1

Cause problem 2:The /sbin/parted command that is executed by OAK when the new disk is inserted, is called to create 2 partitions on the new disk with using an partition size for partition 2, starting at the next cylinder after where partition 1 ends and ending on the cylinder based on 99% of the disk. When deploying the ODA these partitions are defined exactly the same way, but it seems that due to a change in the parted utility or because the new disk is different (Vendor: HITACHI, Model: HUS1560SCSUN600G) than the original disk (Vendor: SEAGATE, Model: ST360057SSUN600G) the 99% results in different end cylinder number, resulting in a smaller partition (around 6 GB smaller).

Solutions

Solution problem 1:
There is no real solution for getting rid of the old disk device and there is a generic ASM problem (MOS note 1485163.1: Disks of Dismounted Diskgroup Are Still Hold / Lock By Oracle Process) created, and still open, for this problem. It is not an ODA specific problem. The only way to get rid of the old device is to restart CRS stack per node, thus including all database instances running on that node.

Solution problem 2:This problem is said to be fixed in ODA 2.6.0.0.0 where you have the oppertunity to reinitialize the disk using oakcli which will recreate the disk partitions with the correct size (and of course all OS and ASM related actions too).

Oracle Database Appliance – /opt filling up

Last time we had the /opt filesystem on a couple of ODA nodes filling up. It turned out that the OS Watcher archive directory structure (/opt/oracle/oak/osw/archive) contained lots of old files that should have been cleaned by OSWatcher.

When OSWatcher is started, it will also start a script called OSWatcherFM.sh that is responsible for cleaning up old files. It turned out this OSWatcherFM.sh script was not running and so the archive directory structure was not cleaned up.

Solution:

The solution is simply to restart the OSWatcher which in turn will start the OSWatcherFM.sh script. Execute the following commands as user root:

/opt/oracle/oak/osw/stopOSW.sh

/opt/oracle/oak/osw/startOSW.sh 10 504 gzip

Check if both OSWatcher.sh and OSWatcherFM.sh are running using:

ps -ef|grep OSW

oracle   10046 24783  0 10:23 pts/0    00:00:00 grep OSW
root     12704     1  0  2012 ?        01:11:42 /usr/bin/ksh ./OSWatcher.sh 10 504 gzip
root     12922 12704  0  2012 ?        00:16:48 /usr/bin/ksh ./OSWatcherFM.sh 504

Note: it can take a couple of seconds before the OSWatcherFM.sh script is started, so if it doesn’t show try again a couple of seconds later.

The OSWachterFM.sh script will cleanup the old files after a couple of minutes, so after some minutes the /opt/oracle/oak/osw/archive directory structure will be cleaned up.

Oracle Database Appliance – /boot filling up

After the installation of the ODA bundle patch 2.4.0.0.0, the /boot filesystem on the ODA nodes was filled up to 84% (16 MB of 99 MB was free) and so Oracle Enterprise Manager started sending warnings about the default 80% warning threshold being overridden for the /boot filesystem.

Although the /boot filesystem is, while not patching or so, not growing. I could have just increased the threshold for the /boot filesystem. But what worried me more, was that with the next ODA bundle patch there would be a big chance of the /boot filesystem becomming completely filled up, because of a new kernel version that probably gets installed. And I don’t like the idea of yet another ODA patch failing because of this and hoping that Oracle has good error handling built-in that makes sure everything still works after the failed kernel update installation!
Although the installation of the ODA bundle patch 2.4.0.0.0 on all our ODA’s went completely fine this time, I wouldn’t put my money on it!

Anyway I had opened a Service Request asking what I could safely cleanup from this /boot filesystem without getting problems with booting my ODA nodes and with future ODA bundle patches that expect certain files to be there. Oracle Support got development involved and they came up with the following answer:

  • Development will add a cleanup procedure in the future ODA bundle patches
  • (Re)move all /boot/*.dup_orig
  • Move all /boot/*2.6.18* files to a backup directory

What I’ve done (on each ODA node), as user root:

Only move this files if you ODA 2.4.0.0.0 running!!!!

mkdir -p /root/backup/boot
mv /boot/*.dup_orig /root/backup/boot
mv /boot/*2.6.18* /root/backup/boot

I’ve rebooted the ODA node to make sure we don’t have a problem the next time the ODA nodes gets rebooted (during the next bundle patch installation)!

I’m not sure if people having ODA’s that have been delivered more recently have the same problem because they probably didn’t have to upgrade from ODA version 2.1.0.0.0 and so don’t have that much different kernel updates in the /boot filesystem.
If you do have the same “problem” you could also wait for the next ODA bundle patch, but
make sure that the cleanup procedure (that development has promised) is really implemented!

 

Oracle Database Appliance – Installing Bundle Patch 2.3.0.0.0

On 23 july 2012 ODA bundle patch 2.3.0.0.0 was released, a long awaited version which should (and indeed does) include multi-home support. As with each new bundle patch, in this version the patch process is getting more robust and the long awaited support for rolling upgrade of the Grid Infrastructure and RAC databases is implemented. Unfortunately the installation of the infrastructure part still requires downtime of the complete cluster stack and rebooting of both nodes (although you can skip the reboot part of the infra patch and reboot the nodes at a more convenient time).

At the beginning I was very surprised how well this new bundle patch installed on one of our test ODA’s. It installed without any problems and in a couple of hours I brought this ODA from 2.1.0.3.1 to 2.3.0.0.0 (2.1.0.3.1 -> 2.2.0.0.0 – only infra + Grid and 2.2.0.0.0 -> 2.3.0.0.0) including database upgrade. Unfortunately after this successfull installation and upgrade of 7 test databases, on the second ODA I patched after that, the first node crashed during patching of the 2.3 infra component. After cleanup and restarting I was able to fix this ODA, but while patching the third ODA, the second node crashed during the installation of the 2.3 infra patch component. Unfortunately cleanup and restarting the patch didn’t help.

On this ODA I now have 1 node with an old BIOS version and even worse, the first node having an applied core configuration key, but on the second node OAK doesn’t know anymore about this applied core configuration key on the first node.

Solution for the core_config_key problem: run the oakcli apply core_config_key <key file> again on the second node. After the reboot (it will only reboot the second node) OAK knows about the key again.

My guess is that the problem lies in the combination of the usage of core configuration keys and the BIOS upgrade that is done during the installation of the infra component/phase of the 2.3 bundle patch. At the moment I have an SR opened with Oracle Support and I can’t patch my other ODA’s until the problem is solved.

Although I have problems with the installation of bundle patch 2.3.0.0.0 I will describe the patching process here, to get some idea of how long things take and what questions you can expect.

Patch installation – component infra

The first component that needs to be patched is the infra component that for bundle patch 2.3.0.0.0 consists of firmware patches (BIOS, ILOM, SAS disks) and of course a new version of OAK. As mentioned, the installation of the infra component requires downtime of the complete cluster (the complete Grid Infrastructure will be shutdown).

Mandatory requirements before installing the 2.3 infra component:

  • Copy the zipfile containing the 2.3 bundle patch (p13982331_23000_Linux-x86-64.zip) to BOTH nodes of the ODA
  • Unpack this patch using the oakcli unpack -package command on BOTH nodes of the ODA
  • You need the passwords (just sudo isn’t enough) for OS user root

After that start the installation of the infra component using the command (run as user root):

 oakcli update -patch 2.3.0.0.0 --infra

It will take around 45 minutes from starting the patch process until both ODA nodes are rebooted and the complete CRS stack and databases are up and running again (time may differ a little depending on the number of databases).

Also make sure to do the checks mentioned in the 2.3 bundle patch release notes on BOTH NODES (run as user root):

dmidecode -t 1 | grep "Serial Number"
fwupdate list disk | grep -A5 CONTROLLER

For the dmidecode command, the serial number of your ODA should be returned on both nodes. For the fwupdate command the output should contain a value in the “BIOS version” column for controllers c1 and c2 on both nodes. If any of these checks fail, reboot the node and check again.

Patch installation – component GI

As of ODA bundle patch 2.3.0.0.0, patching (installation of an PSU) the Grid Infrastructure requires is done in a rolling fashion. So unlike the infra component all RAC databases will be available during the patch process. The process will first patch the Grid Infra environment on ODA node 1 (SC0) and when it is finished and all instances are started again, it will start patching the Grid Infrastructure on ODA node 2 (SC1).

Requirements before installing the 2.3 GI component:

  • You need the passwords (just sudo isn’t enough) for OS users root and grid

After that start the installation of the gi component using the command (run as user root):

 oakcli update -patch 2.3.0.0.0 --gi

It will take around 30 minutes for the Grid Infrastructure patch to complete and as mentioned doesn’t require downtime for RAC databases.

Patch installation – creating 11.2.0.3.3 dbhome

Because I didn’t have my ODA’s running on ODA version 2.2.0.0.0 yet, mainly because this bundle patch introduced a very nasty bug with crashing 11.2.0.2.x (RAC) databases, I had to install mandatory parts of this bundle patch first to be able to install the 2.3 bundle patch. I decided to only install the infra and GI component of the 2.2 bundle patch and use the new 2.3 command oakcli create dbhome command to create a new 11.2.0.3 oracle home and use the oakcli upgrade database command to upgrade my databases from 11.2.0.2.5 (ODA 2.1.0.3.1) to 11.2.0.3.3 (ODA 2.3.0.0.0). Using the database upgrade functionality of ODA 2.3 also doesn’t have the known issue of ODA 2.2 that databases with a databases name containing capitals couldn’t get upgraded.

Requirements for creating an Oracle home (11.2.0.3.3):

  • You need the passwords (just sudo isn’t enough) for OS users root and oracle
  • You need the password for user SYS on the ASM instances

To create a new 11.2.0.3.3 Oracle database home use the following command (run as user root):

create dbhome -version 11.2.0.3.3

It takes around 6 minutes to create this new Oracle home.

Patch installation – upgrading databases

After creating the new 11.2.0.3.3 Oracle home for databases I upgraded all databases running Oracle 11.2.0.2.5 to Oracle 11.2.0.3.3 using the oakcli upgrade database command.

Requirements for upgrading databases using the oakcli upgrade database command:

  • You need the passwords (just sudo isn’t enough) for OS users root and oracle
  • A list of all available dbhomes
    Use the command oakcli show dbhomes
  • A list of all available databases
    Use the command oakcli show databases

To upgrade 1 database use the following command (run as user root):

oakcli upgrade database -db <dbname> -to <destination home name>

Example:
oakcli upgrade database -db mcldbts1 -to OraDb11203_home1

To upgrade all databases running from one Oracle home (serial process – run as user root):

oakcli upgrade database -from <source home name> -to <destination home name>

Example:
oakcli upgrade database -from testdbb1 -to OraDb11203_home1

The database(s) gets upgraded by oakcli using the DBUA in the background. The logfiles of the upgrade process can be found under the directory structure:

/u01/app/oracle/cfgtoollogs/dbua

Oracle Database Appliance – cleanupDeploy X-windows failure

Whenever something goes wrong during the deployment of an ODA, you can try to fix the problem and restart the deployment from the step where the deployment failed (using the GridInst.pl script). Sometimes however when the deployment fails, restarting the deployment doesn’t work and the only (or fastest) solution is to cleanup and start all over. There is a script called cleanupDeploy.pl which does this for you.

A couple of days ago a colleague of mine needed to use this cleanup script because of an error somewhere in deployment step 15 (RunRootScripts) and the deployment couldn’t be restarted. After using the cleanupDeploy.pl script for cleaning up the failed ODA deployment the ODA was brought back to the pre-deployment state, but he wasn’t able to open an xterm (problem with X-windows) for starting the graphical ODA deployment process.

It took some time before the problem was found, but the problem was that the cleanupDeploy.pl script removed the localhost entry from the /etc/hosts file. Manually adding this line to the /etc/hosts file on both ODA nodes fixed the problem.
So just add the following line to the /etc/hosts file:

127.0.0.1    localhost.localdomain localhost

The problem seams to exist only in 2.1.0.0.0 cleanupDeploy.pl script. Running a cleanupDeploy.pl from the 2.1.0.3.1 image did not remove this line from the /etc/hosts file.

Oracle Database Appliance – odachk

With the release of ODA patch bundle 2.2.0.0.0 a tool available on Exadata only, is available for ODA too. This tool named “Exadata Configuration Audit Tool – exachk” is modified and renamed to odachk and is located in /opt/oracle/oak/odachk.

With this tool you can check the configuration your complete ODA environment (numerious checks for OS, GI, RDBMS) and will show you where there are problems or might be problems.

Running the tool

Run the following command as user oracle (full check with verbose option) to start the tool. You will be asked some questions before the checks are executed.

cd /opt/oracle/oak/odachk
./odachk -a -o verbose

There are more command line options that can be found in the User Guide.

Example output

Screendump “odachk” command execution: odachk_testdb_output.txt
HTML file that is generated by “odachk” command: odachk_MCLDB_081512_092757.html

Bugs

At the moment there is at least 1 annoying bug in the odachk tool. The check “One or more software or firmware versions are NOT up to date with OAK repository” will FAIL because odachk will use the software and firmware versions of the OAK 2.1.0.0.0 repository instead of the latest (for now thus 2.2.0.0.0). I’ve raised an SR for this bug and Oracle Support mentioned that this will be fixed as part of the next ODA patch bundle and that this is the reason why they didn’t mention odachk in the MOS note 888888.1.

With the release of ODA bundle patch 2.3.0.0.0 the odachk utilility is officially released. I did some testing with it and the bug mentioned here before is indeed fixed. I have uploaded a new sample HTML report created by this utility and the output while running the command.

Userguide

There is not specific userguide for the odachk tool and the readme for this tool will direct you to My Oracle Support note 1070954.1 (Oracle Exadata Database Machine exachk or HealthCheck) from where you can download the exachk tool including the ExchUserGuide.pdf file.

Oracle Exadata Database Machine exachk or HealthCheck [ID 1070954.1]

Oracle Database Appliance – Installing Patch Bundle 2.2

On 17 April 2012 ODA patch bundle 2.2.0.0.0 was released, which includes an Linux kernel upgrade (2.6.18-194.32.1.0.1.e15 to 2.6.32-300.11.1.el5uek – Unbreakable Enterprise Kernel), Oracle Grid Infrastructure patchset 11.2.0.3.2 and Oracle RDBMS patchset 11.2.0.3.2.
Oracle made the installation of various parts of the patch bundle a lot more flexible. Now you can install the infrastructure (OS, firmware, OAK, etc.), GI and RDBMS parts independently from each other. You can even choose to just install the Oracle RDBMS software without upgrading the existing databases and keep them running on the current 11.2.0.2.x patchset. This way you can just install the “infra” part without upgrading the rest. Unfortunately you have to upgrade to GI patchset 11.2.0.3.2 directly after upgrading the “infra” part if you are using ACFS, because ACFS in GI 11.2.0.2.x doesn’t work with the new Linux kernel.

Patch installation – part 1 (infra)

As with the previous ODA patch bundles, the installation of patch bundle 2.2 isn’t rolling and you will have downtime. The first part of the installation (infra) does a shutdown of the complete CRS stack (cluster) and will reboot both nodes at the end (at the same time). Depending on how many databases are running on your ODA, it will take 45 – 60 minutes before everything is running again.

Patch installation – part 2 (GI & RDBMS)

Especially if you are using ACFS (/cloudfs filesystem) you will have to install the Grid Infrastructure (GI) patchset 11.2.0.3.2 as soon as possible. Installation of the GI patchset isn’t rolling, so again the complete CRS stack (cluster) will be shutdown. I have chosen to install the GI patchset and RDBMS patchset (software only) in one “oakcli update -patch 2.2.0.0.0 –gi –database” run and it took around 45 minutes to complete (the cluster resources like databases aren’t available during this period).

Known issues

There are some “know issues” that come with this patch bundle. The most annoying ones I will sum down here:

  • The privileges on the “oracle” executable in the old (11.2.0.2) home become incorrect after installing this patch bundle. So you have to fix them as noted in the patch readme.
  • With the installation of the 2.2.0.0.0 patch bundle you can choose to let all or a selection of databases be upgraded to 11.2.0.3. However there is a bug which prevents databases that have capitalized database names to be upgraded automatically.

Problems I ran into

Although the installation of ODA patch bundle 2.2.0.0.0 went without visible problems, and it fixed a problem with a shared disk (state details for a disk: PredictiveFail), it introduced a new and I think a much bigger problem with the availability of our RAC databases. Whenever I reboot one of the ODA nodes, the RAC instances on the remaining node crash with an ORA-07445 (generated by the LMD0 daemon – Global Enqueue Service Daemon), see the error below. I have an Service Request opened for this and I will update this post whenever the problem is solved.

SKGXP: ospid 16245: network interface with IP address 169.254.117.116 no longer running (check cable)
Exception [type: SIGSEGV, Invalid permissions for mapped object] [ADDR:0x7FEA51FBB592] [PC:0x7FEA548AA5E7, skgxp_local_status_change()+191] [flags: 0x0, count: 1]
Errors in file /u01/app/oracle/diag/rdbms/tstdb1/tstdb11/trace/tstdb11_lmd0_16245.trc  (incident=121697):
ORA-07445: exception encountered: core dump [skgxp_local_status_change()+191] [SIGSEGV] [ADDR:0x7FEA51FBB592] [PC:0x7FEA548AA5E7] [Invalid permissions for mapped object] []

Problem solved!

Finally the problem with the crashing RAC instance (the surviving one) has been solved. You have to upgrade your databases (RDBMS) to 11.2.0.3 or if you need to be running on 11.2.0.2 you will have to apply patch 12628521SKGXP V3.4 – CUMULATIVE FIXES PATCH 6.1 (For description of the bug fixed see MOS note 11711682.8).

The problem is being described as an 11.2.0.2 generic problem, but the problem doesn’t occur on ODA 2.1.0.3.1 so it has be some combination of either the new GI (11.2.0.3) that comes with ODA 2.2.0.0.0 or the kernel/OS upgrade. I did ask Oracle about this, but they didn’t know (wanted to look further into this).

Oracle Database Appliance – Safely usable ASM diskgroup size

Not long ago I got a warning from Oracle Enterprise Manager that the REDO diskgroup on one of our ODA’s exceeded the warning threshold of 75%. Looking at the number of database instances and configured online redo log sizes I couldn’t understand how this was possible while the ODA documentation states that diskgroup +REDO size is 97.3 GB ( 4 * 73 GB / 3 – high redudancy is used on the ODA diskgroups).

I know the divide by 3 is a rough calculation but I only had 39 GB on redo logs in the REDO diskgroup. Withing ASM the diskgroup showed that 158 GB was still free (FREE_MB column of v$asm_diskgroup) so after triple-mirrored leaving 52 GB, but then the column REQUIRED_MIRROR_FREE_MB came into sight which showed that 140 GB was required for mirroring.
ASM documentation is clear about the calculation of the required size for mirroring: The value is the total raw space for all of the disks in the two largest failure groups.

This means for the ODA that the actual “safely” usable REDO diskgroup size can be calculated with: total_mb – required_mirrror_free_mb = 280016 – 140008 = 140008 MB (raw size)  after tripple-mirrorring leaves (140008/3) 45.6 GB which differs by 51.7 GB with the value that Oracle notes in its documentation.

So here is a table with the “safely” usable diskgroup sizes:

Actual diskgroup size (safely usable)

diskgroup internal backup external backup
+DATA 1.41 TB 2.81 TB
+RECO 1.87 TB 0.46 TB
+REDO 45.6 GB 45.6 GB

Oracle documentation size

diskgroup internal backup external backup
+DATA 1.6 TB 3.2 TB
+RECO 2.4 TB 0.8 TB
+REDO 97.3 GB 97.3 GB

Of course you can use more space, effectively resulting in a negative value for USABLE_FILE_MB, but the question if you want this to happen!

Oracle Database Appliance – Installing Patch Bundle 2.1.0.3.0

We just finished installing ODA Patch Bundle 2.1.0.3.0 on one of our ODA’s….. There goes one of the key features for the ODA – Simple one-button patch installation. As a DBA (and I think all administrators) are very reserved when these kind of statements are made.

First of all applying Patch Bundle 2.1.0.3.0 (and Patch Bundle 2.1.0.3.1, but this one is very small and doesn’t require downtime) patching of one ODA took 3.5 hours to complete, so a pretty long maintenance window for the kind of patches that are part of the patch bundle (ILOM/BIOS and Grid Infrastructure patch).

Unfortunately we ran into different problems when installing Patch Bundle 2.1.0.3.0 on our ODA’s. This post will give you a description of what went wrong:

  • ILOM/BIOS firmware update failed
    Firewall between public network interface and ILOM interface
  • Cluster Ready Service (CRS) got into undefined state, running but didn’t know it was running
    Reduced number of enabled CPU cores (Core Configuration Key)
  • GI (Grid Infrastructure) patch failed
    Oracle Enterprise Manager Grid Control agent running

(more…)

Oracle Database Appliance – Critical Patch Bundle 2.1.0.3.1

Oracle found a bug (a Seagate firmware issue) for ODA’s with Patch Bundle 2.1.0.3.0 applied. They highly recommend the installation of patch 13817532 after you’ve applied Patch Bundle 2.1.0.3.0. The problem is described in MOS note 1438089.1, but in short the problem is that a disk failure could trigger a complete system (ODA) shutdown.

One good thing about this patch (of course it comes in the form of a Patch Bundle with version 2.1.0.3.1, that you can only apply AFTER you’ve applied Patch Bundle 2.1.0.3.0 – it is not a cumulative patch) is that it is the first ODA Patch Bundle that doesn’t require downtime! (it only patches OAK).