fighting for truth, justice, and a kick-butt lotus notes experience.

DAOS problems after update to Domino 12.0.1 IF1 with reproducible server crash

Detlev Poettgen  Februar 15 2022 03:06:38 PM


Based on the experience below, when using DAOS on Domino Server 12.0.1 IF1, I cannot currently recommend and would wait until this is resolved before updating to Domino 12.0.1.

We have a support case open with HCL on this and hope this can be resolved quickly.

Update 2022-02-16:
HCL already looked into it and offered us via the Case a new hotfix  (HF24). So if you already run into the same issue, you should open a Support Case and request the hotfix, too.

Update 2022-02-22:
HCL published a new Technote today:

HCL developers are actively working on these issues. Our Performance team was able to reproduce these issues under a heavy workload and is in the process of testing our fixes under that workload.

If you are encountering any of these issues or something similar, please open a Support ticket to have your issue analyzed and escalated. (Include console logs and NSD if applicable.) If your issue is determined to be one of the issues that HCL has tested and verified, Support can provide a hotfix to you.

HCL will produce a 12.0.1 IF2 release containing the fixes as soon as possible.

Update 2022-03-05:
HCL published a new 12.0.1 IF2, which contains four DAOS fixes.

DCKTCARNVR        Fixed an issue where error may result in long held locks on daoscat.nsf during replication
SPPPCAMM6Y        Fixed an issue where there were multiple locks on daoscat.nsf

HPRHCASE7N        Fixed Domino crashes related to DAOS

BSPRCBQLLJ        Fixed deadlock and performance issues related to DAOS

We are planing to try IF2 during this week to see if our issue is solved with the IF2 too.

Update 2022-04-13:
The update of the Domino servers to Domino v12.0.1 IF2 was successfull and without any DAOS issues.
So if you are planning to upgrade to Domino v12.0.1 you should install IF2.
If you are already running 12.0.1 you should install IF2, too.

On last hint and leason learned: If you will need to rebuild the DAOS catalog because it's corrupted or missing, you should execute the command offline. Not from the console, when the server is up and running.

So what happened?

After a successfull update installation from v11 to Domino v12.0.1 and Interimsfix 1 (Hotfix 11), the first restart was normal.
But after about 30 minutes "Long Held Lock Dump" appears and a while later the server was unresponsive for users.

On the server console we saw many messages like this:

[22C4:0142-27D8] LkMgr BEGIN Long Held Lock Dump ------------------
[22C4:0142-27D8] Lock(Mode=X  * LockID(CONTLONGKEY DB=f:\Domino\data\daoscat.nsf RRV=14545618 len=48 hKey=0xC0190341 SkipLastDWORD)) Waiters countNonIntentLocks = 1 countIntentLocks = 0, queuLength = 2
[22C4:0142-27D8]    Req(Status=Granted Mode=X Class=Manual Nest=0 Cnt=1 0000
[22C4:0142-27D8]        Tran=0 Func=N/A x\ehashr6.c:899 [27C8:0002-000000000000275C])

After restarting and checking the daos status, we observed that the the daos status is out of sync. After this we submitted a load daosmgr resync.
But the resync didn't come to an end and the server was unresponsive again, showing these messages:

semaphore invalid or not allocated

Notes client were no longer able to connect to the server and even the Server Console was not able to send console commands any more.

After all we decided in our situation to downgrad back to 11.0.1FP4, rebuildt the daoscatalog and no more errors occured.

The same behavior occurred on a second large mail server as well. And led to the fact that this server was also no longer available for clients and could only be terminated hard via nsd -kill.

The problem should be solved with 12.0.1 IF1, but unfortunately it is not: