Last year I blogged about a permissions issue with Domino 7 running in a Solaris zone. Domino’s own ID file’s permissions are reset so only root has access rights. This stops the server from executing it’s own code. The problem also effects other files modified by the HTTP task such as DTF files, cgi-bin, and even the web logs. The problem is still effecting us but we’ve discovered a way to alleviate our problems so our solution may help anyone suffering similar problems.
Traditionally we’ve been very careful with our hardware updates, always making single changes at a time. However last year, strategic decisions to consolidated hardware and move towards greener server rooms meant that we had to move our Domino systems from dedicated servers with their own disk arrays to services hosted on Sun Enterprise servers along with moving our data on a SAN. Due to circumstances out of everyone’s control, the original plan to gradually move step by step to the new consolidated approach had to be moved to a very aggressive timetable (i.e. Everything done at once). At this time our servers were also moved from Solaris 9 to 10 and moved into Zones, as per our current Solaris virtualisation policy.
For me this was too many changes at once and even now I am still certain the change to using Zones is part of the underlying cause of our problems.
‘Identifying’ the problem
We reported the problem to Lotus last summer. Since then we’ve been back and forth between Lotus and Sun to identify where the issue lies, regularly updating our systems to record more debugging information.
Since we’ve been unable to generate the problem on demand, it’s often weeks before we could update Lotus with further data. But recently the issues increased to such an extent that the problem was occurring several times a day to both user-facing servers (where previously it was once a month to only one server). As the problems increased, we started to ‘poke and hope’ trying every idea possible while planning both moves to Domino 8.x and de-zoning one server.
Our thinking has always been that either the HTTP task is the cause of the problem or the Novell Identify Management Driver, that updates the address book. Initially the problem only affected the server IDM was connect to so this was our first culprit. When the problems started to occur on a server which did not have IDM, our thinking started to move toward the HTTP task, especially since our backup server does not have HTTP running and the problem has never occurred. Although our two live servers usually have 1000 concurrent users during office hours, the problem often occurred late evening when there’s less user activity. So whatever the cause of the problem was, it wasn’t easy to identify and we couldn’t trace the problem to any particular user activity.
During our ‘crisis’ week, we poked around the system typing to find anything that would alleviate our problems.
As we poked around the system and our own code, we realised a number of our agents have seasonal patterns that matched the increase in server issues. These heavily used agents, such as an agent that issues exam results, logged user’s actions and the data to text logs.
As part of our ‘poke and hope’ plan, we switched off all HTTP agent text file logging. Since then the problems have not reoccurred within our reboot periods (we reboot each server once a week). I have since intentionally brought a live server down using a text logging agent. However, after further testing against our development servers, the cause of the issue appears to be more complex than it first appears. Despite hammering our test servers using jmeter and the same agents, we rarely replicate the issue. So the actual issue is more complex than a simple bug in text logging code, there’s a more complicated interaction occuring. But switching off the code seem to reduce the likelihood of the ‘stars aligning’.
The text logging uses the standard method for opening files
fileNum% = FreeFile() Open filename For Output As fileNum% Print #fileNum%, strOutputString Close #filenum%
So there’s nothing unusual about the code
Although the problem is still sitting with Lotus, they’re still unable to identify what’s causing the problem. However the early indications are that the changes we’ve made have significantly reduced the number of occasions the permissions issue occurs.