from MICHLIB-L in its entirety, headers and email addresses trimmed; if you need to know more or who to contact, you should already be on that mailing list.
--
From: Debbi Schaubman
As many of you may already know, MeLCat is still down. We're currently
estimating that we'll be back up at around 12:30pm or so but that estimate
may get adjusted as the morning proceeds. We will post updates as needed.
Please note that you can continue to do work on the DCB servers and your
local III servers; MeLCat-related transactions will be held in a queue
until the central server is up.
For those wondering what happened and why this is taking so long, read on
. . .
At approximately 4:30pm on Wednesday, we noticed that the system was not
behaving correctly -- MLC staff were not be allowed to login to one of the
administrative interfaces to the system. At roughly that same time, an
automated monitoring tool notified Innovative Interfaces staff of I/O
errors. Innovative staff then began investigating the cause of the
problem. DIT staff verified that I/O errors had occurred and contacted
a technician from Sun. Sun and DIT staff identified the source of the
I/O errors (one of the SCSI channels). They suspect that the errors on
the SCSI channel resulted in an issue which knocked a few array mount
points offline. They unmounted the corrupted volumes and ran file system
checks. This work was completed at 8pm.
We thought we'd be right back in business at that point but we were wrong
-- the system refused to process transactions in a normal fashion. Further
investigation by Innovative Interfaces staff led to the discovery of
missing system files. We believe that these files must have been
corrupted and then dropped during the remount/file system checks.
Innovative Interfaces support managers advised that the best solution
would be to restore from backup. However, in order to do that, we needed
to first run an incremental backup to grab all the database changes that
occurred since the last full backup 2 days prior. The incremental backup
started at approximately 11pm. It was completed at approximately
12:30am. The data restore from the full backup began around 12:45am.
(Many thanks are due to the DIT staff who coordinated the change of tapes
at the hosting center. Without him working way past his normal working
hours, we would be looking at another 24-36 hours of downtime.)
At this point, the full restore has completed, as has the restore from the
incremental. However, that incremental backup had approximately 5.5
million transactions on it. We need those transactions to be processed
before we can bring the system back up since we need to come back up at
exactly the same system point as when we took the system down. Otherwise,
stuff will get lost and/or out of sync.
Innovative staff called at 8am to let us know that the system was
processing the transactions at a rapid pace but, due to the sheer quantity
of them, they anticipated another 4 hours of downtime.
Once Innovative tells us that things are caught up, we'll bring the system
back up and send out an announcement. We will also closely monitor the
system to make sure that all is well.
DIT staff will be talking with Sun staff to unravel more of the mystery
and determine what steps should be taken to prevent this from occurring
again.
Thank you for your patience,
Debbi
Recent Comments