System Status
Welcome to the CSE Computing Facilities status page. This page is intended to provide information on the current state of systems at CSE. Particularly scheduled, current, and recent system events. It is maintained by System Support, part of the Computing Support Group. Tuesday 28th March, 2006 - 4:30am to 9am Unscheduled network outage Forum Announcement
Interconnectivity between internal CSE networks and the larger university was lost at approximately 4:30am this morning.
Service was restored just after 9am (~9:05am).
We understand that the cause was due to some configuration testing to resolve network issues that ultimately disabled the main building router. As this router acts as the gateway for our main subnetworks, plus has several key servers plugged directly into it (for high speed data), pretty much everything stopped working for the duration...
Tuesday 24th January, 2006 - 5pm to 6pm Scheduled outage of grieg server. The grieg fileserver needs to receive a hardware upgrade. This downtime will affect:
- CSG Support Forums
- All class and personal sites that use a database backend (or similar) which is hosted on grieg.
If all goes well, the downtime should be between 15-30 mins.
Updates will be posted here during the upgrade process. Any questions should be posted in the relevant thread in the support forum.
Updates:
- 5:00pm
- - Begun
- 5:06pm
- - Progressing well. If this continues should be finished around 5:20pm
- 5:18pm
- - Copying complete, rebooting.
- 5:21pm
- - Rebooted. All finished. Looks good.
Wednesday 11th January, 2005 - 7:30 to 8:30am Scheduled outage of file servers eno, elfman and kamen. - 7:36
- Eno about to be rebooted.
- 7:58
- Eno filesystems need an fsck - might be a while.
- 7:59
- Elfman about to be rebooted.
- 8:25
- Elfman backup, eno still checking filesystems...
- 8:29
- Kamen about to reboot.
- 8:40
- Kamen back up, eno still checking 1 filesystem.
- 9:28
- Eno fully checked and back up. One duplicate block was found and fixed.
All three fileservers are now running 2.6.14.3 kernels. On the morning of 11th Jaunary we plan to upgrade the 3 main file server to a 2.6 kernel (they are currently running 2.4.30).
This should be a straight forward reboot taking a few minutes, but it is a sufficiently substantial upgrade that we are expecting something to go wrong, but don't know what.
So we are reserving a 1 hour window to deal with any issues that come up. Please do not expect to use files on these servers during this time.
Friday/Saturday 16-17th August, 2005 SSL Certificate issues Due to an unfortunate issue with our new root CA certificate, most of our SSL certificates were being identified as "expired" since late Friday.
This has now been rectified by issuing a new root CA certificate. We apologise for any inconvenience. You can find details on the new certificates here.
Tuesday/Wednesday 9-10th August, 2005 Loss of power - full shutdown Update: The power upgrade went fairly smoothly. A number of tasks took quite a bit longer than we were expecting, so whilst most critical services were up at 9am, some (non-critical) computers will not be turned on for a few more hours. Additionally, several servers had problems on startup. Notably an issue with the elfman fileserver, which meant that some home directories appeared to disappear, grieg had a similar problem (/srvr). These should be ok now, but some client computers may still have issues. Please contact System Support with details of your problem, any errors, and the computer you are using.
--
Due to work on our local power substation, CSE will be losing its electricity supply from 10pm on Tuesday 9th August. Servers will be shut down as of 8pm, and it is recommended that desktops be off by 6pm.
The work on the substation is expected to be finished by 3am, and CSE services should be back to normal by 8am on Wednesday 10th August. UPDATE: We have been advised that electrical work in the server room may not be completed until after 7am. We now hope to have service restored by 9am, if the electrical work completes ontime.
This power outage will also be used to implement new power feeds into our main server room to improve capacity and redundancy.
Tuesday 9th August, 2005 Kamen fileserver unscheduled outage 3:20pm-4:05pm At approximated 3:20pm today, power to the fileserver kamen was inadvertantly removed (loose plugs). This outage was complicated by some networking issues on reboot which we have worked around. We are currently closely investigating the cause to confirm that it won't affect numerous servers for our power outage tonight.
Thursday 7th July, 2005 CSE Mirror down for maintenance The CSE mirror is down for maintenance from approximately 7:30pm. Read-only access may be available from about 9pm (Update: problems during installation, delay in return of read-only service. Update2: Issues resolved at 12:30am). Full service is expected to be restored by about 5pm on Friday 8th July.
This maintenance includes a disk upgrade doubling its capacity. Full details can be found here.
Thursday 9th June, 2005 CSE Internet Connection lost At around 12am this morning, CSE's link to the campus wide network went dead. Service was restored by University ITS at around 9am. We have not yet been informed of the cause.
Network traffic to CSE from outside (including the rest of the university network), and vice-versa did not flow at all during this time (12am-9am, 9th June).
Sunday/Monday 24/25thth April, 2005 DNS server down, now fixed A DNS server had crashed on Sunday evening. This caused many services (notably CGI) to act very slowly (waiting for a DNS query to timeout). The server has been rebooted and things should have returned to normal as of around 10:15am, Monday 25th April.
Sunday 24th April, 2005 (varying times) Network outages on CSE networks Due to maintenance on the router connecting CSE's networks to the UWN, there will be rolling outages on CSE's external connectivity during the day. Service Desk advise that each outage may last up to an hour. CSE's internal networking should not be affected.
Tuesday 4th April, 2005 (9am) Kamen fileserver reboot The reboot of kamen went reasonably smoothly though it took a bit longer that we would have liked. Hopefully the new tracing messages that it is generating will be helpful.
Tuesday 4th April, 2005 (before 9am) Kamen fileserver reboot The kamen fileserver will be rebooted early on Tuesday morning to install a new kernel which should help track a recent problem with 'nul' characters appearing in mail files.
A quick reboot was tried during the day on Monday, but the ethernet didn't work after the reboot, necessitating lots of head-scratching and turning the power of the 10 seconds. Fortunately this second response fixed the network.
Thursday 17th March, 2005 (9am) Eno fileserver error One of the drive arrays on the eno fileserver has failed. We're currently looking into the situation. All home directories on /import/eno/1 are unavailable. This includes most class accounts.
Update (10:03am): The fsck is finished and the filesystem is back online. It took around half an hour to complete.
Update (9:50am): The filesystem is currently undergoing an fsck (filesystem check). This could take anywhere from 15mins to a couple of hours. It appears that around 5:17am this morning, one half of enos disk array disappeared. This is the sort of incident that would happen if someone knocked out one of the cables, but no-one was around...
Monday-Tuesday 14-15th March, 2005 Closure of K17 labs The labs in K17 - banjo, oud, lyre and chil - will be closed to students Monday afternoon in order to prepare for an electricity shutdown on K17 ground floor early Tuesday morning (6:00-9:00). The labs will be opened again at 9:00 Tuesday.
17:00, Thursday 17th February, 2005 Williams upgrade Update: The upgrade ran smoothly and on-time.
Williams will be upgraded as per weill and wagner's recent upgrades. The only difference is extra notice due to its role as a simulations server, plus it will retain the extra RAM (6G total). Please inform SS if this will cause a critical interruption to your activities.
17:30, Tuesday 15th February, 2005 Web server (albeniz) upgrade The web server for www.cse.unsw.edu.au, albeniz, will be upgraded at 5:30pm today. There should be minimal interruption to services (in the order of one minute).
Please inform System Support (ss@cse.unsw.edu.au, x54199) if you encounter any peculiarities with the updated server.
17:00, Tuesday 15th February, 2005 Wagner upgrade Update: The upgrade (also) went very smoothly. The host was inaccessible for under a minute.
Wagner will be upgraded at 5pm today, with the same details as the weill upgrade below (8th Feb).
Please inform System Support (ss@cse.unsw.edu.au, x54199) if you encounter any peculiarities with the updated server.
Sunday 13th February, 2005 Mail server problems - resolved The problems with the mail server were caused by an administrative error on Friday. The error was rectified late on Saturday evening but by that time there was a substantial queue of unprocessed mail (many duplicates) that needed to be dealt with. This queue took about 24 hours to completely clear so duplicates could have been received as late as 8:50pm on Sunday.
Saturday 12th February, 2005 Mail server problems The hard drive on the mail server filled up early this morning. This caused it to generate many duplicates of emails. We have disabled outgoing mail and are currently working to eliminate duplicates. We hope that mail service should be restored sometime between 6pm and 7pm, though it may take another few hours for the backlog to clear.
16:00, Tuesday 8th February, 2005 Weill upgrade Update: The upgrade went very smoothly. The host was inaccessible for under a minute.
At 4pm today, we will be turning off the old weill and turning on a new one. Thus, all current sessions through it (login, IMAP, XDM/X-win32, etc) will be dropped. The outage shouldn't be longer than a minute or two. If you would rather not be interrupted, then we suggest you temporarily move to wagner instead.
There are two significant changes occuring:
- The new weill will be running a 2.6.10 linux kernel (instead of a 2.4 series kernel). The 2.6 series has a number of enhancements. We're particularly curious about how scheduler changes will affect the performance of our busiest servers - the login servers wagner, weill and williams.
- New hardware. The main difference is the CPU from 2.4GHz to 3.2GHz. Of note is that this is the first of several servers in our new "blade" server chassis. Some discussion on these can be found in an older CSG newsletter. Though our new blade is a 10-server Dell blade (with 6 installed currently), rather than the 14-server intel one pictured.
Please inform System Support (ss@cse.unsw.edu.au, x54199) if you encounter any peculiarities with the updated server.
Monday, 7th February, 2005 Britten server failure On the weekend, one of our DNS/YP/NTP secondary servers had a hardware failure. Since we have 3 servers in this role (of providing redundancy), and they would soon be undergoing a hardware upgrade, it was decided to fully decommission britten. The beethoven server is currently also masquerading as britten, and uses of britten will be phased out. The most notable of these is that anyone using any of brittens IP addresses as a DNS server will need to change it. For most people, this means changing 129.94.172.11 to 129.94.172.7 or 129.94.172.12.
08:30 Tuesday 1st February, 2005 Elfman fileserver maintenance Update: The maintenance was completed in around 5 minutes, and the orange light is now off. Please inform SS if you encounter any problems.
The fileserver 'elfman' will be taken down for maintenance for 10-15 minutes at 8:30am, on Tuesday the 1st February. This will affect all users and groups with home directories on elfman.
Please inform System Support (ss@cse.unsw.edu.au, x54199) if you believe this will be problematic for you. This page will be updated with information post-maintenance.
If you aren't sure which fileserver you are on you can either run acc and check if your home directory is on /import/elfman/1 or /import/elfman/2. Alternatively, visit this webpage.
Background: elfman recently developed an orange light (which we are told means "Critical error") that we'd like to investigate under controlled conditions.
|