System Restarts - Know your Hardware

Sunday, May 12, 2019 5:28 AM

Problem(s) Encountered:

Had several problems rebooting the whole system after ligntning shutdown.  Documentation for recovery from lightning page does not include starting airbags or starting prospero.  Documentation for 2.4-m manual also does not mention starting prospero or all the other related isis and caliban etc.  Also very confused about the Windows Vista for running the slit viewer. Doc says to restart the computer but the photo is not linked.  A box in the computer room labeled slit viewer has no lights and does not seem to be plugged into power.  But not sure if pressing the button is actually what makes it come back on magically.

We also wasted 2 hours in the middle of the night because the slit viewer went blank.  It turned out that the subframe box had been accidentally ticked, but we thought the issue was that it was not talking to the camera.  We ended up shutting down and rebooting the IC and hiltner a couple times and getting messed up in the process.  Problem compounded by lack of doc for slit viewer -- I forgot at first that there was no manual and kept going back to the old one that is still online and thinking that it was out of date, but having nothing else going round and round in circles thinking it was out of date and still finding nothign else.

Finally, prospero has been acting up.  It doesn't accept the default directory name for runinit, so everything just gets dumped into /data/hiltner/.  

But even worse, it then started choking on readout.  It would seem to recover, and it claimed to have found an existing filename with the same name (one it should not have been using as a name in the first place) and then claimed to rename the file.  But in reality it seems the new data were not saved at all -- at least we cannot find them. the status window shows strange parameters.  We had to quit and restart prospero a few times.  But now it seems that it never saved the new exposures we have been doing for the last 2 hours.  At least we cannot find them.  It has copied existing files to new file names and not saved the new data at all.  It is incredibly disheartening and frustrating.  See file headers of the 3 fits files in /data/hiltner and file names in 190512.  Time stamps show duplicates.  We moved the files into that directory by hand at the first prospero reboot.

Also, we cannot get the guider to work.  It SEEMS correlated with the rotator.  As the other night, all was fine until we rotated.  Observing a globular cluster, there sould belots of stars in the guider cam, even without setting the new rotator angle in Jskycalc.  But the guider cam looks like no photons are received.  So we've been guiding by hand.

Solution:

Be sure to read through all lightning procedures before storms are imminent.  That can save a lot of headache later on.

A)  I have updated the 2.4m lightning shutdown(/startup) procedures to include information on bringing the airbag support system back online.  Data-systems, such as Prospero (and Owl), were not historically included in these write-ups as they are not related to the telescope.  However, I did go ahead and add a couple lines with links to the pertinent pages that detail starting Prospero (as well as ISIS, Caliban, and the TCS agent) and Owl.  The relevant link for bringing up Prospero and data-handling systems can be found here.

B)  The slit-viewing camera for CCDS was recently replaced.  This is the first observing program to utilize the FLI camera with MaximDL for CCDS slit acquisition.  In fact, final camera mounting/alignment wasn’t completed until last week.  As such, I have not yet had time to update the CCDS documentation to revise slit-viewing procedures.  The computer currently being used for slit acquisition with CCDS is the old guider computer for the 2.4m (for some reason, MaximDL as found on the Andor slit-viewing PC does not recognize FLI cameras in the setup procedures).  This PC (currently labelled as guider PC) sits on top of the Andor slit-viewing PC (labelled currently as slitviewer).  I will make labels that better describe these two different PCs.  That said, the CCDS slit-viewing PC was still powered on and appeared to be initialized and ready for use when I was signed into the various machines around 2215 last night to bring up all systems.  This morning, it appears it had been used through the night and was working normally when tested.  Regardless, to reiterate, documentation and labelling will be updated to reflect the new FLI/Maxim-based slit-acquisition for CCDS soon.

C)  In regards to the slit-viewer going “blank”:  I’m unsure why the IC and hiltner were reboot a couple times.  They have nothing to do with the slit-viewer.  Again, documentation will be updated for this extremely recent update to the CCDS instrument.  That said, the new slit-acquisition camera and software are synonymous to the guider systems.  A good place to get information on operating MaximDL is the Autoguiding & Acquisition link on the MDM website, found under the Useful General Documentation for Observing link from the site’s homepage.  

D)  In regards to Prospero:

1) Directory structure:  Prospero has historically disallowed changing the working directory to anything other than /data/hiltner.  Observers can “trick” Prospero into writing files to an alternate directory by adding a path to the filename.  My preference is to simply create a subdirectory within /data/hiltner through a terminal and move the night’s files over at the conclusion of observations.

2) I’m not sure what “choking on readout” necessarily means.  I’ll take this time however to reiterate that the Prospero Status & Prospero Command windows should be set as “Always on visible workspace” (click on icon, top-left of each terminal and chose appropriate option).  This will keep the Prospero Status window from going blank when switching between various workspaces.  Also, I seem to recall occasionally that if this option wasn’t checked for the Prospero Command window, it would sometimes suspend integrations for some reason.

3) I’m also not sure what was going on with the “duplicate files” issue.  Indeed this morning, I attempted to take a couple test frames and ran into similar issues.  Below is step-by-step what I tried:

___________________________________________________________________

In prospero:

file etest (it changed the ’Next:’ file accordingly from ccds3.0015 to etest.0001)

go (ran normally but upon writing file, gave the warning about filename already existing.  But the listed filename was still ccds3*, not etest*.  So it reverted the filename for some reason.)

I then tried the disk recovery procedure, as listed here.  I clearly needed to revise this (as of this writing, revisions have been made) to tell users to quit isis, caliban, prospero, xmis2, etc. before rebooting hiltner.  Also, when I attempted to reboot hiltner, I found I needed superuser permissions.  Even sudo /sbin/reboot complained.  Once I reboot, I restarted ISIS and CALIBAN and tried RECOVER 20 DISK1 & RECOVER 20 DISK2.  It attempted to recover the first file, but warned it needed to be renamed.  For the remaining files, it spit out the following error:

ERROR: No SIMPLE card in FITS file #n, skipping… where n went from 2 through 20.

Perhaps there was indeed only the one file in the queue?  I don’t know.

Anyways, at this point, I decided to take everything down and back up (mdm24ws1, IC).  After this, I got an initial complaint when starting ISIS, but quitting it and starting it anew resolved that (I’ve seen that happen before as well).  I moved all .fits files away from /data/hiltner (to /data/hiltner/cluster) and commenced with bringing everything back online.  All came up as expected.  I ran a startup and runinit in prospero, setting the file to test.  I then took a couple sample images.  All is now working correctly again.

___________________________________________________________________

The above is snipped and lightly revised from an email to Rick Pogge asking if he had any input on these file-naming issues.  No input at time of this writing.  

4) I cannot locate any lost files taken through Prospero.  I am under the assumption that they are indeed lost.  I appreciate that this is disheartening and sympathize.  That said, particularly when Prospero is spitting out errors about files already existing, it is good practice to verify a file is written before moving on to the next integration.  If issues appear, measures should be taken to ensure file integrity prior to running any additional exposures (e.g. moving file(s) to a separate subdirectory).  

E)  In regards to guiding:  Guiding on a whole has appeared to be an issue during this run.  I have recommended reaching out to John T. and/or Justin R. (both on distribution again here) for any useful tips and tricks to guide systems.  I’ve also cc’ed the 1.3m observer in case she can benefit from any tips or tricks.  I do not have any experience running the guide systems unfortunately.  That said, there is a seemingly very comprehensive document, Autoguiding & Acquisition, found on the General Observing Documentation page that should be quite instructive, particularly section 2.3 and on.  

IN SUMMARY:  I’m not sure how these problems coalesced through the night, particularly with regards to prospero.  I was signed in and assisting the observer as late as 2215 MST.  I logged into mdm24ws1, TCS PC and the slit-viewing PC.  I initialized the TCS and guider programs and verified functionality as best as possible with the telescope at zenith.  On mdm24ws1 I brought up all systems, including data-handling systems, and verified that files were writing as expected.  Whatever happened, happened after this.

As of this afternoon, I have brought all systems up fresh.  Daylight tests indicate that CCDS is working properly; Prospero is properly writing (and recording) files; the slit-viewer is properly resolving the flat lamp-illuminated slit; MaximDL Pro guider software is functioning as expected.