May 05, 2008

Stanford Pervasive Parallelism Lab (PPL)

On Friday, Stanford launched the Pervasive Parallelism Lab (PPL).  There's been lots of press describing it.  The general plan for the lab is to develop a common paradigm for programming new architectures like GPU's, the Cell processor, Intel's Larabee, as well as multi-core CPUs.  This is something we at FAH are very interested in, as we have had to have a unique code path for each of these (i.e. a separate code for the high performance part of the ATI GPU, NVIDIA GPU, PS3, and SMP).  Having a single code path for all would be very, very exciting to keep FAH code development onto new hardware going smoothly.

May 04, 2008

UPDATE: server back up

The server that went down was reset by our sysadmins just now (thanks to them for coming in on a Sunday evening) and we've got the server code running on them.

server down, low on work

Looks like one of our key servers went down and so regular FAH clients (non-adv, non-PS3, non-SMP, non-GPU) will be low on work until the sysadmins get the machine back on-line.  The other platforms (PS3, GPU, SMP, and adv settings) should have plenty of jobs and some even have their own assignment servers (in the case of GPU and PS3).  The sysadmins work M-F, so we expect that they will do a reboot on Monday morning.  In the mean time, we have added some new servers on line with jobs, but they are getting hit hard at the moment.

Finally, we are preping a series of servers to add 1 *million* jobs (I always imagine Dr. Evil saying that) hopefully this week on multiple servers, so being low on work won't be an ongoing problem after they're up.  However, until Monday morning (i.e. the next 16-18 hours or so), it will likely be tight (for non-SMP, non-adv, non-GPU clients).

April 27, 2008

Stats back to regular updates (everything)

We put the user and team file updates back to every 3 hours (used by 3rd party stats), so we're now back to regular behavior.  It looks like we should be ok from here on out. 

There are two upshots of this mess.  First, we've developed some new emergency procedures to deal with such back logs better in the future.  We also have plans for how to refactor the stats input code to potentially speed up the process by 5x (at least 2x). That should help in general (perhaps lettings us go back to hourly updates).

April 26, 2008

Stats back to regular updates (almost)

We've got the stats back to their regular 2 hour updates, although we're keeping the 3rd party stats still to every 12 hours until Sunday morning PST.  We've cleaned up some aspects of how the internal stats scripts work and see a way to speed up the stats input significantly (perhaps 4x), but we'll leave that until this mess has blown over before we start trying to make any further improvements (to avoid introducing any new errors at a sensitive time).

Update: slow stats pt 3

The last stats update (started at 8pm PST) just finished (8:50pm PST), which is very good news.  The next update should be much closer to normal.  If that goes well, we will turn the external stats access back on tonight PST (i.e. in about an hour or two).

Serverstat page streamlining

We've been working on several internal FAH scripts.  Most changes are behind the scenes (mostly to aid the FAH team in detecting server problems in real time).  However, some of these changes has lead to a streamlining of the serverstat page.  In particular, we've cleared out some older servers and removed some columns that we don't use very much.  The goal is to make it just have the most critical information, making it more obvious what the issues are.

Update: slow stats pt 2

The last stats update just finished (taking about 6 hours).  The next one should be a lot faster since there's less stats built up, and the next one faster yet.  We'll keep the outside access to the db down until we're back to every 2 hours though.  It looks like that will be tomorrow morning (Sunday morning PST).

Since things are looking like they're settling down, we put back the osstats page if anyone is curious.

Update: slow stats

It looks like our stats update as worked well.  The last update started at 8pm PST!  It has now finished in about an hour after our modifications.  The next stats update has more stats in it (almost 24 hours of WU data), but it looks like it will go much faster (hopefully 2-3 hours).  That means the next update will only have 2-3 hours of stats info in it and should hopefully only take ~20 minutes, and then we're caught up.

Temporary stats changes to get the stats moving

As I've posted below, we have a backlog of stats to input into our db for donors to access their scores.  To speed this process, we've made some temporary changes.  We've disabled stats updates from our web site.  We've also limited our updates on the external stats pages for teams and donors to once every 12 hours at 6am and 6pm PST.  We're trying to streamline the process so that we can get the backlog through and back to business as normal. 

This is just temporary, but will be a big help.  Once the backlog is through, the points will be up and hopefully all will be back to normal.  We expect this may take as long as 2 days, and will give updates along the way.