Joe has been pounding on the v5 WS trying to shake it out from the recent disaster with problems returning NVIDIA GPU WUs. The upshot of all of this is that the v5 server code was pushed hard in many ways and several issues have now been found. Joe is testing them, but we're hopefully that beyond the initial good news we had a few days ago, that several additional issues may now be fixed.
It's too early too tell since we're still testing, but I'm optimistic. This only affects particular servers (vsp07b, vspg10a, vsp11a) and the vsp09a CS.
Here's a heads up for donors running with clients configured for small WUs (v6 clients) or normal WUs (in pre-6 clients); note that this does not affect "big WU" client configs for v6 or earlier. It looks like we're running low on those over the weekend. I hope to have this resolved by Monday, but likely not tomorrow (Sunday). One workaround is to configure your client for medium-sized WUs:
Acceptable size of work assignment and work result packets (bigger units
may have large memory demands) -- 'small' is <5MB, 'normal' is <10MB, and
'big' is >10MB (small/normal/big) [normal]?
This option states a preference for the size of work units downloaded and uploaded to the project servers. Bigger units will also have bigger memory requirements. If you run on a slower broadband or dialup internet connection, small is the recommended setting to ease your bandwidth usage.
Please see our installation guides if you're not familiar with these settings. In general, the larger the setting here, the less likely we'll run out of WUs, since we'll assign small WUs to big WU clients if we run out of big WUs to give out, but of course won't send big WUs to small WU clients.
We have been working to track down the nasty bug on the NVIDIA GPU WS's that is causing problems for donors sending back WUs. We have been trying different fixes over the last week, but this has been very tricky to figure out.
After another brainstorming session this afternoon, I think we have a good plan for the short term and long term. I hope that new WUs being assigned won't see this problem due to rerouting of assignments. Joe is also going to pound out the bugs on his new WS on vspg11a to get that going.
I'm very sorry for this major issue. This has been called the worst outage we've had and I think we agree. I've had a long chat with the development team about this and we've talked about how to fix issues in the WS code release cycle. I think the plan we have in place will stop this from happening in the future, but the main issue right now is to solve the problems at hand.
UPDATE 6pm 2/19/2010 -- after a week of working on this, trying lots of stuff, and nothing working, I think we've found something promising. I'm nervous typing this as everything looked promising before, but at least I think Joe's found the reason for the problem, which is the hard part.
UPDATE 11pm -- so far so good. It looks like this fix may be sticking.
UPDATE 7:30am 2/20/2010 -- looks like the fix is indeed working. We will continue to monitor the servers closely over the weekend.
Our main stats web server is being hit with a denial of service-like attack from several machines. They are accessing cgi-bin urls multiple times per second per IP, which is slowing down the web server for everyone else. We have banned some IPs, but we will look back and ban some more as needed.
Please stop running scripts -- it ruins the stats for everyone else.
UPDATE: It seems like the DOS-ers are often going to the fahproject page, so we have deactivated it for now to keep the rest of the site up. This seems to have helped a lot, coupled with some IP banning.
toTOW, one of the moderators from the FoldingForum.org community has helped to found a new web site related to Folding@home: FAH-addict. This site has been great about giving the latest info on project news, FAH updates, tutorials, and even hardware reviews. It is also available in both English and French. Since we work closely with the FoldingForum mods to coordinate Folding@home rollouts, a lot of information is available at FAH-addict as well. I suggest you check it out: http://www.fah-addict.net/
We're done with the bulk of our initial update to new hardware. We'll be doing some more work in the future to build up some additional capacity, namely hopefully getting to the point where the stats are never off line. For now I think we're in good shape. The stats are much faster than before, so we've turned back on a lot of the capabilities we previously turned off. Also, stats update are taking about 5 minutes and now are limited not so much by db access than by other issues.
Moreover, we have now set the third party stats to update once an hour (instead of once every 3 hours). It's set to update 10 minutes before the hour, every hour, so checking on the hour should be safe.
Note that the pages that are updated are:
Please do not use scripts to access our main pages (i.e. anything with a cgi-bin in the url). We reserve the right to ban any IP that violates this rule, as it slows down the stats for everyone else.
We've talked about this for some time, but now's the time to start the migration to the new stats db hardware. We are doing it now and everything looks ok so far. We are keeping several safeguards in place in case there is a problem.
IF there is a problem with the stats, please bear with us. There are several links we need to update and it's possible that a link is still pointing to the old db. Also, in case of emergency, we are keeping track of all the new stats from this point in a special place, so even in the worst case scenario, we can just go back to the old db and input all the new stats into it.
So, the stats will be down for a bit and there may be some inconsistencies for a day or so while we get all the links updated. The good news is that we'll have much faster stats soon, which will be great for all of us.
UPDATE 1 The migration is now done and it looks like everything is working. We've tested out the stats pages and done a small manual stats update. All looks good. However, since stats are so important, before diving in and just putting everything back to automatic updates, I wanted to see if donors see any problems. If you do, please report them in our forum (http://foldingforum.org).
UPDATE 2 It looks like everything has migrated well. We have the stats back on normal updates and those updates are going fast (under 10 minutes). With the new hardware, I bet we can make it even faster, but that's for later. We have turned back on certain features we previously turned off (eg CPU counts). We have more ambitious plans for the future, especially ideally getting to the point where the stats are never off line (even during updates), which is now possible with the new hardware.
Over the weekend, we've had trouble with cron on the server which initiates stats updates. This doesn't affect that stats data itself (past, present, or future), just the initiation of various tasks. This machine was backedup and then restarted this morning and it looks to be happy now. We are keeping an eye on things to make sure that all is back to normal.