Message boards : Number crunching : No Work Units
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,306,038 RAC: 24,321 ![]() |
Sid, I hold 3 days of reserve just for occasions like this. I don't like to hog WU's and was reluctant to increase from the default 0.25, but maybe it's a wise move after all. I've been away a few days and seen 13 tasks download, all of which got sent back straight away with Cheksum errors like others have reported. Then nothing, then 7 more WUs with the same problem. Some came through ok, but my last 2 finished within 30 minutes of me getting home (how I'd appreciate some long-running models right now!) On the plus side, now that my lockfile errors have disappeared, it may be that I should increase my runtime further to get more out of the few WUs that make it here successfully. That way, maybe I'll call for fewer WUs and help give everyone else access to what's left. Do people think that's the best plan at the current time - so that even if only one server is running it'll stand a better chance of keeping us busy? ![]() ![]() |
![]() ![]() Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Sid, I hold 3 days of reserve just for occasions like this. I would set up for 3 days of extra work and set a longer run time. I would expect a ton of error messages once the system comes back online. Every computer is going to be asking for work and I bet that the server won't be able to handle the crush. But when it is your turn, grab a bunch to keep your system busy for a few days while everyone is getting work for their system and the server is overloaded. I would think that there is more than enough work to go around when the system is running correctly, so I don't think grabbing 3 days extra work is hogging tasks by any means. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,306,038 RAC: 24,321 ![]() |
I would set up for 3 days of extra work and set a longer run time. I would expect a ton of error messages once the system comes back online. Every computer is going to be asking for work and I bet that the server won't be able to handle the crush. But when it is your turn, grab a bunch to keep your system busy for a few days while everyone is getting work for their system and the server is overloaded. I would think that there is more than enough work to go around when the system is running correctly, so I don't think grabbing 3 days extra work is hogging tasks by any means. I take your point, but I've increased runtimes from 3 to 4 hours for the moment and kept my buffer to one day. Once the servers are back running, everyone will be trawling to fill up their backlogs and they'll get swallowed up by the few, leaving others short. Only when the backlog has been taken up will I consider increasing the buffer. It makes sense for everyone to reduce your buffer just to allow everyone to get something, then up it again once the rush is over. The rush will be shorter if that happens. However, I understand human nature and fully realise that people have the attitude "all for one and I hope it's me" and sit on a pile of unused WUs for days while others remain out of work, so my idea will fall on deaf ears. ![]() ![]() |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It makes sense for everyone to reduce your buffer just to allow everyone to get something, then up it again once the rush is over. The rush will be shorter if that happens. If I had a magic wand, that would be what I'd do. And when work becomes available, I'd give as much as requested to machines with rare internet connections, and nothing to machines that still have work to do. Then I'd catch up on the machines I shorted earlier. ...but the scheduler isn't that sophisticated, and 98% of the people don't read the message boards, so the server is going to be pounded no matter what. But Sid's got the right idea. Just take what you need. Then when work becomes plentiful again, take on a reserve. Sid, in general, I'd say as long as your complete the work before the deadlines, noone is going to accuse you of hording. Especially since there is usually plenty of work to go around on Rosetta. On the other hand, the team likes to see results as soon as possible. So the 2 to 3 days buffer is a good compromise. It gives you enough work to ride through most all outages, and gives the completed results back in a timely mannar. [edit] Having the buffer of work also gives you some room to suspend your network connection after you see outages, and avoid hitting the server on it's busiest times. The problem I always have is remembering to set it back on again the next day. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,306,038 RAC: 24,321 ![]() |
If I had a magic wand, that would be what I'd do. And when work becomes available, I'd give as much as requested to machines with rare internet connections, and nothing to machines that still have work to do. Then I'd catch up on the machines I shorted earlier. There'd be a neat academic exercise for someone here relating to the number of cores available, then RAC etc, but it's not going to come to anything, like you say. On the plus side, those who don't read the message boards are likely to stay at the default 0.25 buffer anyway and it's only the active crunchers who'll dial in too. Sid, in general, I'd say as long as your complete the work before the deadlines, no-one is going to accuse you of hording. Especially since there is usually plenty of work to go around on Rosetta. On the other hand, the team likes to see results as soon as possible. So the 2 to 3 days buffer is a good compromise. It gives you enough work to ride through most all outages, and gives the completed results back in a timely manner. I'm sure that's right, but I'd likely accuse myself of it. I'm a pretty screwed-up individual on that kind of thing! Having lots of WUs hanging round makes people feel warm and fluffy, until something goes wrong and it's several days before they get to the top of the pile and get reported, by which time everyone else is stacked with them too and it becomes a bigger issue. Yes, that's right. I'm in manufacturing.... ;) Having the buffer of work also gives you some room to suspend your network connection after you see outages, and avoid hitting the server on it's busiest times. The problem I always have is remembering to set it back on again the next day. Good point, which I hadn't thought about. These last few days have made me rethink my view on the thread about increasing default run-times site-wide. I still support the view that it should be done step by step (default 4hrs, 2hr minimum first etc) but the urgency of the issue has been highlighted for everyone now. ![]() ![]() |
![]() Send message Joined: 3 Nov 05 Posts: 1833 Credit: 120,154,827 RAC: 17,702 ![]() |
Yes, that's right. I'm in manufacturing.... ;) Maybe there should be some TPS implementation! Principle 5 might be a good place to start... http://en.wikipedia.org/wiki/The_Toyota_Way |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,306,038 RAC: 24,321 ![]() |
Yes, that's right. I'm in manufacturing.... ;) Yes, you've worked me out very quickly. It was 8-10 years ago I qualified in principles of world class manufacturing at UCE Birmingham, which only formalised what I'd been doing by second nature for the previous 20 years. Some of these principles can work side by side, but I'd start with 1 and 12, 13 & 14, otherwise 5 becomes another problem rather than a route to a solution. Far too big a subject to talk about here, but it's possible to see some aspects in action already. All I'd add is that while the principles are always correct, as users here we need to bear in mind the resources available to put them into effect. We see lots of posts reinforcing the symptoms without giving realistic time to the solution coming through. Sometimes the solution is just temporary and doesn't go to the root cause because of money, time or available expertise. When I see people threatening to abandon the project due to a temporary situation it smacks of impatience, a lack of understanding and even a lack of respect. Let's give the guys a break occasionally. Not every problem can be solved by flicking the appropriate switch. ![]() ![]() |
![]() ![]() Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Yes, that's right. I'm in manufacturing.... ;) you got graduates and students and full time staff working, but this is still a university project at the moment and still in its infancy. but some of us including myself get caught up in expecting perfection from the team. I wonder just how large this "team" is? I think the biggest irritation factor amongst the user group is the lack of any news. It seems that mod is the only one at times that has a vague idea as to what happened. Same goes for the rash of m5 errors or whatever that was. no news on that. but being its the holiday i suspect no one is around to explain that error. |
![]() ![]() Send message Joined: 11 Aug 07 Posts: 49 Credit: 1,786,248 RAC: 0 |
I agree with both of the last two posts, it is a combination of lack of patience (and/or people's backup plans if something like this happens) and the lack of news whenever something like this happens. Also, keep in mind, right now, the public schools (and maybe the UofW) are resuming classes today, and that could be "forcing" the R@H project to briefly go on the backburner as everyone is settling back in with new classes, teachers, etc. Another thing to keep in mind, also for this particular crisis, the states of Washington and Oregon have been in a weather crisis for the last three weeks, the likes of which we've not seen in over 40 years. Yes, the national news has been covering the storms more from the midwest and New England, but our snowfall has so far seen about 10 times our normal snow amounts. Don't know about Seattle, where R@H is based, but Portland practically shut down for about 5 days leading up to Christmas because we couldn't handle the amount of snow and ice we got. I only mention this, because there could have been server problems during that time that no one could fix because no one could physically drive to the server to fix it. Add to that two holidays a week apart, and that may have made the problems worse. I'm just trying to aleviate at least a bit of the impatience that's going around. However, more news would be appreciated (and more than just Mod saying something vague). ![]() |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, unfortunately, I've not heard anything either. So I can only infer from what we all observe and a bit of experience seeing similar holiday symptoms in the past. Rosetta Moderator: Mod.Sense |
![]() ![]() Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
As of 5 Jan 2009 19:35:31 UTC (updated every 10 minutes) they have taken the whole system down with the exception of the scheduler and the web server! That looks serious enough. |
bono_vox Send message Joined: 5 Dec 05 Posts: 8 Credit: 371,092 RAC: 0 |
Right now (As of 5 Jan 2009 19:35:31 UTC), with the exception of "Data-driven web pages" and the "Scheduler", all programs are "Not Running". ![]() |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
Looks like your worries are at an end. I have just had 15 work units downloaded and there are about 19000 in the queue. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,306,038 RAC: 24,321 ![]() |
Right now (As of 5 Jan 2009 19:35:31 UTC), with the exception of "Data-driven web pages" and the "Scheduler", all programs are "Not Running". I saw the same thing earlier today - maybe 8 hours ago. But like Evan says, both make_work servers are running now, though not fully delivering requests. I've got 2 WUs already for my quad, which weren't failures from other users, so that's a start. Second requests aren't being filled yet though. Don't get greedy, guys. Take the first few to get you running, then hold off for a couple of hours until everyone gets working before filling up again. ![]() ![]() |
bono_vox Send message Joined: 5 Dec 05 Posts: 8 Credit: 371,092 RAC: 0 |
Well, now I'm downloading 12 big files (from 2.13 to 12.46MB) named "homfragments_????.zip". Hopefully I'll have enough wu's until everything goes back to normal. EDIT: Done! 12 new WU's and network activity set to "suspended". My other computer will be doing some WCG for now. ![]() |
FoldingSolutions![]() Send message Joined: 2 Apr 06 Posts: 129 Credit: 3,506,690 RAC: 0 |
Well, now I'm downloading 12 big files (from 2.13 to 12.46MB) named "homfragments_????.zip". Hopefully I'll have enough wu's until everything goes back to normal. I got 5 WU's now. Just keep pressing update, eventually you get something :) EDIT: slow downloads though!! |
![]() ![]() Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
It is still showing no work for me at the moment...i suppose that's due to system overload at the moment. well its just going to cycle for awhile until it does get work. whoa...just 20 mins later i get 21 tasks! |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Morning all. She's all GREEN now, if i can just get some! pete. ![]() |
![]() ![]() Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Morning all. hang in there...i got assigned 21 tasks but getting them downloaded is another problem at the moment. probably comm overload at the moment. |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
However, I understand human nature and fully realise that people have the attitude "all for one and I hope it's me"... Having the buffer of work also gives you some room to suspend your network connection after you see outages, and avoid hitting the server on it's busiest times. EDIT: Done! 12 new WU's and network activity set to "suspended". My other computer will be doing some WCG for now. Not all human nature, then! I got 5 WU's now. Just keep pressing update, eventually you get something :) Ok, maybe some. :) |
Message boards :
Number crunching :
No Work Units
©2025 University of Washington
https://www.bakerlab.org