Message boards : Number crunching : Why did I get Compute Error?
Author | Message |
---|---|
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
Hi, I have no idea why I got Compute/Client error, maybe someone can shed some light over this: https://boinc.bakerlab.org/rosetta/result.php?resultid=217026695 Thanks in advance. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
Did you reboot your computer several times while running the project? The statement "too many restarts" usually indicates to this problem. Rosetta will only allow I think 3 restarts per work unit before declaring a compute error. If you have to reboot the computer several times for any reason suspend the project using the activity tab at the top of the page and then suspend. |
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
Did you reboot your computer several times while running the project? The statement "too many restarts" usually indicates to this problem. Rosetta will only allow I think 3 restarts per work unit before declaring a compute error. If you have to reboot the computer several times for any reason suspend the project using the activity tab at the top of the page and then suspend. I didn't restart the computer that crunched this WU. Can a restart mean only restarting of a computer or restarting work on rosetta? coz, if the WU is big - which means take about 20 hours, and my BOINC prefences switch between projects every 1 hour or so, it means that rosetta will be "restarted" quite a few times with this big WU... Could that be the problem? |
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,952,147 RAC: 9,308 ![]() |
One failed work unit is no reason for alarm. Switching between projects usually does not cause a failure of the work unit. I have been 100% R@H for a while so I only have a little experience. One of the complaints about R@H has been failed work units. The WU will go to another computer for processing. If the same unit fails on multiple computers, the project team will figure out what is wrong. Keep crunching. Thx! Paul ![]() |
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
One failed work unit is no reason for alarm. Thanks for the feedback, I'm not alarmed or anything, it's just unusual and I thought maybe there might be something wrong with my computer... :) I'm having another WU running for 25 hours, hopefully this one will finish well :) it's just that I don't want my computer to work so hard for a result that couldn't be used... |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
Can a restart mean only restarting of a computer or restarting work on rosetta? coz, if the WU is big - which means take about 20 hours, and my BOINC prefences switch between projects every 1 hour or so, it means that rosetta will be "restarted" quite a few times with this big WU... If you switch between projects enable the "Leave applications in memory while suspended" which you will find in the computer preferences page. This ensures that you will start from where you left off and you won't lose anything while on the other project. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yep, Evan gets the prize. If you exit BOINC (not close it, but exit), shut down the machine, or suspend the task without the keep in memory setting, then the task will be "restarted" when it gets to run again. The number of restarts is not counted per task. It is per checkpoint. But some tasks have long running models and few, if any, checkpoints, and so it is basically the same thing many times. A checkpoint is always recorded at the end of a model. Some types of tasks can record checkpoints within a model as well. I believe it takes 5 restarts before the task is ended. The 3X is when a task runs for 3 times more then your runtime preference, then watchdog will shut it down. The idea is simply that if your machine begins the task that many times and hasn't made enough progress to reach a checkpoint, then, for whatever reason, this task is probably not a good fit with your machine. So, it is ended and another one will begin, which will hopefully be a better fit (which it often is because it has models that complete sooner, or a new application version, or better checkpointing). Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Why did I get Compute Error?
©2025 University of Washington
https://www.bakerlab.org