Why did I get Compute Error?

Message boards : Number crunching : Why did I get Compute Error?

To post messages, you must log in.

AuthorMessage
Idan Shifres

Send message
Joined: 12 Dec 08
Posts: 11
Credit: 126,517
RAC: 0
Message 58289 - Posted: 31 Dec 2008, 13:04:07 UTC

Hi, I have no idea why I got Compute/Client error, maybe someone can shed some light over this: https://boinc.bakerlab.org/rosetta/result.php?resultid=217026695

Thanks in advance.
ID: 58289 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 58291 - Posted: 31 Dec 2008, 13:41:05 UTC

Did you reboot your computer several times while running the project? The statement "too many restarts" usually indicates to this problem. Rosetta will only allow I think 3 restarts per work unit before declaring a compute error. If you have to reboot the computer several times for any reason suspend the project using the activity tab at the top of the page and then suspend.
ID: 58291 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Idan Shifres

Send message
Joined: 12 Dec 08
Posts: 11
Credit: 126,517
RAC: 0
Message 58293 - Posted: 31 Dec 2008, 13:53:48 UTC - in response to Message 58291.  

Did you reboot your computer several times while running the project? The statement "too many restarts" usually indicates to this problem. Rosetta will only allow I think 3 restarts per work unit before declaring a compute error. If you have to reboot the computer several times for any reason suspend the project using the activity tab at the top of the page and then suspend.


I didn't restart the computer that crunched this WU.
Can a restart mean only restarting of a computer or restarting work on rosetta? coz, if the WU is big - which means take about 20 hours, and my BOINC prefences switch between projects every 1 hour or so, it means that rosetta will be "restarted" quite a few times with this big WU...

Could that be the problem?
ID: 58293 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 66,952,147
RAC: 9,308
Message 58295 - Posted: 31 Dec 2008, 14:09:43 UTC - in response to Message 58293.  

One failed work unit is no reason for alarm.

Switching between projects usually does not cause a failure of the work unit. I have been 100% R@H for a while so I only have a little experience.

One of the complaints about R@H has been failed work units. The WU will go to another computer for processing. If the same unit fails on multiple computers, the project team will figure out what is wrong.

Keep crunching.
Thx!

Paul

ID: 58295 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Idan Shifres

Send message
Joined: 12 Dec 08
Posts: 11
Credit: 126,517
RAC: 0
Message 58296 - Posted: 31 Dec 2008, 14:15:42 UTC - in response to Message 58295.  

One failed work unit is no reason for alarm.

Switching between projects usually does not cause a failure of the work unit. I have been 100% R@H for a while so I only have a little experience.

One of the complaints about R@H has been failed work units. The WU will go to another computer for processing. If the same unit fails on multiple computers, the project team will figure out what is wrong.

Keep crunching.


Thanks for the feedback, I'm not alarmed or anything, it's just unusual and I thought maybe there might be something wrong with my computer... :)

I'm having another WU running for 25 hours, hopefully this one will finish well :) it's just that I don't want my computer to work so hard for a result that couldn't be used...
ID: 58296 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 58298 - Posted: 31 Dec 2008, 15:12:13 UTC

Can a restart mean only restarting of a computer or restarting work on rosetta? coz, if the WU is big - which means take about 20 hours, and my BOINC prefences switch between projects every 1 hour or so, it means that rosetta will be "restarted" quite a few times with this big WU...

Could that be the problem?

If you switch between projects enable the "Leave applications in memory while suspended" which you will find in the computer preferences page. This ensures that you will start from where you left off and you won't lose anything while on the other project.
ID: 58298 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58305 - Posted: 31 Dec 2008, 16:36:02 UTC

Yep, Evan gets the prize. If you exit BOINC (not close it, but exit), shut down the machine, or suspend the task without the keep in memory setting, then the task will be "restarted" when it gets to run again.

The number of restarts is not counted per task. It is per checkpoint. But some tasks have long running models and few, if any, checkpoints, and so it is basically the same thing many times. A checkpoint is always recorded at the end of a model. Some types of tasks can record checkpoints within a model as well.

I believe it takes 5 restarts before the task is ended. The 3X is when a task runs for 3 times more then your runtime preference, then watchdog will shut it down.

The idea is simply that if your machine begins the task that many times and hasn't made enough progress to reach a checkpoint, then, for whatever reason, this task is probably not a good fit with your machine. So, it is ended and another one will begin, which will hopefully be a better fit (which it often is because it has models that complete sooner, or a new application version, or better checkpointing).
Rosetta Moderator: Mod.Sense
ID: 58305 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Why did I get Compute Error?



©2025 University of Washington
https://www.bakerlab.org