Message boards : Number crunching : About Unix/Linux hanging WUs.
Author | Message |
---|---|
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
I just got my hands on a nice linux box. Some I'm a bit new to running in that environment. But I've read about others having WUs hang. Where they don't end properly. And show as running, but don't consume any CPU time. My machine has 8 processors, and was typically running 8 tasks at a time. Until today, I had been running 24hr runtimes. But decided to drop down to a 1hr runtime preference. Updated to project. Then within an hour I noticed top no longer showing me 8 high CPU processes. As time went on, the number active dropped and dropped. I finally shutdown BOINC and restarted and now happily have 8 running again. I wanted to document this and see if perhaps it gives a test case for the project team to study this and get it fixed. What seems to have happened is that the runtime preference was trying to propagate to the running tasks. The watchdog then noticed that the target runtime (now an hour) had been exceeded by more then 4x and so the watchdog tried to shutdown the tasks and that apparently is when they stopped utilizing CPU, but BOINC didn't begin any new task. Here is an example: https://boinc.bakerlab.org/rosetta/result.php?resultid=154728963 Under my original 24hr runtime, running for 18,000 seconds was fine. Under my new runtime preference (1hr), that exceeds the limit the watchdog watches for. And so the task got reported back with client error, and exit status 193, and "Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 17989 seconds. Greater than 4X preferred time: 3600 seconds" So, even if you already knew the watchdog was one cause of these hung WUs, this gives you a way to create a watchdog problem. Run some WUs for more then 4hrs. Change runtime preference to 1hr and the next time the watchdog kicks in and looks around, it should hang the WU. Other notes, this box runs only Rosetta. It's running Red Hat Enterprise Linux. Here's a link to the host: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=780012 Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
I have the same problem. I posted in "Problems with version 5.96" thread. My runtime preference is 2 hours. Perhaps there is something wrong with the way their watchdog process ends tasks? |
![]() ![]() Send message Joined: 11 Oct 05 Posts: 153 Credit: 4,350,424 RAC: 4,162 ![]() |
G'Day Feet1st, It is not just a Rosetta problem. I have had it do it with Rosetta but not for a while. Ralph has done it a few times as well. Einstein has recently done it me heaps of times. Boinc Manager says running but nothing happening. Do a quick check and find that cpu has gone to sleep. I suspend the WU and wait till another one starts then resume the WU. The Wu then usually completes ok. If this fails then I have to restart Boinc Manager to get it working again. It may be a Boinc problem rather than a specific project problem. |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Thanks Conan, good to know. Do later version of BOINC help? I guess I just installed the latest stable copy (5.10.45)... so I guess not. I have difficulty reading the dump of the WUs. Any scripts handy to only show the WUs that are currently supposedly running? Or, otherwise to isolate which ones to try suspending individually? Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Message boards :
Number crunching :
About Unix/Linux hanging WUs.
©2025 University of Washington
https://www.bakerlab.org