Message boards : Number crunching : Report Problems with Rosetta Version 5.25
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 12 · Next
Author | Message |
---|---|
mnb Send message Joined: 15 Dec 05 Posts: 51 Credit: 69,458 RAC: 0 |
Yeah, there was a problem with my CPU fan. For some reason it reseted to 0 rpm for about every 3 seconds. Although there was no visual hint that it was malfuncioning. I cleaned the sink and fan and removed that metal ring covering the fan and from now on I'm going to keep the PCalert4 monitoring program running. I'm also using BES to throttle cpu usage to 75%. It seems to lower the temp some 5 degrees celcius. Thank you very much. list of my results |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,103,208 RAC: 0 |
Result ID 30771861 Name t347__CASP7_ASSEMBLEABRELAX_SAVE_ALL_OUT_new6to205hom022__991_873_1 Workunit 25596465 Created 31 Jul 2006 6:59:08 UTC Sent 31 Jul 2006 10:15:14 UTC Received 31 Jul 2006 16:47:30 UTC Server state Over Outcome Client error Client state Computing Exit status -2147483645 (0x80000003) Computer ID 263791 Report deadline 7 Aug 2006 10:15:14 UTC CPU time 14406.359375 stderr out <core_client_version>5.4.9</core_client_version> <message> One or more arguments are invalid (0x80000003) - exit code -2147483645 (0x80000003) </message> <stderr_txt> # random seed: 3336638 # cpu_run_time_pref: 3600 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 14406.4 seconds. Greater than 4X preferred time: 3600 seconds ********************************************************************** GZIP SILENT FILE: .xxt347.out WARNING! attempt to gzip file .xxt347.out failed: file does not exist. Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x77F767CD Engaging BOINC Windows Runtime Debugger... ******************** |
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0 |
|
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
I've had at least 5 WUs recently that have failed in the last week because they ran for over 12 hours (default 3 hour target CPU time in effect), and another is going to fail within the next hour (it's already up to 12.5 hours). They're not getting credit either. ![]() |
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0 |
If your PC isn't very fast and this WU needs so much time for one decoy, it might help to set a higher target time and contact/update the Rosetta server. The Rosetta client notices the increased time limit and (hopefully) will allow more time for that result. |
Pepo![]() Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
I was having this problem with 5.22 (also there) already and now the same happens with 5.25 - stalled/hanging Rosettas. I noticed that the running Rosetta app (result 29859752) ceased to exit and Boinc did not start any other app for days, also not able to run benchmarks or remove it from memory. Until I suspended the result. I'll try to unsuspend the result and wait to see... (maybe few hours until tomorrow, but progress and rime do not increment at all, the boinc.log does not mention restarting the rosetta, only pausing previous app (although BCC is teling the rosetta is running) and the machine is 99% idle)-: Peter Relevant lines from log: 2006-07-24 23:49:13 [---] Rescheduling CPU: files downloaded 2006-07-25 00:22:50 [---] Rescheduling CPU: application exited 2006-07-25 00:22:51 [Einstein@Home] Computation for task h1_0208.0_S5R1__5364_S5R1a_0 finished 2006-07-25 00:22:51 [rosetta@home] Starting task t347__CASP7_ABRELAX_SAVE_ALL_OUT_121to205hom003__847_2818_0 using rosetta version 525 2006-07-25 01:22:51 [SETI@home] Restarting task 16my06ad.2870.14096.47174.3.139_3 using setiathome_enhanced version 512 2006-07-25 01:22:51 [rosetta@home] Pausing task t347__CASP7_ABRELAX_SAVE_ALL_OUT_121to205hom003__847_2818_0 (removed from memory) 2006-07-25 02:22:51 [SETI@home Beta Test] Restarting task 02ap05aa.20527.464.490894.3.48_1 using setiathome_enhanced version 512 2006-07-25 02:22:51 [SETI@home] Pausing task 16my06ad.2870.14096.47174.3.139_3 (removed from memory) 2006-07-25 03:22:52 [SETI@home Beta Test] Pausing task 02ap05aa.20527.464.490894.3.48_1 (removed from memory) 2006-07-25 03:22:52 [SETI@home] Restarting task 16my06ad.2870.14096.47174.3.139_3 using setiathome_enhanced version 512 2006-07-25 04:22:52 [SETI@home] Pausing task 16my06ad.2870.14096.47174.3.139_3 (removed from memory) 2006-07-25 04:22:52 [Einstein@Home] Starting task h1_0208.0_S5R1__5363_S5R1a_0 using einstein_S5R1 version 401 2006-07-25 05:22:53 [Einstein@Home] Pausing task h1_0208.0_S5R1__5363_S5R1a_0 (removed from memory) 2006-07-25 06:22:53 [SETI@home] Restarting task 16my06ad.2870.14096.47174.3.139_3 using setiathome_enhanced version 512 2006-07-25 06:22:53 [rosetta@home] Pausing task t347__CASP7_ABRELAX_SAVE_ALL_OUT_121to205hom003__847_2818_0 (removed from memory) 2006-07-25 06:22:55 [---] Suspending work fetch because computer is overcommitted. 2006-07-25 07:22:53 [---] Using earliest-deadline-first scheduling because computer is overcommitted. 2006-07-25 07:22:53 [SETI@home] Pausing task 16my06ad.2870.14096.47174.3.139_3 (removed from memory) 2006-07-28 16:52:39 [---] Suspending computation - running CPU benchmarks 2006-07-28 16:52:39 [rosetta@home] Pausing task t347__CASP7_ABRELAX_SAVE_ALL_OUT_121to205hom003__847_2818_0 (removed from memory) 2006-07-28 16:52:41 [---] Running CPU benchmarks 2006-07-28 16:52:49 [---] Failed to stop applications; aborting CPU benchmarks 2006-07-28 16:52:50 [---] Resuming computation 2006-07-28 16:52:50 [---] Rescheduling CPU: Resuming computation 2006-07-28 16:52:50 [---] Process 9951 not found 2006-07-31 12:13:04 [-manually-suspended-rosetta-] Rescheduling CPU: result suspended, resumed or aborted by user 2006-07-31 12:13:08 [-manually-suspended-rosetta-] Rescheduling CPU: result suspended, resumed or aborted by user 2006-07-31 12:13:08 [rosetta@home] Pausing task t347__CASP7_ABRELAX_SAVE_ALL_OUT_121to205hom003__847_2818_0 (removed from memory) 2006-07-31 12:13:08 [Einstein@Home] Restarting task h1_0208.0_S5R1__5363_S5R1a_0 using einstein_S5R1 version 401 ...... 2006-08-02 18:31:26 [---] Rescheduling CPU: result suspended, resumed or aborted by user 2006-08-02 18:31:27 [---] Using earliest-deadline-first scheduling because computer is overcommitted. 2006-08-02 18:31:27 [SETI@home] Pausing task 10ap05ac.26689.31026.679814.3.137_2 (removed from memory) 2006-08-02 18:31:27 [---] Suspending work fetch because computer is overcommitted. 2006-08-02 18:36:04 [---] Rescheduling CPU: result suspended, resumed or aborted by user Peter |
TCU Computer Science Send message Joined: 7 Dec 05 Posts: 28 Credit: 12,861,977 RAC: 0 |
I was having this problem with 5.22 (also there) already and now the same happens with 5.25 - stalled/hanging Rosettas. The first time I saw this problem was with Ralph 5.18 then a couple of instances with Rosetta 5.22 then some with [url=https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1891#20832]Rosetta 5.25. I've seen the problem on Mac OS X and Linux (CentOS 4.3) but never on Windows. When the problem occurs on Linux, I stop BOINC but the Rosetta process remains in the process list. I have to kill it manually before restarting BOINC. |
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0 |
Maybe this helps the developers, it is the stdout of an endless running one that I am currently trying : stdout belongs to FRA_t384_CASP7_hom001_4_t384_4_1ofgA_IGNORE_THE_REST_866_1060_22 The result seems to be stuck at fraction_done=0.273800 (for nearly 9 hours now) stdout.txt is updated and growing but that's about all that changes. Random seed is 2033319 RAM usage is at 151MB, the box doesn't have much RAM but there's still physical RAM left without swapping as the other results need less. |
Pepo![]() Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
I noticed that the running Rosetta app (result 29859752) ceased to exit and Boinc did not start any other app for days....... The CPU time stayed at the same 0:59:20 (probably the 1 hour switch point) for the whole night, no other app was started inbetween as expected. Aborted. Another one very similar WU is already overdue, but I'll try to let it run through, if it succeeds. Peter |
![]() ![]() Send message Joined: 11 Oct 05 Posts: 153 Credit: 4,387,904 RAC: 23 |
Have also noticed a lot of work units that just stop. Boinc Manager says they are running but no counter is moving, either "cpu time" or "to completion". Suspending and resuming does not work. Stopping and restarting Boinc Manager does not work, a reboot seems to have gotten the work units going again. Only had one on the Windows XP machine but up to 8 at once on the Linux machines, all AMD processors. My Intel Windows machine has had no problem so far. Have had to abort one that would not move on Linux machine. This has only been happening with 5.25. |
![]() ![]() Send message Joined: 11 Oct 05 Posts: 153 Credit: 4,387,904 RAC: 23 |
Have also noticed a lot of work units that just stop. Boinc Manager says they are running but no counter is moving, either "cpu time" or "to completion". Suspending and resuming does not work. Stopping and restarting Boinc Manager does not work, a reboot seems to have gotten the work units going again. Only had one on the Windows XP machine but up to 8 at once on the Linux machines, all AMD processors. My Intel Windows machine has had no problem so far. Have had to abort one that would not move on Linux machine. Well a follow up on these stopped and restarted work units, 3 of the 4 restarted units all errored out at the same time (I happen to have been watching the screen at the time), giving back "unrecoverable error". This happened when another projects WU finished ans switched to start another WU, this seems to have caused the 3 Rosetta WU's to switch as well but instead of check pointing they all just failed. The work units are :- t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom013__1022_3203_0 (process exited with code 131 (0x83)) (SIGSEGV: segmentation violation) t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom015__1022_3194_0 (process exited with code 131 (0x83)) (SIGSEGV: segmentation violation) t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom010__1022_3198_0 (process exited with code 131 (0x83)) (SIGSEGV: segmentation violation) I have had at least 7 WU's fail with the same error since 31/7. |
![]() ![]() Send message Joined: 31 May 06 Posts: 33 Credit: 97,311 RAC: 0 |
Got another error 131 (0x83): Wed 02 Aug 2006 10:09:39 PM CEST|rosetta@home|Unrecoverable error for result FRA_t384_CASP7_hom001_4_t384_4_1ofgA_IGNORE_THE_REST_321_1060_1_0 (process exited with code 131 (0x83)) BOINC 5.4.9 Linux 2.6.8-3-686-smp #1 SMP Sat Jul 15 08:52:57 UTC 2006 i686 GNU/Linux HTH, Alex. "I am tired of all this sort of thing called science here... We have spent millions in that sort of thing for the last few years, and it is time it should be stopped." -- Simon Cameron, U.S. Senator, on the Smithsonian Institute, 1901. |
![]() ![]() Send message Joined: 11 Oct 05 Posts: 153 Credit: 4,387,904 RAC: 23 |
After checking with another WU that has stopped on my AMD Opteron Linux machine, I too have noticed that when the WU stops it does not switch to another project WU after due time but stays locked to the Rosetta WU with the Status showing "running" but nothing happening. So far a reboot is the only way to get them moving again but I don't plan on doing that everytime a WU locks up. Will probably just abort. |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
I noticed this on a box that was crunching only Rosetta. Sorry, I wasn't on the box long enough to get details. But I'd told it to crunch 1hr WUs, and it was doing fine, but then hit one that ran overnight and in the morning in showed 100%, but was still crunching it. I didn't wait long, but never saw the steps increase, and it had other WUs on deck, but never began them. It was this host, probably the next (time issued order) WU was the one it was hung on), so I guess that would make it this WU: FRA_t386_CASP7_hom001_4_t386_4_2f6sA_IGNORE_THE_REST_121_1061_32_0 Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
![]() ![]() Send message Joined: 11 Oct 05 Posts: 153 Credit: 4,387,904 RAC: 23 |
Have had another 10 WU's fail with the same error code; (process exited with code 131 (0x83)) (SIGSEGV: segmentation violation) All 10 WU's working on 2 Linux machines. t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom015__1022_3194_0 t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom006__1022_3195_0 t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom010__1022_3198_0 t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom013__1022_3202_0 t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom013__1022_3203_0 t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom017__1022_3203_0 t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom002__1022_3204_0 t382__CASP7_ABRELAX_SAVE_ALL_OUT_hom001__1012_80239_0 t382__CASP7_ABRELAX_SAVE_ALL_OUT_hom001__1012_80282_0 t382__CASP7_ABRELAX_SAVE_ALL_OUT_hom001__1012_80297_0 t347__CASP7_ASSEMBLEABRELAX_SAVE_ALL_OUT_new6to205hom001__991_3157_1 |
RosettaMac Send message Joined: 16 Jul 06 Posts: 2 Credit: 1,053 RAC: 0 |
The work unit FRA_t384_CASP7_hom001_4_t384_4_1ofgA_IGNORE_THE_REST_724_1060_27 has been running for about 12 hours and shows progress at 1.27 percent. Is this normal or should I abort? |
RosettaMac Send message Joined: 16 Jul 06 Posts: 2 Credit: 1,053 RAC: 0 |
The work unit FRA_t384_CASP7_hom001_4_t384_4_1ofgA_IGNORE_THE_REST_724_1060_27 has been running for about 12 hours and shows progress at 1.27 percent. Is this normal or should I abort? Never mind...just minutes after I posted the above, the work unit perversely finished. I'm new to this and never had one run that long. |
![]() ![]() Send message Joined: 2 Nov 05 Posts: 258 Credit: 4,496,604 RAC: 2,255 ![]() |
The work unit FRA_t384_CASP7_hom001_4_t384_4_1ofgA_IGNORE_THE_REST_724_1060_27 has been running for about 12 hours and shows progress at 1.27 percent. Is this normal or should I abort? Welcome to the forums. As shown in the screenshot below, you can choose to run jobs up to 1 day. ![]() |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
RosettaMac's results seem to hover around the 10800 second mark; so he's probably using the default 3 hour time setting. That's an impressive time for a single decoy. If someone sits there and watches the client at 1.xx% and considers it stuck, chances are it's working on the first model/decoy. i.e. it'd be ticking right along if the client figured it could produce 100-300 models a day, when you saw the 1.xx% statement. :) Congratulations on sticking it out and finishing that WU. If you're tempted to kill off a job in the future, you're supposed to be able to view the graphics to see the model moving and changing. |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Here are some references describing the completion % and requirement to run at least one complete model. Progress % not advancing Time to completion going up adjustable work units FAQ Welcome to Rosetta!...Mac :) Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Message boards :
Number crunching :
Report Problems with Rosetta Version 5.25
©2025 University of Washington
https://www.bakerlab.org