Message boards : Number crunching : Report Problems with Rosetta Version 5.25
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · Next
Author | Message |
---|---|
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
I have been 'trying' to run version 5.25 for over a day now, and have seen that both my Pentium dual core and single core Linux machines are stopping in the middle of the WU. So I put the dual into 'Rosetta only mode' and it has so far processed both WU given it. The single core is doing better with only 2 out of about 7 WU that hung up. The dual core, this is the first time in over a day that one has got passed about 68%. So that is only 2 out of about 8 that worked on it. I just wanted to pass this information on to the 'Rosetta team'. I will let the two I have in que finish then disconnect from the project and check back in a couple months again. Hi kmanley, that is the new "0%-stuck", which seems to affect only Linux machines who crunch multiple projects. It's known and well described from many Linux user (it happens, when Rosetta gets swaped out by another project and when it gets swapped in again the CPU is not used, although BOINC reports running). So far there are only workarounds: 1. Put your host in Rosetta-only mode, 2. Restart BOINC often. Both options are not convenient, however you could decide to crunch Rosetta for a week exclusively, than another project and so on. There should be a new application in the coming weeks, whether it solves the problem is another question. |
Tino Ruiz Send message Joined: 12 Oct 05 Posts: 13 Credit: 397,392 RAC: 0 |
It appears to be when the app tries to read the following file: I'm sorry, but where exactly do I put that file? I've looked *everywhere* for a BOINC install directory but couldn't find one. Does anyone know the default directory for a debian-based distro (Xubuntu)? |
Pappa Send message Joined: 4 Aug 06 Posts: 3 Credit: 302,149 RAC: 0 |
Ethan Saenger, noted (not Seti Specifically) that even in Seti a -9 "Noisy Workunit" receives credit for time ran... This is calulated on the benchmark. Of the machines I am rotating through various projects have one that had an error... https://boinc.bakerlab.org/rosetta/result.php?resultid=32537818... I noted the error so that I can remove it from the Cross Project Stats that I am collecting... That specific error you will have to look in your database for as it is no longer viewable... That is a single error out of what I presume are over 100 returned results https://boinc.bakerlab.org/rosetta/results.php?hostid=284093. So in most cases unless a machine goes "rogue" and then just start mangeling results, I would then hope that you have a mechanism for reducing the number of workunits to less than one/day. I would presume that you would hope it was a software glitch... Then giving the User Partial credit. Regards Pappa From Fuzzy: |
![]() ![]() Send message Joined: 18 Sep 05 Posts: 655 Credit: 12,080,688 RAC: 3,254 ![]() |
Thiw wu would seem to be stuck on my Evesham node. A 2.533GHz P-IV northwood, (no hyperthreading, not overclocked), Windows NT4 SP6a, BOINC 5.2.13. Showing CPU time 08:59:27, Progress 48.74%, To completion 13:47:12. This machine is set to 20 hour wu's. When this wu is "running" nothing changes on BOINC Manager, the System Idle Process is 99% active, and my CPU temperature is a refreshing 42C. Clicking "Show Graphics" does nothing. Suspending it, another project pops into life, with my current STD, it happens to be MCDN. Suspending that so Rosetta is top again, and it enters the same state, stuck, no processing. By judicious "Suspend" fiddling, I've established that swapping between Rosetta/Einstein and Rosetta/SIMAP does not alter anything, I do not believe, therefore, that it is a Rosetta/MCDN interaction. The message log looks totally normal, Rosetta suspending (left in memory), and another project resuming, then that project suspending (left in memory) and Rosetta resuming. In the Rosetta projects directory, there are no files flagged as being modified 1st September, (it is 10:00:00 1st September my time as I write), so it is possible it has been in this state since yesterday, there are 5 files showing yesterday as their last changed date. Currently, I have Rosetta suspended so the other projects can get on with their productive work. If there is anything I can do with this one, or any diagnostic info I can obtain, please advise. I will leave it in this state until 18:00:00 my time, (roughly 8 hours), if nothing, then I will abort it as we are going away for the weekend. *** EDIT *** Grammar. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
@adrianxw Restart BOINC and the WU will resume. This is the new "0%-Bug" which appears to happen when switching projects. So far there were only Linux puters reporting it, but it seems windows machines are in rare cases affected as well. :-( |
![]() ![]() Send message Joined: 18 Sep 05 Posts: 655 Credit: 12,080,688 RAC: 3,254 ![]() |
I stopped/started BOINC. Once restarted, I removed Rosetta from suspension and forced a scheduling event. Rosetta dropped back to the last checkpoint at 08:47:18 48.70% and started running. I can't say if this is the same fault LINUX is having. If it was a general problem, I'd expect it to be present in roughly the Linux/Windows ratio rather then rare on one system. I have fiddled a lot switching it in and out, and the % complete has never dropped to zero, as it appears to for most who have reported this issue. Maybe it is the same problem but it manifests slightly differently across OS's? That might give a handle to the root cause. Whatever, I hope that adds a clue to the hunt. I'll keep watching it. *** EDIT *** It has now reached 49.92% complete so has past the point it stopped before. Of course, it was not pre-empted this time. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
I stopped/started BOINC. Once restarted, I removed Rosetta from suspension and forced a scheduling event. Rosetta dropped back to the last checkpoint at 08:47:18 48.70% and started running. All reports I read so far describe tht the CPU-Load goes to 0% while WU is marked running in BOINC but actually not advancing. The progress of the WU does not go back to 0% (unless you do a restart before the first checkpoint was written). My naming it "0%-Bug" I was referring to the CPU-Load not the progress bar. |
![]() ![]() Send message Joined: 18 Sep 05 Posts: 655 Credit: 12,080,688 RAC: 3,254 ![]() |
Fair enough, that does sound like what I was seeing. In fact, it was not the stationary BOINC Manager that first caught my eye, it was the suprisingly low CPU temperature on the MoBo monitor. I hope they fix that soon, the machine that was showing this is BOINC only, (most of the time, certainly at present, - it is, in fact, a backup web server), and I only look at it from time to time. With Rosetta set to 50% CPU quota on it, that is potentially a lot of lost CPU time. *** EDIT *** The wu is now Pre-empted, but is showing 58.34% complete, so is clearly doing something. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Fair enough, that does sound like what I was seeing. In fact, it was not the stationary BOINC Manager that first caught my eye, it was the suprisingly low CPU temperature on the MoBo monitor. I hope they find the problem. As it is only a sporadic failure it might not be that easy. As a workaround you can let project a crunch for two weeks at 100% and then project b and so on. Quite inconvenient, but should prevent further such instances. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
As a workaround you can let project a crunch for two weeks at 100% and then project b and so on. Quite inconvenient, but should prevent further such instances. You could set the "Switch between applications every" time in general preferences to be large enough so that Rosetta WUs complete without being switched. |
terry Send message Joined: 7 Aug 06 Posts: 1 Credit: 22,721 RAC: 0 |
I've got two files that have stalled - that is - though the manager shows them as running the cpu usage timer doesn't increase. i let them both run for a while to see if they would start to move again - but no change. when i suspended them the manager moved on to the next file and its been working well since. how do i return them unfinished if indeed thats what is needed - or do i just abort them? |
R.L. Casey Send message Joined: 7 Jun 06 Posts: 91 Credit: 2,728,885 RAC: 0 |
I've got two files that have stalled - that is - though the manager shows them as running the cpu usage timer doesn't increase. i let them both run for a while to see if they would start to move again - but no change. when i suspended them the manager moved on to the next file and its been working well since. how do i return them unfinished if indeed thats what is needed - or do i just abort them? You can abort them, it will be noted as an invalid result and the Work Units will be sent out for someone else to crunch. You may want to post the WU numbers to the Report Problems with Rosetta Version 5.2.5 thread if that's the version you are using. Keep crunching! |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
I've got two files that have stalled - that is - though the manager shows them as running the cpu usage timer doesn't increase. i let them both run for a while to see if they would start to move again - but no change. when i suspended them the manager moved on to the next file and its been working well since. how do i return them unfinished if indeed thats what is needed - or do i just abort them? This error has been reported repeatedly. The workaround is to restart BOINC. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi all Can someone shed some light on these errors i have never had a problem with any project before. Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00730D5C read attempt to address 0xFFEAFF62 Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00730DE9 read attempt to address 0xEFECA35C Windows also put up an error box the first time not the second. ![]() |
Andrew Send message Joined: 7 Mar 06 Posts: 1 Credit: 28,863 RAC: 0 |
Hi Rhiju, This is Andrew from the Kuhlman lab. I've got boinc running on a Mac here, and over the weekend it didn't fetch any jobs. Here are the last several messages from Boinc Manager: Fri Sep 1 12:09:50 2006|rosetta@home|Pausing result NMR_1mzl_CASPR_1_1mzl_1_id_model_13IGNORE_THE_REST_idl_1221_1677_0 (removed from memory) Fri Sep 1 12:11:10 2006||Resuming computation and network activity Fri Sep 1 12:11:10 2006||request_reschedule_cpus: Resuming activities Sat Sep 2 14:26:18 2006||Suspending computation and network activity - running CPU benchmarks Sat Sep 2 14:26:18 2006|rosetta@home|Pausing result NMR_1mzl_CASPR_1_1mzl_1_id_model_13IGNORE_THE_REST_idl_1221_1677_0 (removed from memory) Sat Sep 2 14:26:20 2006||Running CPU benchmarks Sat Sep 2 14:26:28 2006||Failed to stop applications; aborting CPU benchmarks Sat Sep 2 14:26:29 2006||Resuming computation and network activity Sat Sep 2 14:26:29 2006||request_reschedule_cpus: Resuming activities Sat Sep 2 14:26:29 2006||ACTIVE_TASK_SET::check_app_exited(): pid 19632 not found Mon Sep 4 07:26:53 2006||Suspending work fetch because computer is overcommitted. Mon Sep 4 07:26:53 2006||Using earliest-deadline-first scheduling because computer is overcommitted. Tue Sep 5 08:17:26 2006||Suspending computation and network activity - user is active Tue Sep 5 08:17:26 2006|rosetta@home|Pausing result NMR_1mzl_CASPR_1_1mzl_1_id_model_13IGNORE_THE_REST_idl_1221_1677_0 (removed from memory) |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Hi Rhiju, You have a stuck WU which does not advance, probably your CPU-Load is 0%. Check whether this is the case and restart BOINC and the WU will finish. |
Pepo![]() Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
You have a stuck WU which does not advance, probably your CPU-Load is 0%. Check whether this is the case and restart BOINC and the WU will finish. Few days ago it happened to my host 290356 (Linux x86) that after restarting Boinc, the stuck app was left in memory and I had to kill it by hand. (Possibly Boinc lost a track of it? New Rosetta WU was started, was continuously running 3:59 hours and then Boinc made an attempt to start Seti Beta, but nothing happened and the host was idle, so 3 hours later I restarted the whole Boinc.) Peter |
Pepo![]() Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
Just now exact the same happened. Rosetta was running 89:39.34 from start, then Boinc (5.5.15, it could also be a problem of this alpha version) made an attempt to start Seti Beta. Seti Beta is nowhere, Rosetta is there and sleeping and the machine is idle. After stopping Boinc... 4 rosetta_5.25_i6 processes (probably threads) are still there. And after starting Boinc it launched Seti Beta, old Rosetta's are still sleeping there. XXXX Possibly a coincidence (because of the STD, LTD and shares, thus running Seti Beta most often), but if... Peter |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
You might want to ask about the status of the Access Violation errors that were reported on Ralph. In this thread. |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
@Pepo Is the rosetta process still present if you Exit BOINC (not just stopping). If there is no BOINC.exe process present but a rosetta process then there is something wrong with BOINC I think, since it should kill all child processes when it exits. |
Message boards :
Number crunching :
Report Problems with Rosetta Version 5.25
©2025 University of Washington
https://www.bakerlab.org