Message boards : Number crunching : Report Problems with Rosetta Version 5.25
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 · Next
Author | Message |
---|---|
Ethan Volunteer moderator Send message Joined: 22 Aug 05 Posts: 286 Credit: 9,304,700 RAC: 0 |
The scientists are looking into these errors. Let's wait and see what they're able to do. |
![]() ![]() Send message Joined: 19 Sep 05 Posts: 271 Credit: 824,883 RAC: 0 |
So I think the labelling should be changed, as it's also possible that a result is really invalid, for example when the hardware is faulty and delivers no useful results.It would be difficult to decide: Is the result invalid, because the computer failed? Or is the result invalid, because the used "routines / parameter combination" doesn't work? The second is a very useful result for Rosetta. That's right. I don't know how it can be determined, or if at all. If not at all, I prefer the solution with somehow less credits, not necessary half, but less. But if possible, the "useful errors" should definitely get credit, while kaputt hardware should not. But I will wait and see, it's nothing important. |
Tino Ruiz Send message Joined: 12 Oct 05 Posts: 13 Credit: 397,392 RAC: 0 |
Sigh...looks like I spoke too soon. :-/ I had to abort this unit because it's stuck, again. And leaving the app in memory is not an option for me as I'm attached to 14 projects. Mon 21 Aug 2006 03:24:21 PM AST|rosetta@home|Unrecoverable error for result 1dhn__BOINC_BACKBONE_O_PENALTY_ABRELAX_SAVE_ALL_OUT__1176_735_0 (aborted by user) |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
My opinion is that if the "decoys" is ok you should get credit for them. That sounds like a good idea. At the moment they seem to be using claimed credit, which can be many times higher than what the WU would have gotten if it had been valid. Obviously that needs work. |
![]() ![]() Send message Joined: 11 Oct 05 Posts: 153 Credit: 4,387,904 RAC: 23 |
Still getting lots of these errors where the WU hangs or just errors out, the ones that hang saying they are running (for hours) but the CPU is idle, I have to abort. The following have error "process exited with code 131" "SIGSEGV:segmentation violation". The times are where the counters stopped. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811448 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811405 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811373 (3.34 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28130018 (1.5 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28130012 (1.5 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129987 (2 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129980 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129979 (2 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129978 (2 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129967 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606986 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606985 (2.64 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606960 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606959 (0.5 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606946 (1.87 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606945 (2.74 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606904 (1.85 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606894 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606893 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606884 (1.67 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606869 (1 hr) These next ones I had to abort as they just hung with error SIGSEGV, and timers stopped as well the cpus dropped to zero. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811399 (1 hr, error : glibc detected : corrupted double-linked list 0x0aa89e38 : SIGSEGV) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28197646 (1.5 hr, error : glibc detected : corrupted double-linked list 0x09f18228 : SIGSEGV) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606951 (1 hr, error : glibc detected : corrupted double-linked list 0x0b3e0950 : SIGSEGV) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811398 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28130073 (1.5 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28130028 (1.5 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129992 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606949 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606947 (1 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606937 (2.66 hr) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606898 (1.89 hr) Also had "process got signal 11 : SIGSEGV" https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811388 (1 hr) And process exited with code 1 ERROR::Exit at:fragments.cc line:459 FILE_LOCK::unlock():close failed.:Bad file descripter All the above are on my 2 Linux machines, 2 more on my Windows machine https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28228077 (91 sec) "unhandled exception record" Reason : Access Violation (0xc0000005) at address 0x004A4529 read attempt to address 0x00000024 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=29009370 (1014 sec) 'Incorrect Function. exit code 1 ERROR::Exit at: .dock_structure.cc line:401 A lot of the "code 131" errors seem to happen when the Boinc Manager switches from one project to another. When switching the WU errors out. I hope this helps the developers as it is becoming a nuisence to me and I might have to stop using the Linux machines for Rosetta so they keep doing something useful rather than stuck on a WU not doing anything. |
Ethan Volunteer moderator Send message Joined: 22 Aug 05 Posts: 286 Credit: 9,304,700 RAC: 0 |
From Fuzzy: I hope the practice continues, if the WU is what is wrong nothing to do with your system and you have spent say 23 hours of a 24 hour unit working why should you not get credits ? |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Conan, I've been trying to figure out what might be causing your problem, but I haven't really been able to. One idea, though, is that the version of BOINC you're using might have a bug when used on a multiprocessor machine. If you're interested, here's something you could try. Download the latest recommended version of BOINC to one of your Linux machines. http://boinc.berkeley.edu/download.php?all_platforms=1 Then stop BOINC, install the downloaded BOINC into a NEW directory, start BOINC in that new directory, and attach to Rosetta. This should test for a bad or corrupted BOINC client as well as corruption in the Rosetta directory. |
Whl. Send message Joined: 29 Dec 05 Posts: 203 Credit: 275,802 RAC: 0 |
Carrying on from: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=2162#24232 Whl Wrote : Here. Ethan Wrote : Note the 2nd half of my message :) Sorry Ethan. Had to go do some stuff there. Its nearly 5.15 am here in Scotland. My team mate is Nite Owl and I dont have all the information on his WU right now. But it probably does'nt matter now anyway, as he has moved all of his machines to WCG Yesterday. ![]() |
Ethan Volunteer moderator Send message Joined: 22 Aug 05 Posts: 286 Credit: 9,304,700 RAC: 0 |
he has moved all of his machines to WCG Yesterday. Sorry we couldn't have helped sooner. It's good to know another project that uses Rosetta will benefit from the extra work. |
![]() ![]() Send message Joined: 11 Oct 05 Posts: 153 Credit: 4,387,904 RAC: 23 |
Thanks AMD_is_logical, What you say has merit but I don't only run Rosetta on the Linux machines but 4 other projects on one and 5 other projects on the other as well and have no trouble with them. I would have to copy over all the files for the other projects into the new folder so I can keep working, would I not? Possibly just have to rename the folders maybe? The 2 machines in question are :- AMD Opteron Dual 848 (2 cpus) with 2 Gb RAM, 2 X 250 Gb HD, Linux Fedora Core 3 AMD opteron Dual 275 (2 cpus) with 4 Gb RAM, 2 X 250 Gb HD, Linux Fedora Core 3. Chips are standard and not overclocked. Would not a corrupted Boinc programme affect the other projects as well? |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Thanks AMD_is_logical, Conan you can try just upgrading to 5.5.13 without deinstalling your current version: http://boinc.berkeley.edu/download_all.php?platform=linux&version=5.5.13&type=sea That would not reset any of your projects or abort current WUs. Or you try to deinstall BOINC and reinstall 5.5.13. I _assume_ it does not reset your projects and WUs either but as a safety measure you can set all projects to "no new work", crunch your cache empty, deinstall BOINC and reinstall it. If you are attached to many projects you might also try out BAM an account manager which allows you to manage your different projects on one single webpage: http://www.boincstats.com/bam/ Btw, recently I had a hanging WU with 0% processor usage as well on my windows box (first time). It happened after another task utilized the CPU 100% and when I killed that task via TaskManager the Rosetta task did not kick in properly. I had to restart BOINC in order to get the process running again. |
![]() ![]() Send message Joined: 30 Apr 06 Posts: 115 Credit: 1,307,916 RAC: 0 |
he has moved all of his machines to WCG Yesterday. Ethan - could you please elaborate on this? Does WCG crunch Rosetta too? If so, cool! Team Starfire World BOINC ![]() |
![]() ![]() Send message Joined: 19 Sep 05 Posts: 271 Credit: 824,883 RAC: 0 |
he has moved all of his machines to WCG Yesterday. WCG uses the Rosetta algorithm for the Humane Proteome Folding. Fight Aids and Cancer use differernt applications afaik. |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
he has moved all of his machines to WCG Yesterday. You may want to read the latest journal entry from David Baker. I quote: I'm working tonight on a manuscript with my former graduate student Rich Bonneau on some of the results from HPF1 done on the world community grid. We predicted structures for all the proteins in one of the best studied eukaryotic organisms--the yeast used to make bread and beer, and then integrated these predictions with other experimental data to assign 500 proteins of previously unknown structure to protein structural families. After this is done, we will start working on the report on the structures of human proteins also done in HPF1. These efforts used the low resolution version of rosetta (which is all we had several years ago when the HPF project started); I am of course excited about HPF2 which is using the protocol we have been improving on rosetta@home (I sent Rich and the collaborators at IBM the code last March) and should produce much more accurate models. So yes they use the Rosetta application as well although not the latest one. Their goal is different though they study a limited set of specific proteins whereas Rosetta@home tries to improve the overall prediction capabilities of Rosetta, shows what can be achieved in competitions (CASP) and will soon start the HIV research. My understanding is that WCG focuses more on smooth crunching experience without taking too many risks (updating the application often, not using redundancy (WCG uses a quorom of 3) etc.) whereas Rosetta does science at the very front in different directions thus taking more risks (WU errors, bugs in new versions, quorom of 1 etc.). |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
What you say has merit but I don't only run Rosetta on the Linux machines but 4 other projects on one and 5 other projects on the other as well and have no trouble with them. I would have to copy over all the files for the other projects into the new folder so I can keep working, would I not? Possibly just have to rename the folders maybe? What I had in mind was a temporary test just to see if Rosetta worked running by itself with a fresh start and the latest recommended BOINC client. If it did, I would then have suggested upgrading the BOINC client in your current BOINC directory. If that failed to help, I would then have suggested suspending the other projects to see there was some sort of interaction between the various projects. Or you could skip the test and just upgrade the BOINC client, as tralala suggests. I was a little hesitant about suggesting changes to your main BOINC directory without some evidence that it would help. Would not a corrupted Boinc programme affect the other projects as well? If this bug always showed itself it would have been found long ago, so it must be something subtle. Perhaps it's only seen with a particular version of the BOINC client and a particular version of Rosetta when running on a dual processor machine. |
Tino Ruiz Send message Joined: 12 Oct 05 Posts: 13 Credit: 397,392 RAC: 0 |
Tue 22 Aug 2006 10:13:26 AM AST|rosetta@home|Unrecoverable error for result FRA_t368_CASPR_hom001_7_t368_7_dec146IGNORE_THE_REST_1_1179_407_0 (aborted by user) Same deal, it keeps getting stuck. I'm on a single core, single processor CPU. :-/ |
![]() ![]() Send message Joined: 30 Apr 06 Posts: 115 Credit: 1,307,916 RAC: 0 |
Thank you both for the info Saenger and Tralala! |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Tue 22 Aug 2006 10:13:26 AM AST|rosetta@home|Unrecoverable error for result FRA_t368_CASPR_hom001_7_t368_7_dec146IGNORE_THE_REST_1_1179_407_0 (aborted by user) Have you run any diagnostics such as Memtest86 and SuperPi ? |
Tino Ruiz Send message Joined: 12 Oct 05 Posts: 13 Credit: 397,392 RAC: 0 |
Sigh...it's not my PC. Look, every project runs fine, I stress my PC 24/7. Yes I've tried diagnostic tools but they always turn out ok. For the past few weeks a lot of people have complained about this "stuck" unit issue, so I *know* I'm not alone. Something is broken in the Linux version for sure. |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
Is it possible to setup a second profile (i.e. Home and Work instead of just default) here on Boinc, and run one of the two Linux machines with 100% Rosetta with WUs set to have a 1 hour time limit? Run for a day to prove that Rosetta is fine on your system as the only app. Point A. If it passes, then add 1 more Boinc project to the mix. (2 hour switch, don't leave in memory). Run for a day.. if adding a project fails, turn on "leave in memory" and try again. If Leave in Memory =on fails, report findings. if adding a project passes, add a couple more boinc projects to the mix. Go to Point A. Or add Ralph, and see if Ralph will pass back enough information to track down the problem. |
Message boards :
Number crunching :
Report Problems with Rosetta Version 5.25
©2025 University of Washington
https://www.bakerlab.org