Message boards : Number crunching : Report problems with Rosetta version 5.32
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Keith Akins Send message Joined: 22 Oct 05 Posts: 176 Credit: 71,779 RAC: 0 |
Hm. I got a "Read Access Violation" for: 1hz6A_BOINC_NEWRELAXFLAGS_ABRELAX_SAVE_ALL_OUT__1275_27001_0. Seems to be happening once to twice a day. |
![]() ![]() Send message Joined: 22 May 06 Posts: 15 Credit: 1,424,082 RAC: 0 |
|
Sam Miorelli Send message Joined: 16 Feb 06 Posts: 7 Credit: 1,303,044 RAC: 0 |
I've just started running Rosetta on an Athlon 64 X2 4200+ (not overclocked) and while the other projects seem to be going OK on it, I've already had a Rosetta WU crash. I get the Windows process dump reporting message when this happens so I believe it is occurrring while the screensaver is running. I had a similar problem on a P4 3Ghz Prescott machine over the summer that eventually resulted in me no longer running Rosetta on it. The exit code from BOINC is below. Does anyone know what caused this error? 10/20/2006 2:14:33 PM|rosetta@home|Unrecoverable error for result 1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_BOND_ANGLES_SAVE_ALL_OUT__1273_41187_0 ( - exit code 1073807364 (0x40010004)) |
Keith Akins Send message Joined: 22 Oct 05 Posts: 176 Credit: 71,779 RAC: 0 |
DOC_2PTC_pose_u_pert_bbmin_from_short_relax_1290_187_0 <core_client_version>5.4.9</core_client_version> <stderr_txt> # cpu_run_time_pref: 28800 # random seed: 1931404 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! Stuck at score -7.93815 for 3600 seconds ********************************************************************** GZIP SILENT FILE: .ee2PTC.out </stderr_txt> 1dtj__BOINC_NEWRELAXFLAGS_WOBBLECCD_ABRELAX_SAVE_ALL_OUT__1285_6677_0 <core_client_version>5.4.9</core_client_version> <stderr_txt> # cpu_run_time_pref: 28800 # random seed: 2428124 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! Stuck at score -1.32255 for 3600 seconds ********************************************************************** GZIP SILENT FILE: .xx1dtj.out </stderr_txt> |
![]() ![]() Send message Joined: 25 Mar 06 Posts: 71 Credit: 1,630,458 RAC: 0 |
This one seemed fine at first but has an error and no credit granted. 42988975 <core_client_version>5.4.9</core_client_version> <stderr_txt> # random seed: 2052132 # cpu_run_time_pref: 14400 WARNING! error deleting file .aa1t4o.out ====================================================== DONE :: 1 starting structures built 17 (nstruct) times This process generated 17 decoys from 17 attempts 0 starting pdbs were skipped ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> Validate state Workunit error - check skipped Claimed credit 31.7338913214193 Granted credit 0 application version 5.32 |
![]() ![]() Send message Joined: 15 Jul 06 Posts: 76 Credit: 5,263,150 RAC: 0 |
I don't know if this is BOINC or Rosetta but Rosetta is the only project I'm working on, and it's been crashing like a NASCAR driver for the last few days. Just today I've had 3 crashes. Here's my system: 10/21/2006 8:37:27 AM||Starting BOINC client version 5.4.11 for windows_intelx86 10/21/2006 8:37:27 AM||libcurl/7.15.3 OpenSSL/0.9.8a zlib/1.2.3 10/21/2006 8:37:27 AM||Data directory: C:Program FilesBOINC 10/21/2006 8:37:27 AM||Processor: 2 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.20GHz 10/21/2006 8:37:27 AM||Memory: 1022.09 MB physical, 2.40 GB virtual 10/21/2006 8:37:27 AM||Disk: 145.27 GB total, 122.85 GB free 10/21/2006 8:37:27 AM|rosetta@home|URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 272841; location: home; project prefs: default 10/21/2006 8:37:27 AM||No general preferences found - using BOINC defaults 10/21/2006 8:37:27 AM||Local control only allowed 10/21/2006 8:37:27 AM||Listening on port 31416 So far I've had 3 crashes just today, here are the BOINC log errors and Windows event log entries: 10/21/2006 11:21:30 AM|rosetta@home|Unrecoverable error for result BENCH_ABRELAX_SAVE_ALL_OUT_4ubpA_BARCODE_R72_filters_1292_701_0 ( - exit code -1073741819 (0xc0000005)) (No Windows error) ============================================= 10/21/2006 1:19:58 PM|rosetta@home|Unrecoverable error for result 1b72__LARS_ABRELAX_PAIR5_BARCODE__1294_672_0 ( - exit code -1073741819 (0xc0000005)) Event Type: Error Event Source: Application Error Event Category: None Event ID: 1001 Date: 10/21/2006 Time: 1:19:57 PM User: N/A Computer: KAREN_8400 Description: Fault bucket 334968245. For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp. Data: 0000: 42 75 63 6b 65 74 3a 20 Bucket: 0008: 33 33 34 39 36 38 32 34 33496824 0010: 35 0d 0a 5.. ======================================================= 10/21/2006 4:32:35 PM|rosetta@home|Unrecoverable error for result 1r69__BOINC_NEWRELAXFLAGS_DOUBLEFARLXCYCLES_ABRELAX_SAVE_ALL_OUT__1287_6053_0 ( - exit code -1073741819 (0xc0000005)) Event Type: Error Event Source: Application Error Event Category: None Event ID: 1001 Date: 10/21/2006 Time: 4:32:35 PM User: N/A Computer: KAREN_8400 Description: Fault bucket 335025642. For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp. Data: 0000: 42 75 63 6b 65 74 3a 20 Bucket: 0008: 33 33 35 30 32 35 36 34 33502564 0010: 32 0d 0a 2.. ========================================================== I also got 2 Windows event log errors for which I have no log entry in BOINC: ============ 1 ============ Event Type: Error Event Source: Application Error Event Category: None Event ID: 1000 Date: 10/21/2006 Time: 12:39:05 PM User: N/A Computer: KAREN_8400 Description: Faulting application rosetta_5.32_windows_intelx86.exe, version 0.0.0.0, faulting module rosetta_5.32_windows_intelx86.exe, version 0.0.0.0, fault address 0x0036d5d2. For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp. Data: 0000: 41 70 70 6c 69 63 61 74 Applicat 0008: 69 6f 6e 20 46 61 69 6c ion Fail 0010: 75 72 65 20 20 72 6f 73 ure ros 0018: 65 74 74 61 5f 35 2e 33 etta_5.3 0020: 32 5f 77 69 6e 64 6f 77 2_window 0028: 73 5f 69 6e 74 65 6c 78 s_intelx 0030: 38 36 2e 65 78 65 20 30 86.exe 0 0038: 2e 30 2e 30 2e 30 20 69 .0.0.0 i 0040: 6e 20 72 6f 73 65 74 74 n rosett 0048: 61 5f 35 2e 33 32 5f 77 a_5.32_w 0050: 69 6e 64 6f 77 73 5f 69 indows_i 0058: 6e 74 65 6c 78 38 36 2e ntelx86. 0060: 65 78 65 20 30 2e 30 2e exe 0.0. 0068: 30 2e 30 20 61 74 20 6f 0.0 at o 0070: 66 66 73 65 74 20 30 30 ffset 00 0078: 33 36 64 35 64 32 0d 0a 36d5d2.. =================== 2 ========================== Event Type: Error Event Source: Application Error Event Category: None Event ID: 1000 Date: 10/21/2006 Time: 3:26:13 PM User: N/A Computer: KAREN_8400 Description: Faulting application rosetta_5.32_windows_intelx86.exe, version 0.0.0.0, faulting module rosetta_5.32_windows_intelx86.exe, version 0.0.0.0, fault address 0x0036cf47. For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp. Data: 0000: 41 70 70 6c 69 63 61 74 Applicat 0008: 69 6f 6e 20 46 61 69 6c ion Fail 0010: 75 72 65 20 20 72 6f 73 ure ros 0018: 65 74 74 61 5f 35 2e 33 etta_5.3 0020: 32 5f 77 69 6e 64 6f 77 2_window 0028: 73 5f 69 6e 74 65 6c 78 s_intelx 0030: 38 36 2e 65 78 65 20 30 86.exe 0 0038: 2e 30 2e 30 2e 30 20 69 .0.0.0 i 0040: 6e 20 72 6f 73 65 74 74 n rosett 0048: 61 5f 35 2e 33 32 5f 77 a_5.32_w 0050: 69 6e 64 6f 77 73 5f 69 indows_i 0058: 6e 74 65 6c 78 38 36 2e ntelx86. 0060: 65 78 65 20 30 2e 30 2e exe 0.0. 0068: 30 2e 30 20 61 74 20 6f 0.0 at o 0070: 66 66 73 65 74 20 30 30 ffset 00 0078: 33 36 63 66 34 37 0d 0a 36cf47.. ================================================ I hope this information will help someone debug this. Much of the other error information I've seen in my (incomplete) glance through the threads seems to be flavors of Unix. --hedera Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic. ![]() |
Keith Akins Send message Joined: 22 Oct 05 Posts: 176 Credit: 71,779 RAC: 0 |
FRA_t380_NEWFLAGS_hom001_4_t380_4_2fhqA_IGNORE_THE_REST_162_1296_154_0 <core_client_version>5.4.9</core_client_version> <stderr_txt> # cpu_run_time_pref: 28800 # random seed: 1675502 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! Stuck at score 4.843 for 3600 seconds ********************************************************************** GZIP SILENT FILE: .aat380.out </stderr_txt> |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=43235925 https://boinc.bakerlab.org/rosetta/result.php?resultid=43309102 https://boinc.bakerlab.org/rosetta/result.php?resultid=43333974 https://boinc.bakerlab.org/rosetta/result.php?resultid=43586861 Are these errors pointing to Rosetta client issues, WU issues, or possible client software/hardware issues? |
Keith Akins Send message Joined: 22 Oct 05 Posts: 176 Credit: 71,779 RAC: 0 |
Hm. One memory read access violation, two watchdog shutdowns and one without any watchdog or debug info. Maybe we'll get some anwsers this week. Atleast you got credit for your crashed UW's. Ive got crashed WU's going back three to four days without any granted credit. Maybe the server doesn't like me. |
Mike Gelvin![]() Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
running 5.32 Workunit: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=38429684 It has frozen at 46 minutes of runtime, I suspect this occured 27 hours ago as my system idle time is up to 27 hours now. I have shut BOINC down and restarted. the same work unit is now running again from 0. ![]() |
Mike Gelvin![]() Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
running 5.32 Result errored out after about 12 minutes. Dump is avaliable in result page. An interesting thing to note, is that when it "hung" it dumped with an error of LoadLibraryA(srcsrv.dll): GetLastError = 126 followed by an access violation. Then it hung. So, there are 2 dumps in the file. I would have thought they would be a bit more agressive on Ralph taking care of the problems posted in this thread. I dont like these kinds of problems on my production machines, but I do have a machine that Ralph runs on to help ferret out these problems. ![]() |
Keith Akins Send message Joined: 22 Oct 05 Posts: 176 Credit: 71,779 RAC: 0 |
Result ID: 43550168 FRA_t369_NEWFLAGS_hom001_4_t369_4_1rxqA_IGNORE_THE_REST_131_1302_9_0 <core_client_version>5.4.9</core_client_version> <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 28800 # random seed: 1587362 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0076A524 read attempt to address 0x00000011 |
Keith Akins Send message Joined: 22 Oct 05 Posts: 176 Credit: 71,779 RAC: 0 |
Oh, I've had to reboot five times in the last three days. |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
We're of course concerned about some of the reports below of machines that constantly error out. We did check all these workunits on ralph -- the error rates are pretty low there. Even weirder, the error rates here on rosetta@home are pretty low too! Its possible that the next update (to 5.34) will help; please keep posting if these problems keep occurring. Otherwise, my best advice is to help us out by running on ralph -- and to avoid futzing around too much with the "show graphics" window. There are obviously issues with turning on and off graphics, and the BOINC developers are thinking of ways to fix them. running 5.32 |
Keith Akins Send message Joined: 22 Oct 05 Posts: 176 Credit: 71,779 RAC: 0 |
OK, I'm biting on the RALPH Thing. If I attatch, would you prefer me to run XP or Linux? |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
One of the systems having continuous errors lately ended up being a problem with bad ram. Perhaps others having a high failure rate could test out their system with memtest86+ from http://www.memtest.org/. I remember some of the errors that popped up with Distributed Folding actually telling us to test our systems with memtest86 to verify that our memory was okay. |
![]() ![]() Send message Joined: 11 Oct 05 Posts: 153 Credit: 4,350,980 RAC: 4,204 ![]() |
>> Have been getting lock ups for a couple of weeks now. Originally caused by Ralph then moved across to Rosetta. Locks up on the screen saver and unable to get back to main programme. Getting this error:- ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! Stuck at score 0.682356 for 3600 seconds ********************************************************************** GZIP SILENT FILE: .dd1BVK.out The stuck score varies, https://boinc.bakerlab.org/rosetta/result.php?resultid=42889001 https://boinc.bakerlab.org/rosetta/result.php?resultid=42888956 https://boinc.bakerlab.org/rosetta/result.php?resultid=42230633 https://boinc.bakerlab.org/rosetta/result.php?resultid=42230590 https://boinc.bakerlab.org/rosetta/result.php?resultid=42230539 https://boinc.bakerlab.org/rosetta/result.php?resultid=42230545 Have also had compute errors of :- "exit code -1073741819" "Unhandled Exception Detected: reason Access Violation (0xc0000005) at address 0x0076CF20 read attempt to address 0x00000011 on Result id 42230652 at address 0x0076CE15 read attempt to address 0x000000A on result id 42230637 at address 0x0076D4FD read attempt to address 0x00000017 on result id 42230634 at address 0x0076D514 read attempt to address 0x00000011 on result id 42230565 at address 0x0076D4FD read attempt to address 0x00000017 on result id 42230423 One other curious result that I received was classed as successful but only did 1 decoy (my settings are for 6 hours) and returned 3.10 Cobblestones, this seems a bit low in any book. The result is https://boinc.bakerlab.org/rosetta/result.php?resultid=42230653. The lockups often require a reboot to get things going again. |
![]() ![]() Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Watchdog is set to trigger at 4x pref run time. On a 667MHz box this means that the FRA_... wu error out when run with a 1hr pref, for example this and that tasks. Maybe it would be more useful to have the watchdog trigger at a min of (say) 8hrs, or 4x pref whichever is greater? If a normally running decoy of this series really does need 5 or 6 hours to complete on some boxes, it is not appropriate to have watchdog killing it at somewhere between 4 and 5 hours when it would probably have run OK. User workaround From the user end the workaround is not to use a pref lower than 2hrs on a box with a clock speed less than (say) 1GHz and not use a pref less than 3hrs on a box with a clock speed of 667 or less. Anyone illicitly using a box of less than the Rosetta recommended min of 500MHz should use an even longer preferred run time. River~~ |
TLAF Send message Joined: 17 Oct 06 Posts: 2 Credit: 2,535,507 RAC: 0 |
With regards to this result: https://boinc.bakerlab.org/rosetta/result.php?resultid=43474514 I had the graphics window open for about 30 seconds before the WU failed. Now that may be purely coincidence but with no problems on this CPU before (and having never opened the graphics window before) I find that rather unlikely. Hope that helps. N.B. This is a repost of https://boinc.bakerlab.org/rosetta/forum_thread.php?id=2473 |
![]() ![]() Send message Joined: 11 Oct 05 Posts: 153 Credit: 4,350,980 RAC: 4,204 ![]() |
>>> The freezing of the Rosetta and Ralph work units is definately a ScreenSaver problem. I have Rosetta running on 7 machines, all bar 2 do not have graphics (2 are Linux and 3 are XP installed as services, 2 are XP installed at default user settings). >>> The 2 machines that use Graphics are the only 2 machines to have any problems, whether it be the WU becoming stuck, returning some 'access violation' or 'exit code' with no cause. These are a few more that stuck then errored out:- https://boinc.bakerlab.org/rosetta/result.php?resultid=43600036 https://boinc.bakerlab.org/rosetta/result.php?resultid=42883408 https://boinc.bakerlab.org/rosetta/result.php?resultid=42889015 https://boinc.bakerlab.org/rosetta/result.php?resultid=42888958 These 2 had Access Violations:- at address 0x0076D4FD read attempt to address 0x00000012 on result id 43369182 at address 0x0076D507 read attempt to address 0x00000011 on result id 42889050 And these 2 came up as invalid with no real error just 'exit code 1073807364 (0x40010004) https://boinc.bakerlab.org/rosetta/result.php?resultid=42888965 https://boinc.bakerlab.org/rosetta/result.php?resultid=42888964 Hope this can help as it is limiting my output when the screen stops doing anything and you find that the cpus are not doing anything either, on one machine when this happened on the 19/10 the computer then did nothing till I came back from a break on the 24/10, 5 days of lost production. |
Message boards :
Number crunching :
Report problems with Rosetta version 5.32
©2025 University of Washington
https://www.bakerlab.org