Message boards : Number crunching : RAC dropping, BOINC dropping comms
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
larry1186 Send message Joined: 18 Apr 06 Posts: 7 Credit: 329,257 RAC: 0 |
...one's computer may be asleep for hours or days (3-day weekend?). ...or a four day weekend... :( I have my computer at work set up and running a CPDN-SA model which was almost done and I expected my model to be finished when I got back from Thanksgiving break on Monday. It lost the connection to localhost at about 7:30 pm, Wednesday, a couple hours after I left for the long weekend. The model finally finished this morning tho... Four days of a 3.06 Ghz dual processor sitting idle is painful. Earlier this week I found out that one of my projects had the hostid set to zero. I changed it to what it should be and now the manager hasn't dropped it's connection yet. We shall see what the weekend brings. Don't get distracted by shiny objects. ![]() |
zombie67 [MM] Send message Joined: 11 Feb 06 Posts: 316 Credit: 6,621,003 RAC: 0 |
Wow. Good thing I read this thread. Yep, the same thing is happening to me (boinc dead while boincmgr still running). It started (for me) just after upgrading to cunch3r's 5.7.5 client (affinity turned on) a couple weeks ago. Since thin, it has happened 15-20 time across 10 machines. I was just about to switch back to 5.4.11, but now I see there is no point. It is happening with 5.4.11 too. Very frustrating! Lots of crunching time down the drain! Reno, NV Team: SETI.USA |
Blainer Send message Joined: 14 Nov 06 Posts: 1 Credit: 1,814,334 RAC: 0 |
I am having BOINC crashing as well, only when Rosetta is downloading. SETI and Einstein do not have any problems. It crashes with the same memory address as the others have reported as well. Running on a Core2Duo, XP Pro SP2, BOINC 5.4.11, and happened with both Rosetta 5.40 and 5.41. Here's the latest dump: *** UNHANDLED EXCEPTION **** Reason: Access Violation (0xc0000005) at address 0x0033B014 read attempt to address 0x00000008 *** Dump of the (offending) thread: *** eax=013dfc98 ebx=00f24118 ecx=00000000 edx=00f241c0 esi=01251b48 edi=00f241c0 eip=0033b014 esp=01defee0 ebp=00fbd4b0 cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010202 ChildEBP RetAddr Args to Child 01defee0 0033adcd 00f241c0 00000000 01251b48 00000015 libcurl!Curl_llist_insert_next+0x5 (c:boincsrcsdkscurllibllist.c:78) FPO: [3,0,0] 01deff00 0032f7b3 00f24118 00f241c0 00000015 00fbd4b0 libcurl!Curl_hash_add+0xb (c:boincsrcsdkscurllibhash.c:165) FPO: [4,0,0] 01deff24 0032fae5 012e2880 01248520 00f93dc8 00000050 libcurl!Curl_cache_addr+0x19 (c:boincsrcsdkscurllibhostip.c:361) FPO: [4,1,0] 01deff48 0032fb52 003c7170 0032fd7c 0122afe8 00000000 libcurl!addrinfo_callback+0x15 (c:boincsrcsdkscurllibhostasyn.c:131) FPO: [0,1,0] 01deff50 0032fd7c 0122afe8 00000000 003c7170 00000000 libcurl!Curl_addrinfo4_callback+0x12 (c:boincsrcsdkscurllibhostasyn.c:161) FPO: [3,0,0] 01deff80 7c349565 00000000 00000000 0012ed08 00b873d8 libcurl!gethostbyname_thread+0x0 (c:boincsrcsdkscurllibhostthre.c:335) FPO: [1,4,0] 01deffb4 7c80b683 00b873d8 00000000 0012ed08 00b873d8 MSVCR71!__endthreadex+0x0 (c:boincsrcsdkscurllibhostthre.c:335) 01deffec 00000000 7c3494f6 00b873d8 00000000 000000c8 kernel32!_BaseThreadStart@8+0x0 (c:boincsrcsdkscurllibhostthre.c:335) Exiting... I hope this is figured out soon. I'm leaving for the weekend, and I'd love to leave the system on for 72 hours of straight processing time, but there's no point if BOINC is probably going to die 3 hours after I leave. :D |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
...I hope this is figured out soon. ... We sit here and discuss the problem amongst ourselves; the issue was reported on a boinc client forum. But is anyone trying to resolve it? Does anyone capable of resolving it consider this a problem at all? Are the right people looking into it -- Boinc? Rosetta? Or are we just hoping for action that isn't coming? |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
I've sent details and one crash log to Rom and should be able to send some more next week. |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
I've sent details and one crash log to Rom and should be able to send some more next week.You are pro-active and I hope it bears fruit. |
![]() ![]() Send message Joined: 14 Mar 06 Posts: 30 Credit: 2,347,485 RAC: 0 |
This is definately frustrating since more than half my farm is across town, and it keeps happening. |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Ed, you mentioned in your other post that you are running 4 projects. I've noticed that it seems when I bring BOINC back up after such a drop, that it is ALWAYS downloading files. Which implies to me that something occurs during the download which causes a hic-cup in the TCP stack on the PC, and causes the manager to lose contact with the boinc.exe that controls everything (which eventually causes everything to shutdown). Do you know if there is any pattern about what is being downloaded at the time?? Since I always run Rosetta, I always see Rosetta being downloaded. But I was wondering if you find you are just as likely to catch the other projects downloading at the time you restart? It sometimes is hard to catch the transfers, since the screen only refreshes every 5 seconds, and your BOINC Manager generally doesn't open to the transfers tab. If you have a fast connection, and sometimes the file I've got left is only a few bytes long anyway, you can miss them. I happen to see them because I set BOINC to only use the network at night when I'm away from the machine. So, anytime I am AT the machine to restart BOINC, it is during the hours BOINC does not use the network... so the transfers that were in progress during the drop are suspended until the next night. I tried the suggestion in this thread to limit BOINC to one connection at a time. As best I can tell, BOINC is ignoring that setting and still using two at a time. So, I still see the BOINC manager dropping the localhost, and setting to one file at a time did not seem to help the problem. I should also note that now that I've had more occurences of the problem, I've seen cases where more then one file is in my transfers list as well. In fact that is how I saw that my setting requesting only one connection at a time was apparently ignored. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
zombie67 [MM] Send message Joined: 11 Feb 06 Posts: 316 Credit: 6,621,003 RAC: 0 |
Some observations: 1) Seems to be happening across several versions of BOINC (5.4.9, 5.4.11, 5.7.5 Crunch3r). But what about OS? I have 6 Macs, none of which have had this problem *at all*. 2) Perhaps it is project related? My Macs crunch SETI only. Perhaps it is a problem with Rosetta? Maybe started with 5.40 or 5.41? Just a stab in the dark. Can anyone tie the problem starting with the release of either of those? 3) I recently increased my resource allocation for Rosetta, and noticed the problems happening more frequently. Maybe just coincidence. Reno, NV Team: SETI.USA |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Zombie, I'm running three machines on Windows. All have had the problem at one time or another. I ALWAYS see files being transferred when I fire BOINC back up again, and for this reason, I have been of the camp that feels this is either a BOINC or Windows TCP stack problem, not Rosetta. ...But I only crunch Rosetta and Ralph now, so that's why I am hoping Ed can share some of his experience from a setup with 4 projects going. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Colin Smith Send message Joined: 29 Sep 06 Posts: 1 Credit: 894,657 RAC: 0 |
I run 3 projects, Einstein, Rosetta, and Climate Prediction. I have been having problems since November 7 or 8, about the same time as everybody else. After trying to figure out why BOINC wouldn't run anymore, and hearing that all the people who were having difficulties were running Rosetta, I suspended Rosetta to see if it would make a difference. It ran for 2 days without quiting, when before I was lucky if it lasted 4 hours. Just to make sure, I resumed Rosetta again today, and within a couple of hours i found BOINC down again. Based on this, I am pretty sure that something in Rosetta is causing it. |
zombie67 [MM] Send message Joined: 11 Feb 06 Posts: 316 Credit: 6,621,003 RAC: 0 |
Hmmmmmm..... Maybe this is a coincidence, but It's been 36 hours since any of my Windows machines have had another problem. Knock wood. For the last several weeks, I haven't gone more than 8 hours without at least one crapping out. Reno, NV Team: SETI.USA |
![]() ![]() Send message Joined: 14 Mar 06 Posts: 30 Credit: 2,347,485 RAC: 0 |
[quote]Ed, you mentioned in your other post that you are running 4 projects. I've noticed that it seems when I bring BOINC back up after such a drop, that it is ALWAYS downloading files.... /quote] I only run Rosetta. New theory is that this is related to hyperthreaded Intel processors. So I am trying limiting max processors to 1. |
zombie67 [MM] Send message Joined: 11 Feb 06 Posts: 316 Credit: 6,621,003 RAC: 0 |
Maybe this is a coincidence, but It's been 36 hours since any of my Windows machines have had another problem. Knock wood. For the last several weeks, I haven't gone more than 8 hours without at least one crapping out. Spoke too soon. Just had one fail. It's a HT machine, if that helps. https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=318945 Reno, NV Team: SETI.USA |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
All of mine are HT machines. Ed may be on to something there. Keep us posted. Also, Ed, sorry that I misread the post of yours that I'd quoted. I see now that you were quoting another post of a user which had 4 projects. Life is full of details. :( Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
Just lost connection to client few minutes ago. I selected the Transfer Tab before exiting the mgr. When I restarted mgr I saw 2 rosetta files in a download status and 10 more Rosetta files in download pending status. The associated task appears to be 2tif_ETABLE_TEST_ABRELAX_nov19_1411_6818_2 I'm using boinc 5.4.9, windows xp, hyperthreaded, 2 ative projects with Rosetta getting 60% resource share. But my error rate isn't too high as some people seem to experience. Dump Timestamp : 12/05/06 21:14:25 Dump Timestamp : 11/23/06 23:46:30 Dump Timestamp : 11/22/06 15:47:48 Dump Timestamp : 11/10/06 00:16:06 Dump Timestamp : 11/09/06 16:09:28 Dump Timestamp : 11/03/06 15:26:57 |
![]() Send message Joined: 3 Nov 05 Posts: 1833 Credit: 120,243,449 RAC: 22,685 ![]() |
i've had this happen on a machine twice now: Win XP Athlon XP-M Non-service install Running only Rosetta All my other machines run a service install so I've not seen this before. This machine is running BOINC from a network drive, but its been running like this for about a year now and i've never had a problem with it dropping comms before without the hub or server being powered down (or changes being made to the onboard NIC), neither of which have happened recently. The first time I didn't think anything of it and just restarted BOINC. This morning I started BOINC before I had a good look under all the tabs. If it happens again I'll have a look through the logs... Danny |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
I had another crash overnight, which is my first since 5.41 was released, which pretty much rules out the long command-line as the culprit. I've had the crash on various PCs configurations, the only common factor is Rosetta. NT4, Win2000 and WinXP. PIII, P4 and Athlon XP. Single CPU, dual CPU (not dual core) and hyperthreaded P4. |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
...I've had the crash on various PCs configurations, the only common factor is Rosetta... And if only Rosetta was affected by this phenomenon, but when the client stops so do all other attached projects. I've resorted to filling up my cache and setting Rosetta to "No new tasks": after all the queued Rosetta tasks are done I report them and get new tasks under my supervision. If the downloads cause a problem I can provide remedy immediately. |
zombie67 [MM] Send message Joined: 11 Feb 06 Posts: 316 Credit: 6,621,003 RAC: 0 |
Went three whole days without it happening, then bang, 3 machines failed overnight. Two were P4 w/ HT, one was an X2 (no HT obviously). These three machines are all windows, all running 20+ projects, none running as a service. Let me know if there is any further information what would be useful. Reno, NV Team: SETI.USA |
Message boards :
Number crunching :
RAC dropping, BOINC dropping comms
©2025 University of Washington
https://www.bakerlab.org