Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 293 · 294 · 295 · 296 · 297 · 298 · 299 . . . 316 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1759 Credit: 18,534,891 RAC: 388 |
Well, at least it's been a while since the last time. boinc-process host is down again, so no Validation until it lives again. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1759 Credit: 18,534,891 RAC: 388 |
Well, at least it's been a while since the last time.And now the download server has died as well. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2199 Credit: 41,963,219 RAC: 18,421 |
Got up to find only 6 Rosetta tasks running, plus 4 waiting for memory and 6 cores idle, while RAM is at 65% used and 5.5Gb freeVery, very odd. My only settings that are more restrictive are Disk 50% Memory in use 85% Memory not in use 95% More likely it's that I had a faulty RAM stick the other month so I'm only running with 16Gb RAM rather than 32GB |
Klimax Send message Joined: 27 Apr 07 Posts: 44 Credit: 2,801,675 RAC: 26 |
BTW: My cache is configured for 1+1. As long as computer is running 24h there soul be a day of reserve.Setting it that way may not give you what you might expect it to. After quick verification on another project... damn you are correct. I was misunderstanding that option for past 17 years or so. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2199 Credit: 41,963,219 RAC: 18,421 |
BTW: My cache is configured for 1+1. As long as computer is running 24h there soul be a day of reserve.With it set to 1+1 you will get one day's worth, plus another day's worth, but then the cache will run down to just under 1 day's worth, then it will refill the 1 day & then re-fill the second additional day. I also misunderstood it for a decade or more, but in the end I decided I <did> actually want somewhere between the minimum and maximum amount of days and didn't really care where I was as long as I was in that general area. My target actually hovered between 1 and 1.5 days total back then, but more recently I've found it more appropriate for me to halve that so no one project runs away with itself too far when Rosetta tasks sometimes become available. As they have in the last hour or so. The trouble is, as well as the boinc-process server being down, so is the download server, boinc-files.bakerlab.org so the necessary files are failing atm. Fingers crossed someone notices |
tgbauer Send message Joined: 5 Jan 06 Posts: 11 Credit: 104,888,131 RAC: 70,465 |
high RAM usage is generally only for the first 30min or so. After that, it drops down to 1GB or less This is not my experience. Have beta 6.06 tasks that are currently near 50% complete and RAM usage is between 2.26GB and 2.50GB each (1.7GB to 2.2GB compressed). Sounds like limiting the Rosetta count is only recourse because RAM to CPU ratio is so far off, can't prioritize the more RAM efficient tasks, and swapping causes tasks to take 10x longer. |
mmonnin Send message Joined: 2 Jun 16 Posts: 61 Credit: 25,390,629 RAC: 274 |
high RAM usage is generally only for the first 30min or so. After that, it drops down to 1GB or less I agree, I have high RAM usage the entire time in Linux. A Win10 system had lower RAM usage then Linux and I could run 100% R@H with 2GB ram per thread and be my primary desktop. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2199 Credit: 41,963,219 RAC: 18,421 |
The trouble is, as well as the boinc-process server being down, so is the download server, boinc-files.bakerlab.org so the necessary files are failing atm. Looks like it was fixed 3 or 4 hours later. By the time I got back from work to do something about it all the few remaining tasks had been snapped up again <sigh> |
Jonathan Send message Joined: 31 Jul 24 Posts: 2 Credit: 165,784 RAC: 1,621 |
Hi, I had to abort a couple of Rosette beta workunits from my arm64 linux (RPi5) machine as they made the machines unresponsive. Possibly they were memory constrained with 4 cores and 4Gb of memory, but whatever the reason the machine became unresponsive to ssh 1584608775 1409829153 6297726 13 Oct 2024, 9:04:25 UTC 17 Oct 2024, 9:03:56 UTC Aborted 44.46 36.32 --- Rosetta Beta v6.06 aarch64-unknown-linux-gnu 1584607687 1409833590 6297726 13 Oct 2024, 9:04:25 UTC 17 Oct 2024, 9:03:56 UTC Aborted 4.78 0.00 --- Rosetta Beta v6.06 aarch64-unknown-linux-gnu Stderr output <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> aborted by user</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_aarch64-unknown-linux-gnu @SETDB1_8UWP_boinc_fulldb_6hkEP2_0_3936.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 Using database: database_f5ae1de8e1/database Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. </stderr_txt> ]]>[/code] |
Jonathan Send message Joined: 31 Jul 24 Posts: 2 Credit: 165,784 RAC: 1,621 |
apologies duplicate |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1759 Credit: 18,534,891 RAC: 388 |
And the boinc-process host is down again. Grant Darwin NT |
tgbauer Send message Joined: 5 Jan 06 Posts: 11 Credit: 104,888,131 RAC: 70,465 |
Have a work unit that doesn't seem to be getting as far as others, and has an unusually long model (the graphics shows a dot with a line that seems to go on into infinity) Other Tasks are running as expected.
This is stderr.txt command: rosetta_4.20_x86_64-apple-darwin -run:protocol jd2_scripting @flags_rb_09_09_632102_625918__t000__0_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_09_09_632102_625918__t000__0_C1_robetta.zip -frag_weight_aligned 0.5 -max_registry_shift 4 -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3499362 Using database: database_357d5d93529_n_methyl/minirosetta_database error: zipfile probably corrupt (segmentation violation) error: zipfile probably corrupt (illegal instruction) BOINC:: CPU time: 64841.5s, 36000s + 28800s[2024-10-21 22:25: 9:] :: BOINC Output exists: default.out.gz Size: WARNING! cannot get file size for default.out.gz: could not open file. -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 error: zipfile probably corrupt (segmentation violation) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2199 Credit: 41,963,219 RAC: 18,421 |
Have a work unit that doesn't seem to be getting as far as others, and has an unusually long model (the graphics shows a dot with a line that seems to go on into infinity) It's probably already errored out by now, but with all those errors and running over 2.5days without starting, you should abort it if it's still going. It hasn't started, let alone stand any chance of finishing. Let your core have something more productive to run. |
tgbauer Send message Joined: 5 Jan 06 Posts: 11 Credit: 104,888,131 RAC: 70,465 |
Fortunately this seems to be a one-off and other tasks are processing as expected. Restarting bionic client caused it to realize it needed to error out this task. Maybe at some point bionic client will recognize similar errors (for any project) and avoid a restart or abort |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1759 Credit: 18,534,891 RAC: 388 |
And the boinc-process host is down again.Still dead, so still no work being Validated. Grant Darwin NT |
tgbauer Send message Joined: 5 Jan 06 Posts: 11 Credit: 104,888,131 RAC: 70,465 |
Looks like Application "Rosetta Beta 6.06" tasks are using 2.5GB of RAM each! That becomes a bit inefficient when have 128 cores in a computer and 128GB RAM (only 46/128 cores used). Ones before that and "Rosetta 4.20" are consuming less than 0.5GB (and all 128 cores used). The recent beta 6.06 tasks are now using less than 1GB (600MB compressed). Thank you for fixing the RAM size! Now I'm able to use all cores again |
Bill Swisher Send message Joined: 10 Jun 13 Posts: 48 Credit: 37,476,976 RAC: 126,659 |
It appears that they (whoever they are) have resolved the massive memory gobbling. Do you think I would be wise to remove the limitation on the beta runs? I currently have it limited to only 6 per computer. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2199 Credit: 41,963,219 RAC: 18,421 |
I think so. It's possible it ran short of RAM as some tasks are demanding high amounts recently, but better to think of it as a one-off and just move on. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2199 Credit: 41,963,219 RAC: 18,421 |
And the boinc-process host is down again.Still dead, so still no work being Validated. It came back about 8hrs ago. Everything nearly cleared down now. And some tasks became available, but have all been gobbled up again. All very hand-to-mouth |
Matthew Tireman Send message Joined: 24 Mar 20 Posts: 6 Credit: 387,215 RAC: 1 |
:/ |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2025 University of Washington
https://www.bakerlab.org