Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 293 · 294 · 295 · 296 · 297 · 298 · 299 . . . 316 · Next

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1759
Credit: 18,534,891
RAC: 388
Message 109881 - Posted: 16 Oct 2024, 11:14:57 UTC

Well, at least it's been a while since the last time.

boinc-process host is down again, so no Validation until it lives again.
Grant
Darwin NT
ID: 109881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1759
Credit: 18,534,891
RAC: 388
Message 109884 - Posted: 17 Oct 2024, 5:46:19 UTC - in response to Message 109881.  

Well, at least it's been a while since the last time.

boinc-process host is down again, so no Validation until it lives again.
And now the download server has died as well.
Grant
Darwin NT
ID: 109884 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2199
Credit: 41,964,509
RAC: 18,459
Message 109886 - Posted: 17 Oct 2024, 8:04:12 UTC - in response to Message 109858.  

Got up to find only 6 Rosetta tasks running, plus 4 waiting for memory and 6 cores idle, while RAM is at 65% used and 5.5Gb free
5 of the tasks are using 310-440Mb, only one using 2.122Gb
This is very odd
Very, very odd.
Most of my Tasks are now using around 2GB of RAM, even after running for a few hours.

I'd suggest checking your "When and how BOINC uses your computer" preferences.

These are mine- the most likely to be causing issues- the Memory preferences. Is "Leave non-GPU tasks in memory while suspended" selected? And low "Use at most preferences" would also cause issues.

Disk
                             Use no more than 60 % of total

Memory
         When computer is in use, use at most 95 %
     When computer is not in use, use at most 98 %

My only settings that are more restrictive are
Disk 50%
Memory in use 85%
Memory not in use 95%

More likely it's that I had a faulty RAM stick the other month so I'm only running with 16Gb RAM rather than 32GB
ID: 109886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Klimax

Send message
Joined: 27 Apr 07
Posts: 44
Credit: 2,801,675
RAC: 26
Message 109887 - Posted: 17 Oct 2024, 9:48:37 UTC - in response to Message 109871.  

BTW: My cache is configured for 1+1. As long as computer is running 24h there soul be a day of reserve.
Setting it that way may not give you what you might expect it to.
If you want 2 days worth, then set it to 2+ 0.01.

Those additional days are just that- additional days. They will only be added on when the cache gets low enough to reach the "Store at least value" and it needs to be topped up. Then it will also top up the additional day, which will then run down again until the "Store at least value" is reached again.


With it set to 1+1 you will get one day's worth, plus another day's worth, but then the cache will run down to just under 1 day's worth, then it will refill the 1 day & then re-fill the second additional day.
With it set to 2+ 0.01 as it returns a Task, it will download another to keep the cache at the 2 days level.

After quick verification on another project... damn you are correct. I was misunderstanding that option for past 17 years or so.
ID: 109887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2199
Credit: 41,964,509
RAC: 18,459
Message 109889 - Posted: 17 Oct 2024, 11:11:05 UTC - in response to Message 109887.  

BTW: My cache is configured for 1+1. As long as computer is running 24h there soul be a day of reserve.
With it set to 1+1 you will get one day's worth, plus another day's worth, but then the cache will run down to just under 1 day's worth, then it will refill the 1 day & then re-fill the second additional day.
With it set to 2+ 0.01 as it returns a Task, it will download another to keep the cache at the 2 days level.

After quick verification on another project... damn you are correct. I was misunderstanding that option for past 17 years or so.

I also misunderstood it for a decade or more, but in the end I decided I <did> actually want somewhere between the minimum and maximum amount of days and didn't really care where I was as long as I was in that general area.
My target actually hovered between 1 and 1.5 days total back then, but more recently I've found it more appropriate for me to halve that so no one project runs away with itself too far when Rosetta tasks sometimes become available.

As they have in the last hour or so.
The trouble is, as well as the boinc-process server being down, so is the download server, boinc-files.bakerlab.org so the necessary files are failing atm.
Fingers crossed someone notices
ID: 109889 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tgbauer

Send message
Joined: 5 Jan 06
Posts: 11
Credit: 104,889,054
RAC: 70,434
Message 109891 - Posted: 17 Oct 2024, 18:59:38 UTC - in response to Message 109876.  
Last modified: 17 Oct 2024, 19:00:31 UTC

high RAM usage is generally only for the first 30min or so. After that, it drops down to 1GB or less


This is not my experience. Have beta 6.06 tasks that are currently near 50% complete and RAM usage is between 2.26GB and 2.50GB each (1.7GB to 2.2GB compressed).
Sounds like limiting the Rosetta count is only recourse because RAM to CPU ratio is so far off, can't prioritize the more RAM efficient tasks, and swapping causes tasks to take 10x longer.
ID: 109891 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 61
Credit: 25,390,629
RAC: 274
Message 109893 - Posted: 17 Oct 2024, 22:49:37 UTC - in response to Message 109891.  

high RAM usage is generally only for the first 30min or so. After that, it drops down to 1GB or less


This is not my experience. Have beta 6.06 tasks that are currently near 50% complete and RAM usage is between 2.26GB and 2.50GB each (1.7GB to 2.2GB compressed).
Sounds like limiting the Rosetta count is only recourse because RAM to CPU ratio is so far off, can't prioritize the more RAM efficient tasks, and swapping causes tasks to take 10x longer.


I agree, I have high RAM usage the entire time in Linux. A Win10 system had lower RAM usage then Linux and I could run 100% R@H with 2GB ram per thread and be my primary desktop.
ID: 109893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2199
Credit: 41,964,509
RAC: 18,459
Message 109895 - Posted: 18 Oct 2024, 2:10:09 UTC - in response to Message 109889.  

The trouble is, as well as the boinc-process server being down, so is the download server, boinc-files.bakerlab.org so the necessary files are failing atm.
Fingers crossed someone notices

Looks like it was fixed 3 or 4 hours later. By the time I got back from work to do something about it all the few remaining tasks had been snapped up again <sigh>
ID: 109895 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jonathan

Send message
Joined: 31 Jul 24
Posts: 2
Credit: 165,784
RAC: 1,621
Message 109896 - Posted: 19 Oct 2024, 6:02:50 UTC
Last modified: 19 Oct 2024, 6:51:03 UTC

Hi, I had to abort a couple of Rosette beta workunits from my arm64 linux (RPi5) machine as they made the machines unresponsive. Possibly they were memory constrained with 4 cores and 4Gb of memory, but whatever the reason the machine became unresponsive to ssh
1584608775	1409829153	6297726	13 Oct 2024, 9:04:25 UTC	17 Oct 2024, 9:03:56 UTC	Aborted	44.46	36.32	---	Rosetta Beta v6.06
aarch64-unknown-linux-gnu
1584607687	1409833590	6297726	13 Oct 2024, 9:04:25 UTC	17 Oct 2024, 9:03:56 UTC	Aborted	4.78	0.00	---	Rosetta Beta v6.06
aarch64-unknown-linux-gnu

Stderr output
<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
aborted by user</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_aarch64-unknown-linux-gnu @SETDB1_8UWP_boinc_fulldb_6hkEP2_0_3936.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Using database: database_f5ae1de8e1/database
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.

</stderr_txt>
]]>
[/code]
ID: 109896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jonathan

Send message
Joined: 31 Jul 24
Posts: 2
Credit: 165,784
RAC: 1,621
Message 109897 - Posted: 19 Oct 2024, 6:02:56 UTC
Last modified: 19 Oct 2024, 6:05:21 UTC

apologies duplicate
ID: 109897 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1759
Credit: 18,534,891
RAC: 388
Message 109901 - Posted: 23 Oct 2024, 8:41:15 UTC

And the boinc-process host is down again.
Grant
Darwin NT
ID: 109901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tgbauer

Send message
Joined: 5 Jan 06
Posts: 11
Credit: 104,889,054
RAC: 70,434
Message 109903 - Posted: 24 Oct 2024, 0:22:46 UTC
Last modified: 24 Oct 2024, 0:36:15 UTC

Have a work unit that doesn't seem to be getting as far as others, and has an unusually long model (the graphics shows a dot with a line that seems to go on into infinity)
Other Tasks are running as expected.



Application
Rosetta 4.20
Name
rb_09_09_632102_625918__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_2979545_8404
State
Running
Received
Saturday, October 19, 2024 at 03:24:01 AM
Report deadline
Tuesday, October 22, 2024 at 03:24:04 AM
Estimated computation size
80,000 GFLOPs
CPU time
2d 14:28:52
CPU time since checkpoint
2d 14:28:52
Elapsed time
2d 14:12:32
Estimated time remaining
---
Fraction done
100.000%
Virtual memory size
34.42 GB
Working set size
22.83 MB
Directory
slots/2
Process ID
17683
Progress rate
1.440% per hour
Executable
rosetta_4.20_x86_64-apple-darwin



This is stderr.txt
command: rosetta_4.20_x86_64-apple-darwin -run:protocol jd2_scripting @flags_rb_09_09_632102_625918__t000__0_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_09_09_632102_625918__t000__0_C1_robetta.zip -frag_weight_aligned 0.5 -max_registry_shift 4 -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3499362
Using database: database_357d5d93529_n_methyl/minirosetta_database
error:  zipfile probably corrupt (segmentation violation)
error:  zipfile probably corrupt (illegal instruction)
BOINC:: CPU time: 64841.5s, 36000s + 28800s[2024-10-21 22:25: 9:] :: BOINC 
Output exists: default.out.gz Size: WARNING! cannot get file size for default.out.gz: could not open file.
-1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
error:  zipfile probably corrupt (segmentation violation)

ID: 109903 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2199
Credit: 41,964,509
RAC: 18,459
Message 109904 - Posted: 24 Oct 2024, 1:43:18 UTC - in response to Message 109903.  

Have a work unit that doesn't seem to be getting as far as others, and has an unusually long model (the graphics shows a dot with a line that seems to go on into infinity)
Other Tasks are running as expected.

CPU time
2d 14:28:52
CPU time since checkpoint
2d 14:28:52
Elapsed time
2d 14:12:32
Estimated time remaining


This is stderr.txt
error:  zipfile probably corrupt (segmentation violation)
error:  zipfile probably corrupt (illegal instruction)
BOINC:: CPU time: 64841.5s, 36000s + 28800s[2024-10-21 22:25: 9:] :: BOINC 
-----
Stream information inconsistent.
Writing W_0000001
error:  zipfile probably corrupt (segmentation violation)

It's probably already errored out by now, but with all those errors and running over 2.5days without starting, you should abort it if it's still going.
It hasn't started, let alone stand any chance of finishing. Let your core have something more productive to run.
ID: 109904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tgbauer

Send message
Joined: 5 Jan 06
Posts: 11
Credit: 104,889,054
RAC: 70,434
Message 109905 - Posted: 24 Oct 2024, 3:46:07 UTC - in response to Message 109904.  


It's probably already errored out by now, but with all those errors and running over 2.5days without starting, you should abort it if it's still going.
It hasn't started, let alone stand any chance of finishing. Let your core have something more productive to run.


Fortunately this seems to be a one-off and other tasks are processing as expected.
Restarting bionic client caused it to realize it needed to error out this task.
Maybe at some point bionic client will recognize similar errors (for any project) and avoid a restart or abort
ID: 109905 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1759
Credit: 18,534,891
RAC: 388
Message 109906 - Posted: 24 Oct 2024, 4:29:38 UTC - in response to Message 109901.  

And the boinc-process host is down again.
Still dead, so still no work being Validated.
Grant
Darwin NT
ID: 109906 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tgbauer

Send message
Joined: 5 Jan 06
Posts: 11
Credit: 104,889,054
RAC: 70,434
Message 109908 - Posted: 24 Oct 2024, 12:39:37 UTC - in response to Message 109875.  

Looks like Application "Rosetta Beta 6.06" tasks are using 2.5GB of RAM each! That becomes a bit inefficient when have 128 cores in a computer and 128GB RAM (only 46/128 cores used). Ones before that and "Rosetta 4.20" are consuming less than 0.5GB (and all 128 cores used).
Is it possible to limit the RAM usage per task, so can consume all cores again?

The recent beta 6.06 tasks are now using less than 1GB (600MB compressed). Thank you for fixing the RAM size!
Now I'm able to use all cores again
ID: 109908 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill Swisher

Send message
Joined: 10 Jun 13
Posts: 48
Credit: 37,478,793
RAC: 126,619
Message 109909 - Posted: 24 Oct 2024, 17:07:01 UTC

It appears that they (whoever they are) have resolved the massive memory gobbling. Do you think I would be wise to remove the limitation on the beta runs? I currently have it limited to only 6 per computer.
ID: 109909 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2199
Credit: 41,964,509
RAC: 18,459
Message 109912 - Posted: 25 Oct 2024, 4:15:41 UTC - in response to Message 109905.  


It's probably already errored out by now, but with all those errors and running over 2.5days without starting, you should abort it if it's still going.
It hasn't started, let alone stand any chance of finishing. Let your core have something more productive to run.

Fortunately this seems to be a one-off and other tasks are processing as expected.

I think so.
It's possible it ran short of RAM as some tasks are demanding high amounts recently, but better to think of it as a one-off and just move on.
ID: 109912 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2199
Credit: 41,964,509
RAC: 18,459
Message 109913 - Posted: 25 Oct 2024, 4:20:23 UTC - in response to Message 109906.  

And the boinc-process host is down again.
Still dead, so still no work being Validated.

It came back about 8hrs ago.
Everything nearly cleared down now.
And some tasks became available, but have all been gobbled up again.
All very hand-to-mouth
ID: 109913 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Matthew Tireman

Send message
Joined: 24 Mar 20
Posts: 6
Credit: 387,215
RAC: 1
Message 109929 - Posted: 27 Oct 2024, 16:24:38 UTC
Last modified: 27 Oct 2024, 16:25:21 UTC

:/
ID: 109929 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 293 · 294 · 295 · 296 · 297 · 298 · 299 . . . 316 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2025 University of Washington
https://www.bakerlab.org