Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 332 · 333 · 334 · 335

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2331
Credit: 44,194,027
RAC: 27,408
Message 112628 - Posted: 6 May 2025, 9:50:06 UTC - in response to Message 112625.  

However, when you later write...
I have created a profile with the longer runtime. I will let it run for a while. And then probably revert to the 8 hour profile.

Your cache seems to hold 870 tasks (including running tasks).
The way Rosetta works is it initially tells Boinc tasks will take 8hrs, even if you've adjusted tasks to have a 24hr runtime,

870 tasks at 8hrs on a 128 thread server will take ~2.25days to complete - within the 3-day deadline.
But if they end up running 24hrs, they'll take ~6.75days to complete - ALL missing deadline.
Then your earlier unstarted tasks will get cancelled for not starting before deadline, while simultaneously grabbing more tasks because Rosetta (not Boinc) is misleading Boinc as to how big your cache is.

Any cache of tasks larger than 3days*128threads=384 running 24hrs each will miss deadline
The longest runtime your current cache size can successfully complete inside deadline is 10hrs - not notably different to the default 8hrs

The point being, with a fixed 3day deadline, if you treble runtime you have to reduce your cache-size an equivalent amount to continue to meet that hard deadline

Yeah, this is actually happening as predicted above.
You currently have 1100 errored tasks, largely comprising "Not started by deadline - canceled" plus "Timed out - no response" for those tasks that have started.
And for those you have returned they have been awarded credited, which is fortunate because all your 24hrs tasks missed deadline by up to a day.
And you seem to have tried reducing your runtime to 22hrs or 20hrs and it's not producing any better outcomes.
The other thing we can say is that while each task is getting credited more, you're not noticeably getting any better credit/hr much as predicted again, so it's a futile exercise.

It's perfectly legitimate to want to run longer runtimes, if you're happy with the risk of tasks crashing in that extra time and not being rewarded with any credits, but that requires a maximum number of tasks in the 300-350 range when using 24hr runtimes - and, importantly, <waiting> for your cache to actually reduce to 300-350 <before> increasing runtime from 8 to 24hrs in order to avoid these timeouts and cancellations.

By all means confirm that for yourself - your tasklist looks like a bit of a warzone atm with all its red warning messages

If you want to keep your settings as they currently are I think you may squeeze through with 12hr runtimes as a certain number of tasks are crashing out of their own accord in the current batch (project-related, not user-related)
I'm using 12hr runtimes quite successfully atm (albeit with a smaller cache). My personal view is that it will be a workable compromise setting for you of longer runtime vs completion by deadline. YMMV

Checking in on this (because I have nothing better to do) I did note the cache had reduced from 870 to 700ish at the time of my previous post, but didn't know if that was just a random fluctuation.
Now I can see runtime has been knocked back to 8hrs, cache is down to 367, deadlines were only being missed by 7hrs rather than a day and there was a delay in downloading fresh tasks of over a day so that deadlines will now start to be hit. Also there's a further delay in downloading currently going on so that (I speculate) in about a day's time, runtime can be increased to 24hrs again.

If I've got that right, and Tom hasn't come back to confirm it yet, that's all exactly the right thing to do. Good job
ID: 112628 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2331
Credit: 44,194,027
RAC: 27,408
Message 112629 - Posted: 6 May 2025, 14:41:30 UTC - in response to Message 112628.  
Last modified: 6 May 2025, 14:41:59 UTC

Your cache seems to hold 870 tasks (including running tasks).
The way Rosetta works is it initially tells Boinc tasks will take 8hrs, even if you've adjusted tasks to have a 24hr runtime,

870 tasks at 8hrs on a 128 thread server will take ~2.25days to complete - within the 3-day deadline.
But if they end up running 24hrs, they'll take ~6.75days to complete - ALL missing deadline.
Then your earlier unstarted tasks will get cancelled for not starting before deadline, while simultaneously grabbing more tasks because Rosetta (not Boinc) is misleading Boinc as to how big your cache is.

Any cache of tasks larger than 3days*128threads=384 running 24hrs each will miss deadline
The longest runtime your current cache size can successfully complete inside deadline is 10hrs - not notably different to the default 8hrs

The point being, with a fixed 3day deadline, if you treble runtime you have to reduce your cache-size an equivalent amount to continue to meet that hard deadline

Yeah, this is actually happening as predicted above.
You currently have 1100 errored tasks, largely comprising "Not started by deadline - canceled" plus "Timed out - no response" for those tasks that have started.
And for those you have returned they have been awarded credited, which is fortunate because all your 24hrs tasks missed deadline by up to a day.
And you seem to have tried reducing your runtime to 22hrs or 20hrs and it's not producing any better outcomes.
The other thing we can say is that while each task is getting credited more, you're not noticeably getting any better credit/hr much as predicted again, so it's a futile exercise.

It's perfectly legitimate to want to run longer runtimes, if you're happy with the risk of tasks crashing in that extra time and not being rewarded with any credits, but that requires a maximum number of tasks in the 300-350 range when using 24hr runtimes - and, importantly, <waiting> for your cache to actually reduce to 300-350 <before> increasing runtime from 8 to 24hrs in order to avoid these timeouts and cancellations.

By all means confirm that for yourself - your tasklist looks like a bit of a warzone atm with all its red warning messages

If you want to keep your settings as they currently are I think you may squeeze through with 12hr runtimes as a certain number of tasks are crashing out of their own accord in the current batch (project-related, not user-related)
I'm using 12hr runtimes quite successfully atm (albeit with a smaller cache). My personal view is that it will be a workable compromise setting for you of longer runtime vs completion by deadline. YMMV

Checking in on this (because I have nothing better to do) I did note the cache had reduced from 870 to 700ish at the time of my previous post, but didn't know if that was just a random fluctuation.
Now I can see runtime has been knocked back to 8hrs, cache is down to 367, deadlines were only being missed by 7hrs rather than a day and there was a delay in downloading fresh tasks of over a day so that deadlines will now start to be hit. Also there's a further delay in downloading currently going on so that (I speculate) in about a day's time, runtime can be increased to 24hrs again.

If I've got that right, and Tom hasn't come back to confirm it yet, that's all exactly the right thing to do. Good job

I know I'm obsessing over this, but I'm at a loose end, so why not...

In progress tasks are down to 286
All "Not started by deadline - canceled" and "Timed out - no response" error messages have disappeared. Errored tasks are down by 200 with no new ones being added
And task returns are already beating deadlines by as much as 1 day 10hrs, not risking not getting credit and not causing resends to other users who later find them cancelled by the server
All problem issues are solved, and with quite some headroom.

With a 128-thread server I wouldn't reduce the cache size any further - some might already consider that number to be on the low side, especially when tasks ready to send are so hand-to-mouth.
I'd also increase task runtime from 8 to 12hrs, which I personally consider to be a sweeter spot for longer runtimes than 24hrs, reduced server hits compared to 8hrs, less problematic Boinc scheduling and all the other vagaries we have to contend with here.

It all looks neatly balanced atm, with that option to slightly increase runtime as well without recreating problems.

IMO
ID: 112629 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom M

Send message
Joined: 20 Jun 17
Posts: 126
Credit: 27,939,808
RAC: 104,344
Message 112630 - Posted: 6 May 2025, 15:11:14 UTC - in response to Message 112629.  
Last modified: 6 May 2025, 16:07:39 UTC



If I've got that right, and Tom hasn't come back to confirm it yet, that's all exactly the right thing to do. Good job

I know I'm obsessing over this, but I'm at a loose end, so why not...

In progress tasks are down to 286
All "Not started by deadline - canceled" and "Timed out - no response" error messages have disappeared. Errored tasks are down by 200 with no new ones being added
And task returns are already beating deadlines by as much as 1 day 10hrs, not risking not getting credit and not causing resends to other users who later find them cancelled by the server
All problem issues are solved, and with quite some headroom.

With a 128-thread server I wouldn't reduce the cache size any further - some might already consider that number to be on the low side, especially when tasks ready to send are so hand-to-mouth.
I'd also increase task runtime from 8 to 12hrs, which I personally consider to be a sweeter spot for longer runtimes than 24hrs, reduced server hits compared to 8hrs, less problematic Boinc scheduling and all the other vagaries we have to contend with here.

It all looks neatly balanced atm, with that option to slightly increase runtime as well without recreating problems.

IMO


It looks like I have switched back to the 8 hour profile overnight.
I will change the 22-24 hour profile to 12.

Boincmgr is set to 0.1/0.01 right now.

Do we have any idea what the computation errors are triggered by? I would like to lower my computation errors if possible. I am getting them on both of my systems. A Ryzen 3700x cpu and the Epyc CPU system.

Thank you.

===edit===
Bumped the cache from 0.1 to 0.2
The profile is now set to 12 hours.
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
ID: 112630 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom M

Send message
Joined: 20 Jun 17
Posts: 126
Credit: 27,939,808
RAC: 104,344
Message 112631 - Posted: 6 May 2025, 16:25:52 UTC - in response to Message 112630.  


Do we have any idea what the computation errors are triggered by? I would like to lower my computation errors if possible. I am getting them on both of my systems. A Ryzen 3700x cpu and the Epyc CPU system.


Apparently everyone is die-ing on line 2798 of the Beta tasks.

"...ERROR: Error in simple_cycpep_predict app! The imported native pose has a different number of residues than the sequence provided...."
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
ID: 112631 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom M

Send message
Joined: 20 Jun 17
Posts: 126
Credit: 27,939,808
RAC: 104,344
Message 112632 - Posted: 6 May 2025, 16:31:59 UTC - in response to Message 112630.  


===edit===
Bumped the cache from 0.1 to 0.2
The profile is now set to 12 hours.


And I have started my polling script again.
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
ID: 112632 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2331
Credit: 44,194,027
RAC: 27,408
Message 112633 - Posted: 6 May 2025, 22:31:52 UTC - in response to Message 112630.  
Last modified: 6 May 2025, 23:03:49 UTC

If I've got that right, and Tom hasn't come back to confirm it yet, that's all exactly the right thing to do. Good job

I know I'm obsessing over this, but I'm at a loose end, so why not...

In progress tasks are down to 286
All "Not started by deadline - canceled" and "Timed out - no response" error messages have disappeared. Errored tasks are down by 200 with no new ones being added
And task returns are already beating deadlines by as much as 1 day 10hrs, not risking not getting credit and not causing resends to other users who later find them cancelled by the server
All problem issues are solved, and with quite some headroom.

With a 128-thread server I wouldn't reduce the cache size any further - some might already consider that number to be on the low side, especially when tasks ready to send are so hand-to-mouth.
I'd also increase task runtime from 8 to 12hrs, which I personally consider to be a sweeter spot for longer runtimes than 24hrs, reduced server hits compared to 8hrs, less problematic Boinc scheduling and all the other vagaries we have to contend with here.

It all looks neatly balanced atm, with that option to slightly increase runtime as well without recreating problems.

IMO

It looks like I have switched back to the 8 hour profile overnight.
I will change the 22-24 hour profile to 12.

Boincmgr is set to 0.1/0.01 right now.

Do we have any idea what the computation errors are triggered by? I would like to lower my computation errors if possible. I am getting them on both of my systems. A Ryzen 3700x cpu and the Epyc CPU system.

Thank you.

===edit===
Bumped the cache from 0.1 to 0.2
The profile is now set to 12 hours.

On the computation errors, this comes from the project, not from any of us.
The last I heard, in the days when someone at Rosetta was speaking to me, was that it was easier to let those tasks error out after very few seconds than to try to extract them from the queue of tasks, which would take a lot of good tasks out as well as the bad. If that view holds, it's something we're going to continue to suffer, unfortunately. Not ideal but pragmatic.

On the cache size, I do agree with Grant's view that it should be kept low BUT only if there's a constant supply of tasks for us.
For some months now we <haven't> had a constant supply ready to send to us.
And this is only made worse by all the tasks that error out.

As such I can't agree with the cache only being 0.1 or 0.2 +0.01
With the number of threads you have, the hand-to-mouth supply of tasks and the regular computation errors, I would aim for a cache size somewhere between 0.5 and 1.0 + 0.01
That strikes me as the right ballpark for safety & reliability within the deadline, but tweak it to your own view of each of those competing issues within those bounds.
Any less and I can see you regularly having threads free without work.
Supply isn't trustworthy enough and, with all the computation errors, even what you do get you can't entirely rely on.

Having a 12hr runtime rather than 8hrs gives you that little bit more time to get good tasks through - that's one of its pluses IMO

Fwiw on my own machines, I've now settled on a 12hr runtime with a cache of 0.4 + 0.1 which works pretty well with just 16 threads on my main PC (and 6 threads on another and 8 on my work PC), though I run 2 other low-priority projects as a backup in case of unforseen eventualities while they're unattended.

Edit: I see you're down to just 149 tasks now, which will be your 128 threads and only 21 tasks waiting to start for when others complete. This is way too tight. It looks like you're likely asking for tasks already but the project hasn't got them to send you. If you were asking for tasks with 0.5days worth left, rather than only 0.1 or 0.2, you'd stand a much better chance of getting some in time. Even 0.5days may not be enough time tbh.
You can only see how this goes. It's no good swinging from having too many tasks to complete by deadline all the way to not having enough tasks to keep all your threads occupied. There's a balance somewhere between the two to find.
Edit 2: Go straight to 1.0 + 0.01 - even if Rosetta had them all to send you it'd only be ~300 including running tasks which is far from excessive on a 128-thread server. It'd still be nearly 600 fewer than you were stockpiling before
ID: 112633 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 332 · 333 · 334 · 335

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2025 University of Washington
https://www.bakerlab.org