Author | Message |
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2330 Credit: 44,187,498 RAC: 27,446
|
However, when you later write...
I have created a profile with the longer runtime. I will let it run for a while. And then probably revert to the 8 hour profile.
Your cache seems to hold 870 tasks (including running tasks).
The way Rosetta works is it initially tells Boinc tasks will take 8hrs, even if you've adjusted tasks to have a 24hr runtime,
870 tasks at 8hrs on a 128 thread server will take ~2.25days to complete - within the 3-day deadline.
But if they end up running 24hrs, they'll take ~6.75days to complete - ALL missing deadline.
Then your earlier unstarted tasks will get cancelled for not starting before deadline, while simultaneously grabbing more tasks because Rosetta (not Boinc) is misleading Boinc as to how big your cache is.
Any cache of tasks larger than 3days*128threads=384 running 24hrs each will miss deadline
The longest runtime your current cache size can successfully complete inside deadline is 10hrs - not notably different to the default 8hrs
The point being, with a fixed 3day deadline, if you treble runtime you have to reduce your cache-size an equivalent amount to continue to meet that hard deadline
Yeah, this is actually happening as predicted above.
You currently have 1100 errored tasks, largely comprising "Not started by deadline - canceled" plus "Timed out - no response" for those tasks that have started.
And for those you have returned they have been awarded credited, which is fortunate because all your 24hrs tasks missed deadline by up to a day.
And you seem to have tried reducing your runtime to 22hrs or 20hrs and it's not producing any better outcomes.
The other thing we can say is that while each task is getting credited more, you're not noticeably getting any better credit/hr much as predicted again, so it's a futile exercise.
It's perfectly legitimate to want to run longer runtimes, if you're happy with the risk of tasks crashing in that extra time and not being rewarded with any credits, but that requires a maximum number of tasks in the 300-350 range when using 24hr runtimes - and, importantly, <waiting> for your cache to actually reduce to 300-350 <before> increasing runtime from 8 to 24hrs in order to avoid these timeouts and cancellations.
By all means confirm that for yourself - your tasklist looks like a bit of a warzone atm with all its red warning messages
If you want to keep your settings as they currently are I think you may squeeze through with 12hr runtimes as a certain number of tasks are crashing out of their own accord in the current batch (project-related, not user-related)
I'm using 12hr runtimes quite successfully atm (albeit with a smaller cache). My personal view is that it will be a workable compromise setting for you of longer runtime vs completion by deadline. YMMV
Checking in on this (because I have nothing better to do) I did note the cache had reduced from 870 to 700ish at the time of my previous post, but didn't know if that was just a random fluctuation.
Now I can see runtime has been knocked back to 8hrs, cache is down to 367, deadlines were only being missed by 7hrs rather than a day and there was a delay in downloading fresh tasks of over a day so that deadlines will now start to be hit. Also there's a further delay in downloading currently going on so that (I speculate) in about a day's time, runtime can be increased to 24hrs again.
If I've got that right, and Tom hasn't come back to confirm it yet, that's all exactly the right thing to do. Good job
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2330 Credit: 44,187,498 RAC: 27,446
|
Your cache seems to hold 870 tasks (including running tasks).
The way Rosetta works is it initially tells Boinc tasks will take 8hrs, even if you've adjusted tasks to have a 24hr runtime,
870 tasks at 8hrs on a 128 thread server will take ~2.25days to complete - within the 3-day deadline.
But if they end up running 24hrs, they'll take ~6.75days to complete - ALL missing deadline.
Then your earlier unstarted tasks will get cancelled for not starting before deadline, while simultaneously grabbing more tasks because Rosetta (not Boinc) is misleading Boinc as to how big your cache is.
Any cache of tasks larger than 3days*128threads=384 running 24hrs each will miss deadline
The longest runtime your current cache size can successfully complete inside deadline is 10hrs - not notably different to the default 8hrs
The point being, with a fixed 3day deadline, if you treble runtime you have to reduce your cache-size an equivalent amount to continue to meet that hard deadline
Yeah, this is actually happening as predicted above.
You currently have 1100 errored tasks, largely comprising "Not started by deadline - canceled" plus "Timed out - no response" for those tasks that have started.
And for those you have returned they have been awarded credited, which is fortunate because all your 24hrs tasks missed deadline by up to a day.
And you seem to have tried reducing your runtime to 22hrs or 20hrs and it's not producing any better outcomes.
The other thing we can say is that while each task is getting credited more, you're not noticeably getting any better credit/hr much as predicted again, so it's a futile exercise.
It's perfectly legitimate to want to run longer runtimes, if you're happy with the risk of tasks crashing in that extra time and not being rewarded with any credits, but that requires a maximum number of tasks in the 300-350 range when using 24hr runtimes - and, importantly, <waiting> for your cache to actually reduce to 300-350 <before> increasing runtime from 8 to 24hrs in order to avoid these timeouts and cancellations.
By all means confirm that for yourself - your tasklist looks like a bit of a warzone atm with all its red warning messages
If you want to keep your settings as they currently are I think you may squeeze through with 12hr runtimes as a certain number of tasks are crashing out of their own accord in the current batch (project-related, not user-related)
I'm using 12hr runtimes quite successfully atm (albeit with a smaller cache). My personal view is that it will be a workable compromise setting for you of longer runtime vs completion by deadline. YMMV
Checking in on this (because I have nothing better to do) I did note the cache had reduced from 870 to 700ish at the time of my previous post, but didn't know if that was just a random fluctuation.
Now I can see runtime has been knocked back to 8hrs, cache is down to 367, deadlines were only being missed by 7hrs rather than a day and there was a delay in downloading fresh tasks of over a day so that deadlines will now start to be hit. Also there's a further delay in downloading currently going on so that (I speculate) in about a day's time, runtime can be increased to 24hrs again.
If I've got that right, and Tom hasn't come back to confirm it yet, that's all exactly the right thing to do. Good job
I know I'm obsessing over this, but I'm at a loose end, so why not...
In progress tasks are down to 286
All "Not started by deadline - canceled" and "Timed out - no response" error messages have disappeared. Errored tasks are down by 200 with no new ones being added
And task returns are already beating deadlines by as much as 1 day 10hrs, not risking not getting credit and not causing resends to other users who later find them cancelled by the server
All problem issues are solved, and with quite some headroom.
With a 128-thread server I wouldn't reduce the cache size any further - some might already consider that number to be on the low side, especially when tasks ready to send are so hand-to-mouth.
I'd also increase task runtime from 8 to 12hrs, which I personally consider to be a sweeter spot for longer runtimes than 24hrs, reduced server hits compared to 8hrs, less problematic Boinc scheduling and all the other vagaries we have to contend with here.
It all looks neatly balanced atm, with that option to slightly increase runtime as well without recreating problems.
IMO
|
|
Tom M
Send message
Joined: 20 Jun 17 Posts: 126 Credit: 27,917,436 RAC: 104,208
|
If I've got that right, and Tom hasn't come back to confirm it yet, that's all exactly the right thing to do. Good job
I know I'm obsessing over this, but I'm at a loose end, so why not...
In progress tasks are down to 286
All "Not started by deadline - canceled" and "Timed out - no response" error messages have disappeared. Errored tasks are down by 200 with no new ones being added
And task returns are already beating deadlines by as much as 1 day 10hrs, not risking not getting credit and not causing resends to other users who later find them cancelled by the server
All problem issues are solved, and with quite some headroom.
With a 128-thread server I wouldn't reduce the cache size any further - some might already consider that number to be on the low side, especially when tasks ready to send are so hand-to-mouth.
I'd also increase task runtime from 8 to 12hrs, which I personally consider to be a sweeter spot for longer runtimes than 24hrs, reduced server hits compared to 8hrs, less problematic Boinc scheduling and all the other vagaries we have to contend with here.
It all looks neatly balanced atm, with that option to slightly increase runtime as well without recreating problems.
IMO
It looks like I have switched back to the 8 hour profile overnight.
I will change the 22-24 hour profile to 12.
Boincmgr is set to 0.1/0.01 right now.
Do we have any idea what the computation errors are triggered by? I would like to lower my computation errors if possible. I am getting them on both of my systems. A Ryzen 3700x cpu and the Epyc CPU system.
Thank you.
===edit===
Bumped the cache from 0.1 to 0.2
The profile is now set to 12 hours.
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
|
|
Tom M
Send message
Joined: 20 Jun 17 Posts: 126 Credit: 27,917,436 RAC: 104,208
|
Do we have any idea what the computation errors are triggered by? I would like to lower my computation errors if possible. I am getting them on both of my systems. A Ryzen 3700x cpu and the Epyc CPU system.
Apparently everyone is die-ing on line 2798 of the Beta tasks.
"...ERROR: Error in simple_cycpep_predict app! The imported native pose has a different number of residues than the sequence provided...."
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
|
|
Tom M
Send message
Joined: 20 Jun 17 Posts: 126 Credit: 27,917,436 RAC: 104,208
|
===edit===
Bumped the cache from 0.1 to 0.2
The profile is now set to 12 hours.
And I have started my polling script again.
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
|
|