Other projects.

Message boards : Cafe Rosetta : Other projects.

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2465
Credit: 46,464,996
RAC: 274
Message 113134 - Posted: 5 Oct 2025, 16:41:49 UTC - in response to Message 113131.  

And yet another :-

October 3, 2025
We are aware of the issue with the scheduler returning "Another scheduler instance is running for this host" and have identified the cause in the config.xml template we adapated for the new containerzied environment. We will fix it once we have confirmed that the new event-driven validation and assimilation pipelines are working correctly.
Uploads are being processed normally, we've confirmed the new architecture for the containerized file_upload_handler pool behind Apache is correctly producing to the per-application Kafka (Redpanda) topics, storing the event and result data in separate queues on the local brokers partition.
As a result, there will be at least one more weekend sprint. Tentatively, we expect to be producing new workunits next week for MCM1, ARP1, and MAM1 beta version 7.07, validations should resume over the weekend, initial releases of batches will be intermittent.

I've been away a few days and can confirm the above from my logs

30/09/2025 12:04:54 | World Community Grid | Scheduler request to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi failed: HTTP service unavailable
30/09/2025 23:42:51 | World Community Grid | Scheduler request to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi failed: Error 403
...
03/10/2025 09:24:03 | World Community Grid | Scheduler request to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi failed: Error 403
03/10/2025 13:19:00 | World Community Grid | Another scheduler instance is running for this host
ID: 113134 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 430
Credit: 14,933,154
RAC: 13
Message 113141 - Posted: 8 Oct 2025, 7:12:07 UTC

Further progress :-

October 7, 2025
We have resolved the issue with the BOINC scheduler configuration causing "Another scheduler instance is running for this host". Users should be able to report tasks. We will update as soon as we begin creating new workunits as we are still working to stand up the rest of the BOINC backend architecture.
Website went down briefly as we brought the scheduler online. We have adjusted the HAProxy configuration, and we will continue to adjust Apache/HAProxy config if we see the website stops responding again.
Still debugging issues with the new Kafka-based validation workflow that works together with HAProxy routing rules to partition BOINC downloads and uploads by assigning servers equal hex buckets using the https://github.com/BOINC/boinc/wiki/DirHierarchy BOINC expects, and emitting events from the new file_upload_handler we wrote to Kafka so we can batch and respond to them in parallel. This removes the need for multiple round trips to the database for row-wise operations and polling, which are now simply batch applications of state after consuming workunits ready for validation in the relevant Kafka topic for that application. This allows us to perform validation and assimilation in the same process, at least for the projects we run ourselves (MCM1, MAM1, ARP1), and while the Kafka/Redpanda learning curve was significant, we have successfully transitioned to an event-driven in-memory partitioned architecture that should let us keep pace with the upcoming GPU enabled MAM1 application.
ID: 113141 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Cafe Rosetta : Other projects.



©2025 University of Washington
https://www.bakerlab.org