CPU Optimization, GPU utilization: so sad!

Author	Message
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 60436 - Posted: 1 Apr 2009, 21:16:37 UTC - in response to Message 60435. Your point is a good one, Paul. But Rosetta@home is doing two things: 1) Improving the software 2) Running scientific calculations on that software Well, history of this discussion dating back to the days when we had daily conversations with real project people this came up ... so, the project types are aware of it ... One of the long standing issues is that BOINC does not do LOTS of things very well ... people get mad when I say things like that, but, truth hurts, but it is the truth. One of the things that BOINC has not done well (historically) is to report what the CPU capabilities are in the system. And that makes the creation and selection of the correct applications difficult. So, most projects don't bother, or let the rabid crunchers and optimizers handle it on their own ... But that also means open source, not really available here ... The only project that does not follow this model is Einstein, but, for whatever reason the technology they use is not in generalized use. I can give you my thoughts on that, but, quite simply the projects are so isolated from each other and there is so much effort that goes into suppressing volunteer efforts that, well, here we are ... ID: 60436 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1897 Credit: 12,699,382 RAC: 6,556	Message 60450 - Posted: 2 Apr 2009, 10:48:48 UTC - in response to Message 60436. Your point is a good one, Paul. But Rosetta@home is doing two things: 1) Improving the software 2) Running scientific calculations on that software Well, history of this discussion dating back to the days when we had daily conversations with real project people this came up ... so, the project types are aware of it ... One of the long standing issues is that BOINC does not do LOTS of things very well ... people get mad when I say things like that, but, truth hurts, but it is the truth. One of the things that BOINC has not done well (historically) is to report what the CPU capabilities are in the system. And that makes the creation and selection of the correct applications difficult. So, most projects don't bother, or let the rabid crunchers and optimizers handle it on their own ... But that also means open source, not really available here ... The only project that does not follow this model is Einstein, but, for whatever reason the technology they use is not in generalized use. I can give you my thoughts on that, but, quite simply the projects are so isolated from each other and there is so much effort that goes into suppressing volunteer efforts that, well, here we are ... In my workplace we call those "silos", everyone has their own area and god forbid someone tries to intermix the different program areas!! ID: 60450 · Rating: 0 · rate: / Reply Quote

Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 68,058,785 RAC: 5,642	Message 60452 - Posted: 2 Apr 2009, 11:41:35 UTC - in response to Message 60436. Even if BOINC does not correctly report the hardware, multiple math libraries can be linked into the executables. These executables would be larger. Clearly the instructions can be done manually in software but the processors have additional registers to enhance the performance. Discussions from long ago indicate there is little advantage to SSE but it would be good to investigate SSE2, SSE3 and SSE4. It would be great to find out if these instructions could help and get them in use. Thx! Paul ID: 60452 · Rating: 0 · rate: / Reply Quote

Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0	Message 60463 - Posted: 3 Apr 2009, 3:24:19 UTC What I'm saying is that even if BOINC doesn't report CPU capabilities at all, they could have only ONE version of the code, and that would be SSE/SSE2. Any CPU that doesn't handle SSE/SSE2 is getting old anyway (probably single core too - and implementing SSE would take some months, making those non-SSE machines even more obsolete), and the boost on the newer CPUs would more than compensate for the loss of some old Williamette P4s or whatever. ID: 60463 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1897 Credit: 12,699,382 RAC: 6,556	Message 60469 - Posted: 3 Apr 2009, 13:07:36 UTC - in response to Message 60463. What I'm saying is that even if BOINC doesn't report CPU capabilities at all, they could have only ONE version of the code, and that would be SSE/SSE2. Any CPU that doesn't handle SSE/SSE2 is getting old anyway (probably single core too - and implementing SSE would take some months, making those non-SSE machines even more obsolete), and the boost on the newer CPUs would more than compensate for the loss of some old Williamette P4s or whatever. I think Paul Buck's report also stated that Boinc itself has problems determining what a given cpu's capabilities are though. I do not know where those limitations are though, he is much more versed in those things than I, he having followed it since the early days. I remember bits and pieces, he seems to be able to remember much more, and more clearly! ID: 60469 · Rating: 0 · rate: / Reply Quote

Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0	Message 60471 - Posted: 3 Apr 2009, 14:22:10 UTC I think those limitations would only be a problem if there was, f.ex., 4 versions of the code and BOINC had to determine which one was best for a particular CPU (though I wonder why that would be a problem, in our account when you look up a computer, it shows CPU capabilities). But if there's just one main version of the code - like how BOINC SIMAP has SSE code - it'll just run that one be default. ID: 60471 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1897 Credit: 12,699,382 RAC: 6,556	Message 60479 - Posted: 3 Apr 2009, 22:43:04 UTC - in response to Message 60471. I think those limitations would only be a problem if there was, f.ex., 4 versions of the code and BOINC had to determine which one was best for a particular CPU (though I wonder why that would be a problem, in our account when you look up a computer, it shows CPU capabilities). But if there's just one main version of the code - like how BOINC SIMAP has SSE code - it'll just run that one be default. I just looked up your pc's and did not see the SSE/2/3 capabilities listed. I do not know if you program the project to use SSE2 and someone has SSE3 does it auto downgrade? What if you only have SSE capabilities? What about non Windows pc's? ID: 60479 · Rating: 0 · rate: / Reply Quote

Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0	Message 60484 - Posted: 4 Apr 2009, 4:56:58 UTC Hmm, I thought it did. Maybe I was wrong. Even in the BOINC app, when you start it, in the "message" tab there's a "processor features" line that seems to show that, but I don't see SSE. Maybe it's in there under another name. But even without that, you can know just by looking at the processor generation. All Core 2 Duos, for example support SSE, SSE2, SSE3, etc. I do not know if you program the project to use SSE2 and someone has SSE3 does it auto downgrade? What if you only have SSE capabilities? What about non Windows pc's? These things are added on top of one another, they don't replace the previous one. So for example a CPU could be MMX + SSE + SSE2 + SSE3 at the same time. There's no downgrading. SSE2 doesn't replace SSE, they stand side by side. What I'm saying is that SSE has been in all x86 CPUs for many years now, and SSE2 too. At some point you have to stop supporting all the legacy old computers, just like you can't count on people having floppy drives anymore. The project could do a lot more science by supporting SSEx extensions even if the cost is losing a few old Pentium 3s or whatever. ID: 60484 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1897 Credit: 12,699,382 RAC: 6,556	Message 60487 - Posted: 4 Apr 2009, 11:33:54 UTC - in response to Message 60484. Hmm, I thought it did. Maybe I was wrong. Even in the BOINC app, when you start it, in the "message" tab there's a "processor features" line that seems to show that, but I don't see SSE. Maybe it's in there under another name. But even without that, you can know just by looking at the processor generation. All Core 2 Duos, for example support SSE, SSE2, SSE3, etc. I do not know if you program the project to use SSE2 and someone has SSE3 does it auto downgrade? What if you only have SSE capabilities? What about non Windows pc's? These things are added on top of one another, they don't replace the previous one. So for example a CPU could be MMX + SSE + SSE2 + SSE3 at the same time. There's no downgrading. SSE2 doesn't replace SSE, they stand side by side. What I'm saying is that SSE has been in all x86 CPUs for many years now, and SSE2 too. At some point you have to stop supporting all the legacy old computers, just like you can't count on people having floppy drives anymore. The project could do a lot more science by supporting SSEx extensions even if the cost is losing a few old Pentium 3s or whatever. I am learning new things here but think a few stats need to be run to confirm your thoughts. I am sure each Project can run a stat of the capabilities of each pc connected to it. I believe you are correct in your thought that upgrading the program will produce more results faster, but how many pc's would you exclude? I think there is a breakover point that needs to be explored. I also think that those old computers can still be used by some but their projects could be limited to just a few choices, without too much inconvenience. At some point those old pc's cost more to run than the value they are contributing. Yes we can find another cure faster but at the cost that we are polluting the planet much faster too. Faster, more efficient pc's pollute less than their earlier counterparts. ID: 60487 · Rating: 0 · rate: / Reply Quote

Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0	Message 60508 - Posted: 6 Apr 2009, 3:29:12 UTC The point of finding cures and scientific breakthroughs faster through the use of CPU optimization is not just: "oh, we're impatient/it'll use less electricity/it'll be more pleasing from a computer geek's point of view". There are many, many, many people dying every day of diseases that we will someday cure, and the sooner that day is, the more people we save. That's not even counting all those that suffer even if they don't have (yet) those diseases because they lose husbands, wives, fathers, mothers, brothers, sisters, kids, etc. This isn't just a cool theoretical exercise. By understanding biology better we can help millions of people. ID: 60508 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 60510 - Posted: 6 Apr 2009, 6:17:19 UTC - in response to Message 60508. The point of finding cures and scientific breakthroughs faster through the use of CPU optimization is not just: "oh, we're impatient/it'll use less electricity/it'll be more pleasing from a computer geek's point of view". There are many, many, many people dying every day of diseases that we will someday cure, and the sooner that day is, the more people we save. That's not even counting all those that suffer even if they don't have (yet) those diseases because they lose husbands, wives, fathers, mothers, brothers, sisters, kids, etc. This isn't just a cool theoretical exercise. By understanding biology better we can help millions of people. As a person that used to make a living staring at computer screens trying to figure out where the last error came from I can tell you that nothing is more difficult to find that a bug that does not exist when the program is run as a "normal" compile but does exist in an application compiled with "optimized" options turned on. So, I know doing more tasks in a shorter time is a desirable goal ... when one of the main points is to find the PROBLEMS IN THE APPLICATION, then this is not necessarily the optimal way to get there from here ... I know I am not going to convince anyone who has their mind made up, but, it is not as simple as it seems from the sidelines. The last points I will repeat, the fastest optimization is a more effective algorithm. The easiest program to debug is the one that is closest to "stock" as possible. If we want to get the work done faster, get more computers on the task (engage your friends). The project, may, I repeat MAY actually know what they are doing and if it would help they would already have done all the things suggested to increase the runtime speed ... ID: 60510 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1834 Credit: 123,971,244 RAC: 21,373	Message 60512 - Posted: 6 Apr 2009, 11:25:18 UTC - in response to Message 60510. i couldn't agree with Paul more. If there were a GPU client available i'd be the first to go out and buy some decent GPUs but as it stands, it seems that the limiting factor is staff time as well as compute power as there are so many avenues to explore. The optimisations would help with the processing power but would be detrimental to the staff time available to doing the research which leads to the improvements which is what the project is all about. So in the mean time it's a case of getting as many CPUs connected as possible ;) ID: 60512 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2402 Credit: 46,066,118 RAC: 26,126	Message 60513 - Posted: 6 Apr 2009, 12:41:13 UTC - in response to Message 60484. Hmm, I thought it did. Maybe I was wrong. Even in the BOINC app, when you start it, in the "message" tab there's a "processor features" line that seems to show that, but I don't see SSE. Maybe it's in there under another name. But even without that, you can know just by looking at the processor generation. All Core 2 Duos, for example support SSE, SSE2, SSE3, etc. For what it's worth, my messages tab reports the following line: 26/03/2009 03:35:15\|\|Processor features: fpu tsc pae nx sse sse2 pni It's in there somewhere. ID: 60513 · Rating: 0 · rate: / Reply Quote

Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0	Message 60521 - Posted: 6 Apr 2009, 20:13:22 UTC I'm not disagreeing with you, Paul. In fact, I wrote a few comments earlier: "People in the Baker Lab are better positioned than I am to know what is the best use of their resources, but I just want to make sure that they're looking at all options when they make their choices. My goal is for scientific and medical breakthroughs, so however best we can reach that I'm fine with..." I'm just saying that if one of the things that holds optimizations back is the fact that old CPUs don't have these instructions and that BOINC isn't good at using many versions of code, that we should just dump the old PCs. With the complexity of Rosetta's code, I imagine that a lot of stuff is changing (trying new algos), but that parts of it are pretty static. f.ex. if there's a part of the code that does some 3D repositioning of atoms, and all it does is that, there's a chance that this part of the code isn't changing anytime soon because it's just basic molecular physics, then maybe that part can use SSE/SSE2/etc without making the job of the people who work on the other algorithms harder. And maybe a lot of CPU cycles are eaten up by doing "dumb" molecular physics while the "smart" part of the program - which is what is getting improved - isn't the CPU eater, so that this optimization could help a lot. ID: 60521 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 60527 - Posted: 7 Apr 2009, 8:31:23 UTC - in response to Message 60521. I'm not disagreeing with you, Paul. In fact, I wrote a few comments earlier: Sounded like it to me ... :) Just kidding ... this is one of those places where consensus can be had to come by ... "People in the Baker Lab are better positioned than I am to know what is the best use of their resources, but I just want to make sure that they're looking at all options when they make their choices. My goal is for scientific and medical breakthroughs, so however best we can reach that I'm fine with..." I'm just saying that if one of the things that holds optimizations back is the fact that old CPUs don't have these instructions and that BOINC isn't good at using many versions of code, that we should just dump the old PCs. Which creates other political problems with the user community. And your assumption is that the increased speed of the changed application will offset the loss of the other computers With the complexity of Rosetta's code, I imagine that a lot of stuff is changing (trying new algos), but that parts of it are pretty static. f.ex. if there's a part of the code that does some 3D repositioning of atoms, and all it does is that, there's a chance that this part of the code isn't changing anytime soon because it's just basic molecular physics, then maybe that part can use SSE/SSE2/etc without making the job of the people who work on the other algorithms harder. And maybe a lot of CPU cycles are eaten up by doing "dumb" molecular physics while the "smart" part of the program - which is what is getting improved - isn't the CPU eater, so that this optimization could help a lot. On Ralph they are chasing some particularly nasty bugs that seem to "move" around a little ... so, this is the type of error I am talking about. But more fundamentally, if they re-compile with SSE or whatever switch you choose, they have to start back at the beginning to validate the changed code is operating as it should. Particularly in this case a potentially intractable problem. For the simple reason that changing the runtime code paths can (and in this case probably will) change the results. ID: 60527 · Rating: 0 · rate: / Reply Quote

Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0	Message 60533 - Posted: 7 Apr 2009, 14:43:53 UTC - in response to Message 60527. Last modified: 7 Apr 2009, 14:53:09 UTC Which creates other political problems with the user community. And your assumption is that the increased speed of the changed application will offset the loss of the other computers Pentium IIIs, which were released 10 years ago, had SSE instructions. SSE2 came with the Pentium 4 a few years later. Which means that the computers we'd be dumping would probably mostly be about that old (if the change happened now, but in reality the switch might take a few months or a year, so they'd be even older). I don't think that would be a very big "political" problem, especially since most computers that old probably don't have much RAM and rosetta's requirements have been slowly increasing. Expecting cutting-edge science to run on 10 years old hardware is unrealistic. There's no question in my mind that a SSEx speedup on probably 90%+ of our current CPUs would more than offset the loss of probably < 10% of the slowest CPUs on the project, and once that change is made, it keeps paying off forever (new CPUs joining the project..). Heck, if anyone has something pre-Pentium III on Rosetta, it's probably time to stop wasting electricity. But more fundamentally, if they re-compile with SSE or whatever switch you choose, they have to start back at the beginning to validate the changed code is operating as it should. Particularly in this case a potentially intractable problem. For the simple reason that changing the runtime code paths can (and in this case probably will) change the results. I'm not saying it would be easy, I'm saying it might be worth the trouble from a scientific point of view (if we can run more models in the same amount of time, the results will be better). There's a chance that what I mentioned in my last comment would make it less problematic than you think; part of the code is probably well-understood 'static' 3D molecular dynamics (the laws of physics don't change...) - they could isolate that part and validate the SSE version against the non-SSE version to make sure it gives the same results. AFAIK, SSE doesn't have to be used over the whole code base. Right now it's like most Rosetta CPU donors have cars with 5-speed transmissions, and the project only allows them to shift up to 3rd speed... ID: 60533 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 60539 - Posted: 7 Apr 2009, 19:16:51 UTC Michael G.R. perhaps you could post a few links to blogs or other reports of optimization projects that have shown such stellar results, or to manufacturers reporting their products will bring you that level of improvement. Rosetta Moderator: Mod.Sense ID: 60539 · Rating: 0 · rate: / Reply Quote

Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0	Message 60546 - Posted: 7 Apr 2009, 22:29:27 UTC Last modified: 7 Apr 2009, 22:30:02 UTC Not being a programmer, I cannot go into too much detail, but what comes to mind is Einstein@home: Einstein@home has gained considerable attention of the world's distributed computing community when an optimized application for the S4 data set analysis was developed and released in March 2006 by project volunteer Akos Fekete, a Hungarian programmer.[7] Fekete improved the official S4 application and introduced SSE, 3DNow! and SSE3 optimizations into the code improving performance by up to 800%. (this is from their wikipedia page) Also, Folding@home uses the GROMACS molecular dynamics (?) engine which uses SSEx extensions, giving it - from what I remember reading - quite a speed advantage compared to the non-SSEx FAH apps (though I'm not super familiar with Folding@home, so there could be other factors). BOINC SIMAP also has two versions of their app, one SSE and one non-SSE. It would be interesting to run both on the same machine and compare run time... But I suspect that all of this would only give us a general idea; improvements for Rosetta@home might be of a different nature than those other projects (bigger or smaller speedup). I'm sure others here are more familiar with SSE extensions than I am, but my understanding is that when they can be used on floating point math, they make a pretty big difference (some video and audio encoding software uses those extensions and the difference can be significant). I also know that SSE2 (introduce on the P4) can do double-precision (64-bit) floating point, which is what is usually needed for scientific calculations, though maybe Rosetta@home doesn't need it. I'll keep an eye open for the info you asked for, but I don't have a file with those references at hand. ID: 60546 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1834 Credit: 123,971,244 RAC: 21,373	Message 60549 - Posted: 8 Apr 2009, 9:42:34 UTC there was talk of asking Akos to have a look at the RaH code at one point - I think he posted here too... ID: 60549 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 60551 - Posted: 8 Apr 2009, 12:30:22 UTC Thanks Michael. Now I see where you are coming from. Just keep in mind that with Einstein, I believe you are comparing an "optimized" version with the original. I mean to say an entirely different rewrite of the program. And so the changes involved potentially go well beyond just SSE optimizations. One programmer looked at the whole thing with a fresh eye, and rewrote portions of code to achieve the same (?) result with less time. I put a ? after "same" because often the results are NOT identical, but they are sufficiently close for the operations being performed. I don't know if you've ever written hundreds of thousands of lines of text and had someone else edit it, but there is always room for improvement. With text, improvements can be made to convey the same information with less words. Or one might introduce an illustraition and eliminate several paragraphs that attempted to describe sufficient detail for the reader to make their own visualization. With text, when someone translates it to another language, it is entirely likely that the result is more concise then the original. Again, just due to the fresh eyes following through and getting a deep understanding of each area of the body of work. Programming is much the same. Since the scientists are primarily focused on the science and not the programming, it is entirely possible that major portions of the improvement were not due to the SSE optimization at all. And it is certainly possible that Rosetta could get a marked improvement with a fresh eye on the coding. But keep in mind that there are researchers all over the world that have already done this over the course of many years with Rosetta. And the translation a number of years ago from FORTRAN to C++ brought in fresh eyes, and a rework, similar to translating from English to German. So, if my writing style is very wasteful of words to begin with, and someone else comes along and edits my work, they can show a dramatic improvement. But it can be more a reflection of the original work then of the quality of the edits. I'm not trying to disagree that optimzations can be beneficial. And, keep in mind that I am just a moderator, not a Rosetta coder. I'm just trying to help a non-programmer get some frame of reference on what is involved. In the case of programming, the "optimized" version will often be MORE lines of code to maintain, rather then less. You've got some lines of code that now only run in certain environments, and other lines that do not. If you take my analogy to text and consider someone coming in and adding a screenshot for every page, the result might be a more CLEAR description, but the book it makes will be much thicker to print, and much harder to maintain. Or, perhaps better, picture a book that has screenshots of a Windows environment on every page, and now "enhance" it to also show screenshots for Mac as well. Rosetta Moderator: Mod.Sense ID: 60551 · Rating: 0 · rate: / Reply Quote