• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Alternative benchmarking/points plan

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

ihrsetrdr

Señor Senior Member
Joined
May 17, 2005
Location
High Desert, Calif.
I wish I could direct link to this discussion, but as that is not possible, here is a transcript:

Alternative benchmarking/points plan

VijayPande said:
We've had this plan for some time, but haven't implemented it since it's such a radical change. However, perhaps this is a good time to reconsider it.

The goal of this change to how points are determined is to give donors very constant PPD. Here's how it would work:

1) The client runs a benchmark calculation and keeps track of how much CPU time it takes.

2) the benchmark has a certain amount of points associated with it. let's say in this example that if you can do the benchmark in 10 seconds, you get 1000PPD (the actual number may be different of course). If the client runs the benchmark in X seconds, then the PPD of that machine should be 1000*(10/X). For this example, let's say that X=5.

3) the client keeps track of how much CPU time the core ran. Let's say in this example it took 2 days to do the WU. Then, the client would report back to the server that it took 2 days and the machine is rated at 2000PPD, so the server would award 4000 points for that WU. We could keep track of the PPD rating and the WU length to see if it's comparable to other machines.

The basic idea here is that donors would get the same PPD on a given machine consistently. The challenge for us would be to make sure that the benchmark calculation is representative of the science being done (since donors will optimize benchmark PPD). Also, we'd have to make sure there are no easy ways to cheat in this scheme. Assuming this can be done, what do you think?


mhouston said:
How would you deal with machine upgrades? Rebenchmark every day? What about GPU support which is really the PPD that ends up all over the map as the differential between high-end and low-end systems diverges as the proteins get larger.

bruce said:
As I'm sure you'll remember, this was tried once. It's simply too easy to underclock or do other things which slow down the benchmark. The overclocking can be restored once the benchmark has been completed, resulting in a higher score.

Do you plan to run the benchmark semi-continuously -- say 2% of production? Wouldn't that become a target for hackers?

What would happen to the QRB? Would this also replace the present bonus plan for early returns which is responsible for the greatest part of the variation in PPD but also provides a strong incentive for donors to concentrate on getting work returned promptly for the scientific benefit?

VijayPande said:
mhouston said:
How would you deal with machine upgrades? Rebenchmark every day? What about GPU support which is really the PPD that ends up all over the map as the differential between high-end and low-end systems diverges as the proteins get larger.

We'd be rebenchmarking every WU. The benchmark wouldn't be too slow. There would be a GPU benchmark. Recall the goal here is consistent and predictable PPD. I agree that there will be a differential between high end and low end, but that's unavoidable. However, it does put a lot of significance into the benchmark calculation: if that's not reasonable, people will optimize their hardware in ways that's not ideal for the project.

VijayPande said:
bruce said:
As I'm sure you'll remember, this was tried once. It's simply too easy to underclock or do other things which slow down the benchmark. The overclocking can be restored once the benchmark has been completed, resulting in a higher score.

Do you plan to run the benchmark semi-continuously -- say 2% of production? Wouldn't that become a target for hackers?

What would happen to the QRB? Would this also replace the present bonus plan for early returns which is responsible for the greatest part of the variation in PPD but also provides a strong incentive for donors to concentrate on getting work returned promptly for the scientific benefit?

We would have to be constantly benchmarking at random times (almost like random drug testing).

QRB: I'm confident we can come up with a plan which integrates that in. The benchmark would be about base points.

VijayPande said:
PS I forgot to mention that since this is such a radical change, it wouldn't be something to just immediately switch to. I would be tempted to have both systems running simultaneously, with the true points being given by the old system and the new system giving us some input. If nothing else, it's like having a mega-cluster of benchmark systems to compare PPD.

Ivoshiee said:
We have had that points issue around from the beginning of the
FAH. It seems that we have sat still.

Maybe it is possible to find someone from Stanford or somewhere else to
assign that topic to and even write a BA or even BS work on that subject.
All that stuff will need modeling and simulations and analyze. Then we can
model a "perfect" points system out of that work.

One easier option to introduce is re-benchmaking at
Stanford side. As projects utilize different parts of the FAH Core
procedures and maybe shift to different parts of it during their
lifetime then make weekly-monthly project re-benchmarking.
Also, each core build will trigger automatic project
re-benchmarking and project will get that much points it will give.

VijayPande said:
I'd like to avoid re-benchmarking as 1) that could take a lot of PG time away from other activities and 2) it could lead to unpredictable fluctuations in points, which is moving away from my primary goal of consistent PPD.

Also, this sort of thing isn't a good undergraduate thesis project, at least not in Chemistry or CS, as it would not be seen as being of intellectual interest to my colleagues. I do wonder if I could get someone in economics interested, but I am not associated with that department, so that's tougher.

Ivoshiee said:
VijayPande said:
I'd like to avoid re-benchmarking as 1) that could take a lot of PG time away from other activities and 2) it could lead to unpredictable fluctuations in points, which is moving away from my primary goal of consistent PPD.
1) it should not take away any PG time if it is automatic procedure built in the FAH server side and 2)if the running project will change over the course of its progress or processing code does that then it is good if that points for it change as well.
Also, this sort of thing isn't a good undergraduate thesis project, at least not in Chemistry or CS, as it would not be seen as being of intellectual interest to my colleagues. I do wonder if I could get someone in economics interested, but I am not associated with that department, so that's tougher.
I am sure you find someone interested in doing that :) .

bruce said:
Ivoshiee said:
VijayPande said:
I'd like to avoid re-benchmarking as 1) that could take a lot of PG time away from other activities and 2) it could lead to unpredictable fluctuations in points, which is moving away from my primary goal of consistent PPD.
1) it should not take away any PG time if it is automatic procedure built in the FAH server side and 2)if the running project will change over the course of its progress or processing code does that then it is good if that points for it change as well.
I think you're presuming that the Work Servers are identical to the benchmark hardware and that it's okay to suspend uploading/downloading for long enough to dedicate the server to benchmarking.

The other assumption would be designated projects would direct a WU to a standard benchmark machines periodically and that it, too, can be dedicated. The real question here is how much of the benchmarking process is automated and what additional automated code could/should be developed to perform repeated periodic benchmarks plus some sort of a pipeline that directs the data from a new benchmark to someone who compares the results with the previous benchmark(s) and decides if the variation is sufficient to readjust the baseline points. (I don't think that latter part should ever be automated. A human needs to think about the reasonableness of the new data and follow a designated policy.)

Also, this sort of thing isn't a good undergraduate thesis project, at least not in Chemistry or CS, as it would not be seen as being of intellectual interest to my colleagues. I do wonder if I could get someone in economics interested, but I am not associated with that department, so that's tougher.
I am sure you find someone interested in doing that :) .
Personally, I doubt it. The conclusions could not be classified as a scholarly work. There is no real advance in useful knowledge that would be applicable outside of an extremely narrow field of interest -- specifically, the points competition in FAH. There's not going to be any thesis to publish. If I were a professor reviewing the proposed research or perhaps the data and the conclusions, I wouldn't grant a degree on the basis of it actually being something worthy of being called "research."

Ivoshiee said:
Personally, I doubt it. The conclusions could not be classified as a scholarly work. There is no real advance in useful knowledge that would be applicable outside of an extremely narrow field of interest -- specifically, the points competition in FAH. There's not going to be any thesis to publish. If I were a professor reviewing the proposed research or perhaps the data and the conclusions, I wouldn't grant a degree on the basis of it actually being something worthy of being called "research."
You see it a bit too narrow here. From the history of that topic we can see it is being social and mathematical problem. Obviously much of the thesis is not about "ideal points system for the FAH". Maybe some implementation part of it does that, but largely it is about modeling and analyzing "reward system based on set of input-output parameters and constraints" which should have (among other possible sets) all the FAH requirements-needs abstracted and defined in terms of mathematics. If that part is done then the "politics is thrown out" and it is basically just finding best theory and algorithm for the task. Being it from the realm of social behaviour or physics or somewhere else is up to the researcher to decide.

kasson said:
For the moment, please accept that benchmark problems would not be considered an appropriate scholarly project at Stanford in the fields that Professor Pande is associated with. If we consider the original proposal: assume we can provide a means to run a benchmark calibration as part of the work unit and a means (non-trivial) to detect changes to machine capability and ensure that the benchmark continues to be representative of the machine capability during the run. Given that, what do you think of basing points on this on-system benchmark? This is non-trivial for us to implement but could provide a much greater degree of consistency.

bruce said:
For WUs without a bonus, I think it's an excellent suggestion.

That still leaves a question about how to establish settings for QRB that are "fair" for everyone.

This discussion started because of discrepancies between WUs similar to p6901 and p2686. Both projects have the same K-factor, the same deadlines, etc. and (before the discussion started, the same baseline-points. Are these two types of projects still going to have the same values in psummary or will there be a systematic adjustment in baselne points that will make them somehow be adjusted ON DONOR SYSTEMS so that they're no longer identical?

The baseline points for p2684 were adjusted heuristically by the suggested 10/7. We need a systematic plan that can avoid future heuristic adjustments and that plan needs to apply to everything. It's not clear whether this would accomplish that or not.

VijayPande said:
My plan would be to have this new system running along side the current one for a while, which would allow us to try out how the new system would do before switching over to it. If nothing else, it would be a built in way to do benchmarking on a variety of machines to look for PPD fluctuations. That data alone may be very useful in making our point that PPD *can't* be consistent with a current fixed-points based method. If so, then we would have the needed information to decide how to move forward.
 
This strikes me as the Marxist approach to points. From each according to their ability, to each according to their needs.
 
This strikes me as the Marxist approach to points. From each according to their ability, to each according to their needs.
So we get to be the Bourgeoisie? Everyone else who doesn't fold get to be the proletariat's?:p

"As the forces of production, most notably technology, improves, existing forms of social organization become inefficient and stifle further progress".(F@h points system)

See whats happening here?;)
 
Within reason, PG can manipulate the ppd of say ATi and AMD machines to end complaints about low production. THey can't go so far as to influence folks into buying hardware that slows the science, but can go far enough to bring slow folders in. THere will never be equal pay for equal work..... as if there is with the QRB.

Bruce points out the weakness of the system, run the benchmark without the OC then turn it back on and rake in the points. Random benchmarking sounds harder to implement than is practical. If hit with a random benchmark, the unscrupulous donor would turn off the OC, reinstall the client to get a new initial benchmark and turn the OC back on..
 
...the thread continues:

Bruce said:
We've spend quite a bit of time focusing on PPD for bigadv, to the exclusion of all other potential problems. What should we be doing about issues like A4 projects like this?
Subject: Projects 10062 - 10082

Will an internal benchmark that is (probably) based on Gromacs be equitable for ProtoMol work?

As you can see from the comments, it's not only an issue of low PPD, it's an issue of short deadlines.


Michael_McCord said:
Also, there would be fluctuation in PPD for people using their CPU heavily but intermittently, like for gaming, when most of the CPU cycles were not being utilized for folding. I still like the idea of a standardized machine which favors the typical donor/consumer and it is whatever it is.

VijayPande said:
Yea, this is a good question that I'm sure others will ask. CPU time records just the amount of CPU time used. So, if the core is just getting 1/4th the time (let's say 3 other processes are running), then it will take longer wall clock time, but not longer CPU time. The benchmark is for *cputime* used not wall clock.

We would of course test this sort of thing to show that the PPD doesn't vary widely in these situations.

bruce said:
The PPD currently varies depending on how much time is used by gaming or whatever so there's no change there.

A big part of this needs to be covered by donor education.

Suppose I run HFM and it tells me I'm getting 1000 PPD. Now suppose I spend an hour per day with a game that takes 100% of my system or 2 hours a day with some other app that takes 50% of my system. At the end of the day, I will have earned 958 points, but I'll never notice. Actual points are accrued whenever WUs finish so it's pretty much impossible to tell my actual overall PPD.

Donors are seeking to tune their systems for maximum throughput, and the only measurement of actual throughput is a short-term measurement. I'd guess that most of them don't understand that average throughput is less than peak throughput or if they do, they don't care because they don't really measure average throughput closely enough to matter. (I base my "most of them" statement on the amazing number of people who have absolutely no idea how to calculate PPD manually.) Everybody is focusing on the extrapolated time to complete 3 frames using the posted bonus calculations and the data extracted from psummary -- as presented by the 3rd party tools.

Real-time benchmarking focuses on an entirely different aspect of the points system. I guess the fundamental question that we need to answer is how to explain it to the donors and especially how to explain it to the 3rd party developers. Will it even be possible for them to extract enough data to manage the proposed changes, since they seem to be the primary conduit of information from FAH to the Donors with respect to benchmarking.

I don't know if anybody has noticed how many times V7 has been criticized for reporting lower PPDs than HFM does whenever the current WU has not folded continuously. Joe may eventually revise V7 to give a most-recent-frame or most-recent-3-frames calculation as well as an overall PPD number, but none of that has anything to do with the actual PPD that the Donor is earning.

VijayPande said:
Bruces makes a very good point. Many donors never really try to see their true PPD (eg by averaging over a week to get fluctuations) and all they know what they get from some 3rd party tool. This alone is a big issue if those 3rd party tools aren't giving good true averaged values.


Michael_McCord said:
I still stand in favor of a standard benchmarked machine. The main pluses for PG are its simplicity and reproducibility, so they can get back to more science, less nitpicking over slight variability between machines which involve including such things as RAM clock speed, timings and amount, L2 and L3 CPU cache amounts, amount of overclocking as a total percentage of CPU clock speed, effects of heat, drain of CPU cycles on different types and/or numbers of GPUs folding simultaneously, to name a few. I think the massive undertaking of an independent benchmarking system is just not justified in my view.


kasson said:
I think the thing to keep in mind with a standard benchmarked machine is that users *cannot* realistically expect to get consistent PPD across all projects unless their machine matches the benchmarking setup *exactly.* Even then, there will be some random fluctuations--e.g. for sequential runs of the same project on the standard benchmarking machine, we're seeing WU runtime fluctuation of a couple %. This is perhaps a donor education issue. Fluctuation is normal, and it is impossible to use a standardized setup to achieve consistent PPD across a range of hardware for a range of WU types (even using the same simulation engine). The "contents" of the simulation will have different demands on the hardware and cause different machines to scale differently compared to the standard machine. This is to be expected.

That said, I think either way is a valid choice. Just something to keep in mind.

VijayPande said:
Michael_McCord said:
I think the massive undertaking of an independent benchmarking system is just not justified in my view.

The new system would *decrease* the amount of PG time and (if successful) get more consistent PPD. I am very worried that as time goes on, the very concept of having a benchmark machine will lead to wildly varying PPD even on similar machines. We have been doing internal tests of just different linux distros on *identical* hardware and we're even seeing noticeable PPD variations there.

What can't go on is a constant stream of points change requests where either PG members spend a lot of time on points (taking away time from science) or it's said that "PG doesn't care" or "PG doesn't listen" if we try to put science first.

One alternative is to keep the current system and for donors to become at peace with PPD variations. I'm not sure donors will be happy with that, but that's certainly the easiest from our side (nothing new to do), perhaps with only some tweaks in benchmarking protocol to help decrease variation as much as is possible with a single machine benchmark scheme.

bruce said:
Assuming that a continual benchmark process is built-in to each FahCore, how would this information be reported to the Donor (and to the 3rd party benchmarking tools)? Maybe something like this (with variations based on the verbosity settings):

V7 log:
Code:
15:27:34:Unit 00:Completed 160000 out of 500000 steps  (32%)
15:31:21 Unit 00:Checkpoint written; Points factor 1.85
15:34:56:Unit 00:Completed 165000 out of 500000 steps  (33%)
15:42:20:Unit 00:Completed 170000 out of 500000 steps  (34%)
15:46:21 Unit 00:Checkpoint written; Points factor 1.87
15:49:47:Unit 00:Completed 175000 out of 500000 steps  (35%)
15:57:05:Unit 00:Completed 180000 out of 500000 steps  (36%)
16:01:21 Unit 00:Checkpoint written; Points factor 1.87
16:04:44:Unit 00:Completed 185000 out of 500000 steps  (37%)
16:12:11:Unit 00:Completed 190000 out of 500000 steps  (38%)
16:16:21 Unit 00:Checkpoint written; Points factor 1.42
16:20:24:Unit 00:Completed 195000 out of 500000 steps  (39%)
16:28:48:Unit 00:Completed 200000 out of 500000 steps  (40%)
16:31:21 Unit 00:Checkpoint written; Points factor 1.87
16:36:31:Unit 00:Completed 205000 out of 500000 steps  (41%)
V6 log:
Code:
[15:27:34] Completed 160000 out of 500000 steps  (32%)
[15:31:21] - Checkpoint written; Points factor 1.85
[15:34:56] Completed 165000 out of 500000 steps  (33%)
[15:42:20] Completed 170000 out of 500000 steps  (34%)
[15:46:21] - Checkpoint written; Points factor 1.87
[15:49:47] Completed 175000 out of 500000 steps  (35%)
[15:57:05] Completed 180000 out of 500000 steps  (36%)
[16:01:21] - Checkpoint written; Points factor 1.87
[16:04:44] Completed 185000 out of 500000 steps  (37%)
[16:12:11] Completed 190000 out of 500000 steps  (38%)
[16:16:21] - Checkpoint written; Points factor 1.42
[16:20:24] Completed 195000 out of 500000 steps  (39%)
[16:28:48] Completed 200000 out of 500000 steps  (40%)
[16:31:21] - Checkpoint written; Points factor 1.87
[16:36:31] Completed 205000 out of 500000 steps  (41%)

The Points factor would need to be available to the 3rd party folks which introduces a complicating factor, though it's not insurmountable.

(The reason I'm reporting it with every checkpoint is that some donors have noticed that the frame times vary, depending on whether a checkpoint has been written during that 1% or not. That doesn't mean that the benchmarking process needs to be once per checkpoint. That's a separate consideration, and I vote for random times, at least once every 15 minutes. I don't recall how often the GPU core tested VRAM but there are some similarities.)

I personally have to question if Dr. Pande's goal is even feasible, as the donor's machine configuration and usage variables are infinite.
 
If they really benchmarked every WU, there wouldn't be all this discussion. When there are problems, it is usually because they didn't benchmark the WU at all, rather they assigned the same points because they assumed the WU is similar. The huge points on -bigadv are a prime example. I believe p2684, the slowest of the bunch, was the only a3/a5 WU actually benchmarked. The others were assigned the same value because of their similar size, but were 50% faster folding. Now we're seeing quad core processors throw up 60,000 ppd as a result.

I'm a contractor. Masons often get paid by piecework. The price per brick is set so that an average mason makes a mutually acceptable amount of money per day and the faster the mason lays them the more money they make. If the brick is bigger and harder to lay, I pay them more per brick. If I ask them to lay concrete block, I pay them a lot more per unit. That's the way points ought to work in FAH. All Pande group has to do is properly set the value of a WU.

However, they did away with that with the QRB and can never go back. I'm with Dr. McCord, chose a benchmark machine, one that can fold all the -bigadv, regular SMP, and uniprocessor WU (now a hexacore +HT at a minimum) and use it to set the value of all the WUs. If you assign a WU value without a benchmark, be prepared to run a benchmark if there are complaints. The only problem there is almost no one complains if the WU produces too many ppd as happened with the A5 WUs.
 
Whew... finally got through all that. :)

I'm full agreement with Charles... this idea is NUTS!!! How is it fun to always know how much PPD a system is going to make? It's not... it's boring... what would there be to talk about? Even with this proposed benchmark system there are still too many variables. People would try to build identical systems and then, even then, when those two systems don't get the exact same PPD... those donors are going to complain. I think this idea solves nothing and would actually take away from the Folding experience.

Just benchmark the WUs correctly and forget about the whiners PG.
 
Back