- Joined
- May 17, 2005
- Location
- High Desert, Calif.
I wish I could direct link to this discussion, but as that is not possible, here is a transcript:
Alternative benchmarking/points plan
Alternative benchmarking/points plan
VijayPande said:We've had this plan for some time, but haven't implemented it since it's such a radical change. However, perhaps this is a good time to reconsider it.
The goal of this change to how points are determined is to give donors very constant PPD. Here's how it would work:
1) The client runs a benchmark calculation and keeps track of how much CPU time it takes.
2) the benchmark has a certain amount of points associated with it. let's say in this example that if you can do the benchmark in 10 seconds, you get 1000PPD (the actual number may be different of course). If the client runs the benchmark in X seconds, then the PPD of that machine should be 1000*(10/X). For this example, let's say that X=5.
3) the client keeps track of how much CPU time the core ran. Let's say in this example it took 2 days to do the WU. Then, the client would report back to the server that it took 2 days and the machine is rated at 2000PPD, so the server would award 4000 points for that WU. We could keep track of the PPD rating and the WU length to see if it's comparable to other machines.
The basic idea here is that donors would get the same PPD on a given machine consistently. The challenge for us would be to make sure that the benchmark calculation is representative of the science being done (since donors will optimize benchmark PPD). Also, we'd have to make sure there are no easy ways to cheat in this scheme. Assuming this can be done, what do you think?
mhouston said:How would you deal with machine upgrades? Rebenchmark every day? What about GPU support which is really the PPD that ends up all over the map as the differential between high-end and low-end systems diverges as the proteins get larger.
bruce said:As I'm sure you'll remember, this was tried once. It's simply too easy to underclock or do other things which slow down the benchmark. The overclocking can be restored once the benchmark has been completed, resulting in a higher score.
Do you plan to run the benchmark semi-continuously -- say 2% of production? Wouldn't that become a target for hackers?
What would happen to the QRB? Would this also replace the present bonus plan for early returns which is responsible for the greatest part of the variation in PPD but also provides a strong incentive for donors to concentrate on getting work returned promptly for the scientific benefit?
VijayPande said:mhouston said:How would you deal with machine upgrades? Rebenchmark every day? What about GPU support which is really the PPD that ends up all over the map as the differential between high-end and low-end systems diverges as the proteins get larger.
We'd be rebenchmarking every WU. The benchmark wouldn't be too slow. There would be a GPU benchmark. Recall the goal here is consistent and predictable PPD. I agree that there will be a differential between high end and low end, but that's unavoidable. However, it does put a lot of significance into the benchmark calculation: if that's not reasonable, people will optimize their hardware in ways that's not ideal for the project.
VijayPande said:bruce said:As I'm sure you'll remember, this was tried once. It's simply too easy to underclock or do other things which slow down the benchmark. The overclocking can be restored once the benchmark has been completed, resulting in a higher score.
Do you plan to run the benchmark semi-continuously -- say 2% of production? Wouldn't that become a target for hackers?
What would happen to the QRB? Would this also replace the present bonus plan for early returns which is responsible for the greatest part of the variation in PPD but also provides a strong incentive for donors to concentrate on getting work returned promptly for the scientific benefit?
We would have to be constantly benchmarking at random times (almost like random drug testing).
QRB: I'm confident we can come up with a plan which integrates that in. The benchmark would be about base points.
VijayPande said:PS I forgot to mention that since this is such a radical change, it wouldn't be something to just immediately switch to. I would be tempted to have both systems running simultaneously, with the true points being given by the old system and the new system giving us some input. If nothing else, it's like having a mega-cluster of benchmark systems to compare PPD.
Ivoshiee said:We have had that points issue around from the beginning of the
FAH. It seems that we have sat still.
Maybe it is possible to find someone from Stanford or somewhere else to
assign that topic to and even write a BA or even BS work on that subject.
All that stuff will need modeling and simulations and analyze. Then we can
model a "perfect" points system out of that work.
One easier option to introduce is re-benchmaking at
Stanford side. As projects utilize different parts of the FAH Core
procedures and maybe shift to different parts of it during their
lifetime then make weekly-monthly project re-benchmarking.
Also, each core build will trigger automatic project
re-benchmarking and project will get that much points it will give.
VijayPande said:I'd like to avoid re-benchmarking as 1) that could take a lot of PG time away from other activities and 2) it could lead to unpredictable fluctuations in points, which is moving away from my primary goal of consistent PPD.
Also, this sort of thing isn't a good undergraduate thesis project, at least not in Chemistry or CS, as it would not be seen as being of intellectual interest to my colleagues. I do wonder if I could get someone in economics interested, but I am not associated with that department, so that's tougher.
Ivoshiee said:1) it should not take away any PG time if it is automatic procedure built in the FAH server side and 2)if the running project will change over the course of its progress or processing code does that then it is good if that points for it change as well.VijayPande said:I'd like to avoid re-benchmarking as 1) that could take a lot of PG time away from other activities and 2) it could lead to unpredictable fluctuations in points, which is moving away from my primary goal of consistent PPD.I am sure you find someone interested in doing that .Also, this sort of thing isn't a good undergraduate thesis project, at least not in Chemistry or CS, as it would not be seen as being of intellectual interest to my colleagues. I do wonder if I could get someone in economics interested, but I am not associated with that department, so that's tougher.
bruce said:I think you're presuming that the Work Servers are identical to the benchmark hardware and that it's okay to suspend uploading/downloading for long enough to dedicate the server to benchmarking.Ivoshiee said:1) it should not take away any PG time if it is automatic procedure built in the FAH server side and 2)if the running project will change over the course of its progress or processing code does that then it is good if that points for it change as well.VijayPande said:I'd like to avoid re-benchmarking as 1) that could take a lot of PG time away from other activities and 2) it could lead to unpredictable fluctuations in points, which is moving away from my primary goal of consistent PPD.
The other assumption would be designated projects would direct a WU to a standard benchmark machines periodically and that it, too, can be dedicated. The real question here is how much of the benchmarking process is automated and what additional automated code could/should be developed to perform repeated periodic benchmarks plus some sort of a pipeline that directs the data from a new benchmark to someone who compares the results with the previous benchmark(s) and decides if the variation is sufficient to readjust the baseline points. (I don't think that latter part should ever be automated. A human needs to think about the reasonableness of the new data and follow a designated policy.)
Personally, I doubt it. The conclusions could not be classified as a scholarly work. There is no real advance in useful knowledge that would be applicable outside of an extremely narrow field of interest -- specifically, the points competition in FAH. There's not going to be any thesis to publish. If I were a professor reviewing the proposed research or perhaps the data and the conclusions, I wouldn't grant a degree on the basis of it actually being something worthy of being called "research."I am sure you find someone interested in doing that .Also, this sort of thing isn't a good undergraduate thesis project, at least not in Chemistry or CS, as it would not be seen as being of intellectual interest to my colleagues. I do wonder if I could get someone in economics interested, but I am not associated with that department, so that's tougher.
Ivoshiee said:You see it a bit too narrow here. From the history of that topic we can see it is being social and mathematical problem. Obviously much of the thesis is not about "ideal points system for the FAH". Maybe some implementation part of it does that, but largely it is about modeling and analyzing "reward system based on set of input-output parameters and constraints" which should have (among other possible sets) all the FAH requirements-needs abstracted and defined in terms of mathematics. If that part is done then the "politics is thrown out" and it is basically just finding best theory and algorithm for the task. Being it from the realm of social behaviour or physics or somewhere else is up to the researcher to decide.Personally, I doubt it. The conclusions could not be classified as a scholarly work. There is no real advance in useful knowledge that would be applicable outside of an extremely narrow field of interest -- specifically, the points competition in FAH. There's not going to be any thesis to publish. If I were a professor reviewing the proposed research or perhaps the data and the conclusions, I wouldn't grant a degree on the basis of it actually being something worthy of being called "research."
kasson said:For the moment, please accept that benchmark problems would not be considered an appropriate scholarly project at Stanford in the fields that Professor Pande is associated with. If we consider the original proposal: assume we can provide a means to run a benchmark calibration as part of the work unit and a means (non-trivial) to detect changes to machine capability and ensure that the benchmark continues to be representative of the machine capability during the run. Given that, what do you think of basing points on this on-system benchmark? This is non-trivial for us to implement but could provide a much greater degree of consistency.
bruce said:For WUs without a bonus, I think it's an excellent suggestion.
That still leaves a question about how to establish settings for QRB that are "fair" for everyone.
This discussion started because of discrepancies between WUs similar to p6901 and p2686. Both projects have the same K-factor, the same deadlines, etc. and (before the discussion started, the same baseline-points. Are these two types of projects still going to have the same values in psummary or will there be a systematic adjustment in baselne points that will make them somehow be adjusted ON DONOR SYSTEMS so that they're no longer identical?
The baseline points for p2684 were adjusted heuristically by the suggested 10/7. We need a systematic plan that can avoid future heuristic adjustments and that plan needs to apply to everything. It's not clear whether this would accomplish that or not.
VijayPande said:My plan would be to have this new system running along side the current one for a while, which would allow us to try out how the new system would do before switching over to it. If nothing else, it would be a built in way to do benchmarking on a variety of machines to look for PPD fluctuations. That data alone may be very useful in making our point that PPD *can't* be consistent with a current fixed-points based method. If so, then we would have the needed information to decide how to move forward.