K 10 svn:author V 3 bde K 8 svn:date V 27 2005-11-28T11:46:20.000000Z K 7 svn:log V 1105 Rearranged the polynomial evaluation some more to reduce dependencies. Instead of echoing the code in a comment, try to describe why we split up the evaluation in a special way. The new optimization is mostly to move the evaluation of w = z*z later so that everything else (except z = x*x) doesn't have to wait for w. On Athlons, FP multiplication has a latency of 4 cycles so this optimization saves 4 cycles per call provided no new dependencies are introduced. Tweaking the other terms in to reduce dependencies saves a couple more cycles in some cases (more on AXP than on A64; up to 8 cycles out of 56 altogether in some cases). The previous version had a similar optimization for s = z*x. Special optimizations like these probably have a larger effect than the simple 2-way vectorization permitted (but not activated by gcc) in the old version, since 2-way vectorization is not enough and the polynomial's degree is so small in the float case that non-vectorizable dependencies dominate. On an AXP, tanf() on uniformly distributed args in [-2pi, 2pi] now takes 34-55 cycles (was 39-59 cycles). END