Elementary Functions: Algorithms and Implementation

Elementary Functions: Algorithms and Implementation, 2nd Ed. · Jean-Michel Muller ·270 pages

Rigorous treatment of transcendental function implementation: IEEE 754 fundamentals (ULPs, rounding modes, FMA, subnormals), polynomial approximation via Chebyshev/Remez minimax, Horner/Estrin evaluation, range reduction (Cody-Waite, Payne-Hanek for large arguments), CORDIC for sin/cos/atan, and correct-rounding via Ziv's multilevel strategy.

Capabilities (6)

› Evaluate polynomials efficiently using Horner's method (sequential) or Estrin's method (pipelined)
› Design minimax polynomial approximations using Chebyshev polynomials and the Remez algorithm
› Implement range reduction for exp/log/sin/cos using Cody-Waite or Payne-Hanek techniques
› Reason about floating-point error in ULPs and design algorithms targeting ≤1 ulp accuracy
› Apply CORDIC algorithm to compute sin/cos/atan via iterative shift-and-add rotations
› Use FMA (fused multiply-add) to avoid intermediate rounding in double-double arithmetic

How to use

Install this skill and Claude can implement the full three-stage pipeline for any transcendental function — Cody-Waite range reduction, Horner/Estrin polynomial evaluation, and FMA-based reconstruction — while reasoning about per-stage ULP error budgets and targeting faithful or correctly-rounded output

Why it matters

Explains exactly why naive implementations of exp, log, sin, and cos lose precision and how production libm libraries solve it — enabling Claude to implement correctly-rounded elementary functions, audit numerical code for precision loss, and choose the right accuracy-versus-performance tradeoff

Example use cases

› Implementing a double-precision sin(x) using Cody-Waite range reduction to [-π/4, π/4] followed by a degree-11 Horner polynomial targeting faithful rounding within 1 ULP
› Tracing the ULP error introduced at each stage of a multi-step floating-point computation to identify which stage — range reduction, polynomial evaluation, or reconstruction — dominates total error
› Implementing the CORDIC rotation-mode algorithm for an FPGA target where multiplication is expensive, using only integer shifts, adds, and a precomputed atan(2^-i) table with scale-factor correction

Elementary Functions Skill

IEEE 754 Floating-Point Fundamentals

Formats

Format	Sign	Exponent	Mantissa	Total	Range
Binary32 (float)	1	8	23	32	±3.4×10³⁸
Binary64 (double)	1	11	52	64	±1.8×10³⁰⁸
Binary80 (x87 long double)	1	15	63+1	80	±1.2×10⁴⁹³²

double representation:
  value = (-1)^sign × 1.mantissa × 2^(exponent - 1023)
  "hidden bit": leading 1 not stored (implied)
  subnormals: exponent = 0 → no hidden bit, gradual underflow

Rounding Modes (IEEE 754)

Round to nearest, ties to even (RNE) — default
Round toward +∞ (ceiling)
Round toward −∞ (floor)
Round toward zero (truncation)

Correct rounding: returned result = exact rounding of mathematical result
→ this is what IEEE 754 guarantees for +, -, *, /, √
→ NOT guaranteed by standard for transcendentals (exp, log, sin, etc.)
→ correctly rounded transcendentals = difficult research problem

ULP (Unit in the Last Place)

ulp(x): spacing between consecutive floats around x
       = 2^(exponent of x) × 2^(-52)  (for double)

Error measurement: "0.5 ulp error" means result within 0.5 spacing of exact
Correctly rounded → ≤ 0.5 ulp error
"Faithful rounding" → ≤ 1 ulp error (less strict, easier to achieve)

In code:
double nextafter(x, +INFINITY);  // next representable double above x
// ulp(x) = nextafter(x, INFINITY) - x

Fused Multiply-Add (FMA)

FMA(a, b, c) = a*b + c  with a single rounding (not two)
             = round(a×b + c) vs. round(round(a×b) + c)

Key uses:
1. Accurate dot products: avoid intermediate rounding
2. Error-free transformations: s = fl(a+b), e = a+b-s (Sterbenz/Knuth)
3. Compensated summation (Kahan summation)
4. Fast 2-sum algorithms in double-double arithmetic

CPU support: x86 (AVX2+), ARM (NEON/SVE), hardware: __builtin_fma(a,b,c)

Polynomial Evaluation

Horner’s Method (preferred)

P(x) = a₀ + a₁x + a₂x² + ... + aₙxⁿ
     = a₀ + x(a₁ + x(a₂ + ... + x(aₙ₋₁ + x·aₙ)...))

Implementation (n+1 multiplications, n additions, sequential):
double horner(double x, double* a, int n) {
    double p = a[n];
    for (int i = n-1; i >= 0; i--)
        p = p * x + a[i];
    return p;
}
// Error: bounded by n × ulp(P(x)) approximately
// Limitation: sequential → poor pipeline utilization on superscalar CPUs

Estrin’s Method (parallelizable)

// Group into pairs, then pairs-of-pairs
// P(x) = (a₀ + a₁x) + x²(a₂ + a₃x) + x⁴(a₄ + a₅x) + ...
// Then combine pairs: P = pair₀ + x²·pair₁ + x⁴·pair₂ + ...
// Allows multiple FP units to work in parallel
// Slightly higher error than Horner (more rounding steps) but faster on pipelined hardware

Approximation Methods

Chebyshev Polynomials (near-minimax)

T₀(x) = 1
T₁(x) = x
Tₙ(x) = 2x·Tₙ₋₁(x) - Tₙ₋₂(x)

Key property: Tₙ oscillates between ±1 on [-1,1] with n+1 equioscillation points
→ Chebyshev expansion gives near-minimax polynomial approximation

Process:
1. Map interval [a,b] to [-1,1]: t = 2(x-a)/(b-a) - 1
2. Compute Chebyshev coefficients: cₙ = (2/N) Σ f(xₖ)Tₙ(xₖ)
3. Truncate series at desired precision
4. Convert to standard polynomial for Horner evaluation

Better than Taylor: equioscillates error vs. one-sided error buildup

Remez Algorithm (true minimax)

Goal: find polynomial P of degree n minimizing max|f(x) - P(x)| on [a,b]

Key theorem (Chebyshev equioscillation): P* is the minimax polynomial
iff the error f(x) - P*(x) equioscillates at n+2 points.

Remez exchange algorithm (iterative):
1. Initialize with Chebyshev nodes as extremal points
2. Solve linear system for P that equioscillates at current n+2 nodes
3. Find actual extrema of f - P between nodes
4. Exchange nodes, repeat until convergence

Tools: Maple (fminimax), Sollya (remez()), Mathematica (MiniMaxApproximation)

Typical error: degree-8 polynomial for sin on [0,π/4] ≈ 10⁻¹⁶ (double precision)

Rational Approximations

R(x) = P(x) / Q(x)  where P, Q are polynomials

Advantage: same accuracy as polynomial at lower degree (Padé approximants)
           converges where polynomials converge slowly (near singularities)
Disadvantage: one extra division; more complex error analysis

Good for: computing 1/x, x^(1/3), arctan near ±π/2

Range Reduction

Why It’s Needed

Polynomial approximations only work on small intervals.
Strategy: reduce x to x' ∈ [-δ, δ], then apply polynomial, then reconstruct.

Example for exp(x):
  x = n·ln2 + r, where n = round(x/ln2), r = x - n·ln2 ∈ [-ln2/2, ln2/2]
  exp(x) = 2ⁿ × exp(r)    ← ldexp(exp(r), n) in C
  exp(r) computed by polynomial on small interval

Cody-Waite Method (moderate range)

// For sin/cos: reduce to [-π/4, π/4]
// Store π/2 as two-part float for accuracy: PIO2_HI + PIO2_LO
// n = round(x / (π/2))
// r = x - n*PIO2_HI - n*PIO2_LO  (two-step for accuracy)
// Then use sin/cos on r based on n mod 4

#define PIO2_HI 1.5707963267948966192e+00   // high part
#define PIO2_LO 6.123233995736765886130e-17 // low part (Cody-Waite)

Payne-Hanek Method (large arguments)

Problem: for large x (e.g., x = 1e16), n*PIO2 has catastrophic cancellation
Cody-Waite fails; need extended precision precomputed table of 4/π bits.

Method:
1. Use many extra bits of 4/π stored as integer array
2. Multiply x by appropriate portion of the table
3. Extract remainder mod 1 → accurate r

Used in: C99 __kernel_rem_pio2, glibc, libm implementations

CORDIC Algorithm

Circular Mode (sin/cos/atan)

CORDIC (COordinate Rotation DIgital Computer):
Rotate vector (x, y) by angle θ using only shifts and additions.

Iteration: for i = 0, 1, 2, ..., n-1:
  σᵢ = sign(remaining angle)
  x_{i+1} = xᵢ - σᵢ · yᵢ · 2^(-i)
  y_{i+1} = yᵢ + σᵢ · xᵢ · 2^(-i)
  z_{i+1} = zᵢ - σᵢ · atan(2^(-i))

After n iterations:
  x_n = K · cos(θ)     (K ≈ 0.6073 = 1/∏√(1 + 2^(-2i)))
  y_n = K · sin(θ)
  z_n ≈ 0

Scale factor: K must be compensated (multiply result by 1/K)

CORDIC modes:
  Rotation mode: compute cos(z), sin(z) from input angle z
  Vectoring mode: compute atan(y/x), magnitude from input (x, y)

Used in: FPGAs, embedded processors, old x87 (original implementation), calculators

Implementation Strategy

High-Quality Library Design Pattern

Step 1: Range reduction
  → bring argument into small interval [-δ, δ]
  → choose δ small enough for target precision
  → must be done in extra precision (e.g., Cody-Waite two-step)

Step 2: Polynomial approximation
  → minimax (Remez) polynomial on reduced interval
  → evaluate with Horner (sequential) or Estrin (pipelined)
  → coefficients stored in C as double literals

Step 3: Reconstruction
  → recombine polynomial result with range reduction parameters
  → must use FMA or double-double when precision is tight

Step 4: Correct rounding (optional)
  → Ziv's multilevel strategy: first try at low precision,
    if result is ambiguous (near midpoint), retry at higher precision

Error Budget Analysis

For double precision (52-bit mantissa), target: ≤ 1 ulp
Sources of error:
  1. Approximation error: controlled by polynomial degree
  2. Rounding in polynomial evaluation: Horner adds ~n ulp
  3. Range reduction error: catastrophic if done naively for large x
  4. Reconstruction rounding: use FMA if possible

Rule of thumb: work with 1-2 guard bits beyond target precision
               use double-double or quad-double when needed

Reference: Key Algorithm Applications

Function	Range Reduction	Method	Notes
exp(x)	n = round(x/ln2), r = x - n·ln2	Polynomial + ldexp	n-bit table for speed
log(x)	factor out 2^n, normalize mantissa	Polynomial + n·log2	Argument near 1
sin/cos	Payne-Hanek (large x), Cody-Waite (small)	Polynomial on [-π/4, π/4]	Sign from quadrant
atan(x)	Reduce to [0, 1] via identity	Polynomial + identity	atan(x)=π/2-atan(1/x)
sqrt(x)	Normalize, extract exponent	Newton-Raphson refinement	IEEE mandates correct rounding