Can algorithm design unlock the full potential of superscalar processors? This paper presents two parameterized Householder QR factorization algorithms tailored for the cache and register architectures common in superscalar processors. Algorithm designers must streamline memory references and allow for efficient data reuse throughout the memory hierarchy. Guidelines are developed for selecting parameter values that optimize cache and register utilization. The new algorithms are implemented and performance-tuned on diverse systems, including an Intel Pentium Pro, an IBM SP2 node, and a Silicon Graphics POWER Challenge XL processor. The results demonstrate the effectiveness of these algorithms in maximizing processor performance, offering valuable insights for numerical computation and linear algebra on modern computing platforms. By addressing the specific challenges of superscalar architectures, this research contributes to high-performance computing.
Published in ACM Transactions on Mathematical Software, this research directly aligns with the journal's focus on efficient and reliable mathematical algorithms. By presenting optimized algorithms for Householder QR factorization, the paper contributes to the advancement of numerical computation and linear algebra software.