大连理工大学主页平台管理系统郭禾--中文主页-- A Stall-Aware Warp Scheduling for dynamically optimizing thread-level parallelism in GPGPUs

郭禾

教授博士生导师硕士生导师
性别：男
毕业院校：大连理工大学
学位：硕士
所在单位：软件学院、国际信息与软件学院
联系方式：guohe@dlut.edu.cn
电子邮箱：guohe@dlut.edu.cn

访问量：

开通时间：..

最后更新时间：..

个人学术主页

当前位置: 中文主页 >> 科学研究 >> 论文成果

A Stall-Aware Warp Scheduling for dynamically optimizing thread-level parallelism in GPGPUs

点击次数：

论文类型：会议论文

发表时间：2015-06-08

收录刊物：EI、Scopus

卷号：2015-June

页面范围：15-24

摘要：General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-thread-array (CTA) scheduler and a warp scheduler, which administer the thread level parallelism (TLP). Previous research shows the maximized TLP does not always deliver the optimal performance. Unfortunately, existing warp scheduling schemes do not optimize TLP at runtime, which is impossible to fit various access patterns for diverse applications. Dynamic TLP optimization in the warp scheduler remains a challenge to exploit the GPGPU highly-parallel compute power. In this paper, we comprehensively investigate the TLP performance impact in the warp scheduler. Based on our analysis of the pipeline eficiency, we propose a Stall-Aware Warp Scheduling (SAWS), which optimizes the TLP according to the pipeline stalls. SAWS adds two modules to the original scheduler to dynamically adjust TLP at runtime. A trigger-based method is employed for a fast tuning response. We simulated SAWS and conducted extensive experiments on GPGPU-Sim using 21 paradigmatic benchmarks. Our numerical results show that SAWS effectively improves the pipeline eficiency by reducing the structural hazards without causing extra data hazards. SAWS achieves an average speedup of 14:7% with a geometric mean, even higher than existing Two-Level scheduling scheme with the optimal fetch group sizes over a wide range of benchmarks. More importantly, compared with the dynamic TLP optimization in the CTA scheduling, SAWS still has 9:3% performance improvement among the benchmarks, which shows that it is a competitive choice by moving dynamic TLP optimization from the CTA to warp scheduler. ? Copyright 2015 ACM.

上一条：带通信开销的DAG工作流费用优化模型与算法

下一条：Multi-Layer Sparse Representation for Weighted LBP-Patches Based Facial Expression Recognition