大连理工大学主页平台管理系统 He Guo--Home-- A Stall-Aware Warp Scheduling for dynamically optimizing thread-level parallelism in GPGPUs

Paper Publications

A Stall-Aware Warp Scheduling for dynamically optimizing thread-level parallelism in GPGPUs

Hits:

Indexed by:会议论文

Date of Publication:2015-06-08

Included Journals:EI、Scopus

Volume:2015-June

Page Number:15-24

Abstract:General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-thread-array (CTA) scheduler and a warp scheduler, which administer the thread level parallelism (TLP). Previous research shows the maximized TLP does not always deliver the optimal performance. Unfortunately, existing warp scheduling schemes do not optimize TLP at runtime, which is impossible to fit various access patterns for diverse applications. Dynamic TLP optimization in the warp scheduler remains a challenge to exploit the GPGPU highly-parallel compute power. In this paper, we comprehensively investigate the TLP performance impact in the warp scheduler. Based on our analysis of the pipeline eficiency, we propose a Stall-Aware Warp Scheduling (SAWS), which optimizes the TLP according to the pipeline stalls. SAWS adds two modules to the original scheduler to dynamically adjust TLP at runtime. A trigger-based method is employed for a fast tuning response. We simulated SAWS and conducted extensive experiments on GPGPU-Sim using 21 paradigmatic benchmarks. Our numerical results show that SAWS effectively improves the pipeline eficiency by reducing the structural hazards without causing extra data hazards. SAWS achieves an average speedup of 14:7% with a geometric mean, even higher than existing Two-Level scheduling scheme with the optimal fetch group sizes over a wide range of benchmarks. More importantly, compared with the dynamic TLP optimization in the CTA scheduling, SAWS still has 9:3% performance improvement among the benchmarks, which shows that it is a competitive choice by moving dynamic TLP optimization from the CTA to warp scheduler. ? Copyright 2015 ACM.

Pre One:带通信开销的DAG工作流费用优化模型与算法

Next One:Multi-Layer Sparse Representation for Weighted LBP-Patches Based Facial Expression Recognition

Profile

教育背景：

学士学位：吉林大学计算机系，1982
硕士学位：大连理工大学计算机系，1989

科研与工作经历：

1986年10月—1987年10月，新西兰Progeni Company，访问学者
1990年10月—1992年12月，德国PDI Karlsruhe University计算机系，访问学者
1992年12月—2007年12月，大连理工大学计算机系，副教授
1995年3月—1996年6月，大连市金卡工程系统，总工程师
2008年1月—今，大连理工大学软件学院，教授
2020年4月退休

教学工作：

1992年—2007年，计算机导论，计算机组织与结构，计算机系统结构
2009年—2019年，存储技术，计算机系统结构，并行计算

科研：

研究兴趣：并行与分布式计算。

Institutional Repository Personal Page