Due to the irregularity of graph data, designing an efficient GPU-based graph algorithm is always a challenging task. Inefficient memory access and work imbalance often limit GPU-based graph computing, even though GPU provides a massively parallelism computing fashion. To address that, in this paper, we propose a fine-grained task distribution strategy for triangle counting task. Extensive experiments and theoretical analysis confirm the superiority of our algorithm over both large real and synthetic graph datasets.