ParCAn
Parallel code analysis on a GPU
In compiler construction there are analyses that propagate information along the edges of a graph and modify it, until a fix point is reached and the information no longer changes. In this project we built the ParCAn framework to accelerate such analyses by exploiting the massive parallelism of graphic cards.In 2016, our research focus was on synchronization mechanisms for GPUs. Known synchronization methods for CPUs (e. g. a spin lock) cannot be used without further adjustment on GPUs since their special architectural properties easily lead to dead- and livelocks. Synchronization is required (even for predominantly dataparallel graph implementations) if data dependences occur dynamically. We have therefore developed a novel synchronization mechanism which solves two non-trivial problems related to GPUs: First, we prevent dead- and livelocks. Second, we retain as much parallelism as possible by allowing dataparallel threads to work on disjoint areas of a data structure concurrently. For example, think of threads that modify disjoint locations of a graph without affecting its structural integrity. In our approach, a programmer can provide rules that describe the conditions under which a parallel access is allowed. At runtime, we check these rules and determine how many threads can run in parallel.We currently extend the above synchronisation mechanism with a scheduler that redistributes conflicting data access so that the SIMD execution on a GPU causes less serialization than without the re-ordering. Hence, the degree of parallelism grows. The underlying idea exploits that GPUs organize threads in hierarchical units. If the above synchronization mechanism detects a conflicting access in one of these units, it checks on the next smaller unit whether the conflict can also be found there. If this is not the case, the fewer threads in that unit can run in parallel. This is much better than serializing all threads in the enclosing unit. In this situation it is the scheduler’s task to re-distribute the detected collisions across the units so that as many threads as possible can run in parallel. As the scheduling is performed at run-time it needs to be efficient, must itself run in parallel, and potentially make use of the dynamic thread creation capabilities of modern GPUs.
In addition, we focus on speeding up of the communication phase of the BSP model. Communication here means the transfer of the graph in both directions (GPU Host). While the graph and data structure used has strong impact on the run-time behavior it also influences the computation phase. Since there is no publication in the literature that systematically investigates the impact of the data-structure on the end-to-end run-time of a GPU graph analysis, we implemented a number of benchmarks that use different attributes of graphs (e.g., access successor/predecessor, random node access) and eight different graph data structures to represent graphs on the GPU. For the measurements we used a number of structurally different graphs. The results are likely to help developers in picking the right graph data structure for their GPU-problem.
Some analyses store their information in a global data structure that can be modified by all threads simultaneously. Especially the high number of concurrent threads on a GPU demands for efficient synchronization. Thus, as part of the research project we implemented an efficient framework for establishing mutual exclusion, see the LNCS-paper in the references. Previous approaches inevitably resulted in deadlocks when the GPU is fully utilized. Moreover, by using a variant of the inspection-execution paradigm we further improved the efficiency of the framework.Another research topic considered the efficiency of graph structures on GPUs. At its core, ParCAn implements a graph traversal algorithm. The program to be translated is converted into a graph, the control flow graph (CFG), on which the analyses are executed. Due to the large number of parallel accesses, the CFG represents a critical data structure for the performance of ParCAn. For this reason, we conducted an extensive study comparing the performance of graph data structures. We used the results to determine the best possible data structure to represent the CFG. We derived general criteria that allow to make assumptions about the performance of a data structure under certain conditions. Even outside of the context of ParCAN, developers can use these criteria, represented as a decision tree, to choose the most appropriate data structure for their static graph algorithms. The results of the study were presented at the GPGPU workshop, see references.
Publikationen
- Blaß T., Philippsen M.:
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU
12th Workshop on General Purpose Processing Using GPUs (Providence, RI, 13.04.2019 - 13.04.2019)
In: ACM (ed.): Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, New York, NY, USA: 2019
DOI: 10.1145/3300053.3319416
URL: http://ieeetcca.org/2018/12/16/12th-workshop-on-general-purpose-processing-using-gpu-gpgpu-2019-asplos-2019/
BibTeX: Download - Blaß T., Philippsen M.:
GPU-Accelerated Fixpoint Algorithms for Faster Compiler Analyses (Best Paper Award)
28th International Conference on Compiler Construction (Washington, D.C., 16.02.2019 - 17.02.2019)
In: ACM (ed.): Proceedings of the 28th International Conference on Compiler Construction, New York, NY, USA: 2019
DOI: 10.1145/3302516.3307352
URL: http://www2.informatik.uni-erlangen.de/publication/download/cc19_parcan_blass.pdf
BibTeX: Download - Blaß T., Philippsen M., Veldema R.:
Efficient Inspected Critical Sections in Data-Parallel GPU Codes
30th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2017) (College Station, TX, 11.10.2017 - 13.10.2017)
In: Rauchwerger, Lawrence (ed.): Proceedings of the 30th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2017), Cham: 2019
DOI: 10.1007/978-3-030-35225-7_15
URL: https://www2.cs.fau.de/publication/download/lcpc2017_blass.pdf
BibTeX: Download - Blaß T.:
Ein datenparalleler Ansatz zur Beschleunigung von Datenflussanalysen mittels GPU (Dissertation, 2022)
DOI: 10.25593/978-3-96147-494-3
BibTeX: Download