Marius Kamp, M. Sc.
Analysis of Code Repositories
(Own Funds)Term: since 01.01.2010
URL: https://www2.cs.fau.de/research/AnaCoRe/Software developers often modify their projects in a similar or repetitive way. The reasons for these changes include the adoption of a changed interface to a library, the correction of mistakes in functionally similar components, or the parallelization of sequential parts of a program. If developers have to perform the necessary changes on their own, the modifications can easily introduce errors, for example due to a missed change location. Therefore, an automatic technique is desireable that identifies similar changes and uses this knowledge to support developers with further modifications.
Extraction of Code-Changes
In 2017, we developed a new code recommendation tool called ARES (Accurate REcommendation System). It creates more accurate recommendation compared to previous tools as its algorithms take care of code movements during pattern and recommendation creation. The foundation of ARES lies in the comparison of two versions of the same program. It extracts the changes between the two versions and creates patterns based on the changed methods. ARES uses these patterns to suggest similar changes for the source code of different programs automatically.
The extraction of code changes is based on trees. In 2016 we developed (and visibly published) a new tree-based algorithm (MTDIFF) that improves the accuracy of the change extraction.
Symbolic Execution of Code-Fragments
In 2014 we developed a new symbolic code execution engine called SYFEX to determine the behavioral similarity of two code fragments. In this way we aim to improve the quality of the recommendations. Depending on the number and the generality of the patterns in the database, it is possible that without the new engine SIFE generates some unfitting recommendations. To present only the fitting recommendations to the developers, we compare the summary of the semantics/behavior of the recommendation with summary of the semantics/behavior of the database pattern. If both differ too severely, our tool drops the recommendation from the results. The distinctive features of SYFEX are its applicability to isolated code fragments and its automatic configuration that does not require any human interaction.
In 2015 SYFEX was refined and applied to code fragments from the repositories of different software projects. In 2016 we investigated to which extend SYFEX can be used to gauge the semantic similarity of submissions for a programming contest. In 2017 and 2018 we optimized the implementation of SYFEX. We also began collecting a data set of semantically similar methods from open source repositories. We published this data set in 2019.Techniques for symbolic execution use algorithms to check the satisfiability of logical/mathematical expressions in order to detect valid execution paths in a program. Usually, these algorithms account for a large part of the total runtime of a symbolic execution. To accelerate this satisfiability check, we experimented with a technique to replace complicated expressions with simpler equivalent expressions. These simpler expressions are obtained by using program synthesis. In the year 2020, we extended this program synthesis with a novel technique that can quickly detect whether a fixed set of operations can be used to construct an expression that is equivalent to the complicated expression. We published this approach in 2021 and were able to show that the technique can reduce the runtime of common program synthesizers by 33% on average. We subsequently extended this technique to other classes of program synthesis problems. In 2022, we performed a comprehensive evaluation of these extensions. This evaluation showed that these extensions similarly improve the runtime of program synthesizers on a larger class of program synthesis problems.
Detection of Semantically Similar Code Fragments
SYFEX computes the semantic similarity of two code fragments. Therefore, it allows to identify pairs or groups of semantically similar code fragments (semantic clones). However, the high runtime of SYFEX (and similar tools) limit their applicability to larger software projects. In 2016, we started the development of a technique to accelerate the detection of semantically similar code fragments. The technique is based on so-called base comparators that compare two code fragments using a single criterion (e.g., the number of used control structures or the structure of the control flow graph) and that have a low runtime. These base comparators can be combined to form a hierarchy of comparators. To compute the semantic similarity of two code fragments as accurately as possible, we use genetic programming to search for hierarchies that approximate the similarity values as reported by SYFEX for a number of pairs of code fragments. A prototype implementation confirmed that the method is capable of detecting pairs of semantically similar code fragments.
We further improved the implementation of this approach in 2017 and 2018. Additionally, we focused on evaluating the approach with pairs of methods from software repositories and from programming exercises. Moreover, we created a data set of semantically similar methods from open-source software repositories that we published in 2019.
Techniques for symbolic execution rely on algorithms to detect the satisfiability of logic/mathematic expressions. These are used to detect whether an execution path in a program is feasible. The algorithms often use a large amount of the total computation time. To improve the speed of this satisfiability check, in the years 2019 and 2020 we experimented with a technique to replace complicated expressions with simpler expressions that have the same meaning. These simpler expressions result from the application of program synthesis. In 2020 we augmented the program synthesis with a novel approach to detect beforehand if some operations can form an expression with the same meaning as a more complicated expression.
Semantic Code Search
The functionality that has to be implemented during the development of a software product is often already available as part of program libraries. It is often advisable to reuse such an implementation instead of rewriting it, for example to reduce the effort for developing and testing the code.
To reuse an implementation that fits the purpose, developers have to find it first. To this end developers already use code search engines on a regular basis. State-of-the-art search engines work on a syntactic level, i.e., the user specifies some keywords or names of variables and methods that should be searched for. However, current approaches do not consider the semantics of the code that the user seeks. As a consequence, relevant but syntactically different implementations often remain undetected ("false negatives") or the results include syntactically similar but semantically irrelevant implementations ("false positives"). The search for code fragments on a semantic level is the subject of current research.
In 2017 we began the development of a new method for semantic code search. The user specifies the desired functionality in terms of input/output examples. A function synthesis algorithm from the literature is then used to create a method that implements the specified functionality as accurately as possible. Using our approach to detect similar code fragments, this synthesized method is then compared to the methods of program libraries to find semantically similar implementations. These implementations are then presented as search results to the user. A first evaluation of our prototypical implementation shows the feasibility and practicability of the approach.
Clustering of Similar Code-Changes
To create generalized change patterns, it is necessary that the set of extracted code changes is split into subsets of changes that are similar to each other. In 2015 this detection of similar code changes was improved and resulted in a new tool, called C3. We developed and evaluated different metrics for a pairwise similarity comparison of the extracted code changes. Subsequently, we evaluated different clustering algorithms known from the literature and implemented new heuristics to automatically choose the respective parameters to replace the previous naive approach for the detection of similar code changes. This clearly improved the results compared to the previous approach, i.e., C3's new techniques detect more groups of similar changes that can be processed by SIFE to generate recommendations.
The aim of the second improvement is to automatically refine the resulting groups of similar code changes. For this purpose we evaluated several machine learning algorithms for outlier detection to remove those code changes that have been spuriously assigned to a group.
In 2016 we implemented a new similarity metric for the comparison of two code changes that essentially considers the textual difference between the changes (as generated, for example, by the Unix tool 'diff'). We published both a paper on C3 and the dataset (consisting of groups of similar changes) that we generated for the evaluation of our tool under an open-source license, see https://github.com/FAU-Inf2/cthree . This dataset can be used as a reference or as input data for future research. In addition, we prototypically extended C3 by techniques for an incremental similarity computation and clustering. This allows us to reuse results from previous runs and to only perform the absolutely necessary work whenever new code changes are added to a software archive.
Parallelization techniques for embedded systems in automation
(Own Funds)Term: 01.06.2009 - 31.12.2015
This project was launched in 2009 to address the refactorization and parallelization of applications used in the field of industry automation. These programs are executed on specially designed embedded systems. This hardware forms an industry standard and is used worldwide. As multicore-architectures are increasingly used in embedded systems, existing sequential software must be parallelized for these new architectures in order to improve performance. As these programs are typically used in the industrial domain for the control of processes and factory automation they have a long life cycle. Because of this, the programs often are not being maintained by their original developers any more. Besides that, a lot of effort was spent to guarantee that the programs work reliably. For these reasons the software is only extended in a very reluctant way.
Therefore, a migration of these legacy applications to new hardware and a parallelization cannot be done manually, as it is too error prone. Thus, we need tools that perform these tasks automatically or aid the developer with the migration and parallelization.
Research on parallelization techniques
We developed a special compiler for the parallelization of existing automation programs. First, we examined automation applications with respect to automatic parallelizability. We found that it is hard to perform an efficient automatic parallelization with existing techniques. Therefore this part of the project focuses on two steps to handle this situation. As first step, we developed a data dependence analysis that identifies potential critical sections in a parallel program, presents them to the programmer and adds their protection to the code. We ware able to show that our approach to identify critical sections finds atomic blocks that closely match the atomic blocks that an expert would add to the code. Besides that, we showed in 2014 that the impacts on execution times are negligible if our technique finds atomic blocks that are larger than necessary or that are not necessary at all.
As second step we have refined and enhanced existing techniques (software transactional memory (STM) and lock inference) to implement atomic blocks. In our approach, an atomic block uses STM as long as lock inference would lead to coarse-grained synchronization. The atomic block switches from STM to lock inference as soon as a fine-grained synchronization is possible. With this technique, an atomic block always uses fine-grained synchronization while the runtime overhead of STM is minimized at the same time. We showed that (compared to a pure STM or lock inference implementation) our technique speeds up execution times by a factor between 1.1 and 6.3. Although fine-grained synchronization in general leads to better performance than a coarse-grained solution, there are cases where a coarse-grained implementation shows equal performance. We therefore presented a runtime mechanism for an STM that also works together with our combined technique. This runtime mechanism starts with a small number of locks, i.e., a coarse-grained locking, where accesses to different shared variables are protected by the same lock. If this coarse-grained locking leads to too many non-conflicting accesses waiting for the same lock, our approach gradually increases the number of locks. This makes the locking more fine-grained so that non-conflicting accesses can be executed concurrently. Our runtime mechanism that dynamically tunes the locking-granularity makes the programs run up to 3.0 times faster than a fixed coarsegrained synchronization.
We completed this project part in 2014.
Research on migration techniques
Our research on the migration of legacy applications originally consisted of having a tool that automatically replaces suboptimal code constructs with better code. The code sequences that had to be replaced as well as the replacement codes were specified by developers by means of a newly developed pattern description language. However, we found this approach to be too difficult for novice developers.
This led us to the development of a new tool that automatically learns and generalizes patterns from source code archives, recognizes them in other projects, and presents recommendations to developers. The foundation of our tool lies in the comparison of two versions of the same program. It extracts the changes that were made between two source codes, derives generalized patterns of suboptimal and better code from these changes, and saves the patterns in a database. Our tool then uses these patterns to suggest similar changes for the source code of different programs.
In 2014 we developed a new symbolic code execution engine to minimize the number of wrong recommendations. Depending on the number and the generality of the patterns in the database, it is possible that without the new engine our tool generates some unfitting recommendations. To discard the unfitting ones, we compare the summary of the semantics/behavior of the recommendation with summary of the semantics/behavior of the database pattern. If both differ too severely, our tool drops the recommendation from the results. The distinctive features of our approach are its applicability to isolated code fragments and its automatic configuration that does not require any human interaction.
The latest results of our tool SIFE are found online (last update: 2014-05-09).
Parts of the project are funded by the "ESI-Anwendungszentrum" [http://www.esi-anwendungszentrum.de/]
Approximate Bit Dependency Analysis to Identify Program Synthesis Problems as Infeasible
International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI'2021) (Copenhagen, 17.01.2021 - 19.01.2021)
In: Fritz Henglein, Sharon Shoham, Yakir Vizel (ed.): Verification, Model Checking, and Abstract Interpretation (VMCAI 2021), Cham: 2021
Artifact for "Approximate Bit Dependency Analysis to Identify Program Synthesis Problems as Infeasible"
SeSaMe: A Data Set of Semantically Similar Java Methods
16th International Conference on Mining Software Repositories (MSR 2019) (Montréal, QC, Kanada, 26.05.2019 - 27.05.2019)
In: Proceedings of the 16th International Conference on Mining Software Repositories (MSR 2019), Piscataway, NJ, USA: 2019
, , :
SeSaMe: A Data Set of Semantically Similar Java Methods [Data set]
, , :
More Accurate Recommendations for Method-Level Changes
11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE2017) (Paderborn, 04.09.2017 - 08.09.2017)
In: Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE2017), New York, NY, USA: 2017
, , , :
AuDoscore: Automatic Grading of Java or Scala Homework
Third Workshop "Automatische Bewertung von Programmieraufgaben" (ABP 2017) (Potsdam, 05.10.2017 - 06.10.2017)
In: Sven Strickroth Oliver Müller Michael Striewe (ed.): Proceedings of the Third Workshop "Automatische Bewertung von Programmieraufgaben" (ABP 2017) 2017
Open Access: http://ceur-ws.org/Vol-2015/ABP2017_paper_01.pdf
, , :
Entwicklung eines Werkzeugs zum Vergleich von Code-Fragmenten durch symbolische Ausführung (Master thesis, 2014)
Entwicklung eines Werkzeugs zur Extraktion von Mustern aus Software-Archiven zur Quellcode-Optimierung (Bachelor thesis, 2012)