A proposal for the measurement and documentation of research software sustainability in interactive metadata repositories

Stephan Druskat{firstname}.{lastname} (at) hu-berlin.de

Abstract

I propose an interactive repository type for research software metadata which measures and documents software sustainability by accumulating metadata, and computing a sustainability measure over it. Such a repository would help to overcome technical barriers to software sustainability by furthering the discovery and identification of sustainable software, thereby also facilitating documentation of research software within the framework of software management plans.

Research software sustainability as a topic has found its way onto the agenda of funding agencies and will arguably be promoted by them in the future (cf. Hettrick (2016,
p. 11)Hettrick, Simon. 2016. Research Software Sustainability: Report on a Knowledge Exchange workshop. Tech. rep. The Software Sustainability Institute. http://repository.jisc.ac.uk/6332/. Last accessed 20 Apr 2016.). However, as further documented by Hettrick (2016), the identification and discovery of "good", i.e., sustainable, research software present technical barriers to software sustainability.

This paper proposes to tackle these barriers by accumulating software metadata in repositories and quantifying it, in order to firstly enable potential users to identify sustainable software through sustainability measurement, and secondly enable simplified discovery of suitable software through documentation of software and its sustainability.

Similar to the more advanced discussion about data sustainability and funders' subsequent mandate of data management plans (DMPs), software management plans (SMPs) would be an obvious tool for ensuring sustainability of software developed in funded projects. DMPs will feature data repositories as natural target points for research data releases as part of the data lifecycle documentation, and analogously, software repositories should be an integral part of software lifecycles as detailed in SMPs. In addition to source code repositories and distribution repositories for deliverables, there are also already a large number of research data repositories which archive research software source code and deliverables along with some metadataFor an exemplary list of such repositories, browse the Registry of Research Data Repositories (Pampel et al., 2013) by content type "Software applications".

Pampel, Heinz, Paul Vierkant, Frank Scholze, Roland Bertelmann, Maxi Kindling, Jens Klump, Hans-Jürgen Goebelbecker, Jens Gundlach, Peter Schirmbacher & Uwe Dierolf. 2013. Making research data repositories visible: The re3data.org registry. PLoS ONE 8(11). 1–10. doi:10.1371/journal.pone.0078080.
. However, while the latter repositories may provide support for assessing the suitability of a software for the intended research application, they do not provide information about, not to mention a measure of, the software's sustainability.

Nevertheless, a repository can be a suitable means to provide documentation and measurement of software sustainability, specifically by utilising software metadata. In order to do so, it must

To achieve interactive evaluation, the repository should enable its users to contribute towards both the measurement and the documentation of a software's sustainability by allowing changes to, as well as feedback on, the metadata, and provide a way to review the software with regards to its sustainability.

Such a repository would also benefit projects which will be subject to funding agencies' potential mandate of SMPs, as it represents a natural means of documenting the sustainability of software developed within the project.

The conception of such a repository poses at least three theoretical challenges – to be briefly touched upon below –, namely (1) how to define sustainability, (2) how to define quantifiable parameters for sustainability, and (3) how to design an algorithm for computing a comprehensible and reproducible measure for sustainability, e.g., a "sustainability factor" similar to a journal's impact factor.

"Software sustainability" is an under-defined concept, with different agents approaching it from different angles. Gröger & Köhn (2015)Gröger, Jens & Marina Köhn. 2015. Nachhaltige Software. Dokumentation des Fachgesprächs "Nachhaltige Software" am 28.11.2014. Dokumentationen 07/2015. Umweltbundesamt. http://www.umweltbundesamt.de/en/publikationen/nachhaltige-software. Last accessed 22 Apr 2016. focus on the ecological sustainability of software, discussing software engineering only in passing. Tate (2005)Tate, Kevin. 2005. Sustainable software development: An agile perspective. Boston, Mass.: Addison-Wesley. focusses on the development process rather than the product. Goble (2014)Goble, Carole. 2014. Better software, better research. IEEE Internet Computing 18(5). 4–8. – specifically discussing research software – names training, availability, community recognition for developers, dedicated job descriptions, and funding as defining factors for sustainability.

In order to get a clearer picture of what sustainability can mean in the context of software, it helps to base an attempt at its definition on a more general concept of "sustainability". Following an enhancement (Jörissen et al., 1999)Jörissen, Juliane, Jürgen Kopfmüller, Volker Brandl & Michael Paetau. 1999. Ein integratives Konzept nachhaltiger Entwicklung (Wissenschaftliche Berichte FZKA 6393). Karlsruhe: Forschungszentrum Karlsruhe. http://www.itas.kit.edu/pub/v/1999/joua99a.pdf. Last accessed 23 Apr 2016. of a popular three-dimensional general model of sustainability ("three column model"), we can attempt a preliminary definition of "software sustainability" as follows. The three goals of software sustainability are (1) ensuring the existence of the software, (2) preserving the potential for productive operation of the software, (3) creating and retaining possibilities for further development and adaptation of the software.

A metadata repository measuring and documenting software sustainability must therefore accumulate quantifiable metadata pertaining to all three goals. Such metadata and its quantification for a software could be, e.g., for (1): whether the source code is versioned; whether a stable, widely used versioning platform is used; whether deliverables are available from a public repository; whether containers exist; etc. For (2): whether it is platform-independent; whether it is implemented in a stable, widely-used programming language/database/framework; whether it has comprehensive user documentation; whether it is i18nised; whether it is liberally licensed; whether it is interoperable with existing systems; whether its data model is generic; whether automated tests exist; whether it has an intuitive UX; etc. For (3): whether it is open source; whether its source code is available from a public repository; whether it is easily buildable; whether it is implemented in a programming language for which many developers are available on the job market; whether it has comprehensive developer documentation; whether it is modularized/extensible; etc. Defining such quantifiable parameters for sustainability is by no means trivial. The development of a list of parameters could, however, be partly facilitated through crowd-sourcing methods and empirical elicitations.

The accumulation of the metadata itself can be performed by different means, of which direct input by the originator of the software is always the starting point, followed potentially by harvesting from existing repositories, and dedicated crawling of, e.g., source code repositories. The latter two techniques can also be used for both verification and a preliminary quantification of the input, e.g., by comparing the number of code units with the number of existing unit tests, or looking for embedded documentation, such as Javadochttp://www.oracle.com/technetwork/java/javase/documentation/javadoc-137458.html or comments.

Computing a measure over the metadata via the defined parameters, and publishing that measure, is the core function of the proposed metadata repository. The metadata must be available in the repository as structured text in a well-defined format, e.g., based on XML or JSON. The computing algorithm must take into account all of the parameters, but should also utilise additional information such as usage statistics – provided either by the owner of the data set, or by way of, e.g., automatically transmitted download statistics, or even real-time usage statistics implemented in the software itself –, and reviews and evaluations by users of the software. Of course the scope of this paper does not allow to even attempt the definition of such an algorithm.

In addition to presenting the metadata for viewing, browsing, and searching, the repository should also be designed so that users can interact with data sets. Reviews of, as well as comments and votes on, single data points, sections of data, and whole data sets can facilitate successive refinement of the sustainability measure for a software. This functionality can also be used for gamification purposes to attract both software originators and reviewers to the repository, and provide projects with further incentive to submit their data, e.g., by releasing standings tables and issuing graphics including the "sustainability factor" for use on websites, or similar.

In summary, an interactive metadata repository, field-specific or not, that measures and documents the sustainability of software can be a valuable tool not only for the discovery of research software, but also for the identification of sustainable – and therefore preferable – software, and as an integral part of the implementation of SMPs.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.