Hinweise zu Abschlussarbeiten/Archiving Thesis Data

Aus SDQ-Wiki

Our chairs highly value open science principles. Due to these open science principles and to fully evaluate your thesis, including your results, we require you to archive all data used for creating the thesis and its contained results. In summary, your code and data must be documented, consistent, complete, and exercisable (see also ACM Artifact Review and Badging).

In any case, you need to use a repository at our group's theses project at KIT's Gitlab. Ask your supervisor to create the repository and give you access.

In the following, we explain the requirements and rules. As always, there can be exceptions to these rules. If you need to deviate from them, discuss and clarify with your supervisor.

What to archive

You need to archive all research data, including but not limited to:

  • Source Code for the implementation and evaluation (e.g., evaluation scripts)
  • Input and output data of your evaluation
  • Documentation: The documentation must at least contain instructions on how to install and set up the approach and how to execute it. Further, it is required to document the different parts of the repository and where to find certain parts of your approach.
  • Your presentation needs to be included in your repository.

External Services

If you use external services, it is necessary to include the input and output from these external services for better reproducibility. This is due to cases where later calls to these services may return different values or services become unavailable. Additionally, saving the input and output can reduce potential costs for calling external services. For example, if you use GPT-4 using the OpenAI API, it is recommended to save the prompt together with the result of the API call, i.e., the answer.

Chained Calls

If your approach requires manual chained calls (i.e., one tool is called, then another tool needs to be called, etc.), you need to provide the inputs and outputs of every call, explicitly including the intermediate results.

Hindrances for archiving data

In certain cases, archiving the data might not be possible due to, e.g., legal reasons like non-disclosure agreements with companies. In such cases, we require you to have at least some example data that can be used instead to demonstrate the approach.

If you see any hindrances, discuss them with your supervisor.

How to archive your data

  • In any case, you need to use a repository at our chair's thesis repository at KIT's Gitlab. In there, you should at least have:
    • Your presentation
    • Code that you produced
  • If your data ...
    • has less than 8 GB: Use your repository at Gitlab.
    • has more than 8 GB: Upload your data to another long-term storage:
      • Discuss this case with your supervisor.
      • The long-term storage should save your files for up to 10 years and must be immutable (no later changes allowed). We recommend public archives like Zenodo or Figshare.
      • If you cannot share the data publicly, you still need to privately share the data for evaluation reasons. Always discuss this case with your supervisor (see also the statements about hindrances above).
        • Non-public data: For data protected by an NDA or other confidentiality agreements (e.g., with users of your user study), discuss solutions with your supervisor.
        • In exceptional cases, there is the option to use a special repository at KIT's Gitlab with an increased quota. Ask your supervisor about this option.
        • If no other option is possible, the fallback solution is to store the data on a storage device (e.g., a USB stick). Ask your supervisor about this option.

Documentation needs to include

  • a LICENSE file describing distribution rights. We recommend an open-source license (see choosealicense.com and CC Chooser for help). You can and should ask your supervisor about options.
  • a README that describes the artifact, including the purpose, the setup, the data, and the usage:
    • Purpose: What does your artifact do?
    • Data: Provide information for understanding the context, etc. Context can be the domain, the intended application area, the overall research area, references to other data that was used as a basis for this data, and other necessary background information. Additionally, provide information required to understand the data like what the datatypes are for, how tables are structured (e.g., what the tables mean, if not directly and unambiguously clear), or where to find information.
    • Setup: Provide clear information about how to prepare the artifact. This includes information about required hardware (GPU required? Special requirements for, e.g., large amounts of RAM (> 8 GB)? How much storage is roughly required?) and necessary software (Which software needs to be installed and in which version? Are there any dependencies that need to be installed?). If you used some special environment like a (high-performance) computational server, please state this.
    • Usage: Provide clear instructions on how to use the artifact. This requires at least a basic usage example to test the installation. Additionally, this should include instructions on how to replicate the results from the thesis. If keys or passwords are required to execute the artifact, document how and where to get the keys. If this is impossible, make sure to still offer an option to run a basic example without these keys, e.g., by providing the output of the API calls that are locked.

Make Your Artifact Executable

If your artifact contains executable software, you have to make sure that other people can also execute the software. The following guidelines contain some of the things that you might consider, but is not an exhaustive list.

  1. Document dependencies (see also above) and provide means to easily install them. For example, provide a complete requirements.txt for your Python project or make sure that your pom.xml for your Maven project is complete. Do not forget to state the required versions, not only the name of the dependency! If possible, you might also provide a version of the required environment that contains all necessary dependencies (see also dockerization). Also, do not forget to state which version of your programming language you used, e.g., which Python version or which Java version.
  2. Dockerization or virtual environments (e.g., venv for Python) can make the setup of the project easy and replicable. If possible, provide a pre-built environment like a docker image (also hosted at, e.g., dockerhub or Github's registry).
  3. For Eclipse Plugins, please create a local updatesite and provide it in the archive.
  4. Provision of execution scripts: Provide scripts that execute your example and/or your evaluation in one or a few commands. This makes reproducing and verifying your results easy.