Research Data Management in HPC
NHR Tutorial (Online)
The NHR tutorial series is divided into different modules as follows. Details are listed below
-
9:30 h - 9:40 h - Welcome (Anja Gerbes, ZIH, TU Dresden & Julian Kunkel, GWDG)
-
9:40 h - 11:30 h - Introduction to RDM
-
11:40 h - 12:40 h - Secure Workflow Tutorial
-
13:00 h - 13:45 h - Toward Enforcing a Strategy for Data Management in a HPC Center
-
14:00 h - 15:00 h - Coscine - FAIR play integrated right from the start
-
15:00 h - 15:15 h - Closing Words (Julian Kunkel, GWDG)
Module 1 - Introduction to RDM
on Thursday, 01/12/2022, 9:40 am - 11:30 pm
Speaker: Christian Löschen, ZIH, TU Dresden
Data is the basis of all research. A good research process is therefore also based on good research data management. What exactly is behind this, which aspects and tasks are important in which project phase, and why it is worth paying attention to this topic, will be examined in this introductory event. Finally, it should be jointly discussed which topics are important to the RDM, especially from the perspective of the HPC, in order to address them in future events.
Agenda
- Lecture: Introduction to Research Data Management
- Discussion: Specific requirements for the RDM in HPC
Module 2 - Secure Workflow Tutorial
on Thursday, 01/12/2022, 11:40 am - 12:40 pm
Speaker: Trevor Khwam Tabougua, GWDG, Göttingen
Driven by the progress of data and compute-intensive methods in various scientific domains, there is an increasing demand from researchers working with highly sensitive data to have access to the necessary computational resources to be able to adapt those methods in their respective fields. To satisfy the computing needs of those researchers cost-effectively, it is an open quest to integrate reliable security measures on existing High Performance Computing (HPC) clusters. The fundamental problem with securely working with sensitive data is, that HPC systems are shared systems that are typically trimmed for the highest performance – not for high security. For instance, there are commonly no additional virtualization techniques employed, thus, users typically have access to the host operating system. Since new vulnerabilities are being continuously discovered, solely relying on the traditional Unix permissions is not secure enough. In this hands-on tutorial, we present a generic and secure workflow on our local HPC system, the Scientific Compute Cluster (SCC), to enable researchers to transfer, store and analyze sensitive data.
Agenda
- Participants will get an exercise sheet and access to a test account
- Trevor will present the solution of the exercise sheet step-by-step
- Participants will have time to do it themselfs during the presentation and/or after
Module 3 - Toward Enforcing a Strategy for Data Management in a HPC Center
on Thursday, 01/12/2022, 1 pm - 1:45 pm
Speaker: Julian Kunkel & Hendrik Nolte, GWDG, Göttingen
HPC systems typically offer a fast parallel filesystem where users can store their data during the job execution. Since these fast parallel filesystems like BeeGFS, or Lustre, are only supposed to hold hot data they are not backed up and are more expensive per TB than other mass storage. However, often users keep their data on these file systems much longer than intended, even to the state where it can be considered cold data. This, of course, costs unnecessarily a lot of money and also keeps potentially important research data and results in a vulnerable state, since it is not backed up. This problem is a symptom of a larger problem: Users often either don’t have a data management plan or do not live up to the data management plan they have provided in their proposal for compute time. Often there are other issues associated with a general lack of data management, like reproducibility problems. In this talk, a concept is presented and interactively discussed, where users provide a data management plan during their proposal which is used by automated processes deployed by admins to aid users by monitoring and enforcing compliance with the original data management plan.
Agenda
- Talk
- Discussion
Module 4 - Coscine - FAIR play integrated right from the start
on Thursday, 01/12/2022, 2 pm - 3 pm
Speaker: Ilona Lang & Marcel Nellesen, RWTH Aachen
For many researchers an involvement with the FAIR principles does not begin until the publication of an article and the sometimes obligatory transfer of the research data to a repository. At this point, a significant amount of valuable information about the research project is often already lost. Therefore, only a fraction of the data (and metadata) collected during a research project is ever published. This is a particularly difficult challenge when HPC systems are used, since in this case large amounts of data are generated and there are limited access options for external researchers. One solution to make research data FAIR from the very beginning of its life cycle is to use a storage environment on a daily basis that implicitly implements FAIR principles. To create such a storage environment, the research data management platform Coscine was developed at RWTH Aachen University. Coscine provides an integrated concept for research (meta)data management in addition to storage, management and archiving of research data. To help researchers interact with Coscine through the interfaces and improve integration with existing data management processes, tools, programs, and consultation for the technical adaptation of the platform are provided. This includes the collection or extraction of metadata based on the data or the environment in which it was generated. In this talk, we present how Coscine supports FAIR principles - from the initial collection of data to its subsequent reuse.
Agenda
-
Presentation
-
Demonstration of Coscine
-
Discussion
Registration
Link: https://event.zih.tu-dresden.de/nhr/rdm
Registration is closing on 11/24/2022.
You will receive the access data shortly before the event by email to your registered email address.
Handouts
The course material (slides, sample application) will be available.
Further Information
Course language: English
Target group: HPC Basics / HPC User / HPC infrastructure providers & HPC RDM officials
If you have any further questions, please contact Anja Gerbes ().