Introduction
Bootstrapping usually refers to a self-starting process that is supposed to continue without external input. (Wikipedia)
This guide should let students set up their workflow using the infrastructure in Milan. Similar instructions should work also on lxplus or other tier3.
Info
If you have problems with this tutorial open an issue here.
Before embarking on any task, it is crucial to select the optimal setup. This means to decide the machine you want to use to run the code, the runtime environment (how to setup the software), how to edit and save your code, how to share the output, ...
Where to run
The most important thing to decide is where you want to run your code. This can be different from the place where you develop your code or save your outputs. Some examples:
- your desktop/laptop
- a remote machine (usually a tier3 e.g. proof, lxplus, ...)
- a batch cluster (e.g. condor cluster in Milano, condor cluster at CERN, ...)
- grid (Worldwide LHC Computing Grid)
Keep in mind that it can be advantageous to employ more than one setup for a single task. For instance, you may choose to develop code and execute small tasks on your laptop. Then, when ready, you can seamlessly transition to a cluster or grid for running the final, resource-intensive workload.
Pro | Cons | |
---|---|---|
Laptop |
|
|
tier3 |
|
|
condor batch |
|
|
grid |
|
|
Many points in the "cons" column can be mitigated in several ways, for example:
- how to develop remotely: sshfs, remote visual-code
- how to access remotely to data: xrootd
- how to install cvmfs on your laptop: doc
- how to install python software on top of lcg: virtualenv
- how to use jupyter notebook remotely
- how to keep a session alive (tmux and screen)
How to setup the software
Complicated software has multiple versions and many dependencies. Defining an environment, and replicating it everywhere, is quite complicated. One environment is made of several layers, for example:
- bare metal architecture: x86_64 processor
- operating system: centos7, almalinux 9 (el9)
- c++ compiler and relative libraries: gcc 9.2.1, glib 2.30
- python: python 3.7.6
- python modules: numpy 1.18.2
- ROOT 6.20/04
- Athena rel 21
Some tools are distributed already compiled so they need to match the compiler, the libraries, the cpu, .... Even in cases where software isn't distributed in a compiled form (as is common with many Python packages), there are often numerous dependencies that must align seamlessly. Therefore, having an efficient method for creating and activating a consistent software environment is paramount.
Remote machines such as tier3 or Lxplus heavily rely on CernVM File System (CVMFS) for software distribution. While it's possible to manually download and install software in your user directory, this approach may not be very practical, especially when you need to share your environment setup with others.
Info
cmvfs is a read-only filesystem. You can find it usually at /cvmfs
.
Several tools are available to set up the environment and sometimes can be mixed. Some tools set the software from CVMFS, and others install the software (in your home directory) and setup it:
- asetup: for ATLAS software
- lsetup: for many software related to ATLAS (e.g. ROOT)
- lsetup lcg: for all the software available in LCG
- virtualenv: for python packages
- Containers (docker, singularity): the nuclear option
Containers are quite different since it creates a full software stack, from the operating system.
How to save your code
It is important you back up all your code and your results. CERN gives you the possibility to create git repositories with the GitLab platform (gitlab.cern.ch). Git gives you the possibility to back up your work and share the development with other people with different workflows. Gitlab gives you some advanced tools such as continuous integration, docker registry, issue tracker, ...
You can create several git repositories, for example, one can be for your thesis.
Warning
Git repository should be used only for text files (code, documentation, ...), you should note use it to save your output. See the next section
Warning
Don't put your repository folders inside the Cernbox folder on your laptop (synchronizing with two services is a good way to mess it up).
Tip
Soon or later your computer will die: push your code as early as possible. Backup your result (ROOT files, plots, presentations, ...) on cernbox or other backed up place.
Tip
When you back up something put some meta-information: describe what it is. In ten years you will forget that.
How to share the output
Depending on the format and the size of the output you can choose different options, for example:
Pro | Cons | |
---|---|---|
eos |
|
|
local storage directories |
|
|
s3 object storage |
|
|