ScPo PhD Breakfast 2024

Opinionated Advice served for breakfast

Florian Oswald

26 February, 2024

Digital Workflow

  • Johannes has given a great overview.
  • I can dish out opinionated advice now 😜
  • The topic was digital workflow
  • I’m going to change that to workflow only. Most of your work is done with the help of a computer, right?

Note

Different people will give you different advice. You will have to find the best mix for yourself.

Important

There is no point being fundamentalist. Not in life, and not when discussing workflows. ✌️

Tuesday Morning Preaching

Preaching

Investments

  • In research, as in life, you decide to make certain investments.
  • Some pay off, some don’t: it’s risky business. Investments are costly to make.
  • Huge investments towards end of life don’t make much sense.

Preaching

My Advice

  1. Will say a few things about reproducibility and project setup.
  2. Will say a few words about the importance (or not?) of computational performance: Does it matter how fast your application runs and how much effort should you dedicate to this?
  3. Will list a few tools I find helpful. Those may not be helpful for you.

Reproducibility

Reproducibility

Project Organisation

  • Folder Structure is a first order concern for your project.
  • There is no best folder structure - as long as it’s simple, intuitive and helpful.

Minimum Requirement

There should be a separation along:

  1. Inputs: Data, parameters, etc
  2. Outputs: Numbers, tables, figures
  3. Code
  4. Paper/Report etc

Example?

Reproducibility

Project Organisation: Good or Bad?

.
β”œβ”€β”€ 20211107ext_2v1.do
β”œβ”€β”€ 20220120ext_2v1.do
β”œβ”€β”€ 20221101wave1.dta
β”œβ”€β”€ james
β”‚   └── NLSY97
β”‚       └── nlsy97_v2.do
β”œβ”€β”€ mary
β”‚   └── NLSY97
β”‚       └── nlsy97.do
β”œβ”€β”€ matlab_fortran
β”‚   β”œβ”€β”€ graphs
β”‚   β”œβ”€β”€ sensitivity1
β”‚   β”‚   β”œβ”€β”€ data.xlsx
β”‚   β”‚   β”œβ”€β”€ good_version.do
β”‚   β”‚   └── script.m
β”‚   └── sensitivity2
β”‚       β”œβ”€β”€ models.f90
β”‚       β”œβ”€β”€ models.mod
β”‚       └── nrtype.f90
β”œβ”€β”€ readme.do
β”œβ”€β”€ scatter1.eps
β”œβ”€β”€ scatter1_1.eps
β”œβ”€β”€ scatter1_2.eps
β”œβ”€β”€ ts.eps
β”œβ”€β”€ wave1.dta
└── wave2.dta
└── wave2regs.dta
└── wave2regs2.dta

(scroll down! πŸ˜‰)

Bad! πŸ‘Ž

  • Sub directories are not helpful
  • File names are confusing
  • code/data/output are not separated

Reproducibility

Project Organisation: Good πŸ‘

.
β”œβ”€β”€ README.md
β”œβ”€β”€ code
β”‚   β”œβ”€β”€ R
β”‚   β”‚   β”œβ”€β”€ 0-install.R
β”‚   β”‚   β”œβ”€β”€ 1-main.R
β”‚   β”‚   β”œβ”€β”€ 2-figure2.R
β”‚   β”‚   └── 3-table2.R
β”‚   β”œβ”€β”€ stata
β”‚   β”‚   β”œβ”€β”€ 1-main.do
β”‚   β”‚   β”œβ”€β”€ 2-read_raw.do
β”‚   β”‚   β”œβ”€β”€ 3-figure1.do
β”‚   β”‚   β”œβ”€β”€ 4-figure3.do
β”‚   β”‚   └── 5-table1.do
β”‚   └── tex
β”‚       β”œβ”€β”€ appendix.tex
β”‚       └── main.tex
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ processed
β”‚   └── raw
└── output
    β”œβ”€β”€ plots
    └── tables

Good.

  • Meaningful sub directories
  • top level README
  • code/data/output are separated

Reproducibility

Example: TIER Protocol structure

Reproducibility

Best Project Structure?


Note

There is no unique best way to organize your project: Make it simple, intuitive and helpful.


Important

Ideally your entire project is under version control.

Reproducibility

Automatizing Stuff

Manifesto

  1. Never again will I code a latex table by hand.

  2. Never again will I copy and paste any kind of result.

  3. There are no screenshots in my pipeline.

  4. My codes are not just a bunch of scripts.

Action Points

  • I have a run script which runs my entire pipeline.
  • I delete my output folder frequently. Fully reproducible anyway. πŸ˜‰
  • I automatically generate my tables.
  • I write all output to disk.
  • I use code packaging mechanisms and unit testing.

Reproducibility


Code Editors

Use at least one of those

  • Visual Code Studio
  • Sublime Text
  • Vim
  • Emacs

Reproducibility

Unit Tests

  1. The more code you write, the harder it gets to have an overview of what works and what does not.
  2. You may not even remember what this works means for a component after a few months: How did you actually determine that it does work?
  3. You want to be sure that component works even 3 years later as intended.
# in src/ of your julia package
function makes_data()
    # makes some data
end
# in /test of your julia package
using Test

# use your function
d = makes_data()

@test something_about_d == true

Computational Performance

Computational Performance


Questions 1

  • How important is it to have performant code?
  • Why does it matter and how much should you care about this?


Questions 2

  • How should you approach this problem?
  • What is there to know?

Computational Performance

Is it important?

  • Fast code \(\approx\) fast spaceship πŸš€. A faster spaceship allows you to explore galaxies which are farther away. All else equal, a faster spaceship is better.
  • Of course building a fast spaceship is more complicated and harder to learn.
  • Key question: is it worth the effort to build the fast spaceship?
  • Do you need to be at the computational frontier, or not?

Computational Performance

Beware

  • Nothing is a panacea. Nothing. There are just things that work, given some level of effort, and things that don’t.
  • Parallel Programming is often touted as the solution. It’s not. Find a language where it’s easy to test what you are doing.

Donald Knuth: Premature Optimization is the Root of All Evil

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

πŸ‘‰ First: make sure it’s correct, then worry about making it fast.

πŸ‘‰ Unit Tests!