Pipes and Filters

SciencesPo Intro To Programming 2024

Florian Oswald and The Software Carpentry

11 October, 2024

Combining Commands

  • We are now ready to combine some of the commands we learned.

  • You will see that here is where the real power lies.

  • Let’s navigate into our exercise data folder first.

$ cd ~/shell-lesson-data/exercise-data/proteins
$ ls
cubane.pdb   ethane.pdb   methane.pdb  octane.pdb   pentane.pdb  propane.pdb
  • Those are protein data bank files.

Capturing Output

  • Introducing the wc word count command.
wc cubane.pdb
20     156    1158 cubane.pdb

29 lines, 156 words, 1158 characters.

  • Let’s redirect the output of wc to a file instead with >:
wc -l *.pdb > lengths.txt
  • no ouput on screen, you see? but now there is a new file: lengths.txt.
  • Let’s concatenate its content (i.e. join together) and print to screen:
$ cat lengths.txt 
      20 cubane.pdb
      12 ethane.pdb
       9 methane.pdb
      30 octane.pdb
      21 pentane.pdb
      15 propane.pdb
     107 total

Reading Text Files

  • cat prints the entire thing to screen.
  • tail only the end
  • head only the beginning
  • less lets you scroll and read (arrows up/down or j (up) and k (down), q exits.)
$ head -n 3 ../animal-counts/animals.csv
$ tail -n 2 ../animal-counts/animals.csv
$ less ../../north-pacific-gyre/NENE01729A.txt

Printing Text with echo

  • The echo function prints text - by default to screen:
$ echo hi
hi
  • But you can redirect it to a file as well:
$ echo I said hi! > echofile1.txt

Challenge

Do 2 times in a row:

$ echo I said hi! > echofile1.txt

Now do twice (notice >>!)

$ echo I said hi! >> echofile2.txt
  • What’s happening?

Appending to Files

Challenge

Consider the file shell-lesson-data/exercise-data/animal-counts/animals.csv. What is result of this:

$ head -n 3 animals.csv > animals-subset.csv
$ tail -n 2 animals.csv >> animals-subset.csv
  1. The first three lines of animals.csv?
  2. The last two lines of animals.csv?
  3. The first three lines and the last two lines of animals.csv?
  4. The second and third lines of animals.csv?
Show Solution

Solution

Option 3 is correct.

Filtering Files with sort

  • sort reads a file and sorts it’s content to screen
  • it does not change the file.
$ sort -n lengths.txt 
   9 methane.pdb
  12 ethane.pdb
  15 propane.pdb
  20 cubane.pdb
  21 pentane.pdb
  30 octane.pdb
 107 total

Filtering Files and using the result

  • Cool 😎 but now we want to use this list.
  • Could save it to a new file?
$ sort -n lengths.txt > sorted_lengths.txt
$ head -n 2 sorted_lengths.txt
   9 methane.pdb
  12 ethane.pdb

Filtering Files and the pipe

  • We call | the pipe. It takes output from a command and gives it to another command.
  • Modern languages use their own version of this (R has a package and now also a native pipe, julia has of course a pipe etc. Stata not sure 😜)
  • The pipe allows us to do this without storing intermediate results.
$ sort -n lengths.txt | head -n 1
 9 methane.pdb
  • But, wait 🤔. Then we don’t even need lengths.txt:
$ wc -l *.pdb | sort -n | head -n 1
 9 methane.pdb
  • That’s a pipeline. 🤯

Piping Away

  • Make sure we are still in ~/shell-lesson-data/exercise-data/proteins

Pipe Dreams

Which of the following commands shows us the 3 files with the least number of lines in the current directory? Build the pipeline up from left to right to check!

  1. wc -l * > sort -n > head -n 3
  2. wc -l * | sort -n | head -n 1-3
  3. wc -l * | sort -n | tail -n 4 | head -n 3
  4. wc -l * | sort -n | head -n 3

Piping Away

  • Make sure we are still in ~/shell-lesson-data/exercise-data/proteins
Show Solution

Solution

Option 4 is correct. Option 3 finds the ones with most lines.

Cutting and Piping

  • We have a .csv file here: shell-lesson-data/exercise-data/animal-counts
  • Let’s use the cut command to get parts of it.
$ cd ~/shell-lesson-data/exercise-data/animal-counts
$ cut -d , -f 2 animals.csv

Building a Pipe

  • uniq filters adjacent matching lines in a file.
  • Can you extend the above command with uniq (and another command?) such that we get the list of unique animal names?
  • Add the -c flag to uniq to get a contingency table.

Building a Pipe

Show Solution

Solution

  1. cut -d , -f 2 animals.csv | sort | uniq
  2. cut -d , -f 2 animals.csv | sort | uniq -c

House Prices in France

The below dataset contains information on house sales (price, location, type of house etc). We call one record a housing transaction.

Using the shell:

  1. Use wget to download data to from here to your downloads folder as carburants.csv: wget https://static.data.gouv.fr/resources/demandes-de-valeurs-foncieres/20240408-125738/valeursfoncieres-2023.txt
  2. use wc -l to count how many rows (lines) there are
  3. use head -n 2 to see the first two rows (the header)
  4. Use the above solution to build a contingency table that tells us the number of housing transactions per commune. Show the 10 cities with most housing transactions.
  5. Compute the average of variable Valeur fonciere. You should use the awk command like this : awk 'BEGIN{s=0;}{s+=$1;}END{print s/(NR);} your_file.txt'

House Prices in France

The below dataset contains information on house sales (price, location, type of house etc). We call one record a housing transaction.

Using the shell:

  1. Compute the average of variable Valeur fonciere. You should use the awk command like this : awk 'BEGIN{s=0;}{s+=$1;}END{print s/(NR);} your_file.txt'

Real Data

Show Solution
  1. wget https://static.data.gouv.fr/resources/demandes-de-valeurs-foncieres/20240408-125738/valeursfoncieres-2023.txt
  2. wc -l valeursfoncieres-2023.txt
  3. head -n 2 valeursfoncieres-2023.txt
  4. cut -d '|' -f 18 valeursfoncieres-2023.txt | sort | uniq -c | sort -r | head -n 10
  5. cut -d '|' -f 11 valeursfoncieres-2023.txt | cut -d , -f 1 | awk 'BEGIN{s=0;}{s+=$1;}END{print s/(NR);}'
  6. If you have R installed you can check whether this is the same as reading this column into it (it’s not! odd.)
Rscript -e 'x = data.table::fread(cmd = "cut -d \'|\' -f 11 valeursfoncieres-2023.txt | cut -d , -f 1"); x[, mean(`Valeur fonciere`,na.rm = TRUE)]'