A matter of tests

A blog about my Raku projects


A matter of tests

I like the way Raku allows me to express myself. I like how the language is optimized for the maximum fun and easiness of expression.

To show this, let me tell you a story. At the beginning of September 2020 my wife Miriam asked me to verify a claim she read on the Internet, that the second wave of COVID-19 was due to the sheer increase of testing. We both doubted that claim, so we decided to get the official data and play with it.

Luckily the Italian government puts a great deal of data about the COVID-19 epidemics on Github, so I went on, did a git clone git://github.com/pcm-dpc/COVID-19.git, and started writing some code.

My aim was to plot both the “new cases” and the “tests,” and see if there was a correlation if any. In order to do so I started extracting the number of tests performed each day, to have a first taste of what I was looking at.

Here is my first program:

#!/usr/bin/env raku

sub readfiles {
  my @files = dir('COVID-19/dati-regioni/',
      test => { .starts-with: 'dpc-covid19-ita-regioni-20' }).sort; # [¹]
  my ($date, @tests);
  for @files -> $file {
    state $prevnumber = 0;                                          # [²]
    my $thisdate = $file.substr: 52, 8;
    my @lines = $file.lines;
    my $testspos = @lines[0].split(',').grep('tamponi', :k)[0];     # [³]
    my $ntests = 0;
    for @lines[1..^@lines.elems] -> $line {
      next unless $line.starts-with('2020');                        # [⁴]
      $ntests += $line.split(',')[$testspos];                       # [⁵]
    }
    my $total = $ntests - $prevnumber;
    $date.push: $thisdate;
    @tests.push: $total;
    $prevnumber = $ntests;
  }
  return $date, |@tests;
}

my ($date, @tests) = readfiles;

for ^$date -> $i {
  put "$date[$i] @tests[$i]";
}

Notes:
[¹] This line read the list of files containing regional data, selects those files whose name starts with a certain string, and sorts those file by name (and so by date, since luckily they’re using a YYYYMMDD format)
[²] There’s no “new tests” column, so I need to read the total number of tests so far and subtract the previous day’s value
[³] Here I read the headers, in the first line, split it using , as a separator, look for the header I want (“tamponi” = “swabs”) and use the :k adverb to get its position, or key
[⁴] Since the directory contain two more file, besides those tagged with a date, I skip them
[⁵] To get the data I need I split each line and select the array element, according to the index found in note [³]

So I ran the program like this: ./tests.raku > tests.dat to save the data in a file.
Let’s see what I got!
Gnuplot is my friend:

#!/usr/bin/gnuplot

set term qt persist
set xdata time
set timefmt "%Y%m%d"
set xtics timedate rotate format "%d/%m/%Y"
set grid
plot 'tests.dat' using 1:2 title 'Data' with lines

And here’s what I saw:

First graph: the seesaw

Why is it so jagged? Where the seesaw comes from?
Well, probably even doctors have weekends.

Let’s see if some filtering makes the things better, after all what I need is to verify whether the new cases follow the test increment or the other way round.
I will use Math::Libgsl::MovingWindow, one of the modules I wrote to provide an interface to a really awesome C library, the GNU Scientific Library.
To install the module use this command:

zef install Math::Libgsl::MovingWindow

In order to compute the moving mean of the dataset I’m going to use a moving window of one week.

Here’s the program:

#!/usr/bin/env raku

use Math::Libgsl::Constants;
use Math::Libgsl::Vector;
use Math::Libgsl::MovingWindow;

sub readfiles {
  my @files = dir('COVID-19/dati-regioni/',
      test => { .starts-with: 'dpc-covid19-ita-regioni-20' }).sort;
  my ($date, @tests);
  for @files -> $file {
    state $prevnumber = 0;
    my $thisdate = $file.substr: 52, 8;
    my @lines = $file.lines;
    my $testspos = @lines[0].split(',').grep('tamponi', :k)[0];
    my $ntests = 0;
    for @lines[1..^@lines.elems] -> $line {
      next unless $line.starts-with('2020');
      $ntests += $line.split(',')[$testspos];
    }
    my $total = $ntests - $prevnumber;
    $date.push: $thisdate;
    @tests.push: $total;
    $prevnumber = $ntests;
  }
  return $date, |@tests;
}

my constant $WEEK = 7;                                        # [¹]

# Read data from files
my ($date, @tests) = readfiles;

# MovingWindow
my Math::Libgsl::MovingWindow $mw .= new: :samples($WEEK);    # [²]
my $w = array-vec(
      -> $vec { $mw.mean($vec, :endtype(GSL_MOVSTAT_END_PADVALUE)) },
      @tests
    );                                                        # [³]

for ^$date -> $i {
  put "$date[$i] @tests[$i] { $w[$i].Int }";
}

Notes:
[¹] $WEEK is the moving window’s amplitude: a week
[²] Declare and initializes a Math::Libgsl::MovingWindow object with an amplitude of 7 samples
[³] array-vec converts a regular Raku array into a Math::Libgsl::Vector object, then it passes the Vector to a closure, which computes the moving mean over the whole Vector, and returns the Vector of the computed moving mean

I ran the program like this: ./tests.raku > tests.dat.
This Gnuplot script shows the two datasets overlaid:

#!/usr/bin/gnuplot

set term qt persist
set xdata time
set timefmt "%Y%m%d"
set xtics timedate rotate format "%d/%m/%Y"
set grid
plot 'tests.dat' using 1:2 title 'Data' with lines, 'tests.dat' using 1:3 title '7-day average' with lines

Second graph: the seesaw and the moving mean

OK then, let’s do the same for the new cases:

#!/usr/bin/env raku

use Math::Libgsl::Constants;
use Math::Libgsl::Vector;
use Math::Libgsl::MovingWindow;

sub readfiles {
  my @files = dir('COVID-19/dati-regioni/',
      test => { .starts-with: 'dpc-covid19-ita-regioni-20' }).sort;
  my ($date, @new-cases);
  for @files -> $file {
    my $today's-date = $file.substr: 52, 8;
    my @lines = $file.lines;
    my $newcasespos = @lines[0].split(',').grep('nuovi_positivi', :k)[0];
    my $ncases = 0;
    for @lines[1..^@lines.elems] -> $line {
      next unless $line.starts-with('2020');
      $ncases += $line.split(',')[$newcasespos];
    }
    $date.push: $today's-date;
    @new-cases.push: $ncases;
  }
  return $date, |@new-cases;
}

# Leggi i dati dai file
my ($date, @new-cases) = readfiles;

# MovingWindow
my constant $WEEK = 7;
my Math::Libgsl::MovingWindow $mw .= new: :samples($WEEK);
my $w = array-vec(
          -> $vec { $mw.mean($vec, :endtype(GSL_MOVSTAT_END_PADVALUE)) },
          @new-cases
        );

# Output
put "$date[$_] @new-cases[$_] { $w[$_].Int }" for ^$date;

I ran the program like this: ./newcases.raku > newcases.dat.
This script will generate the graph:

#!/usr/bin/gnuplot

set term qt persist
set xdata time
set timefmt "%Y%m%d"
set xtics timedate rotate format "%d/%m/%Y"
set grid
plot 'newcases.dat' using 1:2 title 'Raw data' at .35,.85 with lines, 'newcases.dat' using 1:3 title '7-day average' at .35,.9 with lines

Here’s the graph:

New cases: the seesaw and the moving mean

Fine. Now before putting both datasets on the same graph I need to normalize them, because the two ranges are very different: the number of tests is about ten times that of new cases.

#!/usr/bin/env raku

my (@date, @tests, @newcases);
for 'tests.dat'.IO.lines -> $line {
  my ($date, $, $, $tests) = $line.split: ' ';
  @date.push: $date;
  @tests.push: +$tests;
}
for 'newcases.dat'.IO.lines -> $line {
  my ($, $, $newcases) = $line.split: ' ';
  @newcases.push: +$newcases;
}

@newcases »-=» @newcases.min;                    # [¹]
@newcases »/=» @newcases.max;
@tests    »-=» @tests.min;
@tests    »/=» @tests.max;

for ^@date -> $i {
  put "@date[$i] @newcases[$i] @tests[$i]";
}

Notes:
[¹] This lines normalize the whole dataset.
@newcases.min returns the minimum value in the array, @newcases.max returns the maximum value in the array, while the »-=» hyperoperator subtracts from each element of the @newcases array its minimum value and stores back the result in the original arraythe and the »/=» hyperoperator divides each element of the @newcases array by its maximum value and stores back the result in the original array, thus normalizing it

I ran the program like this: ./normalize.raku > normalize.dat.
This Gnuplot script will show the two normalized datasets superimposed:

#!/usr/bin/gnuplot

set term qt persist
set xdata time
set timefmt "%Y%m%d"
set xtics timedate rotate format "%Y%m%d"
set grid
plot 'normalize.dat' using 1:2 title 'new cases 7-day average' at .5,.9 with lines, 'normalize.dat' using 1:3 title 'tests 7-day average' at .5,.8 with lines

New cases and tests, superimposed

Based on what I can see, in March and April the increase of tests followed the increase of new cases. In September the number of tests increased while the new cases remained almost stable, then in November the number of new cases dropped while the number of tests continued raising.
I think the graph proves that our doubts were justified.