Perl 6 By Example: Stacked Plots with Matplotlib

This blog post is part of my ongoing project
to write a book about Perl 6.

If you’re interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of
the article, or here. It will be
low volume (less than an email per month, on average).


In a previous episode, we’ve explored plotting git statistics in Perl 6 using
matplotlib
.

Since I wasn’t quite happy with the result, I want to explore using stacked
plots for presenting the same information. In a regular plot, the y
coordiante of each plotted value is proportional to its value. In a
stacked plot, it is the distance to the previous value that is
proportional to its value. This is nice for values that add up to a
total that is also interesting.

Matplotlib offers a method called
stackplot

for that. Contrary to multiple plot calls on subplot object, it
requires a shared x axis for all data series. So we must construct
one array for each author of git commits, where dates with no value come
out as zero.

As a reminder, this is what the logic for extracting the stats looked
like in the first place:

my $proc = run :out, <git log --date=short --pretty=format:%ad!%an>;
my (%total, %by-author, %dates);
for $proc.out.lines -> $line {
    my ( $date, $author ) = $line.split: '!', 2;
    %total{$author}++;
    %by-author{$author}{$date}++;
    %dates{$date}++;
}
my @top-authors = %total.sort(-*.value).head(5)>>.key;

And some infrastructure for plotting with matplotlib:

my $py = Inline::Python.new;
$py.run('import datetime');
$py.run('import matplotlib.pyplot');
sub plot(Str $name, |c) {
    $py.call('matplotlib.pyplot', $name, |c);
}
sub pydate(Str $d) {
    $py.call('datetime', 'date', $d.split('-').map(*.Int));
}

my ($figure, $subplots) = plot('subplots');
$figure.autofmt_xdate();

So now we have to construct an array of arrays, where each inner array
has the values for one author:

my @dates = %dates.keys.sort;
my @stack = $[] xx @top-authors;

for @dates -> $d {
    for @top-authors.kv -> $idx, $author {
        @stack[$idx].push: %by-author{$author}{$d} // 0;
    }
}

Now plotting becomes a simple matter of a method call, followed by the
usual commands adding a title and showing the plot:

$subplots.stackplot($[@dates.map(&pydate)], @stack);
plot('title', 'Contributions per day');
plot('show');

The result (again run on the zef source repository) is this:

Stacked plot of zef contributions over time

Comparing this to the previous visualization reveals a discrepancy:
There were no commits in 2014, and yet the stacked plot makes it appear
this way. In fact, the previous plots would have shown the same
“alternative facts” if we had chosen lines instead of points. It comes
from matplotlib (like nearly all plotting libraries) interpolates
linearly between data points. But in our case, a date with no data
points means zero commits happened on that date.

To communicate this to matplotlib, we must explicitly insert zero values
for missing dates. This can be achieved by replacing

my @dates = %dates.keys.sort;

with the line

my @dates = %dates.keys.minmax;

The minmax method
finds the minimal and maximal values, and returns them in a
Range. Assigning the range to an
array turns it into an array of all values between the minimal and the
maximal value. The logic for assembling the @stack variable already
maps missing values to zero.

The result looks a bit better, but still far from perfect:

Stacked plot of zef contributions over time, with missing dates mapped to zero

Thinking more about the problem, contributions from separate days should
not be joined together, because it produces misleading results.
Matplotlib doesn’t support adding a legend automatically to stacked
plots, so this seems to be to be a dead end.

Since a dot plot didn’t work very well, let’s try a different kind of
plot that represents each data point separately: a bar chart, or more
specifically, a stacked bar chart. Matplotlib offers the bar plotting
method, and a named parameter bottom can be used to generate the
stacking:

my @dates = %dates.keys.sort;
my @stack = $[] xx @top-authors;
my @bottom = $[] xx @top-authors;

for @dates -> $d {
    my $bottom = 0;
    for @top-authors.kv -> $idx, $author {
        @bottom[$idx].push: $bottom;
        my $value = %by-author{$author}{$d} // 0;
        @stack[$idx].push: $value;
        $bottom += $value;
    }
}

We need to supply color names ourselves, and set the edge color of the
bars to the same color, otherwise the black edge color dominates the
result:

my $width = 1.0;
my @colors = <red green blue yellow black>;
my @plots;

for @top-authors.kv -> $idx, $author {
    @plots.push: plot(
        'bar',
        $[@dates.map(&pydate)],
        @stack[$idx],
        $width,
        bottom => @bottom[$idx],
        color => @colors[$idx],
        edgecolor => @colors[$idx],
    );
}
plot('legend', $@plots, $@top-authors);

plot('title', 'Contributions per day');
plot('show');

This produces the first plot that’s actually informative and not
misleading (provided you’re not color blind):

Stacked bar plot of zef contributions over time

If you want to improve the result further, you could experiment with
limiting the number of bars by lumping together contributions by week or
month (or maybe $n-day period).

Next, we’ll investigate ways to make the matplotlib API more idiomatic
to use from Perl 6 code.

Subscribe to the Perl 6 book mailing list

* indicates required

  • Article By :

Random Article You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*