Coffee Boss day 2: Doing some work.

After PUBG last night, I spent a few hours trying to understand pandas and matplotlib. At 1:30am I called it a day. 10 hours into this task.

Looking at: https://machinelearningmastery.com/time-series-data-visualization-with-python/ And Julia Evans’ https://github.com/jvns/pandas-cookbook.

Data description

The machine logs into two files:

  • datr<date>.csv which is a regular log of raw measurements from the scale. The frequency of this logging is set using the regularLogInterval variable in coffee_boss.ino. These values are unfiltered, and contain all of the weird noise that this circuit collects. They are, however, a true time-series.
    The R in datr stands for Regular. But Raw works too. These files end up big (~3Mb).
  • datc<date>.csv which is a log of changes greater than a particular threshold which is designed to be just about the smallest thing that can happen with the machine. That threshold is specified in the changeThreshold variable in coffee_boss.ino. It’s currently 30, so this file will only log changes greater than 30 grams. This stream of measurements is also filtered, being a running median of the last 8 raw measurements. This filters out almost all of the noise. It uses the RunningMedian library for this. This is not a true time series, since the logging frequency is not constant.
    The C in datc stands for Change. I wish I’d thought of better names. These files are only a couple of Kb in size.

Both log files have the same format. CSVs with four columns:

  • datetime (%Y-%m-%dT%H:%M:%S)
  • date (%Y-%m-%d)
  • time (%H:%M:%S)
  • weight (float with 2 decimal places)

You can find some examples of these files in https://github.com/euphy/coffee_boss/tree/master/output. They look like this:

2019-09-26T02:57:53,2019-09-26,02:57:53,2249.48
2019-09-26T03:07:55,2019-09-26,03:07:55,2221.33
2019-09-26T03:08:51,2019-09-26,03:08:51,2119.35
2019-09-26T03:08:51,2019-09-26,03:08:51,1886.68
2019-09-26T03:08:51,2019-09-26,03:08:51,1895.43

I want to see this as a line graph with time along the X axis stretching from left to right, and weight on the Y-axis, bottom to top. Doing this with the datR files would be easiest, because they are naturally already time-series data, but they give a pretty awful output because they are so noisy. I’d have to do some filtering on it in python. That’s not such a bad idea.

The datC files are already filtered, but they are not in time series, so I have to either:

  • use the datR files and figure out how to filter the noise out of them. This seems like pandas work. I don’t know how to use pandas.
    Or
  • use the datC files and figure out how to present them as time series – interpolation of missing points perhaps, or some other built-in way to do this in matplotlib. I don’t know how to use matplotlib.

Simple plot with pandas and matplotlib

I’ve started with the datC approach:

from pandas import read_csv
from matplotlib import pyplot

column_names = ['datestamp', 'date','time','weight']
series = read_csv('../output/datc20190926.csv', names=column_names)
series.plot()
pyplot.show()

Which renders a nice graph:

This shows what I expect, in a way. Two pots of coffee made, with about six cups being taken from each one.

There’s something wrong though. The first third is all a bit messy. Not sure what’s happening there, so well look at the time, and hm, there isn’t even a time notes. The x-axis is a count of samples, not a time series. I can’t tell if that first disorganised section is an hour or nine hours. I can’t tell if the first pot of coffee was drank in an hour or in twelve. Given that this data covers a full day (from midnight to midnight), it looks like one pot of coffee lasted all day, and there was another one made late at night (7pm maybe?).

Simple plot of raw, time-series data

Lets try with the raw data (output/datr20190926.csv):

Ok that’s no better. There’s a few bad samples in there that has obscured the good samples. Can I filter it? It still hasn’t got the right X-axis either, measuring samples rather than time.

Filter and remove out-of-range samples

Filtering out values that I _know_ are bad will help and is easy. See https://stackoverflow.com/questions/29594841/how-to-filter-out-values-by-values-in-pandas-dataframe.

from pandas import read_csv
from matplotlib import pyplot

column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
series = series[(series['weight'] > -2000) & (series['weight'] < 10000 )]
series.plot()
pyplot.show()

Which filters out weights less than -2000g and more than 10000g. This is better because I can see the overall shape of the values and the positions in the overall body of samples (across the whole day). I can see that the cups being taken from each pot are not regularly spaced.

But there’s still a lot of noise that doesn’t hit those thresholds, and importantly, this approach simply throws away the samples that are outside the bounds. So that means there is a time gap at those points, and if enough of them happen (I can only see two here), then the time-series is discontinuous.

So I think the approach is not to filter out and discard bad values, it is use to use the source log data to produce an entirely new stream of weight values using something like a moving window of averaging. That’s how the firmware does it and that gives a decent result.

Use a rolling() sample window to remove noise

There is a rolling_mean() function in pandas: https://www.programcreek.com/python/example/101378/pandas.rolling_mean. Oh it’s deprecated. And I don’t want a mean anyway, means are rubbish, I want the median.
https://stackoverflow.com/questions/43437657/rolling-mean-on-pandas-on-a-specific-column describes using pandas.rolling().mean() instead, which is closer. I assume there’s a .median() function too.

from pandas import read_csv
from matplotlib import pyplot

column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
series['rolling'] = series['weight'].rolling(8).mean()
series.plot()
pyplot.show()

That’s a bit more like it. It proves that mean() isn’t the right one (mean averages are very affected by outliers). So now I need to figure out how to only plot rolling rather than rolling and the raw weight. I think this is a matplotlib issue. Well it is, and it isn’t: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/understand-df-plot-in-pandas/

Using rolling().median()

from pandas import read_csv
from matplotlib import pyplot

column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
series['rolling'] = series['weight'].rolling(8).median()
series.plot(x='time', y='rolling')

# Rotate x ticks and tight_layout fits it all on the page
pyplot.xticks(rotation='vertical')
pyplot.tight_layout()
pyplot.show()

That’s more like it. It uses .median() instead of mean to deal better with outliers, and also I figured out series.plot(x='time', y='rolling') to specify which axes to use, rotate the time ticks so they don’t overlap, and tight_layout()‘d it so they didn’t fall off the bottom of the page.

This calculates the median value in a rolling window of 8 samples, so that’s about sixteen seconds. I’m interested in filtering out some more of the jaggies, and would like to see the results of a few different versions plotted together. Doing like below was a guess and it worked. It makes me start to think I’m getting a bit of a clue on how to use this toolset. Four hours in.

See the effects of different sized rolling windows

series['rolling4'] = series['weight'].rolling(4).median()
series['rolling8'] = series['weight'].rolling(8).median()
series['rolling12'] = series['weight'].rolling(12).median()
series['rolling24'] = series['weight'].rolling(24).median()
series.plot(x='time', y=['rolling4', 'rolling8', 'rolling12', 'rolling24'])

What was that weird blip at 3am? So I know that 24 samples covers 48 seconds of activity and filters out almost all variance. It shows the gulp-gulp-gulp of the coffee cups down (8:45 to 11:41) which is really cool and really clear. But I realised that’s not quite what I thought I was looking for.

This tells a story and at first I thought it filtered out too much. It doesn’t show the sequence of events of lifting the coffeepot out->returning the pot and it’s a bit lighter. Because the pot is out for a short period of time it just disappears. I thought that this little signature would be important to be able to recognise a cup of coffee being taken.

In fact, I think the heavily-filtered plot, with the simple chunky downward steps tells the story more clearly. If I could spot a drop of around a cup, then that’s simple.

I’ve just realised that the matplotlib viewer that pops up when I do pyplot.show() is really good. It does a zoom-in on a section! That’s exactly what I wanted excel to do for me. Excel really is the wrong tool for this job.

Adding more ticks to the x-axis

It does not show me very good ticks on the x-axis though, there aren’t enough. I’ll need to add some more.

from pandas import read_csv
from matplotlib import pyplot

column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
small_window = ('rolling4', 4)
large_window = ('rolling36', 36)

series[small_window[0]] = series['weight'].rolling(small_window[1]).median()
series[large_window[0]] = series['weight'].rolling(large_window[1]).median()

series.plot(x='time', y=[small_window[0], large_window[0]])
series.set_index('time')

count = series['time'].count()
no_of_ticks = 24
size_of_segment = int(count / no_of_ticks)

# two lists, one is a list of indices regularly spread out across the series
indices = list()
tick_labels = list()
for i in range(0, no_of_ticks):
    position = (size_of_segment * i) + 42/2
    indices.append(position)
    tick_labels.append(series['time'][position])

pyplot.grid(True, which='major')
pyplot.xticks(labels=tick_labels, ticks=indices, rotation='vertical')
pyplot.tight_layout()
pyplot.show()

That seems like a clunky way to do it. There must be a better one! That’s enough for today. Six hours work. Bit of Overwatch before bed.