After PUBG last night, I spent a few hours trying to understand pandas and
Looking at: https://machinelearningmastery.com/time-series-data-visualization-with-python/ And Julia Evans’ https://github.com/jvns/pandas-cookbook.
Data description
The machine logs into two files:
- datr<date>.csv which is a regular log of raw measurements from the scale. The frequency of this logging is set using
the regularLogInterval
variable in coffee_boss.ino. These values areunfiltered, and contain all of the weird noise that this circuit collects. They are, however, a true time-series.
The R in datr stands for Regular. But Raw works too. These files end up big (~3Mb). - datc<date>.csv which is a log of changes greater than a particular threshold which is designed to be just about the smallest thing that can happen with the machine. That threshold is specified in
the changeThreshold
variable in coffee_boss.ino. It’s currently 30, so this file will only log changes greater than 30 grams. This stream of measurements is also filtered, being a running median of the last 8 raw measurements. This filters out almost all of the noise. It uses the RunningMedian library for this. This is not a true timeseries, since the logging frequency is not constant.
The C indatc stands for Change. I wish I’d thought of better names. These files are only a couple of Kb in size.
Both log files have the same format. CSVs with four columns:
- datetime (%Y-%m-%dT%H:%M:%S)
- date (%Y-%m-%d)
- time (%H:%M:%S)
- weight (float with 2 decimal places)
You can find some examples of these files in https://github.com/euphy/coffee_boss/tree/master/output. They look like this:
2019-09-26T02:57:53,2019-09-26,02:57:53,2249.48 2019-09-26T03:07:55,2019-09-26,03:07:55,2221.33 2019-09-26T03:08:51,2019-09-26,03:08:51,2119.35 2019-09-26T03:08:51,2019-09-26,03:08:51,1886.68 2019-09-26T03:08:51,2019-09-26,03:08:51,1895.43
I want to see this as a line graph with time along the
The datC files are already filtered, but they are not in time series, so I have to either:
- use the
datR files and figure out how to filter the noise out of them. This seems like pandas work. I don’t know how to use pandas.
Or - use the
datC files and figure out how to present them as time series – interpolation of missing points perhaps, or some other built-in way to do this inmatplotlib . I don’t know how to use matplotlib.
Simple plot with pandas and matplotlib
I’ve started with the datC approach:
from pandas import read_csv
from matplotlib import pyplot
column_names = ['datestamp', 'date','time','weight']
series = read_csv('../output/datc20190926.csv', names=column_names)
series.plot()
pyplot.show()
Which renders a nice graph:

This shows what I expect, in a way. Two pots of coffee made, with about six cups being taken from each one.
There’s something wrong though. The first third is all a bit messy. Not sure what’s happening there, so well look at the time, and hm, there isn’t even a time notes. The x-axis is a count of samples, not a time series. I can’t tell if that first disorganised section is an hour or nine hours. I can’t tell if the first pot of coffee
Simple plot of raw, time-series data
Lets try with the raw data (output/datr20190926.csv):

Filter and remove out-of-range samples
Filtering out values that I _know_ are bad will help and is easy. See https://stackoverflow.com/questions/29594841/how-to-filter-out-values-by-values-in-pandas-dataframe.
from pandas import read_csv
from matplotlib import pyplot
column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
series = series[(series['weight'] > -2000) & (series['weight'] < 10000 )]
series.plot()
pyplot.show()
Which filters out weights less than -2000g and more than 10000g. This is better because I can see the overall shape of the values and the positions in the overall body of samples (across the whole day). I can see that the cups

But there’s still a lot of noise that doesn’t hit those thresholds, and importantly, this approach simply throws away the samples that are outside the bounds. So that means there is a time gap at those points, and if enough of them happen (I can only see two here), then the time-series is discontinuous.
So I think the approach is not to filter out and discard bad values, it
Use a rolling() sample window to remove noise
There is a rolling_mean() function in pandas: https://www.programcreek.com/python/example/101378/pandas.rolling_mean.
https://stackoverflow.com/questions/43437657/rolling-mean-on-pandas-on-a-specific-column describes using pandas.rolling().mean() instead, which is closer. I assume there’s a .median() function too.
from pandas import read_csv
from matplotlib import pyplot
column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
series['rolling'] = series['weight'].rolling(8).mean()
series.plot()
pyplot.show()

That’s a bit more like it. It proves that mean() isn’t the right one (mean averages are very affected by outliers). So now I need to figure out how to only plot rolling
rather than rolling and the raw weight. I think this is a
Using rolling().median()
from pandas import read_csv
from matplotlib import pyplot
column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
series['rolling'] = series['weight'].rolling(8).median()
series.plot(x='time', y='rolling')
# Rotate x ticks and tight_layout fits it all on the page
pyplot.xticks(rotation='vertical')
pyplot.tight_layout()
pyplot.show()

That’s more like it. It uses .median()
instead of mean to deal better with outliers, and also I figured out series.plot(x='time', y='rolling')
to specify which axes to use, rotate the time ticks so they don’t overlap, and tight_layout()
‘d it so they didn’t fall off the bottom of the page.
This calculates the median value in a rolling window of 8 samples, so that’s about sixteen seconds. I’m interested in filtering out some more of the
See the effects of different sized rolling windows
series['rolling4'] = series['weight'].rolling(4).median()
series['rolling8'] = series['weight'].rolling(8).median()
series['rolling12'] = series['weight'].rolling(12).median()
series['rolling24'] = series['weight'].rolling(24).median()
series.plot(x='time', y=['rolling4', 'rolling8', 'rolling12', 'rolling24'])

What was that weird blip at

This tells a story and at first I thought it filtered out too much. It doesn’t show the sequence of events of lifting the coffeepot out->returning the pot and it’s a bit lighter. Because the pot is out for a short period of time it just disappears. I thought that this little signature would be important to be able to recognise a cup of coffee being taken.
In fact, I think the heavily-filtered plot, with the simple chunky downward steps tells the story more clearly. If I could spot a drop of around a cup, then that’s simple.
I’ve just realised that the pyplot.show()
Adding more ticks to the x-axis
It does not show me very good ticks on the x-axis though, there aren’t enough. I’ll need to add some more.
from pandas import read_csv
from matplotlib import pyplot
column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
small_window = ('rolling4', 4)
large_window = ('rolling36', 36)
series[small_window[0]] = series['weight'].rolling(small_window[1]).median()
series[large_window[0]] = series['weight'].rolling(large_window[1]).median()
series.plot(x='time', y=[small_window[0], large_window[0]])
series.set_index('time')
count = series['time'].count()
no_of_ticks = 24
size_of_segment = int(count / no_of_ticks)
# two lists, one is a list of indices regularly spread out across the series
indices = list()
tick_labels = list()
for i in range(0, no_of_ticks):
position = (size_of_segment * i) + 42/2
indices.append(position)
tick_labels.append(series['time'][position])
pyplot.grid(True, which='major')
pyplot.xticks(labels=tick_labels, ticks=indices, rotation='vertical')
pyplot.tight_layout()
pyplot.show()

That seems like a clunky way to do it. There must be a better one! That’s enough for today. Six hours work. Bit of Overwatch before bed.