After PUBG last night, I spent a few hours trying to understand pandas and
Looking at: https://machinelearningmastery.com/time-series-data-visualization-with-python/ And Julia Evans’ https://github.com/jvns/pandas-cookbook.
The machine logs into two files:
- datr<date>.csv which is a regular log of raw measurements from the scale. The frequency of this logging is set using
variablein coffee_boss.ino. These values are unfiltered,and contain all of the weird noise that this circuit collects. They are, however, a true time-series.
The R in datr stands for Regular. But Raw works too. These files end up big (~3Mb).
- datc<date>.csv which is a log of changes greater than a particular threshold which is designed to be just about the smallest thing that can happen with the machine. That threshold is specified in
variablein coffee_boss.ino. It’s currently 30, so this file will only log changes greater than 30 grams. This stream of measurements is also filtered, being a running median of the last 8 raw measurements. This filters out almost all of the noise. It uses the RunningMedian library for this. This is not a true time series,since the logging frequency is not constant.
The C in
datcstands for Change. I wish I’d thought of better names. These files are only a couple of Kb in size.
Both log files have the same format. CSVs with four columns:
- datetime (%Y-%m-%dT%H:%M:%S)
- date (%Y-%m-%d)
- time (%H:%M:%S)
- weight (float with 2 decimal places)
You can find some examples of these files in https://github.com/euphy/coffee_boss/tree/master/output. They look like this:
2019-09-26T02:57:53,2019-09-26,02:57:53,2249.48 2019-09-26T03:07:55,2019-09-26,03:07:55,2221.33 2019-09-26T03:08:51,2019-09-26,03:08:51,2119.35 2019-09-26T03:08:51,2019-09-26,03:08:51,1886.68 2019-09-26T03:08:51,2019-09-26,03:08:51,1895.43
I want to see this as a line graph with time along the
The datC files are already filtered, but they are not in time series, so I have to either:
- use the
datRfiles and figure out how to filter the noise out of them. This seems like pandas work. I don’t know how to use pandas.
- use the
datCfiles and figure out how to present them as time series – interpolation of missing points perhaps, or some other built-in way to do this in matplotlib. I don’t know how to use matplotlib.
Simple plot with pandas and matplotlib
I’ve started with the datC approach:
from pandas import read_csv from matplotlib import pyplot column_names = ['datestamp', 'date','time','weight'] series = read_csv('../output/datc20190926.csv', names=column_names) series.plot() pyplot.show()
Which renders a nice graph:
This shows what I expect, in a way. Two pots of coffee made, with about six cups being taken from each one.
There’s something wrong though. The first third is all a bit messy. Not sure what’s happening there, so well look at the time, and hm, there isn’t even a time notes. The x-axis is a count of samples, not a time series. I can’t tell if that first disorganised section is an hour or nine hours. I can’t tell if the first pot of coffee
Simple plot of raw, time-series data
Lets try with the raw data (output/datr20190926.csv):
Filter and remove out-of-range samples
Filtering out values that I _know_ are bad will help and is easy. See https://stackoverflow.com/questions/29594841/how-to-filter-out-values-by-values-in-pandas-dataframe.
from pandas import read_csv from matplotlib import pyplot column_names = ['datestamp', 'date', 'time', 'weight'] series = read_csv('../output/datr20190926.csv', names=column_names) series = series[(series['weight'] > -2000) & (series['weight'] < 10000 )] series.plot() pyplot.show()
Which filters out weights less than -2000g and more than 10000g. This is better because I can see the overall shape of the values and the positions in the overall body of samples (across the whole day). I can see that the cups
But there’s still a lot of noise that doesn’t hit those thresholds, and importantly, this approach simply throws away the samples that are outside the bounds. So that means there is a time gap at those points, and if enough of them happen (I can only see two here), then the time-series is discontinuous.
So I think the approach is not to filter out and discard bad values, it
Use a rolling() sample window to remove noise
There is a rolling_mean() function in pandas: https://www.programcreek.com/python/example/101378/pandas.rolling_mean.
https://stackoverflow.com/questions/43437657/rolling-mean-on-pandas-on-a-specific-column describes using pandas.rolling().mean() instead, which is closer. I assume there’s a .median() function too.
from pandas import read_csv from matplotlib import pyplot column_names = ['datestamp', 'date', 'time', 'weight'] series = read_csv('../output/datr20190926.csv', names=column_names) series['rolling'] = series['weight'].rolling(8).mean() series.plot() pyplot.show()
That’s a bit more like it. It proves that mean() isn’t the right one (mean averages are very affected by outliers). So now I need to figure out how to only plot
rolling rather than rolling and the raw weight. I think this is a
from pandas import read_csv from matplotlib import pyplot column_names = ['datestamp', 'date', 'time', 'weight'] series = read_csv('../output/datr20190926.csv', names=column_names) series['rolling'] = series['weight'].rolling(8).median() series.plot(x='time', y='rolling') # Rotate x ticks and tight_layout fits it all on the page pyplot.xticks(rotation='vertical') pyplot.tight_layout() pyplot.show()
That’s more like it. It uses
.median() instead of mean to deal better with outliers, and also I figured out
series.plot(x='time', y='rolling') to specify which axes to use, rotate the time ticks so they don’t overlap, and
tight_layout()‘d it so they didn’t fall off the bottom of the page.
This calculates the median value in a rolling window of 8 samples, so that’s about sixteen seconds. I’m interested in filtering out some more of the
See the effects of different sized rolling windows
series['rolling4'] = series['weight'].rolling(4).median() series['rolling8'] = series['weight'].rolling(8).median() series['rolling12'] = series['weight'].rolling(12).median() series['rolling24'] = series['weight'].rolling(24).median() series.plot(x='time', y=['rolling4', 'rolling8', 'rolling12', 'rolling24'])
What was that weird blip at
This tells a story and at first I thought it filtered out too much. It doesn’t show the sequence of events of lifting the coffeepot out->returning the pot and it’s a bit lighter. Because the pot is out for a short period of time it just disappears. I thought that this little signature would be important to be able to recognise a cup of coffee being taken.
In fact, I think the heavily-filtered plot, with the simple chunky downward steps tells the story more clearly. If I could spot a drop of around a cup, then that’s simple.
I’ve just realised that the
Adding more ticks to the x-axis
It does not show me very good ticks on the x-axis though, there aren’t enough. I’ll need to add some more.
from pandas import read_csv from matplotlib import pyplot column_names = ['datestamp', 'date', 'time', 'weight'] series = read_csv('../output/datr20190926.csv', names=column_names) small_window = ('rolling4', 4) large_window = ('rolling36', 36) series[small_window] = series['weight'].rolling(small_window).median() series[large_window] = series['weight'].rolling(large_window).median() series.plot(x='time', y=[small_window, large_window]) series.set_index('time') count = series['time'].count() no_of_ticks = 24 size_of_segment = int(count / no_of_ticks) # two lists, one is a list of indices regularly spread out across the series indices = list() tick_labels = list() for i in range(0, no_of_ticks): position = (size_of_segment * i) + 42/2 indices.append(position) tick_labels.append(series['time'][position]) pyplot.grid(True, which='major') pyplot.xticks(labels=tick_labels, ticks=indices, rotation='vertical') pyplot.tight_layout() pyplot.show()
That seems like a clunky way to do it. There must be a better one! That’s enough for today. Six hours work. Bit of Overwatch before bed.