def plot(self, *args, scalex=True, scaley=True, data=None, **kwargs):
"""
Plot y versus x as lines and/or markers.
Call signatures::
plot([x], y, [fmt], *, data=None, **kwargs)
plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs)
The coordinates of the points or line nodes are given by *x*, *y*.
The optional parameter *fmt* is a convenient way for defining basic
formatting like color, marker and linestyle. It's a shortcut string
notation described in the *Notes* section below.
>>> plot(x, y) # plot x and y using default line style and color
>>> plot(x, y, 'bo') # plot x and y using blue circle markers
>>> plot(y) # plot y using x as index array 0..N-1
>>> plot(y, 'r+') # ditto, but with red plusses
I saw that I could use my axe to simply plot the positions of the indicators on in two dimensions. I got this:
Which is pretty much perfectly what I want right now. I did some fairly dirty mucking around with the data to get it to do this, essentially looking for where the row-to-row weight difference crosses a threshold from low-to-high.
# median filter with a rolling window: low pass filter
df['rolling4'] = df['weight'].rolling(4).median()
# normalise by looking for difference over 8 samples
df['diff'] = df['rolling4'].diff(periods=-8)
# Tag with True where the change is over 300g
threshold = 300.0
df['thresholded'] = (df['diff'] > threshold)
# Produce 'highlight' boolean where the threshold is True, AND
# the threshold for the previous row was False. This feels pretty clunky.
df['highlight'] = (df['thresholded'] == True) & (df['thresholded'].shift(1) == False)
# Now create a new dataframe with just the highlights in, and only the interesting columns
highlights = df[df['highlight']][['datetime', 'rolling4']]
I’ve been trying to get a horizontal line to show a threshold. It never worked. It gave me a mean-spirited error message that I couldn’t understand. I spent the last few days trying. I got this one:
Traceback (most recent call last):
File "C:/Users/sandy_000/PycharmProjects/coffee_boss/viz/viz.py", line 65, in <module>
df.plot(y=['diff'], ax=ax3)
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_core.py", line 794, in __call__
return plot_backend.plot(data, kind=kind, **kwargs)
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_matplotlib\__init__.py", line 62, in plot
plot_obj.generate()
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_matplotlib\core.py", line 284, in generate
self._adorn_subplots()
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_matplotlib\core.py", line 472, in _adorn_subplots
sharey=self.sharey,
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_matplotlib\tools.py", line 316, in _handle_shared_axes
_remove_labels_from_axis(ax.xaxis)
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_matplotlib\tools.py", line 281, in _remove_labels_from_axis
for t in axis.get_majorticklabels():
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\axis.py", line 1252, in get_majorticklabels
ticks = self.get_major_ticks()
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\axis.py", line 1407, in get_major_ticks
numticks = len(self.get_majorticklocs())
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\axis.py", line 1324, in get_majorticklocs
return self.major.locator()
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\dates.py", line 1431, in __call__
self.refresh()
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\dates.py", line 1451, in refresh
dmin, dmax = self.viewlim_to_dt()
File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\dates.py", line 1202, in viewlim_to_dt
.format(vmin))
ValueError: view limit minimum 0.0 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
Yes. Same, but the hline happens after the plot. OK, I can make the intuitive leap for why this works and not be cross about it, but I wish I’d tried this a week ago.
Finding an algorithm or treatment (I don’t know what the right word is… an analysis?) that will isolate significant changes to the weight of the machine,
applying a high-cut filter using a median filter in a rolling window,
then producing a diff to create something normalised,
then thresholding that to produce some binary output
Matplotlib and pandas have a couple of fundamental principles that I’m not getting. There seems to be an odd mix of global and specific commands that go into expelling a graph and I’m not seeing the link.
Naturally this is causing me to bump into some awkward questions, the main one being “what am I actually trying to do?”. I thought this was simple, but it’s not quite. I sketched the following manually as capturing what I’d like:
This chart shows the features I think I need to gather:
Weight of each cup of coffee. The height of the grey boxes show this. This can be recognised by seeing a rapid drop in weight where the size of the drop is greater than can be explained by evaporation. I want to know this so that I can see the variance between the biggest and the smallest cups. Everyone pays the same.
Freshness of the pot of coffee (time since last pot). The first vertical line shows the start of a new pot. I can intuitively recognise this point as being where there is a sudden increase in weight of about 2kg. This is obvious in some cases (like the end of the figure below) where the weight is low and rapidly increases. It is less obvious in the refill from the beginning of the figure below, where the weight beforehand was high too so there isn’t that clear jump from very low to very high. I assume in this case, there was already a spare pot of water on top of the machine waiting to be used, and so the weight drop (that is visible) only lasts the time between picking the pot up and pouring it into the machine.
Number of cups in each pot. This is a simple count of the number of events recognised in 1. I can’t see any way to determine if the last small drop before a refill is a cupful (ie someone’s taking it) or if it’s just waste. A combination of age of coffee and size of cup may form a heuristic for that but I don’t know how to gather the data from the scales alone. I might add a button on the touchscreen for “discarded waste/reset pot”.
The width (or length) of the grey boxes is interesting (indicating the time between cups), but I’ve got no direct need for that data yet.
Make it time-series
Right now, the data is arranged in time sequence, and has a fixed sampling frequency, so it is a complete time-series. However, pandas doesn’t know that yet, the labels for time are just strings. I’ll make it into a true time-series because Pandas has a bunch of specific tools for working with time-series data (including resampling and how I specify the size of windows in seconds rather than samples) AND I want to be able to combine multiple days into one stream of data.
Remember to that the first tutes assumed I was converting to datetimes during import using parse_dates=True inthe read_csv(...). That never worked for me, and I got errors I didn’t understand. Use df.info() to check whether the conversion had worked properly, it now looks like:
column_names = ['datestamp', 'date', 'time', 'weight']
df = read_csv('../output/datr20190923.csv', names=column_names, parse_dates=True, infer_datetime_format=True)
df['datetime'] = pd.to_datetime(df['datestamp'])
df.index = df['datetime']
del df['datestamp']
del df['time']
del df['date']
print(df.info())
I don’t know how to do this bit. Not I don’t know technically, I mean I have no awareness of the nature of the tools and practice to look for events in a data stream, catergorise them, and present them.
My opening gambit is:
Look through each weight sample, comparing it to the last (or the last few).
If the current value is higher or lower (over a certain threshold) than it was, then:
Record this as a significant event by putting it into another list with the same timestamp (events)
Combine the events stream with the main data frame
Present the raw weights data in a graph, and:
Show the events overlaid
I can iterate through each row just using iterators and python loops, but that feels like a pandas anti-pattern. From reading around (how do I even describe this problem for google?), it seems like it’s best to do things in pandas en masse rather than by examining each record individually. I think that’s what pandas does.
That’s a bit like what I’m looking for. The pct_change (percent change: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.pct_change.html?highlight=pct_change#pandas.Series.pct_change) will detect the scale of changes. It’ll hover around 0, but you can see where the big jumps are, then the percent change is also big.
A negative percent change means the coffee machine is lighter (ie the pot is lifted or a cup is taken). A positive percent change means the machine got heavier (ie pot replaced or water refilled).
I can look through those percent changes and spot ones bigger than [a certain value], and mark those cases on the plot or save them out somehow for further analysis.
datr<date>.csv which is a regular log of raw measurements from the scale. The frequency of this logging is set using the regularLogInterval variable in coffee_boss.ino. These values are unfiltered, and contain all of the weird noise that this circuit collects. They are, however, a true time-series. The R in datr stands for Regular. But Raw works too. These files end up big (~3Mb).
datc<date>.csv which is a log of changes greater than a particular threshold which is designed to be just about the smallest thing that can happen with the machine. That threshold is specified in the changeThreshold variable in coffee_boss.ino. It’s currently 30, so this file will only log changes greater than 30 grams. This stream of measurements is also filtered, being a running median of the last 8 raw measurements. This filters out almost all of the noise. It uses the RunningMedian library for this. This is not a true time series, since the logging frequency is not constant. The C in datcstands for Change. I wish I’d thought of better names. These files are only a couple of Kb in size.
Both log files have the same format. CSVs with four columns:
I want to see this as a line graph with time along the X axis stretching from left to right, and weight on the Y-axis, bottom to top. Doing this with the datR files would be easiest, because they are naturally already time-series data, but they give a pretty awful output because they are so noisy. I’d have to do some filtering on it in python. That’s not such a bad idea.
The datC files are already filtered, but they are not in time series, so I have to either:
use the datR files and figure out how to filter the noise out of them. This seems like pandas work. I don’t know how to use pandas. Or
use the datC files and figure out how to present them as time series – interpolation of missing points perhaps, or some other built-in way to do this in matplotlib. I don’t know how to use matplotlib.
Simple plot with pandas and matplotlib
I’ve started with the datC approach:
from pandas import read_csv
from matplotlib import pyplot
column_names = ['datestamp', 'date','time','weight']
series = read_csv('../output/datc20190926.csv', names=column_names)
series.plot()
pyplot.show()
Which renders a nice graph:
This shows what I expect, in a way. Two pots of coffee made, with about six cups being taken from each one.
There’s something wrong though. The first third is all a bit messy. Not sure what’s happening there, so well look at the time, and hm, there isn’t even a time notes. The x-axis is a count of samples, not a time series. I can’t tell if that first disorganised section is an hour or nine hours. I can’t tell if the first pot of coffee was drank in an hour or in twelve. Given that this data covers a full day (from midnight to midnight), it looks like one pot of coffee lasted all day, and there was another one made late at night (7pm maybe?).
Simple plot of raw, time-series data
Lets try with the raw data (output/datr20190926.csv):
Ok that’s no better. There’s a few bad samples in there that has obscured the good samples. Can I filter it? It still hasn’t got the right X-axis either, measuring samples rather than time.
Filter and remove out-of-range samples
Filtering out values that I _know_ are bad will help and is easy. See https://stackoverflow.com/questions/29594841/how-to-filter-out-values-by-values-in-pandas-dataframe.
from pandas import read_csv
from matplotlib import pyplot
column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
series = series[(series['weight'] > -2000) & (series['weight'] < 10000 )]
series.plot()
pyplot.show()
Which filters out weights less than -2000g and more than 10000g. This is better because I can see the overall shape of the values and the positions in the overall body of samples (across the whole day). I can see that the cups being taken from each pot are not regularly spaced.
But there’s still a lot of noise that doesn’t hit those thresholds, and importantly, this approach simply throws away the samples that are outside the bounds. So that means there is a time gap at those points, and if enough of them happen (I can only see two here), then the time-series is discontinuous.
So I think the approach is not to filter out and discard bad values, it is use to use the source log data to produce an entirely new stream of weight values using something like a moving window of averaging. That’s how the firmware does it and that gives a decent result.
Use a rolling() sample window to remove noise
There is a rolling_mean() function in pandas: https://www.programcreek.com/python/example/101378/pandas.rolling_mean. Oh it’s deprecated. And I don’t want a mean anyway, means are rubbish, I want the median. https://stackoverflow.com/questions/43437657/rolling-mean-on-pandas-on-a-specific-column describes using pandas.rolling().mean() instead, which is closer. I assume there’s a .median() function too.
from pandas import read_csv
from matplotlib import pyplot
column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
series['rolling'] = series['weight'].rolling(8).mean()
series.plot()
pyplot.show()
from pandas import read_csv
from matplotlib import pyplot
column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
series['rolling'] = series['weight'].rolling(8).median()
series.plot(x='time', y='rolling')
# Rotate x ticks and tight_layout fits it all on the page
pyplot.xticks(rotation='vertical')
pyplot.tight_layout()
pyplot.show()
That’s more like it. It uses .median() instead of mean to deal better with outliers, and also I figured out series.plot(x='time', y='rolling') to specify which axes to use, rotate the time ticks so they don’t overlap, and tight_layout()‘d it so they didn’t fall off the bottom of the page.
This calculates the median value in a rolling window of 8 samples, so that’s about sixteen seconds. I’m interested in filtering out some more of the jaggies, and would like to see the results of a few different versions plotted together. Doing like below was a guess and it worked. It makes me start to think I’m getting a bit of a clue on how to use this toolset. Four hours in.
See the effects of different sized rolling windows
What was that weird blip at 3am? So I know that 24 samples covers 48 seconds of activity and filters out almost all variance. It shows the gulp-gulp-gulp of the coffee cups down (8:45 to 11:41) which is really cool and really clear. But I realised that’s not quite what I thought I was looking for.
This tells a story and at first I thought it filtered out too much. It doesn’t show the sequence of events of lifting the coffeepot out->returning the pot and it’s a bit lighter. Because the pot is out for a short period of time it just disappears. I thought that this little signature would be important to be able to recognise a cup of coffee being taken.
In fact, I think the heavily-filtered plot, with the simple chunky downward steps tells the story more clearly. If I could spot a drop of around a cup, then that’s simple.
I’ve just realised that the matplotlib viewer that pops up when I do pyplot.show() is really good. It does a zoom-in on a section! That’s exactly what I wanted excel to do for me. Excel really is the wrong tool for this job.
Adding more ticks to the x-axis
It does not show me very good ticks on the x-axis though, there aren’t enough. I’ll need to add some more.
from pandas import read_csv
from matplotlib import pyplot
column_names = ['datestamp', 'date', 'time', 'weight']
series = read_csv('../output/datr20190926.csv', names=column_names)
small_window = ('rolling4', 4)
large_window = ('rolling36', 36)
series[small_window[0]] = series['weight'].rolling(small_window[1]).median()
series[large_window[0]] = series['weight'].rolling(large_window[1]).median()
series.plot(x='time', y=[small_window[0], large_window[0]])
series.set_index('time')
count = series['time'].count()
no_of_ticks = 24
size_of_segment = int(count / no_of_ticks)
# two lists, one is a list of indices regularly spread out across the series
indices = list()
tick_labels = list()
for i in range(0, no_of_ticks):
position = (size_of_segment * i) + 42/2
indices.append(position)
tick_labels.append(series['time'][position])
pyplot.grid(True, which='major')
pyplot.xticks(labels=tick_labels, ticks=indices, rotation='vertical')
pyplot.tight_layout()
pyplot.show()
That seems like a clunky way to do it. There must be a better one! That’s enough for today. Six hours work. Bit of Overwatch before bed.
So, this should be easy, right? I've used my [coffee_boss scales](https://github.com/euphy/coffee_boss) device to save a load of weight measurements into a couple of text files as csv. I want to show these values as a graph, and do some other things with them. I'm working on my home machine, which is Windows 10. I opened up Pycharm and checked a new project out of source control (https://github.com/euphy/coffee_boss.git).
I create a new directory "viz" to do some visualisation stuff in. Inside, I create viz.py, and look at https://machinelearningmastery.com/time-series-data-visualization-with-python/ which I think is a good place to start. I've never done anything visual with python before, never used pandas, barely used numpy.
# Diversion 1:
Set up the environment. Oh right, lets switch to python3 - start on the right foot eh. Ok, that's in project settings, I remember that. (I haven't used Pycharm for anything for years.) Oh right, there's already loads of packages loaded in site-packages.
I want a fairly fresh Python environment so it's easier to debug problems if they happen. I'll create a virtualenv through pycharm's interface. Should I create the virtualenv somewhere near the coffee_boss project? I don't think that's a good idea because I don't want it to be checked into git. But I don't really want a random disconnected bundle of venvs either. I'm sure I already have a few of those floating around, with names that only made sense at the time.
So I'll do it anyway, create sandy_000/venv/coffee_boss. At least there's something concrete to connect it to this project.
# Back to work
Good, fresh environment. First line.
from pandas import read_csv
Ok, got the squiggles, that's good. Let's see if muscle memory serves me… what do my fingers do? Turns out alt+enter is still the combo for "solve the squiggles". Nice, I feel clever, all those years haven't abandoned me entirely. Alt+enter, then "install package pandas".
# Diversion 2: Update Pycharm
I've got an error: "AttributeError: module 'pip' has no attribute 'main'". Hm. Check that pip is the right version...
Google. This was a problem in 2018, when upgrading to v10.x of pip, maybe that's it. Ok, I'm on version 19.x, that's not it. 20 minutes of searching. Google. https://intellij-support.jetbrains.com/hc/en-us/community/posts/360000168364-Pycharm-Virenv-AttributeError-module-pip-has-no-attribute-main-occured- gives me the hint, and I notice that this is version Pycharm v2017.
I start up the Jetbrains toolbox and hit the button to upgrade Pycharm. I decide to uninstall an ancient version of Webstorm while I'm there. I didn't even know I had that installed.
# Diversion 3: Make this into a blogpost.
I decide to write this down. I had this idea about stepping through my process of getting something from idea to openshift and it's this kind of back-and-forth that reveals the dull complexity of trying to get anything done with programming. What a good idea. Just type it into an email in the first case.
I open the empty draft and instantly remember I hate swapping between tabs in firefox. And I hate having multiple instances of firefox open, I always end up finding the wrong window and duplicating tabs in both.
I just opened notepad++ instead. That's good too, I got up to here typed, and I'll just use markdown for code.
# So, back to work
So I've started and got this far. But hm. I'm writing this article in notepad++, and I want to paste in a picture of the squiggles. I use snipping tool to take it, but of course I can't paste it here. Man.
Diversion 4: Updating WordPress
Obviously, why don’t I just switch to writing this on my dev blog. Yes, good plan. Go to www.euphy.co.uk, it’s a wordpress site. Figure out my password and log in and start a new post. Good, now we’re getting somewhere.
(Don’t want to spoil the tension, but you can tell this bit worked eventually.
Copy the snipped image, … paste it. Paste it. PASTE IT. It’s not pasting. Doesn’t wordpress allow pasting into posts? I’m sure it used to. Oh, the installation is three years old. Maybe it just needs an update – I think I remember doing that on my other site recently.
So abandon that post and click “Get v5.2.3”. It says it can do it automatically. I know it says take a backup first… Well nothing has changed recently on this machine, so I’m sure the last backup is ok.
Downloading. Now unpacking the update… That’s taking a long time. Open a new tab. It’s not in maintenance mode. I wonder if I can see the logs somewhere? What version even is this one? It tells me this could be updated, but is it a minor or major bump? What should I expect?
Ok, this is v4.something. Try to start the update again. Hm update already in progress… Gosh I’m going to have to look at the logs. Lets go to uk2.net, that’s my hosting company. Oh.
This wordpress installation was installed through “softaculous”. That’s got a way to do upgrades outside of the wordpress UI. Try that. Great, the install to 5.1.2 seemed to go ok… It says. How did I check the version again? This is awful. Oh right, under the big W. It still says it’s at 4.9.11.
Why isn’t it upgrading. Ok, in the file manager I can see that the upgrade bundle has been downloaded, and unpacked. What happens next? Can I do it manually?
Google. Found https://wordpress.org/support/topic/upgrade-stalls-at-unpacking-the-update/ which describes my issue. It points me to some plugins – “fix another update in progress” in particular. That wasn’t the error I had… But try it. Ok, manual install it is. Make a backup and remove the files, and copy across the new ones. Ok, test it, that’s trying to setup a fresh installation… I’ll copy wp-config.php and the wp-content directory over too. It works.
https://www.wordfence.com/learn/how-to-manually-upgrade-wordpress-themes-and-plugins/ showed how to update. It was pretty easy.
Now, were was I. That’s taken an hour and a half. Dave has just sent a message on steamchat asking if I’m ready for some rocket league while we’re waiting for Andy to come online. I haven’t even had my dinner yet! The dishes from lunch are still waiting in the kitchen! I had hotdogs for lunch. I’m not really hungry.
Oh right, just before I do that, let’s just see if I can paste inline images into a new post.
Oh yeah, it was all worth it.
Back to work
Copy and paste the doc from notepad into wordpress. Oh I’ll do a clever thing where I show how the text started in notepad and do some stuff to indicate diversions.
This is getting long. Got to make dinner. I started this just before four, it’s half eight now and I’ve got one line of code written. And that doesn’t even work. I think this is a two-parter.