Monthly Archives: October 2019

Coffee Boss day 6.5: Combining a scatter with a line

So based on what I spotted in the source code of matplotlib’s _axes.py (https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes/_axes.py#L1469-L1493):

    def plot(self, *args, scalex=True, scaley=True, data=None, **kwargs):
        """
        Plot y versus x as lines and/or markers.

        Call signatures::

            plot([x], y, [fmt], *, data=None, **kwargs)
            plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs)

        The coordinates of the points or line nodes are given by *x*, *y*.

        The optional parameter *fmt* is a convenient way for defining basic
        formatting like color, marker and linestyle. It's a shortcut string
        notation described in the *Notes* section below.

        >>> plot(x, y)        # plot x and y using default line style and color
        >>> plot(x, y, 'bo')  # plot x and y using blue circle markers
        >>> plot(y)           # plot y using x as index array 0..N-1
        >>> plot(y, 'r+')     # ditto, but with red plusses

I saw that I could use my axe to simply plot the positions of the indicators on in two dimensions. I got this:

Which is pretty much perfectly what I want right now. I did some fairly dirty mucking around with the data to get it to do this, essentially looking for where the row-to-row weight difference crosses a threshold from low-to-high.

# median filter with a rolling window: low pass filter
df['rolling4'] = df['weight'].rolling(4).median()

# normalise by looking for difference over 8 samples
df['diff'] = df['rolling4'].diff(periods=-8)

# Tag with True where the change is over 300g
threshold = 300.0
df['thresholded'] = (df['diff'] > threshold)

# Produce 'highlight' boolean where the threshold is True, AND
# the threshold for the previous row was False. This feels pretty clunky.
df['highlight'] = (df['thresholded'] == True) & (df['thresholded'].shift(1) == False)

# Now create a new dataframe with just the highlights in, and only the interesting columns
highlights = df[df['highlight']][['datetime', 'rolling4']]

That’s good isn’t it?

Coffee Boss day 6: Horizontal line

I’ve been trying to get a horizontal line to show a threshold. It never worked. It gave me a mean-spirited error message that I couldn’t understand. I spent the last few days trying. I got this one:

Traceback (most recent call last):
  File "C:/Users/sandy_000/PycharmProjects/coffee_boss/viz/viz.py", line 65, in <module>
    df.plot(y=['diff'], ax=ax3)
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_core.py", line 794, in __call__
    return plot_backend.plot(data, kind=kind, **kwargs)
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_matplotlib\__init__.py", line 62, in plot
    plot_obj.generate()
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_matplotlib\core.py", line 284, in generate
    self._adorn_subplots()
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_matplotlib\core.py", line 472, in _adorn_subplots
    sharey=self.sharey,
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_matplotlib\tools.py", line 316, in _handle_shared_axes
    _remove_labels_from_axis(ax.xaxis)
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\pandas\plotting\_matplotlib\tools.py", line 281, in _remove_labels_from_axis
    for t in axis.get_majorticklabels():
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\axis.py", line 1252, in get_majorticklabels
    ticks = self.get_major_ticks()
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\axis.py", line 1407, in get_major_ticks
    numticks = len(self.get_majorticklocs())
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\axis.py", line 1324, in get_majorticklocs
    return self.major.locator()
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\dates.py", line 1431, in __call__
    self.refresh()
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\dates.py", line 1451, in refresh
    dmin, dmax = self.viewlim_to_dt()
  File "C:\Users\sandy_000\venv\coffee_boss\lib\site-packages\matplotlib\dates.py", line 1202, in viewlim_to_dt
    .format(vmin))
ValueError: view limit minimum 0.0 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units

So what I did was change

ax3.axhline(threshold, linewidth=1, color='r')

df.plot(y=[small_window['name'], 'thresholded'], secondary_y=['thresholded'], ax=ax1)
df.plot(y=[small_window['name']], ax=ax2)
df.plot(y=['diff'], ax=ax3)

To

df.plot(y=[small_window['name'], 'thresholded'], secondary_y=['thresholded'], ax=ax1)
df.plot(y=[small_window['name']], ax=ax2)
df.plot(y=['diff'], ax=ax3)

ax3.axhline(threshold, linewidth=1, color='r')

Yes. Same, but the hline happens after the plot. OK, I can make the intuitive leap for why this works and not be cross about it, but I wish I’d tried this a week ago.

Coffee Boss day 5: Further adventures

I’ve been doing a bit more work which is about:

Analysis

Finding an algorithm or treatment (I don’t know what the right word is… an analysis?) that will isolate significant changes to the weight of the machine,

  1. applying a high-cut filter using a median filter in a rolling window,
  2. then producing a diff to create something normalised,
  3. then thresholding that to produce some binary output
column_names = ['datestamp', 'date', 'time', 'weight']
df = read_csv('../output/datr20190911.csv', names=column_names, parse_dates=True, infer_datetime_format=True)
df = df.append(read_csv('../output/datr20190912.csv', names=column_names, parse_dates=True, infer_datetime_format=True))
df = df.append(read_csv('../output/datr20190913.csv', names=column_names, parse_dates=True, infer_datetime_format=True))
df = df.append(read_csv('../output/datr20190914.csv', names=column_names, parse_dates=True, infer_datetime_format=True))

df['datetime'] = pd.to_datetime(df['datestamp'])
df.set_index('datetime')
del df['datestamp']
del df['time']
del df['date']

small_window = {'name': 'rolling4', 'size': 4}
df[small_window['name']] = df['weight'].rolling(small_window['size']).median()

df['diff'] = df[small_window['name']].diff(periods=8)

threshold = 600.0
df['thresholded'] = (df['diff'] > threshold) * 1

Displaying it to check the analysis

Find how to display the charts in a way that lets me see what I’m actually doing. This has evolved into what’s below.

fig1, (ax1, ax2, ax3) = plt.subplots(nrows=3, ncols=1, sharex='all')

ax1.grid(b=True, which='major', color='#666666', linestyle='-')
ax1.minorticks_on()
ax1.grid(b=True, which='minor', color='#999999', linestyle='-', alpha=0.2)

ax2.grid(b=True, which='major', color='#666666', linestyle='-')
ax2.minorticks_on()
ax2.grid(b=True, which='minor', color='#999999', linestyle='-', alpha=0.2)

ax3.grid(b=True, which='major', color='#666666', linestyle='-')
ax3.minorticks_on()
ax3.grid(b=True, which='minor', color='#999999', linestyle='-', alpha=0.2)


df.plot(x='datetime', y=[small_window['name'], 'thresholded'], secondary_y=['thresholded'], ax=ax1)
df.plot(x='datetime', y=[small_window['name']], ax=ax2, figsize=(8, 8))
df.plot(x='datetime', y=['diff'], ax=ax3)

plt.tight_layout()
plt.show()

Coffee Boss day 4: What am I actually trying to do

Matplotlib and pandas have a couple of fundamental principles that I’m not getting. There seems to be an odd mix of global and specific commands that go into expelling a graph and I’m not seeing the link.

Naturally this is causing me to bump into some awkward questions, the main one being “what am I actually trying to do?”. I thought this was simple, but it’s not quite. I sketched the following manually as capturing what I’d like:

This chart shows the features I think I need to gather:

  1. Weight of each cup of coffee. The height of the grey boxes show this. This can be recognised by seeing a rapid drop in weight where the size of the drop is greater than can be explained by evaporation. I want to know this so that I can see the variance between the biggest and the smallest cups. Everyone pays the same.
  2. Freshness of the pot of coffee (time since last pot). The first vertical line shows the start of a new pot. I can intuitively recognise this point as being where there is a sudden increase in weight of about 2kg. This is obvious in some cases (like the end of the figure below) where the weight is low and rapidly increases.
    It is less obvious in the refill from the beginning of the figure below, where the weight beforehand was high too so there isn’t that clear jump from very low to very high. I assume in this case, there was already a spare pot of water on top of the machine waiting to be used, and so the weight drop (that is visible) only lasts the time between picking the pot up and pouring it into the machine.
  3. Number of cups in each pot. This is a simple count of the number of events recognised in 1. I can’t see any way to determine if the last small drop before a refill is a cupful (ie someone’s taking it) or if it’s just waste. A combination of age of coffee and size of cup may form a heuristic for that but I don’t know how to gather the data from the scales alone. I might add a button on the touchscreen for “discarded waste/reset pot”.

The width (or length) of the grey boxes is interesting (indicating the time between cups), but I’ve got no direct need for that data yet.

Make it time-series

Right now, the data is arranged in time sequence, and has a fixed sampling frequency, so it is a complete time-series. However, pandas doesn’t know that yet, the labels for time are just strings. I’ll make it into a true time-series because Pandas has a bunch of specific tools for working with time-series data (including resampling and how I specify the size of windows in seconds rather than samples) AND I want to be able to combine multiple days into one stream of data.

Remember to that the first tutes assumed I was converting to datetimes during import using parse_dates=True in the read_csv(...). That never worked for me, and I got errors I didn’t understand. Use df.info() to check whether the conversion had worked properly, it now looks like:

column_names = ['datestamp', 'date', 'time', 'weight']
df = read_csv('../output/datr20190923.csv', names=column_names, parse_dates=True, infer_datetime_format=True)
df['datetime'] = pd.to_datetime(df['datestamp'])
df.index = df['datetime']
del df['datestamp']
del df['time']
del df['date']
print(df.info())

and gives me:

[41141 rows x 5 columns]
 
 RangeIndex: 41141 entries, 0 to 41140
 Data columns (total 5 columns):
 weight        41141 non-null float64
 datetime      41141 non-null datetime64[ns]
 rolling4      41138 non-null float64
 rolling36     41106 non-null float64
 pct_change    41105 non-null float64
 dtypes: datetime64ns, float64(4)
 memory usage: 1.6 MB

When I run it. There’s a datetime64 object in there which is good! I wonder why it didn’t work last week? Furthermore,

data = pd.Series(df['pct_change'])
print(data)

Now gives me a time-indexed series:

datetime
 2019-09-23 00:00:01         NaN
 2019-09-23 00:00:03         NaN
 2019-09-23 00:00:05         NaN
 2019-09-23 00:00:07         NaN
 2019-09-23 00:00:09         NaN
                          …   
 2019-09-23 23:59:51    0.000046
 2019-09-23 23:59:53    0.000000
 2019-09-23 23:59:55    0.000043
 2019-09-23 23:59:57   -0.000043
 2019-09-23 23:59:59    0.000000
 Name: pct_change, Length: 41141, dtype: float64

Coffee Boss day 3: Looking for events

I don’t know how to do this bit. Not I don’t know technically, I mean I have no awareness of the nature of the tools and practice to look for events in a data stream, catergorise them, and present them.

My opening gambit is:

  1. Look through each weight sample, comparing it to the last (or the last few).
  2. If the current value is higher or lower (over a certain threshold) than it was, then:
  3. Record this as a significant event by putting it into another list with the same timestamp (events)
  4. Combine the events stream with the main data frame
  5. Present the raw weights data in a graph, and:
  6. Show the events overlaid

I can iterate through each row just using iterators and python loops, but that feels like a pandas anti-pattern. From reading around (how do I even describe this problem for google?), it seems like it’s best to do things in pandas en masse rather than by examining each record individually. I think that’s what pandas does.

df['pct_change'] = df[large_window['name']].pct_change()
df.plot(x='time', y=[large_window['name'], 'pct_change'], secondary_y=['pct_change'])

That’s a bit like what I’m looking for. The pct_change (percent change: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.pct_change.html?highlight=pct_change#pandas.Series.pct_change) will detect the scale of changes. It’ll hover around 0, but you can see where the big jumps are, then the percent change is also big.

This also uses the second_y kwarg which seems barely documented and most guides suggest a different approach. Here is something: https://stackoverflow.com/questions/29685887/secondary-y-true-changes-x-axis-in-pandas.

A negative percent change means the coffee machine is lighter (ie the pot is lifted or a cup is taken). A positive percent change means the machine got heavier (ie pot replaced or water refilled).

I can look through those percent changes and spot ones bigger than [a certain value], and mark those cases on the plot or save them out somehow for further analysis.