Matplotlib and pandas have a couple of fundamental principles that I’m not getting. There seems to be an odd mix of global and specific commands that go into expelling a graph and I’m not seeing the link.
Naturally this is causing me to bump into some awkward questions, the main one being “what am I actually trying to do?”. I thought this was simple, but it’s not quite. I sketched the following manually as capturing what I’d like:
This chart shows the features I think I need to gather:
- Weight of each cup of coffee. The height of the grey boxes show this. This can be recognised by seeing a rapid drop in weight where the size of the drop is greater than can be explained by evaporation. I want to know this so that I can see the variance between the biggest and the smallest cups. Everyone pays the same.
- Freshness of the pot of coffee (time since last pot). The first vertical line shows the start of a new pot. I can intuitively recognise this point as being where there is a sudden increase in weight of about 2kg. This is obvious in some cases (like the end of the figure below) where the weight is low and rapidly increases.
It is less obvious in the refill from the beginning of the figure below, where the weight beforehand was high too so there isn’t that clear jump from very low to very high. I assume in this case, there was already a spare pot of water on top of the machine waiting to be used, and so the weight drop (that is visible) only lasts the time between picking the pot up and pouring it into the machine.
- Number of cups in each pot. This is a simple count of the number of events recognised in 1. I can’t see any way to determine if the last small drop before a refill is a cupful (ie someone’s taking it) or if it’s just waste. A combination of age of coffee and size of cup may form a heuristic for that but I don’t know how to gather the data from the scales alone. I might add a button on the touchscreen for “discarded waste/reset pot”.
The width (or length) of the grey boxes is interesting (indicating the time between cups), but I’ve got no direct need for that data yet.
Make it time-series
Right now, the data is arranged in time sequence, and has a fixed sampling frequency, so it is a complete time-series. However, pandas doesn’t know that yet, the labels for
time are just strings. I’ll make it into a true time-series because Pandas has a bunch of specific tools for working with time-series data (including resampling and how I specify the size of windows in seconds rather than samples) AND I want to be able to combine multiple days into one stream of data.
df.info() to check whether the conversion had worked properly, it now looks like:
column_names = ['datestamp', 'date', 'time', 'weight'] df = read_csv('../output/datr20190923.csv', names=column_names, parse_dates=True, infer_datetime_format=True) df['datetime'] = pd.to_datetime(df['datestamp']) df.index = df['datetime'] del df['datestamp'] del df['time'] del df['date'] print(df.info())
and gives me:
[41141 rows x 5 columns] RangeIndex: 41141 entries, 0 to 41140 Data columns (total 5 columns): weight 41141 non-null float64 datetime 41141 non-null datetime64[ns] rolling4 41138 non-null float64 rolling36 41106 non-null float64 pct_change 41105 non-null float64 dtypes: datetime64ns, float64(4) memory usage: 1.6 MB
When I run it. There’s a datetime64 object in there which is good! I wonder why it didn’t work last week? Furthermore,
data = pd.Series(df['pct_change']) print(data)
Now gives me a time-indexed series:
datetime 2019-09-23 00:00:01 NaN 2019-09-23 00:00:03 NaN 2019-09-23 00:00:05 NaN 2019-09-23 00:00:07 NaN 2019-09-23 00:00:09 NaN … 2019-09-23 23:59:51 0.000046 2019-09-23 23:59:53 0.000000 2019-09-23 23:59:55 0.000043 2019-09-23 23:59:57 -0.000043 2019-09-23 23:59:59 0.000000 Name: pct_change, Length: 41141, dtype: float64