Statistical calculations on a set of data without storing all values

The problem:
I have a set of time readings taken at regular intervals. I dont want to store and sort all the values, but I'd like to form a sensible statistical representation of them as they come in.
The standard deviation shows they are not normally distributed.
Sample data below.
I guess this is easy but my brain is addled; I'm finding the max and min values like this:

    if (pingTime > pingMax) pingMax = pingTime;
    if (pingTime < pingMin) pingMin = pingTime;

what I'm looking for is a way to find the SECOND largest, SECOND smallest, and the median.
Presently they are floats, in milliseconds, but I'm happy to use ints.

They are ping times so the mean and standard deviation dont seem to be a sensible way to represent them.

pingstats pingTime;    23.87; jitter;  46.73 
pingstats pingTime;   105.69; jitter;  35.09 
pingstats pingTime;    23.08; jitter;  47.52 
pingstats pingTime;   104.32; jitter;  33.72 
pingstats pingTime;   116.13; jitter;  45.53 
pingstats pingTime;    34.76; jitter;  35.84 
pingstats pingTime;     4.32; jitter;  66.28 
pingstats pingTime;    26.82; jitter;  43.78 
pingstats pingTime;    96.89; jitter;  26.29 
pingstats pingTime;   118.99; jitter;  48.39 
pingMax is 216 pingMin is 4 
sumJitter;   429.15 ;pingSamples;    10.00 ;pJitter ;   42.92  
sumPing;   654.88 ;sumPing2;   62133.70 ;pingAvNew ;68.04 ;sTemp 42886.32 ;sTemp2; 1924.74 ;sigma; 43.87 ;jitter; 42.92 

dont worry about the reported computed results seeming "wrong":
here pingMax and pingMin are hourly values, and pingAv New is geometrically filtered.
the jitter is the difference between the actual ping time and pingAvNew.

The first two values are quite easy to get, the third, the "median", I don't think you can without storing the values, unless you meant "average".
Just to be clear, the main difference between average (mean) and median lies in how they are calculated and what they represent. Average is the sum of all values divided by the number of values while median is the middle value when the data is sorted.

So, if you need the average/mean value together with the first two, as "pingMin2" and "pingMax2" for the second smallest and largest, it's something like (untested, add it to your code):

    if (pingTime > pingMax) 
      pingMax = pingTime;
    else
      if (pingTime > pingMax2) pingMax2 = pingTime;
    if (pingTime < pingMin) 
      pingMin = pingTime;
    else
      if (pingTime < pingMin2) pingMax2 = pingTime;

Maybe you could determine that by counting the number of values n to get n/2 or n/2+1 then reading only that value again?

thanks Alex; yes that will do the job. I didnt mean average its too ambiguous.
The arithmetic mean I dont think is representative of this data set.
I'll try a the idea of using a geometric mean, but that may be even less meaningful.
The mode may be more useful, if I work with a set of n bins within a range nMin - nMax.
If they are all different (because of the precision) the mode may be meaningless.

In this case the old largest becomes the second largest.

    if (pingTime > pingMax) {
      pingMax2 = pingMax;
      pingMax = pingTime;
    }

Same for the minimum.

1 Like

Have a look at the interfaces of the following libs

2 Likes

Ok, I can accept that, but if we're talking about ping times, what counts are the min and max times while the average is just an additional indication but could help more than just min and max...

I.e. let's say if you have the sequence 10, 12, 16, 14, 11, 18, 11 and 20 ms, the min values are 10 and 11, the max 18 and 20, but the average isn't just the middle value (min+max)/2 but instead you have the averages of the 8 samples like:
(10+12+16+14+11+18+11+20)/8=112/8=14
making it a more realistic representation of the sequence values.

If you don't want to keep all the samples (for memory reason, I suppose) just keep adding the samples, so after 8 (or any interval you need to use) of them you'll have a variable containing the total and another with the number of samples. But you can't obviously indefinitely keep the total, so to avoid overflows you can decide to reset it after a certain number of readings, high enough to let you get a significative sample.
Another example of (untested) code, just to make things clearer:

...
unsigned long totalValue = 0;
unsigned long totalSamples = 0;

void AddSample(unsigned long lastValue) {
  totalValue += lastValue;
  ++totalSamples;
}

unsigned long getAverage() {
  if (totalSamples > 0)
    return (totalValue/totalSamples);
  else
    return 0;
}

void resetSamples() {
  totalValue = 0;
  totalSamples = 0;
}
...
  // Sum last value
  addSample(lastPing);
  // If at least 10 samples, print the average
  if (totalSamples >= 10) Serial.println(getAverage());
  // After 2000 samples reset the counters
  if (totalSamples >= 2000)  resetSamples();
...

Since the ping durations can't go to -∞, or even negative, they can't be normally distributed.

Cool -- It works like an insertion sort on a top-N (for N=2) list.

How many observations would occur during the hour?

I'd think @robtillaart's running median scheme could be a reasonable way forward. The average of the sample medians of a skewed variate would approach the median of the population.

My code to find the arithmetic mean (also the jitter and standard deviation) use thst approach with aggregator variables.

Thanks Rob, but sadly oop is a total mystery to me. If it wasnt I'd look at the ESPPing library which If i'm right sits on top of a ping library and cripples it.

Took me a while, but I see now. Neat. Reminds me of programming with LISP about 30 years ago!
Ah - it only works if the data are sorted first.

I wondered about that Dave but that only seems to apply after the data have been "normalised".

@docdoc I thought it was working, Alex, but not so.

pingstats pingTime, 31, jitter; 21; pingMax ,31, pingMax2 ,1, pingMin2 , 999, pingMin , 31 
pingstats pingTime, 92, jitter; 40; pingMax ,92, pingMax2 ,1, pingMin2 , 92, pingMin , 31 
pingstats pingTime, 22, jitter; 30; pingMax ,92, pingMax2 ,22, pingMin2 , 92, pingMin , 22 
pingstats pingTime, 94, jitter; 42; pingMax ,94, pingMax2 ,22, pingMin2 , 92, pingMin , 22 
pingstats pingTime, 5, jitter; 47; pingMax ,94, pingMax2 ,22, pingMin2 , 92, pingMin , 5 
pingstats pingTime, 34, jitter; 18; pingMax ,94, pingMax2 ,34, pingMin2 , 34, pingMin , 5 
pingstats pingTime, 4, jitter; 48; pingMax ,94, pingMax2 ,34, pingMin2 , 34, pingMin , 4 
pingstats pingTime, 24, jitter; 28; pingMax ,94, pingMax2 ,34, pingMin2 , 24, pingMin , 4 
pingstats pingTime, 96, jitter; 44; pingMax ,96, pingMax2 ,34, pingMin2 , 24, pingMin , 4 
pingstats pingTime, 12, jitter; 40; pingMax ,96, pingMax2 ,34, pingMin2 , 12, pingMin , 4 

note the pingMax of 96 is correct, but the second largest, 94 and the third largest , 92 are missed:
and for pingMin the lowest, of 4 is correct, but the second lowest, of 5 is missed.

I used your code, with values initialised each minute

 //these record minute by minute max & min so reset them
  pingMax = 0;
  pingMax2 = 1;
  pingMin = 990;  //the ping program has a default timeout of 1 sec
  pingMin2 = 999;
 //pingMax and pingMin   (10 samples per minute)
        if (pingTime > pingMax) 
      pingMax = pingTime;
    else
      if (pingTime > pingMax2) pingMax2 = pingTime;

    if (pingTime < pingMin) 
      pingMin = pingTime;
    else
      if (pingTime < pingMin2) pingMin2 = pingTime;

A few debugging could make it clear (I said the code was untested, right?), so let's start with a step-by-step analysis.

Based on your output, when receiving the first 31 value we have (this format is more readable than yours, please consider using it):

pingMax=31 pingMax2=1 pingMin2=999 pingMin=31 

On the next value, 92, you report a wrong assignment:

pingMax=92 pingMax2=1 pingMin2=92 pingMin=31 

instead of the expected output (without the correction from @oqibidipo we're ignoring but must be applied to fully update the variables):

pingMax=92 pingMax2=1 pingMin2=999 pingMin=31 

Following your code:

    //pingMax and pingMin   (10 samples per minute)
    if (pingTime > pingMax) 
      pingMax = pingTime;
    else
      if (pingTime > pingMax2) pingMax2 = pingTime;

    if (pingTime < pingMin) 
      pingMin = pingTime;
    else
      if (pingTime < pingMin2) pingMin2 = pingTime;

with the first variable values (pingMax=31 pingMax2=1 pingMin2=999 pingMin=31) and pinfTime=92, the first "if (pingTime > pingMax)" is true, so the code will set pingMax=92. Then, the second if "if (pingTime < pingMin)" is also true (92 < 999), and that's the culprit.
But this should have happened in the first case too, why the first line shows "pingMin2 , 999"?

Anyway, when designing an algorythm the very first step should be fully describe the desired results. The expected and correct output after the first 4 values IMO should be:

pingtime=31 pingMax=31 pingMax2=31 pingMin2=31 pingMin=31 
pingtime=92 pingMax=92 pingMax2=31 pingMin2=31 pingMin=31 
pingtime=22 pingMax=92 pingMax2=31 pingMin2=31 pingMin=22 
pingtime=94 pingMax=92 pingMax2=31 pingMin2=31 pingMin=31 

To avoid giving you a "ready-made" solution (you won't learn much), which changes you think are needed to get this exact output? Can you first make a flow diagram and then a newer and better code? Give it a try, then I'll try and help you more than that.

pingtime=31 pingMax=31 pingMax2=31 pingMin2=31 pingMin=31 : 
 I dont mind the values being incorrect for the first 2 lines; they are only significant after 10 samples
pingtime=92 pingMax=92 pingMax2=31 pingMin2=31 pingMin=31 
pingtime=22 pingMax=92 pingMax2=31 pingMin2=31 pingMin=22 
pingtime=94 pingMax=92 pingMax2=31 pingMin2=31 pingMin=31 
no, I'd expect
pingtime=94 pingMax=94 pingMax2=92 pingMin2=31 pingMin=22 

I'm guessing the "error" was to get me thinking about the process. I'd have used that "trick" with students myself!

Adding the @oqibidipo "trick" like this and it SEEMS to work

                     //pingMax and pingMin
    if (pingTime > pingMax) {
      pingMax2 = pingMax;
      pingMax = pingTime;
    } else  //its not the biggest but is it bigger than the present value for second largest?
      if (pingTime > pingMax2) pingMax2 = pingTime;

    if (pingTime < pingMin) {
      pingMin2 = pingMin;
      pingMin = pingTime;
    } else 
    if (pingTime < pingMin2) pingMin2 = pingTime;
pingTime	30	pingMax	30	pingMax2	0	pingMin2	990	pingMin	30
pingTime	102	pingMax	102	pingMax2	30	pingMin2	102	pingMin	30
pingTime	20	pingMax	102	pingMax2	30	pingMin2	30	pingMin	20
pingTime	92	pingMax	102	pingMax2	92	pingMin2	30	pingMin	20
pingTime	113	pingMax	113	pingMax2	102	pingMin2	30	pingMin	20
pingTime	82	pingMax	113	pingMax2	102	pingMin2	30	pingMin	20
pingTime	102	pingMax	113	pingMax2	102	pingMin2	30	pingMin	20
pingTime	125	pingMax	125	pingMax2	113	pingMin2	30	pingMin	20
pingTime	4	pingMax	125	pingMax2	113	pingMin2	20	pingMin	4
pingTime	5	pingMax	125	pingMax2	113	pingMin2	5	pingMin	4

Thanks all for your help

Comparing my suggested behaviour with your expected results you could build your version.
So, try this approach, and test the code below posting the results (or make the needed changes to the conditions):

int pingMax = 0;
int pingMax2 = 1;
int pingMin = 990;
int pingMin2 = 999;

int pingTime = 0;

void setup() {
  Serial.begin(9600);
  updateValues(31);  
  updateValues(92);
  updateValues(22);
  updateValues(94);
}

void updateValues(int v) {
  pingTime = v;
  if (pingTime > pingMax)  {
    pingMax2 = pingMax;
    pingMax = pingTime;
  }
  if (pingMax2 == 0) pingMin2 = pingTime;
  if (pingTime < pingMax && pingTime > pingMax2) pingMax2 = pingTime;
  if (pingTime < pingMin) {
    pingMin2 = pingMin;
    pingMin = pingTime;
  } 
  if (pingMin2 == 990) pingMin2 = pingTime;
  if (pingTime > pingMin && pingTime < pingMin2) pingMin2 = pingTime;
  printValues();
}

void printValues() {
  char buf[80];
  sprintf(buf, "pingTime=%d pingMax=%d pingMax2=%d pingMin2=%d pingMin=%d",
         pingTime, pingMax, pingMax2, pingMin2, pingMin);
  Serial.println(buf);
}

void loop() {
}

Thanks Alex.
I was having some residual problems when a full minute (10 readings) was incomplete, so the Min2 could still be 990, and your code fixed that, assigning the first (only) reading to all initial values.

Really, I feel it would be better to put all those times - and maybe the hourly results also - into a struct. However the code has kind of developed as my ideas have developed.
So for example, the readings are SUPPOSED to be at 6 second intervals, so I could count 10 readings for a minute; but the ping routine can take a lot longer if there is no response, so I've had to change to using millis() to time the up and down times.

I have a hunch you might be able to analyze this as a right-censored Poisson process, but seems you have found a solution that works for you.