Module ndarray_stats::histogram::strategies
source · [−]Expand description
Strategies used by GridBuilder
to infer optimal parameters from data for building Bins
and Grid
instances.
The docs for each strategy have been taken almost verbatim from NumPy
.
Each strategy specifies how to compute the optimal number of Bins
or the optimal bin width.
For those strategies that prescribe the optimal number of Bins
, the optimal bin width is
computed by bin_width = (max - min)/n
.
Since all bins are left-closed and right-open, it is guaranteed to add an extra bin to include the maximum value from the given data when necessary, so that no data is discarded.
Strategies
Currently, the following strategies are implemented:
Auto
: Maximum of theSturges
andFreedmanDiaconis
strategies. Provides good all around performance.FreedmanDiaconis
: Robust (resilient to outliers) strategy that takes into account data variability and data size.Rice
: A strategy that does not take variability into account, only data size. Commonly overestimates number of bins required.Sqrt
: Square root (of data size) strategy, used by Excel and other programs for its speed and simplicity.Sturges
: R’s default strategy, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets.
Notes
In general, successful infererence on optimal bin width and number of bins relies on variability of data. In other word, the provided ovservations should not be empty or constant.
In addition, Auto
and FreedmanDiaconis
requires the interquartile range (IQR)
,
i.e. the difference between upper and lower quartiles, to be positive.
Structs
Maximum of the Sturges
and FreedmanDiaconis
strategies. Provides good all around
performance.
Robust (resilient to outliers) strategy that takes into account data variability and data size.
A strategy that does not take variability into account, only data size. Commonly overestimates number of bins required.
Square root (of data size) strategy, used by Excel and other programs for its speed and simplicity.
R’s default strategy, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets.
Traits
A trait implemented by all strategies to build Bins
with parameters inferred from
observations.