<no title>

class surpyval.univariate.nonparametric.nonparametric.NonParametric

Bases: NonParametricDistribution

Result of .fit() method for every non-parametric surpyval distribution. This means that each of the methods in this class can be called with a model created from the NelsonAalen, KaplanMeier, FlemingHarrington, or Turnbull estimators.

Hf(x: ArrayLike, interp: str = 'step') → NDArray

Cumulative hazard rate with the non-parametric estimates from the data. This is calculated using the relationship between the hazard function and the density:

\[H(x) = -\ln (R(x))\]

Parameters: x (array like or scalar) – The values of the random variables at which the function will be calculated.
Returns: Hf – The value(s) of the density function at x
Return type: scalar or numpy array

Examples

>>> from surpyval import NelsonAalen
>>> x = np.array([1, 2, 3, 4, 5])
>>> model = NelsonAalen.fit(x)
>>> model.Hf(2)
array([0.45])
>>> model.Hf([1., 1.5, 2., 2.5])
array([0.2 , 0.2 , 0.45, 0.45])

band(x: ArrayLike | None = None, method: str = 'hall-wellner', bound_type: str = 'exp', alpha_ci: float = 0.05, n_sims: int = 10000, random_state: int | None = 1) → NDArray

Simultaneous confidence band of the survival function.

The pointwise bounds from cb() cover the true value of the survival function at each single time with probability 1 - alpha_ci, but the probability that the whole true curve lies between them is lower, since the curve has many opportunities to escape. A confidence band is widened so that, with probability 1 - alpha_ci, the entire survival function lies within the band over the observed range. Use the band, not the pointwise bounds, to assess whether a hypothesised curve (e.g. a fitted parametric distribution) is consistent with the data as a whole.

Two classical bands are available:

“hall-wellner”: width proportional to \((1 + n\sigma^2(t))/\sqrt{n}\); tends to be relatively wider in the middle of the curve.
“nair” (equal precision): width proportional to the pointwise standard error, i.e. the band is the pointwise interval scaled by a larger critical value, so its width follows the pointwise bounds everywhere.

Critical values are computed by Monte Carlo simulation of the limiting Brownian bridge process, with a fixed default seed so results are reproducible.

The band is only defined between the first and last observed events (where the variance estimate is positive and finite); NaN is returned outside that range. The asymptotic theory for these bands is for right censored data; for Turnbull models with interval censoring prefer bootstrap_cb().

Parameters

x (array like or scalar, optional) – The values at which the band will be evaluated. Defaults to the observed values.
method (('hall-wellner', 'nair'), str, optional) – The type of band. Defaults to ‘hall-wellner’.
bound_type (('exp', 'normal'), str, optional) – As for cb(): ‘exp’ applies the band on the log(-log) scale, keeping it within [0, 1]. Defaults to ‘exp’.
alpha_ci (scalar, optional) – The level of significance of the band. Defaults to 0.05.
n_sims (int, optional) – Number of simulated paths for the critical value.
random_state (int or numpy.random.Generator, optional) – Seed for the critical value simulation. Defaults to a fixed seed for reproducibility.

Returns

band – Array of shape (len(x), 2) with the [lower, upper] band values for the survival function at each x.

Return type

numpy array

References

Hall, W. J. and Wellner, J. A. (1980), “Confidence bands for a survival curve from censored data”, Biometrika 67, 133-143.

Nair, V. N. (1984), “Confidence bands for survival functions with censored data: a comparative study”, Technometrics 26, 265-275.

Klein, J. P. and Moeschberger, M. L. (2003), “Survival Analysis”, 2nd ed., Section 4.4.

bootstrap_cb(x: ArrayLike, bound: str = 'two-sided', alpha_ci: float = 0.05, B: int = 200, random_state: int | None = None) → NDArray

Confidence bounds of the survival function computed with a non-parametric bootstrap: the data are resampled with replacement, the model is refitted with the same estimator, and the percentile interval across the refits is taken at each x.

This is the recommended way to compute bounds for the Turnbull estimator. The Greenwood-style bounds from cb() treat the expected (fractional) at risk and death counts from the Turnbull EM as if they were observed counts, which ignores the uncertainty in the EM allocation itself; the bootstrap does not.

Note that the Turnbull NPMLE only identifies the probability mass within each Turnbull interval, not how it is distributed inside one. Point estimates and bounds evaluated strictly inside an interval therefore reflect the step convention rather than an estimate of the underlying continuous survival function, and are best evaluated at the interval bounds.

Parameters

x (array like or scalar) – The values at which the confidence bounds will be calculated.
bound (('two-sided', 'upper', 'lower'), str, optional) – Compute either the two-sided, upper or lower confidence bound(s). Defaults to two-sided.
alpha_ci (scalar, optional) – The level of significance at which the bound will be computed. Defaults to 0.05.
B (int, optional) – The number of bootstrap resamples. Defaults to 200. Larger values give smoother bounds at a linear cost in runtime; note that refitting the Turnbull estimator is relatively expensive.
random_state (int or numpy.random.Generator, optional) – Seed or generator for reproducible resampling.

Returns

cb – For two-sided bounds an array of shape (len(x), 2) with [lower, upper] columns; otherwise an array of the requested bound at each x.

Return type

numpy array

cb(x: ArrayLike, on: str = 'sf', bound: str = 'two-sided', interp: str = 'step', alpha_ci: float = 0.05, bound_type: str = 'exp', dist: str = 'z') → NDArray

Confidence bounds of the on function at the alpha_ci level of significance. Can be the upper, lower, or two-sided confidence by changing value of bound. Can change the bound type to be regular (normal) or exponential using either the ‘t’ or ‘z’ statistic.

The variance used is the one appropriate to the estimator with which the model was fitted: Greenwood’s formula for Kaplan-Meier, Aalen’s (Poisson) variance for Nelson-Aalen, and the tie-corrected variance for Fleming-Harrington.

Parameters

x (array like or scalar) – The values of the random variables at which the confidence bounds will be calculated
on (('sf', 'ff', 'Hf'), optional) – The function on which the confidence bound will be calculated.
bound (('two-sided', 'upper', 'lower'), str, optional) – Compute either the two-sided, upper or lower confidence bound(s). Defaults to two-sided.
interp (('step', 'linear', 'cubic'), optional) – How to interpolate the values between observations. Survival statistics traditionally uses step functions, but can use interpolated values if desired. Defaults to step.
alpha_ci (scalar, optional) – The level of significance at which the bound will be computed.
bound_type (('exp', 'normal'), str, optional) – The method with which the bounds will be calculated. Using ‘normal’ (i.e. the plain Greenwood-style interval) will allow for the bounds to exceed 1 or be less than 0 and tends to undercover in small samples. Defaults to ‘exp’ (the log(-log) transformed interval) as this ensures the bounds are within 0 and 1 and has better coverage.
dist (('t', 'z'), str, optional) – The statistic to use in calculating the bounds (student-t or normal). Defaults to the normal (z). The ‘t’ option is a conservative heuristic which uses the at risk count minus one at each point as the degrees of freedom; it is not supported by asymptotic theory, widens (overcovers) as the at risk count falls, and is undefined when only one item remains at risk.

Returns

cb – The value(s) of the upper, lower, or both confidence bound(s) of the selected function at x. For two-sided bounds the result has one row per value of x with the columns being the [lower, upper] bounds of the on function.

Return type

scalar or numpy array

Notes

If the last observation is an event (i.e. there is no right censoring) the Kaplan-Meier variance is undefined at, and after, that point. The bounds there are filled with the last finite upper bound and 0 for the lower bound, rather than NaN, so that bounds can be drawn up to the last observation.

Examples

>>> from surpyval import NelsonAalen
>>> x = np.array([1, 2, 3, 4, 5])
>>> model = NelsonAalen.fit(x)
>>> model.cb([1., 1.5, 2., 2.5], bound='lower')
array([0.35485348, 0.35485348, 0.2345113 , 0.2345113 ])
>>> model.cb([1., 1.5, 2., 2.5])
array([[0.24175891, 0.97222045],
       [0.24175891, 0.97222045],
       [0.16288538, 0.89441253],
       [0.16288538, 0.89441253]])

References

http://reliawiki.org/index.php/Non-Parametric_Life_Data_Analysis

df(x: ArrayLike, interp: str = 'step') → NDArray

Density function with the non-parametric estimates from the data. This is calculated using the relationship between the hazard function and the density:

\[f(x) = h(x)e^{-H(x)}\]

Parameters: x (array like or scalar) – The values of the random variables at which the survival function will be calculated
Returns: df – The value(s) of the density function at x
Return type: scalar or numpy array

Examples

>>> from surpyval import NelsonAalen
>>> x = np.array([1, 2, 3, 4, 5])
>>> model = NelsonAalen.fit(x)
>>> model.df([1.5, 2.5, 3.5])
array([0.20468269, 0.15940704, 0.15229351])

ff(x: ArrayLike, interp: str = 'step') → NDArray

CDF (failure or unreliability) function with the non-parametric estimates from the data

Parameters: x (array like or scalar) – The values of the random variables at which the survival function will be calculated.
Returns: ff – The value(s) of the failure function at each x
Return type: scalar or numpy array

Examples

>>> from surpyval import NelsonAalen
>>> x = np.array([1, 2, 3, 4, 5])
>>> model = NelsonAalen.fit(x)
>>> model.ff(2)
array([0.36237185])
>>> model.ff([1., 1.5, 2., 2.5])
array([0.18126925, 0.18126925, 0.36237185, 0.36237185])

hf(x: ArrayLike, interp: str = 'step') → NDArray

Instantaneous hazard function with the non-parametric estimates from the data. This is calculated using simply the difference between consecutive H(x).

Parameters: x (array like or scalar) – The values of the random variables at which the survival function will be calculated
Returns: hf – The value(s) of the failure function at each x
Return type: scalar or numpy array

Examples

>>> from surpyval import NelsonAalen
>>> x = np.array([1, 2, 3, 4, 5])
>>> model = NelsonAalen.fit(x)
>>> model.hf([1.5, 2.5, 3.5])
array([0.25      , 0.25      , 0.33333333])

mean(tau: float | None = None) → float

The (restricted) mean survival time: the area under the estimated survival function from 0 to tau.

If the survival function reaches zero this is the mean of the estimated distribution. With right censoring the survival function does not reach zero and the unrestricted mean is undefined; the restricted mean up to tau (defaulting to the largest observed value) is reported instead, which is the standard restricted mean survival time (RMST).

Parameters: tau (scalar, optional) – The horizon up to which the survival function is integrated. Defaults to the largest observed value. If tau is beyond the last observation the survival function is extended at its final value.
Returns: mean – The restricted mean survival time.
Return type: float

Examples

>>> from surpyval import KaplanMeier
>>> x = np.array([1, 2, 3, 4, 5])
>>> model = KaplanMeier.fit(x)
>>> model.mean()
3.0

mean_cb(tau: float | None = None, alpha_ci: float = 0.05) → NDArray

Two-sided confidence interval of the (restricted) mean survival time, using the normal approximation with the standard variance estimate:

\[\widehat{Var}(\hat{\mu}) = \sum_{i: x_i \leq \tau} A_i^2 v_i\]

where \(A_i\) is the area under the survival function from \(x_i\) to \(\tau\) and \(v_i\) is the variance increment of the cumulative hazard at \(x_i\) (e.g. the Greenwood increment for the Kaplan-Meier estimator).

Parameters

tau (scalar, optional) – The horizon up to which the survival function is integrated. Defaults to the largest observed value.
alpha_ci (scalar, optional) – The level of significance at which the interval will be computed. Defaults to 0.05.

Returns

cb – The [lower, upper] interval of the (restricted) mean.

Return type

numpy array

property median: float: The median survival time; the smallest observed value at which the estimated CDF reaches, or exceeds, 0.5. NaN if the estimate never reaches 0.5 (e.g. due to right censoring).

plot(ax: Axes | None = None, **kwargs) → Any

Creates a plot of the survival function.

Two-sided confidence bounds are drawn as a shaded band in the same colour as the survival curve, and right censored observations are marked with ticks on the curve. Any keyword arguments not listed below (e.g. color or label) are passed to the matplotlib plotting call for the survival curve.

Parameters

ax (matplotlib axis, optional) – The axis on which the plot will be drawn. Defaults to the current axis.
plot_bounds (bool, optional) – Whether to draw the confidence bounds. Defaults to True.
show_censors (bool, optional) – Whether to mark right censored observations on the curve. Defaults to True.
interp (('step', 'linear', 'cubic'), optional) – How to draw the curve between observations.
bound (optional) – Passed to the confidence bound calculation; see cb().
alpha_ci (optional) – Passed to the confidence bound calculation; see cb().
bound_type (optional) – Passed to the confidence bound calculation; see cb().
dist (optional) – Passed to the confidence bound calculation; see cb().

Returns

ax

Return type

matplotlib axis

qf(p: ArrayLike) → NDArray

Quantile function of the non-parametric estimate. Returns the smallest observed value at which the estimated CDF reaches, or exceeds, the probability p.

Parameters: p (array like or scalar) – The probabilities at which the quantile will be computed. Values must be in (0, 1].
Returns: q – The value(s) of the quantile at each p. NaN where the estimated CDF never reaches p (e.g. due to right censoring).
Return type: numpy array

Examples

>>> from surpyval import KaplanMeier
>>> x = np.array([1, 2, 3, 4, 5])
>>> model = KaplanMeier.fit(x)
>>> model.qf(0.5)
array([3.])
>>> model.qf([0.1, 0.5, 0.9])
array([1., 3., 5.])

quantile_cb(p: ArrayLike, alpha_ci: float = 0.05, bound_type: str = 'exp', dist: str = 'z') → NDArray

Two-sided confidence interval of the quantile at each probability p using the Brookmeyer-Crowley method: the interval is the set of times at which the pointwise confidence interval of the survival function contains 1 - p.

Parameters

p (array like or scalar) – The probabilities at which the quantile interval will be computed. Values must be in (0, 1].
alpha_ci (scalar, optional) – The level of significance at which the interval will be computed. Defaults to 0.05.
bound_type (('exp', 'normal'), str, optional) – The method for the underlying survival function bounds.
dist (('t', 'z'), str, optional) – The statistic used in the underlying survival function bounds.

Returns

cb – Array of shape (len(p), 2) with the [lower, upper] interval of the quantile for each p. The upper limit is NaN where the relevant bound of the survival function never crosses 1 - p (i.e. the interval is open to the right).

Return type

numpy array

random(size: int) → NDArray: Draws random samples from the fitted distribution. Each observed value x is drawn with the probability mass the estimated survival function assigns to it. If the estimate does not reach zero (e.g. due to right censoring) the remaining mass is distributed over the observed values, i.e. sampling is conditional on an event occurring at one of the observed values.

sf(x: ArrayLike, interp: str = 'step') → NDArray

Surival (or Reliability) function with the non-parametric estimates from the data.

Parameters: x (array like or scalar) – The values of the random variables at which t he survival function will be calculated.
Returns: sf – The value(s) of the survival function at each x
Return type: scalar or numpy array

Examples

>>> from surpyval import NelsonAalen
>>> x = np.array([1, 2, 3, 4, 5])
>>> model = NelsonAalen.fit(x)
>>> model.sf(2)
array([0.63762815])
>>> model.sf([1., 1.5, 2., 2.5])
array([0.81873075, 0.81873075, 0.63762815, 0.63762815])

smoothed_hf(x: ArrayLike, bandwidth: float | None = None) → NDArray

Kernel smoothed estimate of the hazard rate, using an Epanechnikov kernel over the increments of the cumulative hazard estimate:

\[\hat{h}(t) = \frac{1}{b} \sum_{i} K\left ( \frac{t - x_i}{b} \right ) \Delta \hat{H}(x_i)\]

Contributions are renormalised near the boundaries of the observed range so the estimate is not biased downward where the kernel window extends past the data.

This is a better estimate of the hazard rate than hf(), which simply differences the cumulative hazard between the requested points.

Parameters

x (array like or scalar) – The values at which the hazard rate will be estimated.
bandwidth (scalar, optional) – The kernel bandwidth in the units of x. Defaults to a rough rule of thumb (one eighth of the observed range); for serious use choose by inspection or cross-validation.

Returns

hf – The estimated hazard rate at each x. NaN outside the observed range.

Return type

numpy array

References

Klein, J. P. and Moeschberger, M. L. (2003), “Survival Analysis”, 2nd ed., Section 6.2.