Utilities
Data Classes
- class surpyval.utils.surpyval_data.SurpyvalData(x: ArrayLike | None = None, c: ArrayLike | None = None, n: ArrayLike | None = None, t: ArrayLike | None = None, xl: ArrayLike | None = None, xr: ArrayLike | None = None, tl: ArrayLike | numbers.Number | None = None, tr: ArrayLike | numbers.Number | None = None, Z: ArrayLike | None = None, group_and_sort: bool = True, handle: bool = True)
Bases:
objectInitialize a SurpyvalData instance for survival analysis.
Validates, sorts, and stores survival data in the xcnt format. Supports uncensored, right/left/interval-censored, and truncated observations. Can convert to xrd format and select subsets by censoring type.
- Parameters
x (array-like, optional) – The primary data array of failure/event times. When c is 2 the corresponding x entry is a 2-element array [left, right].
c (array-like, optional) – Censoring flags for each value in x: * 0 = uncensored * 1 = right censored * -1 = left censored * 2 = interval censored
n (array-like, optional) – Number of occurrences for each value in x.
t (array-like, optional) – 2D array of truncation bounds [left, right] for each value in x.
xl (array-like, optional) – Left interval bounds for interval censored data. Cannot be used with ‘x’. Must be paired with ‘xr’.
xr (array-like, optional) – Right interval bounds for interval censored data. Cannot be used with ‘x’. Must be paired with ‘xl’.
tl (array-like or scalar, optional) – Left truncation bounds. Cannot be used with ‘t’. Must be paired with ‘tr’.
tr (array-like or scalar, optional) – Right truncation bounds. Cannot be used with ‘t’. Must be paired with ‘tl’.
group_and_sort (bool, default=True) – Whether to group and sort the data. Set False when using covariates to maintain data order.
handle (bool, default=True) – Whether to validate and process the input data. Set False for pre-validated data.
Examples
Basic usage with uncensored data: >>> x = np.array([1, 2, 3]) >>> data = SurpyvalData(x)
Right censored data: >>> x = np.array([1, 2, 3]) >>> c = np.array([0, 1, 1]) # 2 and 3 are censored >>> data = SurpyvalData(x, c)
Interval censored data: >>> xl = np.array([1, 2, 3]) >>> xr = np.array([2, 3, 4]) >>> data = SurpyvalData(xl=xl, xr=xr)
Interval Censored with nested 2 arrays: >>> x = [1, 2, [2, 5], 3, 6] >>> c = [0, 1, 2, 0, 0] >>> data = SurpyvalData(x=x, c=c)
With truncation: >>> x = np.array([1, 2, 3]) >>> t = np.array([[0, 5], [0, 5], [0, 5]]) >>> data = SurpyvalData(x, t=t)
- add_covariates(Z: ArrayLike) None
Method to add covariates to the data. The covariates are stored in the Z attribute of the object. When doing regression survival analysis this method allows for the covariates to be added to the data in a consistent manner that also allows for the data to be converted to be passed to the fitters.
- Parameters
Z (numpy.ndarray) – The covariate array.
- classmethod from_json(source: str | pathlib.Path) SurpyvalData
Create SurpyvalData instance from JSON string or file path.
- Parameters
source (str | Path) – Pass a
Pathto load from a file; pass astrto parse as JSON text directly.- Returns
New instance created from JSON data
- Return type
- to_json(filepath: str | pathlib.Path | None = None) str | None
Serialize SurpyvalData to JSON format.
- Parameters
filepath (str | Path, optional) – If provided, saves JSON to this file path
- Returns
JSON string if no filepath provided, None if saved to file
- Return type
str | None
- to_xrd(estimator='Nelson-Aalen') tuple
Converts the data into the xrd format. If the data has right truncated observations or left or interval censored observations, the data is converted to the xrd format using the Turnbull estimator. The
estimatorparameter will be used in the with, and only with, the Turnbull estimator.- Parameters
estimator (str, optional) – The method for estimation if data requires the use of the Turnbull estimator to convert to xrd, defaults to “Nelson-Aalen”.
- Returns
The xrd data.
- Return type
tuple
- class surpyval.utils.recurrent_event_data.RecurrentEventData(x, i, c, n, e=None, tl=None, tr=None)
Bases:
object- Z: numpy.ndarray[tuple[Any, ...], numpy.dtype[numpy._typing._array_like._ScalarT]] | None = None
A class to handle and manipulate recurrent event data. Recurrent events are those that can occur more than once for each subject or item.
Examples
>>> import numpy as np >>> from surpyval import RecurrentEventData >>> x = np.array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5]) >>> c = np.array([0, 0, 1, 1, 1, 0, 0, 0, 0, 1]) >>> n = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) >>> i = np.array([1, 1, 1, 1, 1, 2, 2, 2, 2, 2]) >>> data = RecurrentEventData(x, i, c, n) >>> data.to_xrd() (array([1, 2, 3, 4, 5]), array([2, 2, 2, 2, 2]), array([2, 2, 1, 1, 0])) >>> data[0:2] RecurrentEventData( x=[1 2], i=[1 1], c=[0 0], n=[1 1] ) >>> data.get_times_to_first_events() SurpyvalData( x=[1.], c=[0], n=[2], t=[[-inf inf]]) >>> data.get_interarrival_times() array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
- property event_types
The distinct event types (marks) present in the data, excluding the
Nonemark used for censored / end-of-observation rows. Returns an empty list when the data carries no marks.
- get_events_for_item(item)
Get all events for a specific item or subject.
- Parameters
item (int or str) – The id of the item or subject.
- Returns
A tuple containing event times, censoring information and frequencies for the specified item.
- Return type
tuple
- get_interarrival_times()
Finds the interarrival times between events for each item. The class assumes that the time of the event is cumulative, sometimes it is necessary to know the interarrival times of the events. This method returns the interarrival times for each item. It is aligned with the items attribute.
- Returns
An array of interarrival times.
- Return type
numpy.ndarray
- get_previous_x(min_x=0)
Finds the previous event time for each event. This is useful for calculating the time since the last event. This method returns the previous event time for each event. It is aligned with the items attribute.
- Parameters
min_x (float, optional) – Fallback minimum for the first event of each item. The item’s left truncation bound is used instead when it is greater, so a delayed-entry item’s first interval begins at its entry time.
- Returns
An array of previous event times.
- Return type
numpy.ndarray
- get_right_truncation_close()
Per-item integration bounds for the NHPP likelihood’s right window-close.
The NHPP integral runs from each item’s entry time (its left truncation
tl, handled byget_previous_x()) to the time its observation window closes. Historically that close was only known from an explicit right-censoring (c=1) row, so the integral stopped at the item’s last recorded time. When an item instead carries a finite right-truncation timetrthe window closes there, and the integral must be extended from the last in-window time out totr.Returns three aligned arrays
(x_last, x_close, rep_idx)with one entry per item whosetris finite:x_lastis the item’s last in-window time (its last event or right-censoring row),x_closeis itstr, andrep_idxis a representative row index for the item (so per-item covariatesZcan be gathered). Items with the defaulttr = infare omitted, so untruncated data yields empty arrays and contributes nothing to the integral. Addingcif(x_last) - cif(x_close)to the log-likelihood therefore extends the telescoped integral totr(and is exactly zero when ac=1row already sits attr, so the two ways of closing the window never double-count).- Returns
(x_last, x_close, rep_idx)as described above.- Return type
tuple of numpy.ndarray
- get_times_to_first_events()
Get the times to the first events for each item or subject. In the estimation of recurrent or renewal events it can be helpful to know the distribution of the times to the first event per item. This method returns the times to the first events for each item. It is aligned with the items attribute.
- Returns
A transformed dataset containing times to the first events.
- Return type
- to_cause_specific_xrd(cause)
Convert the recurrent event data to xrd format for a single event type (cause). The at-risk set
ris shared across all causes (an item remains at risk for every cause until it leaves observation); only the event countdis restricted to the requested cause.- Parameters
cause (object) – The event type to compute the cause-specific counts for. Must be one of
self.event_types.- Returns
A tuple
(x_unique, r, d_cause)whered_causecounts only events of the requested cause andris the shared at-risk set.- Return type
tuple
- to_xrd(estimator='Nelson-Aalen')
Convert the recurrent event data to xrd format.
- Parameters
estimator (str, optional) – The estimator to use, defaults to “Nelson-Aalen”.
- Returns
A tuple containing unique event times, risk set sizes, and the event counts.
- Return type
tuple
Data Wrangling Utilities
- surpyval.utils.coerce_xcnt_x(x) ndarray[tuple[Any, ...], dtype[_ScalarT]]
Coerce the
xvariable of xcnt-format data into a numpy array.Accepts a 1D array of event values, or a 2D array / list-of-pairs of
[left, right]interval bounds. Validates dimensionality, the interval ordering (left <= right) and the absence of NaNs. Shared by the univariate (xcnt_handler) and recurrent (handle_xicn) handlers.
- surpyval.utils.format_truncation(t, tl, tr, n_rows) ndarray[tuple[Any, ...], dtype[_ScalarT]]
Build the
(n_rows, 2)truncation array from either atmatrix or separatetl/trbounds (scalars broadcast to all rows). The default window is the whole real line[-inf, inf]. Shared byxcnt_handlerandhandle_xicn.
- surpyval.utils.fs_to_xrd(f, s)
Converts the fs format to the xrd format.
- Parameters
f (array) – array of values for which the failure/death was observed
s (array) – array of right censored observation values
- Returns
x (array) – sorted array of values of variable for which observations were made.
r (array) – array of count of units/people at risk at time x (including if it had an event at ‘x’).
d (array) – array of the count of failures/deaths at each time x.
Examples
>>> from surpyval import fs_to_xrd >>> f = [1, 4, 5] >>> s = [2, 3] >>> x, r, d = fs_to_xrd(f, s) >>> x, r, d (array([1, 2, 3, 4, 5]), array([5, 4, 3, 2, 1]), array([1, 0, 0, 1, 1]))
- surpyval.utils.fsli_handler(f=None, s=None, l=None, i=None)
Takes in the fsli format and ensures that the data is correctly defined. Takes an assorted combination of f, s, l, and i and returns them in the correct format as numpy arrays.
- Parameters
f (array-like, optional (default: None)) – array of values for which the failure/death was observed
s (array-like, optional (default: None)) – array of right censored observation values
l (array-like, optional (default: None)) – array of left censored observation values
i (array-like, optional (default: None)) – array of length 2 arrays interval censored data
- Returns
f (array) – array of values for which the failure/death was observed that have been checked for correctness
s (array) – array of right censored observation values that have been checked for correctness
l (array) – array of left censored observation values that have been checked for correctness
i (array) – array of interval censored data that have been checked for correctness
Examples
>>> from surpyval import fsli_handler >>> f = [1, 2, 3, 4, 5, 6] >>> s = [1, 2, 3] >>> l = [4, 5, 6] >>> i = [[1, 2], [3, 4]] >>> fsli_handler(f, s, l, i) (array([1., 2., 3., 4., 5., 6.]), array([1., 2., 3.]), array([4., 5., 6.]), array([[1., 2.], [3., 4.]]))
- surpyval.utils.fsli_to_xcnt(f=None, s=None, l=None, i=None)
Converts the fsli format to the xcn format. This ensures is so that the data can be passed to one of the parametric or nonparametric fitters.
- Parameters
f (array) – array of values for which the failure/death was observed
s (array) – array of right censored observation values
l (array) – array of left censored observation values
i (array) – array of length 2 arrays interval censored data
- Returns
x (array) – sorted array of values of variable for which observations were made.
c (array) – array of censoring values (-1, 0, 1, 2) corrseponding to output array x.
n (array) – array of count of observations at to output array x and with censoring c.
t (ndarray) – ndarray of truncation values of observations at output array x and with censoring c.
Examples
>>> from surpyval import fsli_to_xcnt >>> f = [1, 4, 5] >>> s = [2, 3] >>> l = [] >>> i = [] >>> x, c, n, t = fsli_to_xcnt(f, s, l, i) >>> x array([1, 2, 3, 4, 5]) >>> c array([0, 1, 1, 0, 0]) >>> n array([1, 1, 1, 1, 1]) >>> t array([[-inf, inf], [-inf, inf], [-inf, inf], [-inf, inf], [-inf, inf]])
- surpyval.utils.validate_float_array(arr, name)
Convert input to float array with better error handling.
- surpyval.utils.xcn_to_fsl(x, c=None, n=None)
Converts the xcn format to the fsl format.
- Parameters
x (array) – array of values of variable for which observations were made.
c (array, optional (default: None)) – array of censoring values (-1, 0, 1, 2) corrseponding to x. If None, an array of 0s is created corresponding to each x.
n (array, optional (default: None)) – array of count of observations at each x and with censoring c. If None, an array of ones is created.
- Returns
f (array) – array of values for which the failure/death was observed
s (array) – array of right censored observation values
l (array) – array of left censored observation values
Examples
>>> x = np.array([1, 2, 3, 4, 5]) >>> c = np.array([0, 1, 1, 0, 0]) >>> n = np.array([1, 1, 1, 1, 1]) >>> f, s, l = xcn_to_fsl(x, c, n) >>> f array([1, 4, 5]) >>> s array([2, 3]) >>> l array([], dtype=float64)
- surpyval.utils.xcnt_handler(x=None, c=None, n=None, t=None, xl=None, xr=None, tl=None, tr=None, group_and_sort=True) tuple[numpy.ndarray[tuple[Any, ...], numpy.dtype[_ScalarT]], numpy.ndarray[tuple[Any, ...], numpy.dtype[_ScalarT]], numpy.ndarray[tuple[Any, ...], numpy.dtype[_ScalarT]], numpy.ndarray[tuple[Any, ...], numpy.dtype[_ScalarT]]]
Main handler that ensures any input to a surpyval fitter meets the requirements to be used in one of the parametric or nonparametric fitters.
- Parameters
x (array) – array of values of variable for which observations were made.
c (array, optional (default: None)) – array of censoring values (-1, 0, 1, 2) corrseponding to x
n (array, optional (default: None)) – array of count of observations at each x and with censoring c
t (array, optional (default: None)) – array of values with shape (?, 2) with the left and right value of truncation
xl (array or scalar, optional (default: None)) – array of the values of the left interval of interval censored data. Cannot be used with ‘x’ parameter, must be used with the ‘xr’ parameter
xr (array or scalar, optional (default: None)) – array of the values of the right interval of interval censored data. Cannot be used with ‘x’ parameter, must be used with the ‘xl’ parameter
tl (array or scalar, optional (default: None)) – array of values of the left value of truncation. If scalar, all values will be treated as left truncated by that value cannot be used with ‘t’ parameter but can be used with the ‘tr’ parameter
tr (array or scalar, optional (default: None)) – array of values of the right value of truncation. If scalar, all values will be treated as right truncated by that value cannot be used with ‘t’ parameter but can be used with the ‘tl’ parameter
group_and_sort (bool, optional (default: True)) – whether to group and sort the data. If False, the data will be returned in the order it was entered. This is useful for when validating survival data for which you also have covariates.
- Returns
x (array) – sorted array of values of variable for which observations were made.
c (array) – array of censoring values (-1, 0, 1, 2) corrseponding to output array x. If c was None, defaults to creating array of zeros the length of x.
n (array) – array of count of observations at output array x and with censoring c. If n was None, count array assumed to be all one observation.
t (array) – array of truncation values of observations at output array x and with censoring c.
Examples
>>> from surpyval import xcnt_handler >>> x = [1, 2, 3, 4, 5] >>> c = [0, 0, 1, 1, 1] >>> n = [1, 1, 1, 1, 1] >>> t = [[0, 5], [0, 5], [0, 5], [0, 5], [0, 5]] >>> xcnt_handler(x, c, n, t) (array([1., 2., 3., 4., 5.]), array([0, 0, 1, 1, 1]), array([1, 1, 1, 1, 1]), array([[0., 5.], [0., 5.], [0., 5.], [0., 5.], [0., 5.]])) >>> xcnt_handler(x, c, n, tl=0, tr=5) (array([1., 2., 3., 4., 5.]), array([0, 0, 1, 1, 1]), array([1, 1, 1, 1, 1]), array([[0., 5.], [0., 5.], [0., 5.], [0., 5.], [0., 5.]])) >>> xl = [1, 2, 3, 4, 5] >>> xr = [2, 3, 4, 5, 6] >>> xcnt_handler(xl=xl, xr=xr) (array([[1., 2.], [2., 3.], [3., 4.], [4., 5.], [5., 6.]]), array([2, 2, 2, 2, 2]), array([1, 1, 1, 1, 1]), array([[-inf, inf], [-inf, inf], [-inf, inf], [-inf, inf], [-inf, inf]]))
- surpyval.utils.xcnt_to_xrd(x, c=None, n=None, t=None, **kwargs)
Converts the xcn format to the xrd format.
- Parameters
x (array) – array of values of variable for which observations were made.
c (array, optional (default: None)) – array of censoring values (-1, 0, 1, 2) corrseponding to x. If None, an array of 0s is created corresponding to each x.
n (array, optional (default: None)) – array of count of observations at each x and with censoring c. If None, an array of ones is created.
kwargs (keywords for truncation can be either 't' or a combo of 'tl' and) –
'tr' –
- Returns
x (array) – sorted array of values of variable for which observations were made.
r (array) – array of count of units/people at risk at time x (including if it had an event at ‘x’).
d (array) – array of the count of failures/deaths at each time x.
Examples
>>> x = np.array([1, 2, 3, 4, 5]) >>> c = np.array([0, 1, 1, 0, 0]) >>> n = np.array([1, 1, 1, 1, 1]) >>> x, r, d = xcnt_to_xrd(x, c, n) >>> x array([1, 2, 3, 4, 5]) >>> r array([5, 4, 3, 2, 1]) >>> d array([1, 0, 0, 1, 1]) >>> # Using left truncated data >>> x = np.array([1, 2, 3, 4, 5]) >>> tl = np.array([0, 1, 2, 3, 4]) >>> x, r, d = xcnt_to_xrd(x, tl=tl) >>> x array([1., 2., 3., 4., 5.]) >>> r array([2, 2, 2, 2, 1]) >>> d array([1, 1, 1, 1, 1])
- surpyval.utils.xrd_handler(x, r, d)
Takes a combination of ‘x’, ‘r’, and ‘d’ arrays and ensures that the data is feasible.
Does not check for the case where r is always decreasing as this is possible in some cases, i.e. when there is left truncation, a.k.a late entry.
- Parameters
x (array) – array of values of variable for which observations were made.
r (array) – array of at risk items at each value of x
d (array) – array of failures / deaths at each value of x
- Returns
x (array) – array of values of variable for which observations were made.
r (array) – array of at risk items at each value of x
d (array) – array of failures / deaths at each value of x
Examples
>>> from surpyval import xrd_handler >>> x = [1, 2, 3, 4, 5] >>> r = [5, 4, 3, 2, 1] >>> d = [1, 1, 1, 1, 1] >>> x, r, d = xrd_handler(x, r, d) >>> x array([1., 2., 3., 4., 5.]) >>> r array([5, 4, 3, 2, 1]) >>> d array([1, 1, 1, 1, 1]))
- surpyval.utils.xrd_to_xcnt(x, r, d)
Converts the xrd format to the xcn format. Assumes that there is no right truncation or left censoring.
Note: left truncation cannot be recovered from the xrd format because the at-risk count r collapses per-subject truncation times into a single scalar. Use xcnt format directly when left truncation is present.
- Parameters
x (array) – array of values of variable for which observations were made.
r (array) – array of at risk items at each value of x
d (array) – array of failures / deaths at each value of x
- Returns
x (array) – array of values of variable for which observations were made.
c (array) – array of censoring values (-1, 0, 1, 2) corrseponding to x
n (array) – array of count of observations at each x and with censoring c
t (array) – array of values with shape (?, 2) with the left and right value of truncation
Examples
>>> x = np.array([1, 2, 3, 4, 5]) >>> r = np.array([5, 4, 3, 2, 1]) >>> d = np.array([1, 0, 0, 1, 1]) >>> x, c, n, t = xrd_to_xcnt(x, r, d) >>> x, c, n, t array([1, 2, 3, 4, 5]) >>> c array([0, 1, 1, 0, 0]) >>> n array([1, 1, 1, 1, 1]) >>> t array([[-inf, inf], [-inf, inf], [-inf, inf], [-inf, inf], [-inf, inf]]))