Utilities

Data Classes

Bases: object

Initialize a SurpyvalData instance for survival analysis.

Validates, sorts, and stores survival data in the xcnt format. Supports uncensored, right/left/interval-censored, and truncated observations. Can convert to xrd format and select subsets by censoring type.

Parameters

x (array-like, optional) – The primary data array of failure/event times. When c is 2 the corresponding x entry is a 2-element array [left, right].
c (array-like, optional) – Censoring flags for each value in x: * 0 = uncensored * 1 = right censored * -1 = left censored * 2 = interval censored
n (array-like, optional) – Number of occurrences for each value in x.
t (array-like, optional) – 2D array of truncation bounds [left, right] for each value in x.
xl (array-like, optional) – Left interval bounds for interval censored data. Cannot be used with ‘x’. Must be paired with ‘xr’.
xr (array-like, optional) – Right interval bounds for interval censored data. Cannot be used with ‘x’. Must be paired with ‘xl’.
tl (array-like or scalar, optional) – Left truncation bounds. Cannot be used with ‘t’. Must be paired with ‘tr’.
tr (array-like or scalar, optional) – Right truncation bounds. Cannot be used with ‘t’. Must be paired with ‘tl’.
group_and_sort (bool, default=True) – Whether to group and sort the data. Set False when using covariates to maintain data order.
handle (bool, default=True) – Whether to validate and process the input data. Set False for pre-validated data.

Examples

Basic usage with uncensored data: >>> x = np.array([1, 2, 3]) >>> data = SurpyvalData(x)

Right censored data: >>> x = np.array([1, 2, 3]) >>> c = np.array([0, 1, 1]) # 2 and 3 are censored >>> data = SurpyvalData(x, c)

Interval censored data: >>> xl = np.array([1, 2, 3]) >>> xr = np.array([2, 3, 4]) >>> data = SurpyvalData(xl=xl, xr=xr)

Interval Censored with nested 2 arrays: >>> x = [1, 2, [2, 5], 3, 6] >>> c = [0, 1, 2, 0, 0] >>> data = SurpyvalData(x=x, c=c)

With truncation: >>> x = np.array([1, 2, 3]) >>> t = np.array([[0, 5], [0, 5], [0, 5]]) >>> data = SurpyvalData(x, t=t)

add_covariates(Z: ArrayLike) → None

Method to add covariates to the data. The covariates are stored in the Z attribute of the object. When doing regression survival analysis this method allows for the covariates to be added to the data in a consistent manner that also allows for the data to be converted to be passed to the fitters.

Parameters: Z (numpy.ndarray) – The covariate array.

classmethod from_json(source: str | pathlib.Path) → SurpyvalData

Create SurpyvalData instance from JSON string or file path.

Parameters: source (str | Path) – Pass a Path to load from a file; pass a str to parse as JSON text directly.
Returns: New instance created from JSON data
Return type: SurpyvalData

to_json(filepath: str | pathlib.Path | None = None) → str | None

Serialize SurpyvalData to JSON format.

Parameters: filepath (str | Path, optional) – If provided, saves JSON to this file path
Returns: JSON string if no filepath provided, None if saved to file
Return type: str | None

to_xrd(estimator='Nelson-Aalen') → tuple

Converts the data into the xrd format. If the data has right truncated observations or left or interval censored observations, the data is converted to the xrd format using the Turnbull estimator. The estimator parameter will be used in the with, and only with, the Turnbull estimator.

Parameters: estimator (str, optional) – The method for estimation if data requires the use of the Turnbull estimator to convert to xrd, defaults to “Nelson-Aalen”.
Returns: The xrd data.
Return type: tuple

class surpyval.utils.recurrent_event_data.RecurrentEventData(x, i, c, n, e=None, tl=None, tr=None)

Bases: object

Z: NDArray | None = None

A class to handle and manipulate recurrent event data. Recurrent events are those that can occur more than once for each subject or item.

Examples

>>> import numpy as np
>>> from surpyval import RecurrentEventData
>>> x = np.array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5])
>>> c = np.array([0, 0, 1, 1, 1, 0, 0, 0, 0, 1])
>>> n = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> i = np.array([1, 1, 1, 1, 1, 2, 2, 2, 2, 2])
>>> data = RecurrentEventData(x, i, c, n)
>>> data.to_xrd()
(array([1, 2, 3, 4, 5]), array([2, 2, 2, 2, 2]), array([2, 2, 1, 1, 0]))
>>> data[0:2]
RecurrentEventData(
    x=[1 2],
    i=[1 1],
    c=[0 0],
    n=[1 1]
)
>>> data.get_times_to_first_events()
SurpyvalData(
x=[1.],
c=[0],
n=[2],
t=[[-inf  inf]])
>>> data.get_interarrival_times()
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

property event_types: The distinct event types (marks) present in the data, excluding the None mark used for censored / end-of-observation rows. Returns an empty list when the data carries no marks.

get_events_for_item(item)

Get all events for a specific item or subject.

Parameters: item (int or str) – The id of the item or subject.
Returns: A tuple containing event times, censoring information and frequencies for the specified item.
Return type: tuple

get_interarrival_times()

Finds the interarrival times between events for each item. The class assumes that the time of the event is cumulative, sometimes it is necessary to know the interarrival times of the events. This method returns the interarrival times for each item. It is aligned with the items attribute.

Returns: An array of interarrival times.
Return type: numpy.ndarray

get_previous_x(min_x=0)

Finds the previous event time for each event. This is useful for calculating the time since the last event. This method returns the previous event time for each event. It is aligned with the items attribute.

Parameters: min_x (float, optional) – Fallback minimum for the first event of each item. The item’s left truncation bound is used instead when it is greater, so a delayed-entry item’s first interval begins at its entry time.
Returns: An array of previous event times.
Return type: numpy.ndarray

get_right_truncation_close()

Per-item integration bounds for the NHPP likelihood’s right window-close.

The NHPP integral runs from each item’s entry time (its left truncation tl, handled by get_previous_x()) to the time its observation window closes. Historically that close was only known from an explicit right-censoring (c=1) row, so the integral stopped at the item’s last recorded time. When an item instead carries a finite right-truncation time tr the window closes there, and the integral must be extended from the last in-window time out to tr.

Returns three aligned arrays (x_last, x_close, rep_idx) with one entry per item whose tr is finite: x_last is the item’s last in-window time (its last event or right-censoring row), x_close is its tr, and rep_idx is a representative row index for the item (so per-item covariates Z can be gathered). Items with the default tr = inf are omitted, so untruncated data yields empty arrays and contributes nothing to the integral. Adding cif(x_last) - cif(x_close) to the log-likelihood therefore extends the telescoped integral to tr (and is exactly zero when a c=1 row already sits at tr, so the two ways of closing the window never double-count).

Returns: (x_last, x_close, rep_idx) as described above.
Return type: tuple of numpy.ndarray

get_times_to_first_events()

Get the times to the first events for each item or subject. In the estimation of recurrent or renewal events it can be helpful to know the distribution of the times to the first event per item. This method returns the times to the first events for each item. It is aligned with the items attribute.

Returns: A transformed dataset containing times to the first events.
Return type: SurpyvalData

to_cause_specific_xrd(cause)

Convert the recurrent event data to xrd format for a single event type (cause). The at-risk set r is shared across all causes (an item remains at risk for every cause until it leaves observation); only the event count d is restricted to the requested cause.

Parameters: cause (object) – The event type to compute the cause-specific counts for. Must be one of self.event_types.
Returns: A tuple (x_unique, r, d_cause) where d_cause counts only events of the requested cause and r is the shared at-risk set.
Return type: tuple

to_xrd(estimator='Nelson-Aalen')

Convert the recurrent event data to xrd format.

Parameters: estimator (str, optional) – The estimator to use, defaults to “Nelson-Aalen”.
Returns: A tuple containing unique event times, risk set sizes, and the event counts.
Return type: tuple

Data Wrangling Utilities

surpyval.utils.coerce_xcnt_x(x) → NDArray

Coerce the x variable of xcnt-format data into a numpy array.

Accepts a 1D array of event values, or a 2D array / list-of-pairs of [left, right] interval bounds. Validates dimensionality, the interval ordering (left <= right) and the absence of NaNs. Shared by the univariate (xcnt_handler) and recurrent (handle_xicn) handlers.

surpyval.utils.format_truncation(t, tl, tr, n_rows) → NDArray: Build the (n_rows, 2) truncation array from either a t matrix or separate tl/tr bounds (scalars broadcast to all rows). The default window is the whole real line [-inf, inf]. Shared by xcnt_handler and handle_xicn.

surpyval.utils.fs_to_xrd(f, s)

Converts the fs format to the xrd format.

Parameters

f (array) – array of values for which the failure/death was observed
s (array) – array of right censored observation values

Returns

x (array) – sorted array of values of variable for which observations were made.
r (array) – array of count of units/people at risk at time x (including if it had an event at ‘x’).
d (array) – array of the count of failures/deaths at each time x.

Examples

>>> from surpyval import fs_to_xrd
>>> f = [1, 4, 5]
>>> s = [2, 3]
>>> x, r, d = fs_to_xrd(f, s)
>>> x, r, d
(array([1, 2, 3, 4, 5]), array([5, 4, 3, 2, 1]), array([1, 0, 0, 1, 1]))

surpyval.utils.fsli_handler(f=None, s=None, l=None, i=None)

Takes in the fsli format and ensures that the data is correctly defined. Takes an assorted combination of f, s, l, and i and returns them in the correct format as numpy arrays.

Parameters

f (array-like, optional (default: None)) – array of values for which the failure/death was observed
s (array-like, optional (default: None)) – array of right censored observation values
l (array-like, optional (default: None)) – array of left censored observation values
i (array-like, optional (default: None)) – array of length 2 arrays interval censored data

Returns

f (array) – array of values for which the failure/death was observed that have been checked for correctness
s (array) – array of right censored observation values that have been checked for correctness
l (array) – array of left censored observation values that have been checked for correctness
i (array) – array of interval censored data that have been checked for correctness

Examples

>>> from surpyval import fsli_handler
>>> f = [1, 2, 3, 4, 5, 6]
>>> s = [1, 2, 3]
>>> l = [4, 5, 6]
>>> i = [[1, 2], [3, 4]]
>>> fsli_handler(f, s, l, i)
(array([1., 2., 3., 4., 5., 6.]),
array([1., 2., 3.]),
array([4., 5., 6.]),
array([[1., 2.],
        [3., 4.]]))

surpyval.utils.fsli_to_xcnt(f=None, s=None, l=None, i=None)

Converts the fsli format to the xcn format. This ensures is so that the data can be passed to one of the parametric or nonparametric fitters.

Parameters

f (array) – array of values for which the failure/death was observed
s (array) – array of right censored observation values
l (array) – array of left censored observation values
i (array) – array of length 2 arrays interval censored data

Returns

x (array) – sorted array of values of variable for which observations were made.
c (array) – array of censoring values (-1, 0, 1, 2) corrseponding to output array x.
n (array) – array of count of observations at to output array x and with censoring c.
t (ndarray) – ndarray of truncation values of observations at output array x and with censoring c.

Examples

>>> from surpyval import fsli_to_xcnt
>>> f = [1, 4, 5]
>>> s = [2, 3]
>>> l = []
>>> i = []
>>> x, c, n, t = fsli_to_xcnt(f, s, l, i)
>>> x
array([1, 2, 3, 4, 5])
>>> c
array([0, 1, 1, 0, 0])
>>> n
array([1, 1, 1, 1, 1])
>>> t
array([[-inf,  inf],
       [-inf,  inf],
       [-inf,  inf],
       [-inf,  inf],
       [-inf,  inf]])

surpyval.utils.validate_float_array(arr, name): Convert input to float array with better error handling.

surpyval.utils.xcn_to_fsl(x, c=None, n=None)

Converts the xcn format to the fsl format.

Parameters

x (array) – array of values of variable for which observations were made.
c (array, optional (default: None)) – array of censoring values (-1, 0, 1, 2) corrseponding to x. If None, an array of 0s is created corresponding to each x.
n (array, optional (default: None)) – array of count of observations at each x and with censoring c. If None, an array of ones is created.

Returns

f (array) – array of values for which the failure/death was observed
s (array) – array of right censored observation values
l (array) – array of left censored observation values

Examples

>>> x = np.array([1, 2, 3, 4, 5])
>>> c = np.array([0, 1, 1, 0, 0])
>>> n = np.array([1, 1, 1, 1, 1])
>>> f, s, l = xcn_to_fsl(x, c, n)
>>> f
array([1, 4, 5])
>>> s
array([2, 3])
>>> l
array([], dtype=float64)

surpyval.utils.xcnt_handler(x=None, c=None, n=None, t=None, xl=None, xr=None, tl=None, tr=None, group_and_sort=True) → tuple[NDArray, NDArray, NDArray, NDArray]

Main handler that ensures any input to a surpyval fitter meets the requirements to be used in one of the parametric or nonparametric fitters.

Parameters

x (array) – array of values of variable for which observations were made.
c (array, optional (default: None)) – array of censoring values (-1, 0, 1, 2) corrseponding to x
n (array, optional (default: None)) – array of count of observations at each x and with censoring c
t (array, optional (default: None)) – array of values with shape (?, 2) with the left and right value of truncation
xl (array or scalar, optional (default: None)) – array of the values of the left interval of interval censored data. Cannot be used with ‘x’ parameter, must be used with the ‘xr’ parameter
xr (array or scalar, optional (default: None)) – array of the values of the right interval of interval censored data. Cannot be used with ‘x’ parameter, must be used with the ‘xl’ parameter
tl (array or scalar, optional (default: None)) – array of values of the left value of truncation. If scalar, all values will be treated as left truncated by that value cannot be used with ‘t’ parameter but can be used with the ‘tr’ parameter
tr (array or scalar, optional (default: None)) – array of values of the right value of truncation. If scalar, all values will be treated as right truncated by that value cannot be used with ‘t’ parameter but can be used with the ‘tl’ parameter
group_and_sort (bool, optional (default: True)) – whether to group and sort the data. If False, the data will be returned in the order it was entered. This is useful for when validating survival data for which you also have covariates.

Returns

x (array) – sorted array of values of variable for which observations were made.
c (array) – array of censoring values (-1, 0, 1, 2) corrseponding to output array x. If c was None, defaults to creating array of zeros the length of x.
n (array) – array of count of observations at output array x and with censoring c. If n was None, count array assumed to be all one observation.
t (array) – array of truncation values of observations at output array x and with censoring c.

Examples

>>> from surpyval import xcnt_handler
>>> x = [1, 2, 3, 4, 5]
>>> c = [0, 0, 1, 1, 1]
>>> n = [1, 1, 1, 1, 1]
>>> t = [[0, 5], [0, 5], [0, 5], [0, 5], [0, 5]]
>>> xcnt_handler(x, c, n, t)
(array([1., 2., 3., 4., 5.]),
array([0, 0, 1, 1, 1]),
array([1, 1, 1, 1, 1]),
array([[0., 5.],
        [0., 5.],
        [0., 5.],
        [0., 5.],
        [0., 5.]]))
>>> xcnt_handler(x, c, n, tl=0, tr=5)
(array([1., 2., 3., 4., 5.]),
array([0, 0, 1, 1, 1]),
array([1, 1, 1, 1, 1]),
array([[0., 5.],
        [0., 5.],
        [0., 5.],
        [0., 5.],
        [0., 5.]]))
>>> xl = [1, 2, 3, 4, 5]
>>> xr = [2, 3, 4, 5, 6]
>>> xcnt_handler(xl=xl, xr=xr)
(array([[1., 2.],
        [2., 3.],
        [3., 4.],
        [4., 5.],
        [5., 6.]]),
array([2, 2, 2, 2, 2]),
array([1, 1, 1, 1, 1]),
array([[-inf,  inf],
        [-inf,  inf],
        [-inf,  inf],
        [-inf,  inf],
        [-inf,  inf]]))

surpyval.utils.xcnt_to_xrd(x, c=None, n=None, t=None, **kwargs)

Converts the xcn format to the xrd format.

Parameters

x (array) – array of values of variable for which observations were made.
c (array, optional (default: None)) – array of censoring values (-1, 0, 1, 2) corrseponding to x. If None, an array of 0s is created corresponding to each x.
n (array, optional (default: None)) – array of count of observations at each x and with censoring c. If None, an array of ones is created.
kwargs (keywords for truncation can be either 't' or a combo of 'tl' and) –
'tr' –

Returns

x (array) – sorted array of values of variable for which observations were made.
r (array) – array of count of units/people at risk at time x (including if it had an event at ‘x’).
d (array) – array of the count of failures/deaths at each time x.

Examples

>>> x = np.array([1, 2, 3, 4, 5])
>>> c = np.array([0, 1, 1, 0, 0])
>>> n = np.array([1, 1, 1, 1, 1])
>>> x, r, d = xcnt_to_xrd(x, c, n)
>>> x
array([1, 2, 3, 4, 5])
>>> r
array([5, 4, 3, 2, 1])
>>> d
array([1, 0, 0, 1, 1])
>>> # Using left truncated data
>>> x = np.array([1, 2, 3, 4, 5])
>>> tl = np.array([0, 1, 2, 3, 4])
>>> x, r, d = xcnt_to_xrd(x, tl=tl)
>>> x
array([1., 2., 3., 4., 5.])
>>> r
array([2, 2, 2, 2, 1])
>>> d
array([1, 1, 1, 1, 1])

surpyval.utils.xrd_handler(x, r, d)

Takes a combination of ‘x’, ‘r’, and ‘d’ arrays and ensures that the data is feasible.

Does not check for the case where r is always decreasing as this is possible in some cases, i.e. when there is left truncation, a.k.a late entry.

Parameters

x (array) – array of values of variable for which observations were made.
r (array) – array of at risk items at each value of x
d (array) – array of failures / deaths at each value of x

Returns

x (array) – array of values of variable for which observations were made.
r (array) – array of at risk items at each value of x
d (array) – array of failures / deaths at each value of x

Examples

>>> from surpyval import xrd_handler
>>> x = [1, 2, 3, 4, 5]
>>> r = [5, 4, 3, 2, 1]
>>> d = [1, 1, 1, 1, 1]
>>> x, r, d = xrd_handler(x, r, d)
>>> x
array([1., 2., 3., 4., 5.])
>>> r
array([5, 4, 3, 2, 1])
>>> d
array([1, 1, 1, 1, 1]))

surpyval.utils.xrd_to_xcnt(x, r, d)

Converts the xrd format to the xcn format. Assumes that there is no right truncation or left censoring.

Note: left truncation cannot be recovered from the xrd format because the at-risk count r collapses per-subject truncation times into a single scalar. Use xcnt format directly when left truncation is present.

Parameters

x (array) – array of values of variable for which observations were made.
r (array) – array of at risk items at each value of x
d (array) – array of failures / deaths at each value of x

Returns

x (array) – array of values of variable for which observations were made.
c (array) – array of censoring values (-1, 0, 1, 2) corrseponding to x
n (array) – array of count of observations at each x and with censoring c
t (array) – array of values with shape (?, 2) with the left and right value of truncation

Examples

>>> x = np.array([1, 2, 3, 4, 5])
>>> r = np.array([5, 4, 3, 2, 1])
>>> d = np.array([1, 0, 0, 1, 1])
>>> x, c, n, t = xrd_to_xcnt(x, r, d)
>>> x, c, n, t
array([1, 2, 3, 4, 5])
>>> c
array([0, 1, 1, 0, 0])
>>> n
array([1, 1, 1, 1, 1])
>>> t
array([[-inf,  inf],
       [-inf,  inf],
       [-inf,  inf],
       [-inf,  inf],
       [-inf,  inf]]))