Survival Curves Projection and Benefit Time Points Estimation using a New Statistical Method

: Survival analysis concerns the analysis of time-to-event data and it is essential to study in fields such as oncology, the survival function, S ( t ), calculation is usually used, but in the presence of competing risks (presence of competing events), is necessary introduce other statistical concepts and methods, as is the Cumulative incidence function CI ( t ). This is defined as the proportion of subjects with an event time less than or equal to. The present study describe a methodology that enables to obtain numerically a shape of CI ( t ) curves and estimate the benefit time points (BTP) as the time (t) when a 90, 95 or 99% is reached for the maximum value of CI ( t ). Once you get the numerical function of CI ( t ), it can be projected for an infinite time, with all the limitations that it entails. To do this task the R function Weibull.cumulative.incidence() is proposed. In a first step these function transforms the survival function ( S ( t )) obtained using the Kaplan–Meier method to CI ( t ). In a second step the best fit function of CI ( t ) is calculated in order to estimate BTP using two procedures, 1) Parametric function: estimates a Weibull growth curve of 4 parameters by means a non-linear regression (nls) procedure or 2) Non parametric method: using Local Polynomial Regression (LPR) or LOESS fitting. Two examples are presented and developed using Weibull.cumulative.incidence() function in order to present the method. The methodology presented will be useful for performing better tracking of the evolution of the diseases (especially in the case of the presence of competitive risks), project time to infinity and it is possible that this methodology can help identify the causes of current trends in diseases like cancer. We think that BTP points can be important in large diseases like cardiac illness or cancer to seek the inflection point of the disease, treatment associate or speculate how is the course of the disease and change the treatments at those points. These points can be important to take medical decisions furthermore.


INTRODUCTION
A branch of statistics that deals with analysis of time duration until one or more events happen, such as death in biological organisms and failure in mechanical systems [1] is known as survival analysis [2]. Survival analysis concerns the analysis of time-to-event data and it is essential to study in fields such as oncology, the survival function, S(t) , calculation is usually used.
In survival data, subjects experience only one type of event over follow-up, such as death from a disease (e.g. cancer). Unfortunately, life is very complex, and sometimes, subjects can potentially experience more than one type of a certain event (e.g. senior patients at an oncology department, could possibly die from heart attack or breast cancer, or even traffic accident). When only one of these different types of event can occur, we refer to these events as "competing events" [3]. In this case, one competing event compete with each other to deliver the event of interest (e.g. death due to illness), and the occurrence of one type of event will prevent the occurrence of the others. As a result, we call the probability of these events as "competing risks" [4,5], in a sense that the probability of each competing event is somehow regulated by the other competing events, *Address correspondence to this author at the GRBIO (Research Group in Biostatistics and Bioinformatics). BIOST 3 . Section of Statistics, Department of Genetics, Microbiology and Statistics. University of Barcelona, Spain; Tel: 003494.402. 15.60; E-mail: amonleong@ub.edu which has an interpretation suitable to describe the survival process determined by multiple types of event [3].
In the presence of competing risks (presence of competing events), is necessary introduce statistical concepts and methods for the analysis of survival data. Cumulative incidence CI(t) is defined as the proportion of subjects with an event time less than or equal to t [4].
In this field the Cumulative incidence function, CI(t) , is defined as the probability that a particular event related with time, such as occurrence of a particular disease, has occurred before a given time. It is equivalent to the incidence, calculated using a period of time during which all of the individuals in the population are considered to be at risk for the outcome. It is sometimes also referred to as the incidence proportion, but in function of the evolution of the disease [6] not all the events occur at the same moment or with the same speed, so it would be of interest assess a possible benefit time points (BTP) when the disease could be stable or change.

Survival Analysis
The survival function S(t) analyses the "time to event outcome variable".
A time to event variable, t , reflects the time until a participant has an event of interest (e.g., heart attack, goes into cancer remission, death, curation, etc). Statistical analysis of time to event variables requires special techniques [1] than those described thus far for other types of outcomes because of the unique features of time to event variables. Statistical analysis of these variables is called time to event analysis or survival analysis [6] even though the outcome is not always death. What we mean by "survival" in this context is remaining free of a particular outcome over time.
The survival function, S(t) , of an individual is the probability that they survive until at least time t.
where t is a time of interest and T is the time of event.
The survival curve of S(t) is non-increasing (the event may not reoccur for an individual) and is limited within [0,1]. Note that the event might not happen within our period of study and we call this rightcensoring (See Figure 1). The questions of interest in survival analysis are questions like: What is the probability that a participant survives 10 or 20 years? Are there differences in survival between groups (e.g., between those assigned to a new versus a standard drug in a clinical trial)? How do certain personal, behavioural or clinical characteristics affect participants' chances of survival? [1]

Survival Time t as a Random Variable
The survival analysis of a random variable of time study the T variables "time until an event or event" is known as survival analysis. This analysis contemplates a specific methodology since T measurements occur frequently before the event and patients do not enter the study at the same time [7].
The event considered is not whether or not death occurs, for example, but death related to the disease. If a death unrelated to the disease is considered, an information bias occurs, so the patient died for a cause that is not related to the event of interest should be considered as censored and compute their follow-up time as incomplete or lost.
The event or event studied must also be perfectly defined in order to determine exactly the date of the event. This event is almost always associated with the death of the patient but it does not have to be so, since it can also refer to the discharge date, the date of remission of the disease, the date a clinical event occurs (example: Cardiovascular), [6] the date of relapse, the date of relapse or failure, etc.
From the clinical point of view, survival can be defined as: • Disease-free survival: Time during which the patient is free of any evidence of illness. It is applicable to patients who undergo radical root treatment and disappear the moment a relapse occurs. If the patient presents advanced disease, the concept of disease-free survival is not applicable, but the duration of the response. Event-free survival: • Global survival: Life time from the start of study treatment to death or to the last known data, in case of abandonment or loss of follow-up.
One of the objectives of these techniques is to infer the relationship between T and the explanatory variables of the model X that are known and controlled by the researcher in the study. The variable T does not belong to a normal population and can be distributed according to exponential function, Weibull, log-normal or log-logistic.
The differences between the factors studied by the survival analysis can be performed using parametric and non-parametric techniques. A summary is: • Parametric: -Exponential Distribution.
-Cox Regression (Semi-Parametric method) We're measuring time-to-event in the real world and so there's practical constraints on the period of study and how to treat individuals that fall outside that period. Censoring is when the event of interest (death, relapse, curing, failure, etc) occurs outside the study period, and truncation is due to the study design.
It is sometimes unknown if the patient has presented the event studied (death, relapse, etc.) or not. These data are known as the censored data [7,8].
There are several types of censorship such as:

•
Censorship type I: is the most common. The study has a limited time. If the time until the event occurs in the patient is less than the time set, the time obtained is taken, otherwise the time until the end of the study.
• Type II Censorship: The study ends when the event has occurred in a given number of individuals.
• Random censorship: The time until the event is observed less than or equal to a constant in censorship I. In this case it is not a constant but a random variable d, which takes into account the causes not considered in the experiment and that cause the censorship. The failure time is observed when T < d .
The survival function S(t) (Figure 1) is defined as the probability of a patient surviving a time t , if T is the survival time variable. It is a decreasing function that satisfies:

Kaplan-Meier Method
It is a non-parametric method widely used to estimate survival function [9,10] (it does not assume any probability function) that uses maximum likelihood, maximising the sample likelihood function. We allow for right-censoring (but not truncation). We start with a random sample of size n , drawn from a population, it will be formed by k(k ! n) times t 1 < t 2 < ... < t k in which events are observed. At each time, there are no "individuals at risk" and d i events are observed [1] This model gives us a maximum-likelihood estimate of the survival function S(t) with Kaplan-Meier productlimit [10] estimator defined as, where d is the frequency of interest events (e.g. deaths, curing, etc) and n the individuals at risk at time t . The cumulative product (equation 3) gives us a nonincreasing curve of survival S(t) , at any times t during the study, the estimated probability of survival from the start to that time t . A good survival estimator is the median of survival time (half-life), used frequently.
Kaplan-Meier method calculates survival every time t and an event of interest is presented: Where t 1 < t 2 < ... < t k is defined at times where the event of interest occurs and n j is the number of survivors before t j and d j is the number of individuals presenting the event at time t j .
This function S(t) is usually represented on a graph as such as the example in Figure 1.
Survival table [8] is another possibility to compute S(t) , in it you can present the values of proportion of survivors, time, number of individuals, cumulative survival rate, probability density, and risk ratio, number of abandonment, number of individuals exposed to risk, number of terminal events, proportion of individuals who have completed, etc.
In a general sense, S(t) is the survival density function, which indicates the time of the greatest number of events T . H (t) is the instantaneous failure rate or risk function and represents the probability that an individual remains alive between the moment t and the T + !t , previously knowing that it has arrived alive at time t.
There are many methods associated with survival curves, used to compare survival when different levels of a factor associated with an experimental design are available. The log-rank test is used to compare the survival functions according to the assigned treatments or some relevant factor.
In order to be able to construct an explanatory model of the survival function and to explain the relationship between the survival time and the independent variables of the model (sex, age, treatment, stage of disease, tumour marker, etc.), we can use the Cox regression . This methodology [1,7] allows us to more accurately estimate the survival function S(t) and to determine which variables best explain patients' survival. The Cox regression is represented by a risk function: Where h o (t) is the baseline risk and e ! 1 X 1 +...+! n X n depends on the independent or explanatory variables (weight, age, treatment, concomitant factors, etc.).
In the Cox model, the coefficients ! are determined first and by the Wald test or by the logarithm of maximum likelihood it will be determined whether or not they are significant for the model. Subsequently it is estimated h o (t) .

A First Example of Survival Analysis with R
The first working example is the study of survival time of a set of patients affected by two variants of the tongue cancer and its survival function is going to be estimated using the Kaplan-Meier method previously exposed, this example will be used later for the purpose of this article. This data set comes from the R package KMsurv. See R Documentation [11] for more information.
The tongue data frame has 80 rows and 3 columns and this data frame contains the following columns: • Time to death or on-study time, weeks • Death indicator (0=alive, 1=dead) Source was obtained from Klein and Moeschberger (1997) [8].
The survival data estimated was presented in Table  1 and in Figure 1.
The life table with the survival function estimation and CI95 was represented at Table 2. Figure 1, represents S(t) with survival decreasing from 100% to 20% over 400 weeks is shown with lines above and below that indicate the 95% confidence limits for the survival estimates.

Cumulative Incidence Curves CI(t) and Competitive Risk (CR)
Competing risks (CR) are present in many medical articles dealing with survival analysis: about half of the Kaplan-Meier analyses in medical journals are susceptible to CR. The issue may become even more relevant in the future, e.g. for elderly patients who are more likely to experience several potential disease endpoints, i.e. the occurrence of competing events increases [12]. The Kaplan-Meier method is applied to estimate the cumulative incidence of an event, using Cumulative Incidence Curves CI(t) , computed as 1-(Kaplan-Meier) estimator. This method is appropriate for endpoints such as overall survival, but also for composite endpoints such as progression-free survival [12].
So, complementary to the estimate of S(t) and frequently, the researchers prefer to generate CI(t) (See Figure 2), as opposed to survival curves S(t) which show the cumulative probabilities of experiencing the event of interest. Cumulative incidence, or cumulative failure probability, is computed as 1-S(t), and can be computed easily from the life table using the Kaplan-Meier approach. The cumulative incidence function, also referred to as the cause-specific failure probability [12], can be interpreted as the cumulative probability that a failure of type k occurs on or before time t [13]. The cumulative incidence function helps to determine patterns of failure and to assess the extent to which each component contributes to overall failure. For competing risks data one often wishes to estimate the cumulative incidence probability of failure of a specific cause, k , at time t , that is [9]: (6) where ! j indicates the cause of type of failure, S(s) is the overall survival probability, and ! k (s) is the causespecific hazards for cause k [14]. For more information about calculation cumulative incidence curve CI(t) see [9]. The cumulative incidence estimator can be expressed in terms of the Kaplan-Meier estimator as, where, t i is the distinct ordered observed times, n i is the number of patients who at risk beyond t i , d i is the number of events of interest at t i , K(t i ) is the Kaplan-Meier estimate of the probability of the free of all events at the time t i . Furthermore competing risks are events that occur instead of the failure event of interest, and we cannot treat these as censored [15]. When you have competing events, you want to focus on cause-specific hazards rather than standard hazards. When we have competing events, we want to focus on the cumulative incidence function ( CI(t) ) rather than the survival function S(t) , Cox regression is fine for cause-specific hazards, but for CI(t) you need to go through a lot of work competing-risks regression by the method of [16] is a possibility.

METHOD PROPOSED TO PROJECT CUMULATIVE INCIDENCE CURVE
Byung Mook Weona and Jung Ho Jeb [17] describe a methodology that enables us to obtain separate measurements of scale and shape variances in survival curves and these authors demonstrated that they will be useful for performing better tracking of ageing statistics and it is possible that this methodology can help identify the causes of current trends in human ageing. Also, in this work it is desired to find a method that generalizes this process to diseases where objectives such as survival analysis, such as cancer and heart disease, are used.

The Cumulative Incidence Curve and its Shaping
We propose the use of a parametric method based on the Weibull growth function or Weibull sigmoid model inspired in a previous research model [18]. We think this method can estimate 90, 95 and 99 maximum percent of ( CI(t) ), X-axis (time) points of great clinical interest known as benefit time points (BTP).
The Weibull distribution of four parameters is an asymptotic growth function and can be expressed as, where W (x) represents an approximation to ( CI(t) ) being expressed at each time (x). a , b , c and m are parameters to be estimated and e is the base of the natural logarithms.
Parameter a is the upper asymptote of limiting value of the response variable (W): x!" lim W x ( ) = a (9) which represents the maximum cumulative survival modelled 1 ! S(t) . b is the lower asymptote, c is the parameter governing the rate at which the response variable approaches its potential maximum a or growth rate. Finally, m is a parameter that controls the x-ordinate (time) for the point of inflection (allometric constant). The four parameter Weibull growth model can be easily transformed in a 3, 2 and 1 parameter Weibull model to adapt the relation between dependent ( CI(t) ) at each time "x".
When m = 1 the Weibull model is a simple exponential growth curve.
Finally, once a good estimate of the function parameters ( a , b , c and m ) is obtained, it is possible to calculate the desired points on the X axis using the inverse function ( CI(t) !1 ). If the Weibull curve correctly represents the function, it has the advantage that it can be projected over time and its immediate application is to know whether or not the curve ( CI(t) ) has reached saturation and when it will reach this maximum limit.

Representing CI(t) using the Proposed Weibull Model and its Calculation
Here, we present the calculation of survival cumulative incidence CI(t) from S(t) using Kaplan-Meir method. To do this task we have create the function Cumulative.Incidence.Curves() , see Appendix I.

Cumulative.Incidence.Curves()
f u n c t i o n transforms the function S(t) into CI(t) : 1-S(t), the results for the case study is presented on Table 3 and drawn in Figure 2.
In order to detect some relevant clinical points over CI(t) we propose fit it using a Weibull growth model ( W (x) ) of 4 parameters, before commented: where a is the upper asymptote of limiting value of the response variable (W) , b is the lower asymptote, c is the growth rate and m is the point of inflection. We will estimate these in order to characterise the function CI(t) and trying to determine a benefit time points (BTP).
The specific function Weibull.cumulative. incidence() has been developed to be able to estimate the parameters of this model by non-linear regression using the R function, nls2(). Fundamentally this function adjust a weibull grown model with 4 parameters (equation 10) using a non-linear regression procedure.  Weibull.cumulative.incidence() has been encapsulated within the library BDSbiost3 [Machine learning and advanced statistical methods for omic, categorical analysis and others] [19] that has been developed by the author and is located in Github https://github.com/amonleong/BDSbiost3. The model accuracy (goodness of fit) was tested using Efron's pseudo R 2 , Min.max.accuracy (for minimum, maximum accuracy, more substantial indicates a better fit, and a perfect fit is equal to 1) and root mean square error (RMSE) which has the same units as the predicted values. The Weibull sigmoid model obtained best scores and was selected as a good function that fits and extrapolates curve.
The Weibull.cumulative.incidence() function also allows to represent the estimated function, its CI95% of prediction and the BTP points of interest of 90, 95 and 99% of the asymptote.
In Table 4 is shown the parameters obtained in the estimation of the Weibull curve using: Weibull.cumulative.incidence().
The value of the estimation of the parameters of our case of use is: The goodness of test computed for the model is: • Efron's pseudo r-squared = 0.974 • Min.max.accuracy = 0.923 In Figure 3 is presented the curve fitted for CI(t) , and the Weibull model estimation is shown (red line with the CI95%).
Also we can calculate in Figure 3 where are the possible benefit time points (BTP) as the time to reach the 90%, 95% and 99% of the upper asymptote, a , using the inverse of the Weibull growth function W (x) !1 . So, we can obtain the value of time to reach some value of the upper asymptote, we will consider this a possible benefit time points (BTP).

Local Polynomial Regression (LPR)
Another possibility to make a projection of the CI curve is to use a non-parametric method. After trying several it has been chosen Local Polynomial Regression (LPR) fitting.
LPR is a family of flexible and robust nonparametric regression methods that allow fitting smooth curves between two or more independent variables. This type of methods combine multiple regression models in a k-nearest-neighbor neighbour-based meta- model. The method used in the study was LOESS (locally scatter-plot smoothing) as the generalization of LOWESS (locally weighted scatterplot smoothing). The mathematical description of LPR and LOESS can be found at [20]. This method will be used for the following example and it has also been incorporated into the function Weibull.cumulative.incidence().
Is possible to get an R-squared for a LOESS fit using: pseudo ! R 2 = 1 ! SS.resid / SS.variable (11) where SS.resid is the sum of square for the residuals and SS.variable is the sum of square of the variable of interest (Y).

AN ALTERNATIVE CASE OF USE FOR WEIBULL.CUMULATIVE.INCIDENCE()
Here, We present other examples of the use of the function Weibull.cumulative.incidence(), now using LPR.
The second example (Figures 4, 5 and 6       The goodness of test computed for the LOESS method (non parametric) is: • Pseudo r-squared = 0.981704

CONCLUSIONS
In this work we present the use of the cumulative survival function, 1 ! S(t) , as a form to calculate a benefit time points (BTP) as a time when a 90, 95 or 99% is reach for the maximum 1 ! S(t) . To do this task we propose the use of the R function: Weibull.cumulative.incidence() These function transforms the cumulative survival curve 1 ! S(t) ( CI(t) ) obtained using the Kaplan-Meier method and transform S(t) to 1 ! S(t) .
In a second step, Weibull.cumulative.incidence() estimates a Weibull growth curve of 4 parameters to characterise the best fit function for the CI(t) and its inverse CI(t )! 1 to estimate the BCP is a best way.
BTP can be important in diseases like cardiac illness or cancer to seek the inflection point of the disease, treatment associate or speculate how is the course of the disease and change the treatments at those points. Finally, this model has many possibilities and applications for those situations in which there is great uncertainty and it is necessary to make temporal projections, such as microbiological growth models, or epidemiological models, as in the world of coronavirus pandemic.