Margaux in biostatistical wonderland

1/3/2017

Fig.1: let's start with a bit of humour, shall we?

Relax and enjoy your flight on Biostatistics Airlines:

    First thing I’ve done while crossing path with medical biostatistics was… laughing. A mirthless laugh: those formulae were unintelligible.
What I did not understand then was that, at our little level, the only thing we need is… knowing what we want! Indeed, each described situation can be reduced to a formula.

    In the previous post, we’ve seen that distributing patients into prognostic strata was possible by comparing two genes mutation status within different tumour types and their combined impact on prognostic.
What is the prognostic here? Simply survival. What do we want? Simply knowing if Mr X with mutated A but wild-type (i.e the gene “normal” state) B will live longer than Mrs Y with both wild-type A and B.
So let’s not forget that statistics can be easy to grasp if those concepts are explained with the right handle [1] and let’s slowly take one step after another to understand how survival analysis is made and can help us on our journey toward personalised medicine.

First step: datum datum datum, dati, dato dato, DATA, DATA, DATA, datorum, datis datis

    We are all familiar with the first step of every statistical analysis: data collection. In the introduction, it is said that we are interested by survival: at a given time is a given patient dead or alive (yes, mathematics can be pretty cynical)?
    Some of you who already know a bit about statistics may ask why we do not use the classical models such as the Normal Law (Gaussian curve [2]) that allow easy interpretations. It is because survival do not follow classical distributions and events are often distributed as follows: several early events, some late events [3]. So survival analysis requires specific models.

    There are two “events” that can be observed: death or recurrence (i.e cancer comes back). Models use two variables linked to these events:
        i) overall survival: time between diagnostic and death or the last follow-up
        ii) progression free survival: time between the response to the treatment and the recurrence or the last follow-up

    Here “follow-up” means the time when you examine all the patients in your cohort. What is interesting is that this notion of follow-up is another point that makes survival analysis different from other models. If patients do not exhibit death nor recurrence, they are considered as “censored” in the study i.e they do not give objective data therefore classical models are not relevant either. The only thing you need to keep in mind is that censoring events follows specific rules [4].

Second step: “All Curves Are Beautiful” (Yvan Cornut)

    Now that we have our data, we have to put them into functions to model what we want. Two functions are used [3]:
i) survival S(t) that represents the “cumulative non occurrence” i.e the the summed time a patient survived
ii) hazard h(t) that represent the event occurrence at a given time t knowing that until t, the patient did not experience any events.

Those function will be useful to plot survival curves, a way to represent survival in order to analyse it more easily. The most common way to do it is by using the Kaplan-Meier method:

Fig.2: Formula for the KM Method. Relax, it is simpler than most of mathematical things we’ve done so far!

Fig.3: Kaplan-Meier curves, S(t) = f(t)

As initial conditions we have:

t(0) = 0
S(0) = 1
Events are considered independent

Then we plot S(t) against time to obtain our survival curves as here:

    If you’re a cynical, you can plot [1-S(t)] vs. time to obtain death curves but let’s stay positive, after all, you’re still alive after the previous explanations :)

    Next step: comparing your curves. Why? Because all differences are not significant. Huu... what? Let me be clearer: when you cook, it will be significantly different to add 10 more grams of butter than only one gram. Same thing for your survival curve:
How big must be the difference so you can conclude that if Mr X with wt-B and mutated A live longer, everyone is more likely to also live longer?
To do that, we use a test named log rank [6] and performed the same way as a χ² [5].
No, no, don’t go! I’ll be happy to develop in the comments for those who may be interested but let’s stop there on survival curves and take another step.

Last-but-not-least step: what else?

In the first blogpost, we saw that cancer was a multifactorial disorder with internal and external causing factors. Methods like the ones above (Kaplan Meier and log rank) are univariate. Hence the need for models that take into account the factors linked to the patient (remember: personalised approach) named cofounders or covariates and that allow to estimate clinically [7] (not statistically) the impact of the mutation status combined with those factors on the prognostic.
There are two categories of models that differ by the way impact on the prognosis is conceived. In the Cox proportional hazard model that we will see below, each factor has a weight that influences the prognosis whereas in the Accelerated Failure Time model that will not further be detailed, each factor can shrink or stretches the survival time along the time-axis.

Briefly, the Cox semi-parametric is a way to link event occurrences with the covariate set. It uses the following function of the hazard h(t):
                                                                                                       h(t) = h0(t) x exp(i=1pbi.xi)
That can appear absolutely barbaric until you just know that p is the number of covariates you are considering (age at diagnosis, gender, other disorder…), x a given covariates and b its relative coefficient that weigh the hazard.

    Still following? Great! Here are some little precisions:
i) h0(t) is a “basal hazard” that is very convenient because we do not have to assume that h(t) follows a given and known distribution
ii) This model can only be applied under the assumption that all the risks are constant multiples.
iii) exp(bi) is the risk ratio and allows to better understand the model: if b>0, h(t) increases therefore, if the risk increases, the survival time decreases. Thus we have a negative correlation between risk ratios and survival

Now you know!

Congratulations, you have succeeded in following statistical explanations! What to remember and tell your family during dinner so they consider you as some kind of wizard?
First, survival analysis are made following three steps: data collection, survival curves plot and comparison, estimation of covariates impact on the prognostic groups defined by your curves.
Second, what matters in biostatistics is to know what you want because only then will you know what hypotheses to make and what test or model to use to answer your question. Here are presented tests and models that are “non parametric” which mean that you assume your data do not fit a given distribution. This is convenient because survival distributions are often different from what exists but parametric test are considered more robust therefore, looking at what can be done with parametric models is interesting too.
Finally, I really hope that you were able to follow everything above because that would mean that I was able to simplify those notions and that you maybe understand the most important message to take home: statistics can be easily understood and even if we won’t be biostatistician, it is important to understand what we read in papers. The only way to be legitimate in our criticism is by knowing what we’re talking about :)

Références:
[1] Aberkane, I. (2016). Libérez votre cerveau !. 1st ed. Paris: Robert Laffont.
[2] Mathsisfun.com. (2017). Normal Distribution. [online] Available at: https://www.mathsisfun.com/data/standard-normal-distribution.html [Accessed 24 Feb. 2017].
[3] Clark, T., Bradburn, M., Love, S. and Altman, D. (2003). Survival Analysis Part I: Basic concepts and first analyses. British Journal of Cancer, 89(2), pp.232-238.
[4] Bradburn, M., Clark, T., Love, S. and Altman, D. (2003). Survival Analysis Part III: Multivariate data analysis – choosing a model and assessing its adequacy and fit. British Journal of Cancer, 89(4), pp.605-611.
[5] Chi Square Statistics. (2017). [online] Math.hws.edu. Available at: http://math.hws.edu/javamath/ryan/ChiSquare.html [Accessed 22 Feb. 2017].
[6] http://www.oxfordjournals.org/our_journals/tropej/online/ma_chap12.pdf
[7] Bradburn, M., Clark, T., Love, S. and Altman, D. (2003). Survival Analysis Part II: Multivariate data analysis – an introduction to concepts and methods. British Journal of Cancer, 89(3), pp.431-436.

Figures:

From: https://fr.pinterest.com/pin/89298005081195023/
Beautifully drawn by me
In Kishore, J., Goel, M. and Khanna, P. (2010). Understanding survival analysis: Kaplan-Meier estimate. International Journal of Ayurveda Research, 1(4), p.274.

6 Comments

Hortense

1/3/2017 07:26:11 am

Dear Margaux ;)
First thank you for your clear explanation of statistics !
I have one question about the number of samples. In a lot of our research project in FVD we were looking for a lot of samples, for repetability. But how about your kind of samples ?
thank you !

Margaux

2/3/2017 07:09:29 am

Dear Hortense,

Thanks for passing by :D

I've two things to point out in order to answer your question. First, the large number of samples we use in our FdV project is made for two things:
1) with a large number of samples, you're more likely to reduce the variation within your distribution or to have a more accurate picture of the statistical "truth"
Let's picture that with an example: if you take the height of one person in our class, you won't have any clue about the average height in our group or even in the world. If you measure every one of us, you can get the average height and maybe infer the average in a larger population
2) and to me it's the most obvious: we don't have the tools to quantify the significance of our results. We can quantify our results, we can evaluate the biases that may influence them but we can not say "ok, this is similar in my experiments so there is a similarity IRL". So if we take a large number of samples or if we perform the protocole many times we might say "ok, given the number of times this experiment has been done, if my results are similar or within a narrow range of variation, it is more likely due to a real thing than due to an error I've made because I can not have a repetitive bias over such a number of times"

Second, the number of samples in statistics can be calculated. I'll explain briefly but this can be a bit tricky to understand at first
When you perform a statistical test (as the log rank I'll explain to Aurélien after), you have to hypothesis
i) the null one that you'll try to invalidate
ii) the one you're trying to validate BY INVALIDATING THE FIRST one (sorry 'bout the caps lock)
Given those two hypotheses, you can have two errors:
i) first type (alpha) that is the probability of rejecting your null hypothesis whereas it is true
alpha is also named significance threshold and often equal to 0.05
ii) second type (bêta) is the probability of maintaining your null hypothesis whereas it is false
Bêta is useful only because it allows us to get the power of your statistical test: P = 1 - Bêta

Remember what's said in my intro: for every situation, you have a formula. Well for each of these formulae (= tests), you have a second formula that use P as an input and allows you to calculate the number of samples required to reach P

If you want to know more, you can read the paper below that explains quite well (without too many calculations, just the concepts!)
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2933537/

Hope I've been clear enough, don't hesitate to ask me for further informations!

margaux

2/3/2017 07:15:50 am

I just found out how to explain the tricky thing about the two hypotheses
imagine (all the peopleeeeeee linving life in peaaaaace... sorry) that you can only have to color: blue and pink
in statistics, the best way to admit that you're blue is by saying that you are not pink
More formally:
The best way to validate the fact that there is A is by showing that there is not "non-A"
(draw potatoes, it will help!)

Aurélien DIEHL

1/3/2017 01:13:19 pm

Hi Margaux!

You wrote :
"How big must be the difference so you can conclude that if Mr X with wt-B and mutated A live longer, everyone is more likely to also live longer?
To do that, we use a test named log rank [6] and performed the same way as a χ² [5].
No, no, don’t go! I’ll be happy to develop in the comments for those who may be interested but let’s stop there on survival curves and take another step."
So I'll be happy to know more about this :)

Secondly, I was wondering if the model that you discribed is only appliable on cancer?

Lastly, in this formula : h(t) = h0(t) x exp(i=1pbi.xi)
I do not understand what does "i" represent?

Thanks !

2/3/2017 08:02:03 am

I was going to answer your first question but in my sources, you have this great website http://math.hws.edu/javamath/ryan/ChiSquare.html that takes a huge advantage over my answer: you’ll find tables that are far more explanatory than any words I could try to write about it.
To sum up, a chi square test is a test used to see if the difference between two groups is due to a reality or due to the random sampling.
For examples, when you try a new drug, you can compare two groups of human guinea pigs: the one that takes the drug and the other one that takes the placebo. With the chi square test, you can say if the difference between your group is significant or if your new drug doesn’t work.
The only difference between a chi square and a log rank is that in a chi square your denominator is the number of Expected event under your null hypothesis (see my answer to Hortense below to decrypt that :p) whereas in the log rang you use the variation of your difference (Observed - Expected). Once again, check the website, it’s clearer :)

From what I’ve read, this is mainly applied in cancer because cancer behavior/events does/do not fit classical models that are traditionally used.
But this model or those tests could be used in other diseases that does not fit known distribution or diseases where a prognostic stratification could be useful.
I think that in this blogpost, the fact that the model seems to only be applicable in cancer is due to two things
The combination of steps
The choice of death as an event
But for example, if we take neurodegenerative diseases, you could pick “hard to swallow” as an event and use those tests to perform a prognostic stratification as explained in the post and survival curves would not be curves about the time your patients live but about the time your patients take to lose the capacity to swallow..

Ho I should have been more explicit: i is just the index of your covariate. For example, let’s say that you have three covariates:
Age

Sorry for my long answer, I'm a bit enthousiastic when it comes to stat' :D

2/3/2017 09:00:27 am

Ho I should have been more explicit: i is just the index of your covariate. For example, let’s say that you have three covariates:
Age
smoker/non smoker
Red hair/white hair
Well you’ll just index them: age 1, smoking 2, hair 3 so it is clearer in your formula and you won’t forget one. To sum up, i a just an integer between 1 and p (p is the index of the last covariate, btw)

sorry for my long answer, I'm a bit enthusiastic when it comes to stat' :D

(error in my copy-paste, sorry)

Search the site...

THE AMAZING ADVENTURES OF LFDV BACHELOR STUDENTs

Margaux in biostatistical wonderland

Leave a Reply.

Categories