A Generalization of the «lady-tasting-tea» Procedure to Link Qualitative and Quantitative Approaches in Psychiatric Research

In Fisher's " The Design of Experiments " , a trial was designed to test a lady's claim to be able to discriminate whether the milk or the tea was added first to a cup. In this trial, eight cups are poured, four with milk first and four with tea first. They are then presented in random order to a subject who has to divide them into two sets of 4, according his/her belief about the "treatment" received. The present paper generalizes this design so that a hypothesis concerning the existence of two sub groups in a set of psychiatric patient records (whether written, audiotaped or videotaped) can be tested rigorously from a statistical point of view. Tables are proposed to enable power and sample size calculations. A real example is presented; it shows that psycho-dynamically oriented professionals are able to discriminate seven healthy adults who have experienced a sibling's cancer during childhood or adolescence from seven matched controls. This method is particularly suited to small sample studies that explore elusive clinical hypotheses traditionally tackled with qualitative methodologies.


INTRODUCTION
It is perhaps in psychology that randomization was used in experiment for the first time [1,2].This was in 1884; C.S. Pierce and his student Joseph Jastrow were the experimenters; their objective was to refute a hypothesis made by Gustave Fechner about the existence, for each sense, of a nonzero threshold of intensity below which two sensations cannot be distinguished [3].The experiment was based on the repeated presentation of a pair of weights; the subjects had to determine whether or not the first weight was the heavier one.To facilitate the interpretation of results, the authors decided to randomize the order of presentation of the two weights in batches of 25.
Curiously, the practice of randomization does not seem to have spread after this seminal publication; it is only after the work by Sir Ronald Aylmer Fisher that it came to be considered as a standard [1].Even if Fisher had a professional interest in genetics and agriculture, the place he gave to psychological experiment also appears considerable.The second chapter of R. A. Fisher's book The Design of Experiments [4] is entitled "The Principle of Experimentation, Illustrated by a *Address correspondence to this author at the INSERM U669, Maison de Solenn, 97 Bd du Port Royal, 7569 Paris cedex 14, France; Tel: (33) 1 58412850; Fax: (33) 1 58412946; E-mail: bruno.falissard@gmail.com# CO-Authors E-mail: daniel.milman75@orange.fr,david.cohen@psl.aphp.frPsycho-physical Experiment".This chapter begins with an anecdote which occurred at a university tea party in the late 1920s [5]: "A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup.We will consider the problem of designing an experiment by means of which the assertion can be tested.Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject in random order.[...] Her task is to divide the cups into two sets of 4, agreeing, if possible, with the treatments received." Fisher insists on the importance of randomization.This process is "the only point in the experimental procedure in which the laws of chance […] have been explicitly introduced".This is essential from a statistical point of view since "the simple precaution of randomisation will suffice to guarantee the validity of the test of significance".
In the area of psychiatric research, a procedure of this type can be considered as a precursor of singlesubject randomised experimental designs, which are sometimes used in the evaluation of behavioral therapies [6,7].If it is often difficult, for theoretical and practical reasons, to give a single subject a random succession of two "treatments" as in the lady-tastingtea experiment, it is possible to achieve randomization by choosing at random the point at which one of the treatments will succeed to the other.A design of this sort is known as an "AB" design [7] where all A treatments are given first, then all B treatments.This design can be extended to "ABAB" designs or to multiple-baseline AB designs, where a small sample of subjects is studied instead of a single subject [8].
A particular interest of the lady-tasting-tea procedure is that it offers a very simple design to test hypotheses that would be rather difficult to tackle using, for example, psychometric scales or questionnaires.We propose here to develop and adapt it to the field of psychological research.The question of statistical power will be particularly important since low power has been pointed to as a major limitation of singlesubject experiments in general [9].A real example will be presented; it focuses on a question formulated in the field of psychodynamic psychopathology.

Objective
To test the hypothesis that two sets of records A and B (written, audiotaped or videotaped interviews) are distinguishable.

1.
n records (n even) are collected.n/2 records belong to set A and n/2 to set B.

2.
The n records are presented in random order to k independent raters (the orders are different from rater to rater).The raters know that half of the records belong to A and the other half to B.

3.
The raters are then asked to give their opinion about the likelihood that the records belong to A or B. In practice, for each record a given rater has to choose among 4 propositions: "it is certain that the record belongs to A", "it is plausible that the record belongs to A", "it is plausible that the record belongs to B", "it is certain that the record belongs to B".The raters can examine the records as long as they want, possibly several times in different orders.The raters work blind to one another's assessments.

4.
Ratings are analyzed in the following way: for each record in group A, when a rater has considered that "A is certain" 2 is added to the score, 1 is added if "A is plausible", 1 is subtracted if "B is plausible" and 2 is subtracted if "B is certain".The same is applied to each record in group B, but here 2 is added to the score if "B is certain", and so forth.A total score is then obtained from the responses given by all raters to all records.Finally a statistical test of hypothesis compares this total score to 0, and a permutation test is used ([10] p. 202).It should be noted that since the raters know that half of the records belong to A and the other half to B, the ratings cannot be considered as independent realizations of a random variable, so that the traditional Student t.test or Mann-Whitney test should not be used.In contrast, under the null hypothesis that A and B records are indistinguishable, all permutations of scores obtained for each record are equi-probable.So that a p-value (one-sided) can be estimated as the proportion of permutations of the n records for which the total score is higher than or equal to the total score obtained in the experiment ( [10], p. 208).When a two-sided p-value is preferred, some authors suggest doubling the one-sided p-value; this is also the option proposed in the International Conference on Harmonization ICH E9 guidelines [11].Twosided p-values will be preferred in the rest of the paper.

Power
From a theoretical point of view, since in a randomization test the outcomes are regarded as fixed quantities and not as random variables with a given distribution the concept of statistical power is in itself questionable.But since this point is discussed by several authors [6], we will not enter into this fundamental debate and will focus on very practical considerations.
We will consider here three alternative hypotheses: one where raters have a sensitivity and a specificity of 0.8 for correctly allocating subjects to groups A and B; one where sensitivity and specificity are equal to 0.7 and one where they are equal to 0.6.It should be noted that when sensitivity and specificity are equal to 1 the raters have perfect discriminating ability, and when sensitivity and specificity are equal to 0.5 their discriminating ability is null, or rather it is comparable to an equiprobable random scoring.
For the case where sensitivity and specificity are equal to 0.7 (for example), data sets are simulated in the following way.For records belonging to group A, the k n/2 ratings are randomly chosen from the following distribution {"2" with a probability of 0.35, "1" with a probability of 0.35, "-1" with a probability of 0.15 and "-2" with a probability of 0.15}.For records belonging to group B, the k n/2 ratings correspond to -1 multiplied by a random permutation of the ratings obtained for group A. This procedure guarantees that there will be as many negative as positive ratings, which is what each rater is supposed, by construction, to produce.
For a series of n and k, 7000 random data sets are generated according each alternative hypothesis.The statistical power is estimated as the proportion of random data sets for which the two-sided permutation test at the 5% level rejects the null hypothesis.The number of 7000 guarantees that for a power of 75%, the half span of the 95% confidence interval for the estimated statistical power is equal to 0.01.This precision in the estimated power will be greater for higher power values and lower for smaller power values (up to 0.50).All computations were performed using R software version 2.4.1 [12].
Results are presented in Table 1.They show that even for a weak alternative hypothesis (sensitivity = specificity = 0.6) some designs can lead to substantial

EXAMPLE
In a recent study the authors were interested in the long term psychological outcome of siblings of children with cancer.They hypothesized: 1/ that cancer would have a lasting traumatic effect on the siblings of children with the disease and 2/ that personal psychodynamic experience enhances the ability to discriminate between these siblings and controls [13].
To test these two hypotheses, seven healthy adults who had experienced a sibling's cancer during childhood or adolescence and seven matched controls were asked to give a 5-minute spontaneous freeassociation speech sample following specific instructions designed to activate a buffer zone between fantasy and reality.Three psycho-dynamically oriented professionals and three non-experienced professionals were randomly shown the videos and asked to classify them blind according to possible traumatic history (i.e.being siblings of children with cancer) using a -2/-1/1/2 response pattern.
Psycho-dynamically oriented professionals (1) were able to recognize, beyond levels attributable to chance, healthy adults who had experienced a sibling's cancer, without explicit knowledge of this history (p=.002); and (2) discriminated better than inexperienced professionals (p=.003), who were unable to make such decisions beyond levels attributable to chance (p=.68).Of course, these results should be discussed more in depth, but this is not the scope of the present paper which focuses mainly on methodological considerations.
The R script used to perform these analyses is presented in the appendix.

DISCUSSION
The World Psychiatric Association (WPA) President's workplan for 2005-2008 was centered on "psychiatry for the person" which "promotes a contextualized and integrative perspective, seeking to articulate science and humanism in the service of the wholeness of the person who consults" [14].
However, when adapted to research, this laudable ambition to integrate science and humanism raises methodological issues, and if qualitative research can be interesting in this perspective, there are several drawbacks when it is used alone, among which the questionable generalizability of its results and their possible refutability [15].The procedure proposed here thus appears as an example of a methodology that can enable the statistical testing of hypotheses that could be difficult to tackle using traditional tools like scales, questionnaires or cognitive tests.
Of course, the randomization of experimental materials is not a new idea in psychiatric or in psychological research [16].We have even seen, in the introduction, that it goes back to a very early proposition.But, to our knowledge, there is no formal presentation of a method that enables power and sample size calculations.Like all methods, the proposed procedure has advantages and limitations.Among the advantages: it can deal with (very) small samples of records; the methodology is clear-cut, with straightforward statistical inferences; the possibilities for applications are wide.Two protocols based on this methodology are presently underway.The first is a randomized controlled trial among children with an Attention Deficit with Hyperactivity Disorder (ADHD).This trial compares "treatment as usual" (psychological treatment and medication) to "treatment as usual" plus psychodynamic therapy.The objective is to show that after one year of treatment, the two groups are distinguishable on the basis of a general 5-minutes videotaped interview.The second protocol concerns psychological autopsies of state employees.The procedure will be used to show that records obtained from the last days of the lives of state employees who died by accident are distinguishable from the records obtained from the last days of the lives of state employees who committed suicide.
One limitation of the procedure presented here is that although it is based on a randomized procedure, it can be prone to problems of interpretation.Indeed, the randomization process makes it possible to say with a 5% error that records from groups A and B are distinguishable; but it does not say why these records are distinguishable.In the example presented above, if a subject from group A (siblings of children with cancer) says explicitly during the interview that he/she had this particular history, the experiment is of course no longer valid.This limitation can be tackled in several ways.First, the question introducing the interview needs to be as general as possible; in the present situation it was " […] you could talk about the importance you assign to your dreams but also about how you relate to art, painting, music, or sculpture, and about how much room you give to all these feelings in your everyday life.[…]".In addition, an experimenter needs to verify that the explicit content (verbatim) of the interview is not informative.Finally, if a second group of raters is used and if the hypothesis tested concerns a difference between the two groups of raters, then the experiment is less limited by the explicit content problem (since it is present in a same way in the two situations).It should be noted that a limitation of this kind is present in many other randomized experiments, for example in randomized controlled trials on medications when there is no blinding, or when the blinding is likely to be invalidated due to the presence of specific side-effects.
In all events, if there is a need to generate hypotheses about the process which enabled discrimination of subjects from groups A and B, this can be done through qualitative interviews with the raters and a content analysis of these interviews.
In conclusion, provided that 1) cases and controls are selected in a careful manner; 2) raters are experienced and motivated by the study so that they will take all the time necessary to provide carefully thought-out ratings; 3) the starting question addressed to the subjects is worded in a way that does not generate an explicit content bias; 4) written records are drafted by a person blinded to group membership (when written records are used); 5) the duration of the audiotaped or videotaped interviews is optimal (5 minutes seems to be sufficiently informative and acceptable in practice [17]); then the procedure can be powerful in producing results with a level of evidence that is difficult to achieve using alternative methodologies.