Inference Procedures on the Ratio of Modified Generalized Poisson Distribution Means: Applications to RNA_SEQ Data

: The Poisson and the Negative Binomial distributions are commonly used as analytic tools to model count data. The Poisson is characterized by the equality of mean and variance whereas the Negative Binomial has a variance larger than the mean and therefore is appropriate to model over-dispersed count data. The Generalized Poisson Distribution is becoming a popular alternative to the Negative Binomial. We have considered inference procedures on a modified form of this distribution when two samples are available from two independent populations and the target effect size of interest is the ratio of the two population means. The statistical objective is to construct confidence limits on the ratio. We first test the presence of over dispersion and derive several estimators in the single sample situation. When two samples are available, our interest is focused on the estimation of an effect size measured by the ratio of the respective population means. We have compared two methods; namely the Fieller’s and the delta methods in terms of coverage probabilities. We have illustrated the methodologies on published genomic datasets.


INTRODUCTION
The Poisson distribution is commonly used to model count data. However, a restriction of this distribution is that the response variable must have a mean equal to the variance. This restriction does not often hold for many biological and epidemiological data. In reality, the variance can be much larger than the mean, a phenomenon known as "overdispersion". This overdispersion may occur due to population heterogeneity, or the presence of outliers in the data [1]. An analysis of data with overly dispersed counts can lead to the underestimation of parameter standard error if overdispersion is ignored. A review of the issue of overdispersion in both binary and count data was reviewed by Hinde and Demetrio [2], and in a more recent review by Hayat and Higgins [3]. Diagnosing and accounting for overdispersion is not a simple issue and should be appropriately dealt with to avoid bias in interpreting the results.
When overdispersion is suspected, the Negative-Binomial (NB) distribution has been adopted as a common alternative to the Poisson distribution. The NB has two parameters and a variance that is a quadratic function of the mean and has therefore has been the model of choice to model count data that exhibit overdispersion. Since accounting for measured covariates is one of the methods used to address the issue of over dispersion by including them in a *Address correspondence to this author at the Department of Epidemiology and Biostatistics, Schulich of Medicine and Dentistry, University of Western Ontario, London, Ontario, Canada; Tel: +12269778651; E-mail: mmshoukr@uwo.ca, Shoukri.mohamed@gmail.com regression model, Hinde [4] reviewed the methodologies of NB regression. Joe and Zhu [5] drew a comparison between the NB and a mixture-based generalization of the Poisson distribution.
In this paper, we discuss several inferential statistical issues related to a modified form of the Generalized Poisson Distribution (GPD). The GPD distribution was introduced to the statistical literature by Consul and Jain [6] and a detailed account of its properties was given by Consul [7]. The distribution has two parameters and has variance larger than the mean. This makes the GPD an attractive competitor of the Negative Binomial Distribution (NBD). The distribution has been used to analyze data in the fields of genetics [8] as a queuing model [9,10,11] and genomics [12]. The modified form of the GPD, which we shall call "Modified Poisson Distribution" (MGPD)was first discussed in [9]. The modification was the result of a double parametric transformation on the original parameters of the GPD. The main purpose of the transformation is to achieve parameters orthogonality [13], which will improve the statistical properties of the maximum likelihood estimators, and make the MGPD a member of the "Generalized Linear Models" [14].
There are situations when the researchers have the opportunity to study count data under two experimental conditions. One of the questions of interest is to conduct statistical inference on the ratio of the mean counts. To the best of our knowledge, the issue of constructing a confidence interval on the ratio of means of two MGPD's has not been discussed before.
The paper is divided into 6 sections. In Section 1, we discuss the issue of parameters estimation in single samples. We study several estimators and evaluate their asymptotic efficiencies relative to the method of maximum likelihood. In Section 2 we introduce the score test for overdispersion. In Section 3 we consider the problem of testing for over dispersion, and in section 4 we deal with the problem of constructing confidence intervals on the ratio of means, where we compare two approaches; the Fieller's interval and the delta method. In section 5 illustrate the methodology on published data arising from reading counts of and RNA sequencing of gene expressions data. The conventional abbreviation is RNA_SEQ. General discussion will be given in section 6.

MODIFIED GENERALIZED POISSON DISTRIBUTION
The GPD was introduced by Consul and Jain [6] The GPD whose probability function is given in (1) reduces to the well-known Poisson distribution when ! ! = 0. Therefore the parameter ! ! with the above restriction on its range, is considered the dispersion parameter. Shoukri and Mian [9] employed the parametric transformations: Therefore, equation (1) reduces to: where For fixed !, the function g(.) is the natural parameter the transformation which renders the GPD a member of the linear family of exponential class (see; McCaullagh and Nelder [14]): We call the transformed GPD, the "Modified Generalized Poisson Distribution" or MGPD Shoukri and Mian [9] showed that a recurrence relation among the rth non-central moments ! ! ! is such that: Here, ! ! ! ≡ 1, ! ! ! ≡ ! = ! ! .
Using the recurrence relation (4) one can show that the higher central moments are given by:
The likelihood function is given by: The log-likelihood function: The first partial derivatives of the log-likelihood function with respect to the model parameters are: to zero we get, explicit unique solution for ! as ! = !. On the other hand; Consul and Shoukri [15] showed that equation (8) has a unique root in !, in (0,1) if and only if ! ! > !.
We can also show that: Equation (10) indicates that the model parameters are orthogonal. Moreover; Hence, the variance of the maximum likelihood estimators are: !! ! , and the two estimators (µμ, ! ) are stochastically independent because they are orthogonal to each other as shown in equation (10).

Moment Estimators
Equating the first two sample moments to their corresponding population moments and solving for the parameters we get: The variance of the ml estimators are (due to the parameters orthogonality) And for the moment estimator The relative efficiency of the moment estimator of the dispersion parameter is measured by the ratio of the variance of the maximum likelihood estimator to the variance of the corresponding estimator.  Table 1 for a few values of the model parameters. The relative efficiency (eff_moment) of the moment estimator is quite high for small values of the mean and the dispersion parameter but declines rapidly as both parameters increase.
Another type of estimator for the dispersion parameter (!) which has not been discussed before is the so-called mixed estimator. We consider this estimator in the next sub-section.

Mixed Estimators
Here we use two-sample statistics to estimate the model parameters. Let ! ! , ! ! , … ! ! be the outcomes of a simple random sample let ! ! denote the count of zeros. Clearly ! ! ! , ! are sufficient statistics for ! ! , ! , where Solving the equations: , for ! we get: To find the variance of ! ! , we use the delta method so that to the first order of approximation we have , however, the derivation of !ov log ! ! , ! is not straight forward. To derive the covariance between the sample mean and the fraction of zeros in the sample we proceed as follows: Since ! ! = !! ! has binomial distribution ! !~b in !, ! ! , ! ! ! = !! ! and ! the sample total has expected value !", one can show that the joint MGF of !, ! ! ; The function ! ! ! is the mgf of the MGPD, which does not have an explicit expression.
Differentiating ! ! ! , ! ! with respect to ! ! and ! ! , setting ! ! = ! ! = 0, we get Using the delta method, we can show that Direct substitution gives: Similar to the calculation of the efficiency of the moment estimator we measure the efficiency of the mixed estimator by the ratio: var ! /!"# ! ! Table 2 shows the relative efficiency of the mixed estimator (eff_mixed). The behavior of the relative efficiency of the mixed estimator is similar to that of the moment estimator in terms of variations in the parameter values. However, the relative efficiency of the moment estimator of the dispersion parameter is lower than that of the mixed estimator, for large values of the population mean and the dispersion parameter.

TESTING FOR OVERDISPERSION: SAMPLE SIZE REQUIREMENTS TO DETECT OVER-DISPERSION USING THE SCORE TEST
As mentioned, the MGPD reduces to the Poisson distribution when the dispersion parameter ! is set equal to zero. Therefore, to construct a goodness of fit test where the null hypothesis is that the available data is drawn from a Poisson distribution against an alternative in the direction of the MGPD, our best approach is to use the score-testing. The score function is obtained by differentiating the log-likelihood function with respect to the dispersion parameter, and setting the value of the dispersion parameter equal to zero. The advantage of the score test is that the test statistic is evaluated only under the null hypothesis [16]. We proceed as follows: Based on a !"!, ! ! , ! ! , … ! ! the score test on the null hypothesis ! ! : ! = 0 against one-sided alternative ! ! : ! > 0 is given by: In equation (16) ! ! and ! are respectively the sample variance and the sample mean.
The mean and variance of ! are given as: For Type I error rate !, and power 1 − ! the approximate sample size n to detect the departure from the Poisson (i.e. ! = 0) in the direction of MGPD is: In Table 3 we show the empirical power for small, moderate, and large samples at a 5% level of significance. The power increases the farther away ! from its null value, when the sample size is large and when the population means is large as well.
In the previous sections we dealt with the problem of statistical estimation in a single sample. This was inevitable to properly deal with the two samples situation.

Estimation of the Ratio of Two Means
Let Y1, Y2 be random variables with expected values E(Y1) and E(Y2 ). Of interest is the ratio of the two means; R= E (Y1)/E(Y2) = ! ! /! ! . We know that the unbiased maximum likelihood estimators of the population mean, are the sample means. We assume that we have two independent sample ! !! , ! !" , … . . ! !!! from a MGPD with parameters (! ! , ! ! ), and another independent sample ! !" , ! !! , … . . ! !!! from independent MGPD with parameters ! ! , ! ! . We shall discuss two methods for constructing confidence levels on the ratio of mean R.

FIELLER'S METHOD
The approach to construct confidence limits on the ratio of means is the Fieller's method [16,17], applied to independent samples with unequal variances as was shown for the normal distribution [18].
We denote the maximum likelihood estimator of R by ! ! = ! ! /! ! . Furthermore, we denote For a Type I error rate α, we have: The inequality in the square bracket in equation (18) may be written as: Solving the quadratic for ! ! we get: Simplifying we get: When Fieller's confidence set for the ratio is finite, then it is given by ! = ! − !, ! + ! .

We can establish bioequivalence using the Fieller's theorem. The confidence set is an interval if
Hence as shown in [19] and [20] the equivalence range is 0.8, 1.25 .

Delta Method
From [21], the variance of the ratio of two random variables is, to the first order of approximation given by: Therefore, an approximate (1 − !)100% confidencein the ratio of means is given by: In general, the estimated ratio of two means is biased. The magnitude of bias can be obtained again by using the Delta method and is given to the first order of approximation as: We simplify the above expression to get: In Table 4, we compare the empirical coverage probabilities of Feiller's to those of the delta method, for selected values of the population parameters, and nominal level of significance 5%. To simplify the table, we assume the homogeneity of the dispersion parameters of the two populations. As can be seen, for small values of the ratio and small values of the common dispersion parameter, the Fieller's theorem gives better coverage. However, for larger values of the dispersion parameter, both methods seem to have similar coverage probabilities.

APPLICATIONS (RNA-SEQ DATA)
Gene expression is the process by which information from a gene is used in the synthesis of a the functional gene product, which may be proteins. A gene is declared differentially expressed if an observed difference or change in reading counts or expression levels between two experimental conditions is statistically significant. To identify differentially expressed genes between two conditions, it is important to find the statistical distributional property of the data to approximate the nature of differential genes. As we have already indicated, the Poisson distribution is ubiquitous in the analysis of count data. It is usually assumed that the position-level read count follows a Poisson distribution with a rate !. But evidence from a large body of data do not support the Poisson assumption of the equality of mean and variance [22,23]. Robinson and Smythe [24] used the negative binomial distribution to analyze tag abundance to account for overdispersion in this type of data.
In the present study, the focus is mainly is on investigating the differential gene expression analysis for sequence data based on MGPD. This approach was applied in RNA-seq read count data [12] where the authors used the original form the model given in equation (1). Thus, fitting of appropriate distribution to gene expression data provides statistically sound cutoff values for identifying differentially expressed genes. One of the basic questions while analyzing genomic data is related to the identification of the appropriate distribution of the position level read counts. This distribution if proven appropriate allows: i.
Better estimation of gene expressions ii. Improving the identification of differentially expressed genes The proposed MGPD will be used to re-analyze the data (Sudeep & Chen [12]) for some highly expressed genes. The published data were downloaded from http://www.ncbi.nlm.nih.gov/sra/ as the fastq files: SRA010153 for the MAQC data, SRP000727 for the human data (the two low-coverage MAQC samples were excluded), SRX000559-SRX000564 for the yeast data.
We analyzed the read count of the Mice-Brain tissue data under two experimental conditions named (Chrom1, and Chrom9) using the MGPD .   Figures 2 and 3 show the histograms of the read counts for the Chrom1 and Chrom9 respectively.
As can be seen from Figures 1 and 2, the data are heavily skewed due to the presence of outliers. This may be one of the reasons for overdispersion in the data. In Tables 5 and 6 we present the summary statistics under the two experimental conditions.    Figure 3: Histogram of the read counts for the second sample from Mice-brain tissues.  The likelihood estimators of the dispersion parameters and their standard errors are given in Table 7.
The construction of the 95% confidence intervals on the ratio of means based on the delta method and Fieller's theorems show that: Because the sample sizes are large the two methods give almost the upper and lower limits for the same 95 % confidence level. Moreover, the Fieller's limits show the non-equivalence of the two population means as indicated in Figure 1.

DISCUSSION
In this paper, we demonstrated the applicability of the modified form of the generalized Poisson distribution. The modification is, in fact, a double transformation on the original model parameters. We used the score testing to assess the departure of the model from the Poisson distribution, and provided sample size justifications, and evaluated the power of this test. The inference procedure on the ratio of two means was evaluated by estimating the coverage probabilities using simulations.
There are situations however when data may be available from multiple samples. The two questions of interest are: i.
how to test the homogeneity of the dispersion parameters in two or more MGPD models, and ii. how to test the homogeneity of several MGPD means in the presence of covariates. This is equivalent to the ANCOVA model Both questions are under investigation by the authors of this paper.