Study of Population Structure and Genetic Prediction of Buffalo from Different Provinces of Iran using Machine Learning Method

Considering breeding livestock programs to milk production and type traits based on existence two different ecotypes of Iranian’s buffalo, a study carried out to investigate the population structure of Iranian buffalo and validate its classification accuracy according to different ecotypes from Iran (Azerbaijan and North) using data SNP chip 90K by means Support vector Machine (SVM), Random Forest (RF) and Discriminant Analysis Principal Component (DAPC) methods. A total of 258 buffalo were sampled and genotyped. The results of admixture, multidimensional scaling (MDS), and DAPC showed a close relationship between the animals of different provinces. Two ecotypes indicated higher accuracy of 96% that the Area Under Curve (AUC) confirmed the obtained result of the SVM approach while the DAPC and RF approach demonstrated lower accuracy of 88% and 80 %, respectively. SVM method proved high accuracy compared with DAPC and RF methods and assigned animals to their herds with more accuracy. According to these results, buffaloes distributed in two different ecotypes are one breed, and therefore the same breeding program should be used in the future. The water buffalo ecotype of the northern provinces of Iran and Azerbaijan seem to belong to the same population.


INTRODUCTION
Buffalo plays an essential role in the animal husbandry and agricultural economy of many countries all over the world [1]. Moreover, it influences the rural family's economy where it is bred due to its abilities for producing milk and meat. In some studies, the role of buffalo in the Socioeconomic Development of Rural Areas has been studied [2], and the reasons buffalos failure to contribute to livestock production is include lack of financial resources and interest in the private sector [3]. Asian buffalos are divided into two subspecies, including River buffalo and Swamp buffalo, which the Iranian buffalo belongs to the River type [4]. Furthermore, there are three buffalo ecotypes in Iran consist of Azerbaijan ecotype, which is living in East-Azerbaijan, West-Azerbaijan, and Ardabil provinces, North (Northern) ecotype found in Guilan and Mazandaran provinces, and Khuzestan ecotype observed in Khuzestan province.
Geographically, widespread species often exhibit considerable genetic diversity across the populations [5]. One of the interesting subjects in large scale studies is to study the existence of genetic differences among subdivided groups ascertained from various geographical locations. The Single Nucleotide method to describes clusters of genetically related samples [20]. In other words, DAPC identifies genetic clusters and optimizes the separation of individuals into pre-defined groups and provides group membership probabilities [21]. Clustering methods were used for the inference of population structure of human [22], Mongolian domestic camels [23], horse breeds [24], and Italian domestic pigeons [25].
The supervised learning approach is efficient when individuals are classified into pre-defined populations, particularly in quality control for large scale genomewide association studies (Bridges et al. 2011). Diverse methods of machine learning such as Random Forest (RF), Support Vector Machine (SVM) [26], and DAPC tackle the classification problems. Support Vector Machine is extensively being applied as a solution to classification issues [27], and dealing with the great dimensionality problem in a computationally flexible manner [28]. Random forest is a classification algorithm that constructs multiple decision trees, each of which is built on a bootstrap sample of the training data using a randomly selected subset of variables [29]. The random forest has excellent comparable performance to SVM in classification tasks and conducts both classification and regression precisely. Regression models and SVM were employed in the subpopulation assignment of German Warmblood horses [30]. Machine learning has been applied to proteomics tandem mass spectrometry data, classification, and biomarker identification in postgenomics biology. Also, it has been used for GWAS and genetic prediction of a discrete and complex trait [31][32][33][34][35][36].
In this research, diverse ecotypes were based on various climate conditions. Given the importance of the study population structure for decision-making in the implementation of breeding programs, the structure of the population was studied. Subsequently, the classification strategy in two ecotypes was assessed by SVM, RF, and DAPC methods.
Our hypothesis is that SVM, RF, and DAPC approaches may show a better prediction for identifying animals in distinctive native breeds and (sub) population, due to the use of prior knowledge of the membership of the populations. The aim of the research is to determine the most accurate methods of predicting animals, especially for recognizing ecotypes, subpopulation, and native breeds.

Animal Samples and Genotyping
Hair and blood samples were collected from flocks that had the registration and recording system of the National Animal Breeding Centre of Iran. The selection of the sampling regions was performed according to the registered farms by the National Animal Breeding Center and Promotion of Animal Products, the ministry of agriculture. Two factors were considered to select the samples: different geographical distribution and relativity of the breed in the pedigree. Animal sampling for the Azerbaijan ecotype was performed in West Azerbaijan province (3 cities), Ardabil province (2 cities), and East Azerbaijan province (5 cities) and for the North ecotype it was conducted in Guilan province (7 cities) (Figure 1). Totally, 262 samples were genotyped, including: 68 from East Azerbaijan, 65 from west Azerbaijan, 56 from Ardabil and 73 from Guilan provinces ( Table 1). Genomic DNA was extracted from the hair roots [37], and whole blood by applying a salting-out protocol [38]. Samples of DNA genotyped using the Axiom® Buffalo Genotyping 90 K Array (Affymetrix), and after quality control of the genotyped data, population structure analysis was performed

Data Quality Control
SNP genotypes were extracted from raw data by using the AffyPipe workflow [52] and applying default. Primary quality control and filtering, Initial Quality Control (QC) were carried out, and genotypes exported in PLINK. In the genotyping process, 4 samples (2 samples from Ardabil province and 2 samples from Guilan province) with more than 5% missing data were excluded from further analysis. The reason for the omitted data could be related to the DNA quality, which was not so high, so that likely to show more missing data and incorrect genotype calls [39]. In total, 19 SNPs were removed due owing to unknown position, 8855 SNPs were removed due to minor allele frequency (MAF<0.01), and 336 SNPs were deleted through Hardy-Weinberg disequilibrium at the 5% level. A total of 64750 SNPs passed QC steps. Quality control was performed by PLINK for the initial data to ensure the overall quality of genotyped samples. The samples with more than 1% missing data were excluded from the analysis. Then MAF and call percentages were calculated for each SNP. The SNPs that had a call rate lower than 95% and a MAF < 1% were discarded. Deviation from Hardy-Weinberg equilibrium (p < 10 -6 ) was estimated for the remaining SNPs to identify genotyping errors [40]. The Bonferroni Correction (β=α/n) was used to address the multiple testing comparison problems [41]. The number of tests was taken to be the number of SNPs (n = 64,000), being 10 -6 the corresponding value to ! = 0.05 experiment-wise error. We initiated the QC test with the edited Affymetrix data comprising 64750 SNPs. Then, 19 SNPs were removed due to unknown position, 7 SNPs were removed due to MAF<0.01 and 5 SNPs were not in Hardy-Weinberg equilibrium at the 5% level. Finally, a total of 64719 SNPs passed quality control steps.

Determining Population Structure
Admixture (Unsupervised Hierarchical Clustering) The model-based clustering algorithm was performed in ADMIXTURE v1.23 [42] to investigate genetic structure in the combined dataset, and the genetic share of populations were plotted. ADMIXTURE is a program for estimating ancestry from a large autosomal SNP genotype dataset where the individuals are unrelated (admixture). In admixture, k is a factor involved in determining the number of ancestral populations.

Multidimensional Scaling
Multidimensional scaling [43] was conducted using PLINK on the basis of the genome-wide average proportion of Identical by State (IBS) shared alleles between every two individuals to visualize substructure and provide quantitative indices of population genetic variation and furthermore, identify outlying individuals [18]. Multidimensional scaling was performed based on the matrix with i and j elements (average proportion of IBS alleles shared by i and j individuals) using cmdscale in R Software.

Discriminant Analysis Principal Component
Discriminant analysis principal component is consists of unsupervised (as clustering method) and supervised procedures (as a predictive model). Discriminant analysis principal component analysis is employed when groups are often unknown, and there is a need for identifying genetic clusters before describing them. The number of clusters obtained by means of the find.clusters function and the optimal number of clusters were determined with Bayesian Information Criterion (BIC) that the rate of decrease in BIC values was visually examined to identify values of k, after which BIC values decreased only subtly Discriminant analysis principal component would be like trying k means with different ks, calculating BIC for each k and choosing the best k and defined as: Where W (X) is the variance within groups, g is the number of groups, and n is the number of observations, then low BIC values are better than high ones [44]. The DAPC was performed using the adegenet package (function DAPC) for R software (Jombart, 2008). The supervised procedure of DAPC provides membership probabilities of each individual for the different groups that obtain indications of how clear-cut genetic clusters are.

Prediction Models
The supervised methods are used to predict unknown animals and determine the probability of membership populations. Hence, it requires that the data be prepared.

Labeled Samples for Classification
The labeling was based on two ecotypes in which Azerbaijan ecotype with North ecotype considered as two classes and analyzed simultaneously.

Support Vector Machine Classifiers
An SVM constructs a set of hyperplanes in a high or infinite-dimensional space, which can be used for classification, regression, or other tasks. A good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class. The region bounded by two hyperplanes is called the margin, and the maximum margin hyperplane is the hyperplane that lies halfway between them. The original optimal hyperplane algorithm was a linear classifier [27]. Nonlinear classifiers were created by applying the kernel trick to maximum-margin hyperplanes that allow the algorithm to fit the maximum-margin hyperplane in high-dimensional feature space [45].
The effectiveness of SVM depends on the choice of the kernel, the kernel parameters, and a soft margin parameter C. A common choice is the Gaussian kernel (Radial Basis Function), which has a single parameter ! and RBF kernel can be defined as: Where || x ! " x || 2 is squared Euclidean distance between the two feature vectors and ! is a parameter needed for kernel. The parameters of C and ! is often selected by a grid search with exponentially growing sequences of C and ! . Typically, each combination of parameter choices is checked using cross-validation, and the parameters with best cross-validation accuracy in the training set of each fold are picked [46]. In this study, the package of e1071 was used for SVM, and the function of tune () was used to set the parameters using R Software that we tune the parameters for each training set.

Random Forest Classifiers
Random forest is an assemble learning algorithm for classification developed by Leo Breiman, which uses an ensemble of unpruned decision trees, each of which is built on a bootstrap sample of the training data using a randomly selected subset of variables. Each tree gets a vote in classifying. Individual trees are constructed as follows from data having n animals and m SNP: Choose a training set by selecting n animals, with replacement, from the data.

2.
At each node in the tree, randomly select m SNP from the entire set of m SNP in the data.

3.
Choose the best split at that node from among the m SNP.

4.
Repeat the second and third steps until the tree is fully grown.
Random forest uses bagging and random variable selection for tree buildings that result in a low correlation of the individual trees. The algorithm makes an ensemble that can achieve both low bias and low variance [29]. For finding the relationship between x and y, RF, build a model for predicting the value of y for a new value of x. A RF classification model is a collection of classification tree predictors {h(x, ! k ), k = 1,..} Where the ! k is P×N matrix that P is a (p×1) vector of animals and N is a (1×n) vector representing the genotype of each animal (0, 1, or 2) for n SNP, to which k decision trees are built. An interesting feature in the RF is "Out of Bag" (OOB) that has been explained below.

Out of Bag Sample
Considering a single tree from a random forest, grow the tree on a bootstrap sample (the bag). About two-thirds of the cases are in the bag, and the remaining one-third data are out-of-bag. The out-of-bag data are like a test set for this tree [29]. The out of bag data accuracy is the accuracy of the RF predictor that gives an estimate of test set accuracy.
There are two components of randomness involved in the building of the Random Forest, and they need to be tuned. First, for constructing each tree, a random subsample of the total data set was selected to grow the tree (ntree). After that, at each node of the tree, a well-performing variable from a random subset of all variables (mtry) was chosen as a splitter variable.

Cross-Validation
Data were randomly divided into training and testing sets. The training set was used for the statistical model construction, i.e., learning the classifier. The testing set was applied to check the accurate estimation of the classifier. In k-fold cross-validation, firstly, the training set was divided into k subsets of equal size. Sequentially one subset was tested using the classifier trained on the remaining k-1 subsets. The crossvalidation procedure can prevent the overfitting problem, and estimations will be unbiased because each testing set was used only once to estimate the performance of a single classification model that was built by using training data exclusively. The accuracy was estimated as the average accuracy obtained after the k-fold cross-validation.

Confusion Matrix
The confusion matrix is a method to examine the performance of classifiers. A confusion matrix contains information about actual and predicted classifications done by a given classification method. The performance of such a system is commonly evaluated using data in the matrix. Table 2 shows the confusion matrix for two-class classifiers.  Table 2, Sensitivity, Fall-out, Specificity, and Accuracy were calculated based on the formula: Where TN is the number of correct negative predictions; FN is the number of incorrect positive predictions; FP is the number of incorrect negative predictions, and TP is the number of correct positive predictions.
The predictive accuracy of the classifier was estimated using sensitivity, specificity, accuracy, and phi coefficient correlation. The most important criterion for determining the performance of a classifier algorithm is its accuracy. In fact, it is the most famous and common criterion for calculating the efficiency of classifier algorithms.

Phi Coefficient Correlation
The phi coefficient correlation is identical to the Pearson that estimated for two binary variables.
In this study, we used the caret package and VCD package to calculate the confusion matrix and the phi coefficient (49,50).

Area under the Receiver Operating Characteristic (ROC) curve
The ROC graph provides information on the performance of the classification model [51]. Moreover, it is a plot with the False Positive Rate on the X-axis (FPR) and the True Positive Rate on the Y-axis (TPR). The area under the ROC curve (AUC) is widely used for the performance measurement in classification and diagnostic rules [52]. If AUC is close to 1, the result of the test is excellent. On the contrary, the closer the AUC to 0.5, the lesser accuracy of the test result. The AUC can be used as a model comparison criterion and can be interpreted as the probability that a given classifier assigns a higher probability to a correct label when the animals are randomly picked. Individuals with a true genetic susceptibility above or below the population average were assumed positive or negative cases, respectively. Models with higher values of AUC are desirable and are considered more robust [53]. The packages of ROCR and pROC were used to calculate the area under the ROC and plot corresponding graphs [54,55].

Multidimensional Scaling
The multidimensional scaling demonstrated the individual distribution of the different provinces. It presented separated clusters, but there is mixing and overlapping between individuals from the various provinces (Figure 2).

Discriminant Analysis Principal Component
The DAPC method showed the distribution of individuals of the different provinces that presented separated clusters, but there is a close relationship between individuals ( Figure  3). Overlapping distributions of genetic clusters on the ordination plot indicated a low degree of genetic differentiation. Furthermore, Figure 3 illustrated the density of different provinces and the close relationship between individuals. This method used k means for finding the number of clusters. The number of clusters (k) that can be inferred from the estimation of BIC plot by means of DAPC procedure is shown in Figure 4, and additionally, BIC had the lowest value of k=1 (Figure 4) that displayed well the actual number of populations and confirmed that the population is a member of a group.
The results of DAPC analysis demonstrated that 250 PCs correspond to about 99% of variance exhibiting in an ordination plot with the first two axes ( Figure 5). According to the Figure 5, explanation 90% of the variance needed more than 200 components, but retaining too many components with respect to the number of individuals can lead to overfitting in the  membership probabilities returned. The result of crossvalidation for optimization of the trade-off between maintenance of too few and too many PCs in the model with 30 replicates indicated that the number of retaining PCA should be 60 PCs.

Admixture
The model-based clustering shows ancestral and mixture proportions of the individuals of different provinces. According to the results obtained from the model-based method, the animals in different provinces belonged to one breed, and they could not be distinguished from each other. The distribution of colors in Figure 6 illustrates an admixture between the individuals of different provinces.
The different methods used for studying the structure of the population demonstrated that these populations belong to one group. It means the breeding programs should be the same for these populations. In the next step, the accuracy of the prediction of individuals from two ecotypes was investigated by a supervised method.

Parameter Regulation of SVM and RF
For SVM, the C-classification SVM with a radial kernel was applied for classification in the R package e1071 [56] with ! = 1.5×10 -5 and a regularization (cost) parameter =1 determined by a grid search. As well, for   RF, the optimal tree size was determined to be at ntree= 1000, and the optimal mtry=250 along with nodesize=5 were defined by an average OOB error of 31.47%.

Analysis of the Two Ecotypes
In this section of the analysis, data set gained from three provinces of Azerbaijan ecotypes (members of different provinces merged) were considered as the first class, and the ecotype of Guilan was considered as the second class where two classes were analyzed together. In SVM analysis, the results of crossvalidation with k=10 showed better accuracy of 98%, while DAPC and RF analysis represented an accuracy of 88% and 80%, respectively. The ROC curve displayed a good classification performance, and the area under the curve confirmed the classifier accuracy (Figure 7a, b, and c) representing a better performance of the classifier with a test set of SVM, DAPC, and RF methods. Table 3 shows the results of ROC curve components represented better accurate methods. SVM, DAPC, and RF methods predict the individuals of each province on the basis of their genomic data. The above procedures indicated the probabilities of each class and could provide the QC of our data sets by removing those individuals whose classifications show very low probability and affect predicted accuracy (result not shown). Hence, the individuals were  The results showed a better individual classification of the two ecotypes with the SVM method. Despite the short distance between the animals of the two ecotypes, they could be predicted more accurately based on the training data set. The animals of two ecotypes and their membership probability were distinguished according to the predictions achieved by these methods. In the present study, SVM acted better than DAPC and RF. Interestingly, these methods allow for a probabilistic assignment of individuals to each group.

DISCUSSION
Nowadays, animal breeding researches tend towards the understanding of genome structure, mechanisms of evolution, and finding loci under selection with the increasing use of genomic information. Understanding population genetic structure is valuable for better implementation of breeding programs and, most importantly, preservation of genetic resources. Recent large-scale genotyping and sequencing technologies, e.g., next-generation sequencing, are useful for the study of genetic livestock diversity and population structure. In these large scale genome-wide association studies, it is necessary to determine whether the animals included from different herds and regions belonged to one or more different breeds and (sub) populations. Challenges related to stratification in the studying of the populations can be considered as a problem for GWAS studies [57,58]. Hence, studying population structure is important. Multidimensional scaling and DAPC analysis of exploring the population structure of SE and SC African Bantu-speaking population showed that the populations from distant sampling localities could be clustered closely in the plot [59]. The population structure of the Korean cattle breeds was studied using a multivariate approach and model-based methods. The authors found that DAPC, PCA, and MDS result determined 20 separated clusters, and unsupervised hierarchical clustering was showed ancestry ratio and admixture of breeds [60]. The previous studies proposed that DAPC can be used as an efficient genetic clustering method [20,61]. In this study, the result of MDS, Admixture, and DAPC showed a close relationship between the animals of the different provinces. These results suggest that it is better the breeding goals and programs to be considered for one population that can deal with decreases in the costs of the breeding programs. When the genetic admixture is high, it requires the determination of the probability of membership or unknown animals, and supervised methods are accurate classification ones to detect the individuals from each other. The most widely used classifiers are SVM and RF, which are supervised learning methods [62]. Supervised classifiers are able to recognize patterns in different features and assign individuals into one population or another. The supervised learning methods are able to classify individuals from two populations within Scotland in comparison with PCA; also, they can be used for QC in large scale genome-wide association studies [63]. The results of our study showed that the accuracy obtained from SVM and DAPC approaches were better than the RF one. In comparison with RF and SVM for microarray-based cancer classification, the results presented that SVM offers advantageous classification performance [64]. Comparing the other classification methods using seven microarray gene expression data sets, SVM, and more sophisticated classifiers such as RF showed the best performance among all methods [65]. Support Vector Machine was used to infer recent genetic ancestry of the American population, which showed 86% accuracy [66]. Prediction of the population assignments using whole-genome regression models and SVM revealed high prediction accuracy for the classification of horses into four German Warmblood breeds [30]. Discriminant analysis principal component method used for the analysis of genetically structured populations showed correct assignment rates ranging from 80% to 97% [20]. The SVM, DAPC, and RF supplied explicit probabilities for the classification of each individual. Despite the short distance between populations of the different provinces, they can be separated more accurately based on the training data set. Consequently, an individual with a specified genotype can be attributed more accurately to a breed, a region or a province that it belongs to. Furthermore, SVM, DAPC, and RF can classify populations based on a large number of markers without the necessity for strong assumptions to determine the population structure. These methods, including SVM, DAPC, and FR, can predict and assign the individuals for each group according to the determination of the probability of membership.

CONCLUSION
In fact, buffaloes from different regions belong to various ecotypes, and it related to climate conditions. However, this study showed that different buffalo ecotypes in Iran belong to one population, and in breeding programs and GWAS, they can be analyzed in the form of one population. Although the breeds studied belonged to two groups, it seems that the utilized SNP density was unable to distinguish them completely according to their phenotype. Another reason could be the mismatch between provincial names and provincial divisions of water buffalo distributed regions. That is, there is a crossbreeding between local breeds of Azerbaijan and northern provinces. To ensure the correctness of individual grouping and prediction of new individuals, supervised methods such as SVM, RF, and DAPC were used. Among the suited methods, SVM can be used particularly for identifying animals belongs to different (sub) populations, breeds, ecotypes.