Study of Population Structure and Genetic Prediction of Buffalo from Different Provinces of Iran using Machine Learning Method


  • Zahra Azizi Department of Animal Science, Faculty of Agriculture, University of Tabriz, Tabriz, Iran
  • Hossein Moradi Shahrbabak Department of Animal Science, Faculty of Agricultural Science and Engineering, College of Agriculture and Natural Resources, University of Tehran, Iran
  • Seyed Abbas Rafat Department of Animal Science, Faculty of Agriculture, University of Tabriz, Tabriz, Iran
  • Mohammad Moradi Shahrbabak Department of Animal Science, Faculty of Agricultural Science and Engineering, College of Agriculture and Natural Resources, University of Tehran, Iran
  • Jalil Shodja Department of Animal Science, Faculty of Agriculture, University of Tabriz, Tabriz, Iran



Classification, Buffalo, Machine learning, SNP Chip data.


Considering breeding livestock programs to milk production and type traits based on existence two different ecotypes of Iranian’s buffalo, a study carried out to investigate the population structure of Iranian buffalo and validate its classification accuracy according to different ecotypes from Iran (Azerbaijan and North) using data SNP chip 90K by means Support vector Machine (SVM), Random Forest (RF) and Discriminant Analysis Principal Component (DAPC) methods. A total of 258 buffalo were sampled and genotyped. The results of admixture, multidimensional scaling (MDS), and DAPC showed a close relationship between the animals of different provinces. Two ecotypes indicated higher accuracy of 96% that the Area Under Curve (AUC) confirmed the obtained result of the SVM approach while the DAPC and RF approach demonstrated lower accuracy of 88% and 80 %, respectively. SVM method proved high accuracy compared with DAPC and RF methods and assigned animals to their herds with more accuracy. According to these results, buffaloes distributed in two different ecotypes are one breed, and therefore the same breeding program should be used in the future. The water buffalo ecotype of the northern provinces of Iran and Azerbaijan seem to belong to the same population


Moaeen-ud-Din M, Bilal G. Sequence diversity and molecular evolutionary rates between buffalo and cattle. J Anim Breed Genet 2015; 132(1): 74-84. DOI:

Bibi S, Khan MF, Rehman A. Population Diversity and Role in the Socioeconomic Development of Domestic Buffaloes of Rural Areas of District Haripur, KPK Pakistan. Journal of Buffalo Science 2018; 7(3): 38-42. DOI:

Wilson RT. The Domestic (Water) Buffalo in Africa: New and Unusual Records. Journal of Buffalo Science 2016; 5(2): 23-31. DOI:

Naserian AA, Saremi B. Water buffalo industry in Iran. Italian Journal of Animal Science 2010; 6(2s): 1404-5. DOI:

McTavish EJ, Hillis DM. A Genomic Approach for Distinguishing between Recent and Ancient Admixture as Applied to Cattle. J Hered 2014. DOI:

Lin BZ, Sasazaki S, Mannen H. Genetic diversity and structure in Bos taurus and Bos indicus populations analyzed by SNP markers. Anim Sci J 2010; 81(3): 281-9. DOI:

McKay SD, Schnabel RD, Murdoch BM, Matukumalli LK, Aerts J, Coppieters W, et al. An assessment of population structure in eight breeds of cattle using a whole genome SNP panel. BMC Genet 2008; 9: 37. DOI:

Epps CW, Castillo JA, Schmidt-Kuntzel A, du Preez P, Stuart-Hill G, Jago M, et al. Contrasting historical and recent gene flow among African buffalo herds in the Caprivi Strip of Namibia. J Hered 2013; 104(2): 172-81. DOI:

Lykkjen S, Dolvik NI, McCue ME, Rendahl AK, Mickelson JR, Roed KH. Genome-wide association analysis of osteochondrosis of the tibiotarsal joint in Norwegian Standardbred trotters. Anim Genet 2010; 41 Suppl 2: 111-20. DOI:

Tian C, Gregersen PK, Seldin MF. Accounting for ancestry: population substructure and genome-wide association studies. Hum Mol Genet 2008; 17(R2): R143-50. DOI:

Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, et al. Demonstrating stratification in a European American population. Nat Genet 2005; 37(8): 868-72. DOI:

Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, et al. Machine learning in bioinformatics. Brief Bioinform 2006; 7(1): 86-112. DOI:

Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics 2000; 155(2): 945-59. DOI:

Hoggart CJ, Shriver MD, Kittles RA, Clayton DG, McKeigue PM. Design and analysis of admixture mapping studies. The American Journal of Human Genetics 2004; 74(5): 965-78. DOI:

Verdu P, Pemberton TJ, Laurent R, Kemp BM, Gonzalez-Oliver A, Gorodezky C, et al. Patterns of admixture and population structure in native populations of Northwest North America 2014. DOI:

Patterson N, Price AL, Reich D. Population structure and eigenanalysis 2006. DOI:

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics 2006; 38(8): 904-9. DOI:

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 2007; 81(3): 559-75. DOI:

Li Q, Yu K. Improved correction for population stratification in genome‐wide association studies by identifying hidden population structures. Genetic Epidemiology 2008; 32(3): 215-26. DOI:

Jombart T, Devillard S, Balloux F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics 2010; 11(1): 1. DOI:

Jombart T, Collins C. A tutorial for discriminant analysis of principal components (DAPC) using adegenet 2.0. 0 2015.

Sethuraman A. On inferring and interpreting genetic population structure-applications to conservation, and the estimation of pairwise genetic relatedness 2013.

Chuluunbat B, Charruau P, Silbermayr K, Khorloojav T, Burger PA. Genetic diversity and population structure of Mongolian domestic Bactrian camels (Camelus bactrianus). Anim Genet 2014; 45(4): 550-8. DOI:

Felicetti M, Lopes MS, Verini-Supplizi A, Machado Ada C, Silvestrelli M, Mendonca D, et al. Genetic diversity in the Maremmano horse and its relationship with other European horse breeds. Anim Genet 2010; 41 Suppl 2: 53-5. DOI:

Bigi D, Mucci N, Mengoni C, Baldaccini E, Randi E. Genetic investigation of Italian domestic pigeons increases knowledge about the long-bred history of Columba livia (Aves: Columbidae). Italian Journal of Zoology 2016; 83(2): 173-82. DOI:

González-Recio O, Rosa GJ, Gianola D. Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livestock Science 2014; 166: 217-31. DOI:

Vapnik VN, Vapnik V. Statistical learning theory: Wiley New York; 1998.

Gunn SR. Support vector machines for classification and regression. ISIS technical report. 1998; 14.

Breiman L. Random forests. Machine learning 2001; 45(1): 5-32. DOI:

Heuer C, Scheel C, Tetens J, Kühn C, Thaller G. Genomic prediction of unordered categorical traits: an application to subpopulation assignment in German Warmblood horses. Genetics Selection Evolution 2016; 48(1): 1. DOI:

Swan AL, Mobasheri A, Allaway D, Liddell S, Bacardit J. Application of machine learning to proteomics data: classification and biomarker identification in post-genomics biology. OMICS 2013; 17(12): 595-610. DOI:

Sun CS, Markey MK. Recent advances in computational analysis of mass spectrometry for proteomic profiling. J Mass Spectrom 2011; 46(5): 443-56. DOI:

Moore JH, Asselbergs FW, Williams SM. Bioinformatics challenges for genome-wide association studies. Bioinformatics 2010; 26(4): 445-55. DOI:

Goldstein BA, Hubbard AE, Cutler A, Barcellos LF. An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings. BMC genetics 2010; 11(1): 1. DOI:

González-Recio O, Forni S. Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genet Sel Evol 2011; 43(7): 21329522.

Long N, Gianola D, Rosa GJ, Weigel KA, Kranis A, Gonzalez-Recio O. Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Research 2010; 92(03): 209-25. DOI:

Alberts CC, Ribeiro-Paes JT, Aranda-Selverio G, Cursino-Santos JR, Moreno-Cotulio VR, Oliveira AL, et al. DNA extraction from hair shafts of wild Brazilian felids and canids. Genet Mol Res 2010; 9(4): 2429-35. DOI:

Grimberg J, Nawoschik S, Belluscio L, McKee R, Turck A, Eisenberg A. A simple and efficient non-organic procedure for the isolation of genomic DNA from blood. Nucleic Acids Res 1989; 17(20): 8390. DOI:

Barendse W, Harrison BE, Bunch RJ, Thomas MB, Turner LB. Genome wide signatures of positive selection: the comparison of independent samples and the identification of regions associated to traits. BMC Genomics 2009; 10: 178. DOI:

Teo YY, Fry AE, Clark TG, Tai ES, Seielstad M. On the usage of HWE for identifying genotyping errors. Ann Hum Genet 2007; 71(Pt 5): 701-3. DOI:

Abdi H. Bonferroni and Šidák corrections for multiple comparisons( In NJ Salkind (ed.). Encyclopedia of Measurement and Statistics. Encyclopedia of measurement and statistics 2007.

Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res 2009; 19(9): 1655-64. DOI:

Kruskal JB, Wish M. Multidimensional scaling: Sage; 1978. DOI:

Schwarz G. Estimating the dimension of a model. The Annals of Statistics 1978; 6(2): 461-4. DOI:

Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 1958; 65(6): 386. DOI:

Hsu C-W, Chang C-C, Lin C-J. A practical guide to support vector classification 2003.

Liaw A, Wiener M. Classification and regression by randomForest. R news 2002; 2(3): 18-22.

Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006; 7(1): 1. DOI:

Schaeffer L, Jamrozik J, Kistemaker G, Van Doormaal J. Experience with a test-day model. Journal of Dairy Science 2000; 83(5): 1135-44. DOI:

Swets JA. Measuring the accuracy of diagnostic systems. Science 1988; 240(4857): 1285-93. DOI:

Hand DJ. Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine learning 2009; 77(1): 103-23. DOI:

Gonzalez-Recio O, Forni S. Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genet Sel Evol 2011; 43: 7. DOI:

Schaeffer L. Application of random regression models in animal breeding. Livestock Production Science 2004; 86(1-3): 35-45. DOI:

Geetha E, Chakravarty A, Kumar KV. Estimates of genetie parameters using random regression test day model for first lactation milk yield in Murrah buffaloes. The Indian Journal of Animal Sciences 2007; 77(9).

Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A. Misc functions of the Department of Statistics (e1071), TU Wien. R package 2008: 1.5-24.

Wacholder S, Rothman N, Caporaso N. Counterpoint: bias from population stratification is not a major threat to the validity of conclusions from epidemiological studies of common polymorphisms and cancer. Cancer Epidemiol Biomarkers Prev 2002; 11(6): 513-20.

Thomas DC, Witte JS. Point: population stratification: a problem for case-control studies of candidate-gene associations? Cancer Epidemiol Biomarkers Prev 2002; 11(6): 505-12.

Marks SJ, Montinaro F, Levy H, Brisighelli F, Ferri G, Bertoncini S, et al. Static and moving frontiers: the genetic landscape of Southern African Bantu-speaking populations. Molecular biology and evolution 2014: msu263. DOI:

Sharma A, Lee S-H, Lim D, Chai H-H, Choi B-H, Cho Y. A genome-wide assessment of genetic diversity and population structure of Korean native cattle breeds. BMC Genetics 2016; 17(1): 139. DOI:

Jemaa SB, Boussaha M, Mehdi MB, Lee JH, Lee S-H. Genome-wide insights into population structure and genetic history of Tunisian local cattle using the illumina bovinesnp50 beadchip. BMC Genomics 2015; 16(1): 1. DOI:

Gutierrez S, Tardaguila J, Fernandez-Novales J, Diago MP. Support Vector Machine and Artificial Neural Network Models for the Classification of Grapevine Varieties Using a Portable NIR Spectrophotometer. PLoS ONE 2015; 10(11): e0143197. DOI:

Bridges M, Heron EA, O'Dushlaine C, Segurado R, Morris D, Corvin A, et al. Genetic classification of populations using supervised learning. PLoS One 2011; 6(5): e14802. DOI:

Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 2008; 9(1): 319. DOI:

Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis 2005; 48(4): 869-85. DOI:

Haasl RJ, McCarty CA, Payseur BA. Genetic ancestry inference using support vector machines, and the active emergence of a unique American population. European Journal of Human Genetics 2013; 21(5): 554-62. DOI:




How to Cite

Azizi, Z. ., Shahrbabak, H. M. ., Rafat, S. A. ., Shahrbabak, M. M. ., & Shodja, J. . (2020). Study of Population Structure and Genetic Prediction of Buffalo from Different Provinces of Iran using Machine Learning Method. Journal of Buffalo Science, 9, 48–59.