Free statistical software

Free statistical software is a practical alternative to commercial packages. Many of the free to use programs aim to be similar in function to commercial packages, in that they are general statistical packages that perform a variety of statistical analyses. Many other free to use programs were designed specifically for particular functions, like factor analysis, power analysis in sample size calculations, classification and regression trees, or analysis of missing data.

Many of the free to use packages are fairly easy to learn, using menu systems. Many others are command-driven. Still others are meta-packages or statistical computing environments, which allow the user to code completely new statistical procedures. These packages come from a variety of sources, including governments, universities, and private individuals.

This article is primarily a review of the general statistical packages.

Brief history of free statistical software

SAS (software) was among the first commercial statistical packages, released for mainframes in 1968.[1] SAS has since then released versions free to use, the most recent of which is SAS Studio.[2] Epi Info a free to use program from the Centers for Disease Control and Prevention was developed in the 1980s.[3] One of the first completely free to use and open source statistical software was R, first released in 2000.[1]

Some of the free software packages are from governments, for example Epi Info, which is from CDC[4] (Centers for Disease Control and Prevention). Some other software packages are from smaller or independent organizations or universities. JASP is supported by the University of Amsterdam.[5] Two other packages, R,[6] and PSPP are being developed as part of the GNU Project by a large group of individuals, many of them volunteers, all over the world. These packages are notable in that it is not just open source but also free software in the same sense that material written on Wikipedia is free: others can edit, use, and redistribute at will.

OpenStat was developed as a teaching aid.[7] Other packages were developed for specific purposes but can be more generally used. One example is Epi Info, developed for public health. Several of the packages, PSPP, R and Osiris don't appear to give any statements about why they were developed, other than just general use for statistical analysis.

These free software packages have been used in a number of scholarly publications. For example, OpenStat was used in a research letter to JAMA[8] and in several published studies.[9][10][11] Irristat is used in an agricultural report,[12] EasyReg is listed or used in several papers,[13][14][15] EpiInfo was also used in several papers,[16][17][18] R was used in a number of papers[19][20][21] and WinIdams was used in other papers.[22][23]

While Microsiris doesn't appear to be used in academic research, the author of the program was one of the original authors of OSIRIS,[24] which was the starting program from which WinIdams was developed.[25] The author of Microsiris also has contributed or co-contributed several components to WinIdams.[25]

Reviews of free statistical software

There are a few reviews of free statistical software. There were two reviews in journals (but not peer reviewed), one by Zhu and Kuljaca[26] and another article by Grant that included mainly a brief review of R.[27] Zhu and Kuljaca outlined some useful characteristics of software, such as ease of use, having a number of statistical procedures and ability to develop new procedures. They reviewed several programs and identified which ones, at that time, had the most functionality. At that time, several of the programs may not have had all of the desired ability for advanced statistics. Grant reviewed some of the programing features of R, and briefly mentioned the availability of other programs. One other paper reviewed statistical packages, mainly commercial, but includes R.[28] One article reviewed EasyReg and included a discussion of its accuracy.[29]

Only two reviews have compared the output of various packages.[30][31] In the 2006 review, all of the packages read either CSV files or Microsoft Excel format. All of the packages gave exactly the same results for correlation and regression. The free software packages also gave the same regression results as did excel. One of the main differences among the packages was how they handled missing data. With the example data sets used in the review, and for the package versions available in November 2006 when this review was conducted, two packages, MicrOsiris and Epi Info, could read files with blanks for missing. Two other programs, Stat4U and WinIdams need something for the missing, like -9 or -9.99. The other packages could only handle data sets with no missing values. The more recent review, from 2022, compared output from a number of free to use statistical packages and found they all gave pretty much the same results.

In contrast, there are various reviews of commercial statistical software, such as a comparison between several major packages[32] and a brief review of several packages.[33]

Using free statistical software

Before using any statistical packages, it is generally a good idea to have a solid background in Statistics. Then the packages can be used to the best advantage, for example, to choose the most appropriate test, to make sure all the necessary assumptions are met, so that the appropriate conclusions can be drawn.

Once the statistical issues are understood, the next step is to decide which package to use. Most of these packages are menu driven, and can be learned in a couple of hours at most, except R, which is generally code driven and requires a much longer time to learn, and to some extent CDC's Epi Info, which also takes some time to learn.

Several of the packages also have tutorials. These tutorials help with a basic introduction and learning the basics of the programs. For example, CDC has tutorials about Epi Info.[34][35] The CDC page also lists a video slide show tutorial from the University of Nebraska,[36] and another site has online training classes.[37] R has a large number of tutorials and manuals, in English and other languages[38][39][40] and a faq site.[41] PSPP has a particularly easy to follow tutorial, and a rich set of statistical analyses, including T-test, Oneway and Factorial Anova, Linear and Logistic regression and Principal components analysis. It also has provision for it to be very easy to import data from many other different file formats. A few of the packages have email discussion lists, including R[42] and PSPP.[43]

Most of the packages have online manuals, guides or help pages. These are useful when there are questions about specific procedures or statistical tests. Some manuals or guides are for R,[44] PSPP,[45] and Zelig.[46] The CDC EpiInfo site itself does not have a manual, but one faculty member from Emory's School of Public Health has an introductory manual.[47]

Finally, there are a number of commercial packages such as SAS,[48] SPSS[49] and many others.[50] Most of the major commercial and free packages have many statistical procedures in common. The main reason to use free packages is probably the cost.

Many of the packages have some kind of opening menu that is used to get or enter the data, manipulate the data, and select the statistical analysis. Then after starting the program, generally data can be obtained, either from previously saved data sets, or importing from some other format. From this menu, data files in various formats can be imported. For example, if the data is in CSV form (text with commas between values), the program recognizes the format and creates a data set from the CSV file. Finally, the program can be used to do some analysis. In this analysis menu, the variables of interest can be selected, along with other options. Then the analysis is run and results are obtained.

Command driven packages

R can be used both in a menu-driven way and as a programming language and as an interpreter.

Getting data

Most packages are able to import data from Excel or CSV (text with commas separating values).

One consideration is whether there are missing data. Some packages, like PSPP and MicrOsiris, can automatically deal with the missing data. So for example, say one set of data look like this:

Name Age Sex Born in US Degree
Joe 31 M Yes BA
Sam M No MS
Sally 28 F Ph.D.

In this data set, Sam is missing his age, and Sally is missing whether she was born in the USA. When some packages, like PSPP or MicrOsiris, read in or import the original data set, the packages will recognize that those values are missing, and do their calculations accordingly. MicrOsiris automatically assigns 1.5 or 1.6 billion to blanks as missing, and these values are excluded from analysis.[51]

Other packages need a 'placeholder', such as '-9' where there are missing data.[52] Before the package is used to read the data, the data set has to be edited to put in a placeholder where there are missing data. So for example:

Name Age Sex Born in US Degree
Cole 31 M Yes BA
Sam -9 M No MS
Sally 28 F -9 Ph.D.

If the data set actually includes '-9', then when the data is being read in the program will have to be told when the -9 means missing data.

Limitations of packages

Most of the packages have limitations of some sort.

Several of the programs, including Easyreg, Epidata and Instat, do not appear to handle missing data or do not handle it well.[30] While EpiInfo has many statistical procedures, correlation is not one of them. Rather correlation is found by regression.[53] This means that EpiInfo will not produce a single table showing correlations among multiple variables. According to the Zelig installation manual, use of Zelig requires that R and several of its libraries already be installed, and the installation also requires some degree of background in R.[46] One limit of MicrOsiris is in handling the output. When calculations are complete, the output pages through the results, but various menu boxes also appear over the results, and so the results cannot be accessed. The output can be saved, though, as a text file and then used.

One limitation is specific to programs that were developed by individuals. Support for these programs is limited to the time that the author has available. While the authors may, and often do, respond fairly quickly when there are few people asking questions, if too many people ask questions or the author is otherwise busy, support would correspondingly be slower.

R is both written by and used by a large number of people all over the world, and many forums and other internet facilities can be used to get support from other users. While R is powerful, the learning curve can be rather steep for those not already familiar with other kinds of scientific programming.[54]

See also

References

  1. The VSNi Team. "Evolution of statistical computing". VNSi. Retrieved 12 June 2022.
  2. "SAS on demand for academics". SAS. Retrieved 12 June 2022.
  3. "The Epi Info™ Story". Centers for Disease Control and Prevention. 16 September 2021. Retrieved 12 June 2022.
  4. "Epi Info". CDC. Retrieved 13 June 2022.
  5. "JASP".
  6. "The R Project".
  7. Bill Miller (2009). "OpenStat".
  8. Ebell, Mark (10 September 2008). "Future Salary and US Residency Fill Rate Revisited". JAMA. 300 (10): 1131–1132. doi:10.1001/jama.300.10.1131. PMID 18780840.
  9. Toscano, Christopher D; Prabhu, Vinaykumar V; Langenbach, Robert; Becker, Kevin G; Bosetti, Francesca (2007). "Differential gene expression patterns in cyclooxygenase-1 and cyclooxygenase-2 deficient mouse brain". Genome Biol. 8 (1): R14. doi:10.1186/gb-2007-8-1-r14. PMC 1839133. PMID 17266762.
  10. Bielaszewska, M; Sinha, B; Kuczius, T; Karch, H (2005). "Cytolethal Distending Toxin from Shiga Toxin-Producing Escherichia coli O157 Causes Irreversible G2/M Arrest, Inhibition of Proliferation, and Death of Human Endothelial Cells". Infection and Immunity. 73 (1): 552–562. doi:10.1128/iai.73.1.552-562.2005. PMC 538959. PMID 15618195.
  11. Toscano, C.D.; Kingsley, P.J.; Marnett, L.J.; Bosetti, F. (2008). "NMDA-induced Seizure Intensity is Enhanced in COX-2 Deficient Mice". Neurotoxicology. 29 (6): 1114–1120. doi:10.1016/j.neuro.2008.08.008. PMC 2587528. PMID 18834901.
  12. FAO Plant Production and Protection Paper No. 174, Rome, 2003, Genotype x environment interactions. Challenges and opportunities for plant breeding and cultivar recommendations, http://www.fao.org/DOCREP/005/Y4391E/y4391e00.htm
  13. Gambardella, A; Hall, Bronwyn H. (2006). "Proprietary versus public domain licensing of software and research products" (2006)". Research Policy. 35 (6): 875–892. doi:10.1016/j.respol.2006.04.004. S2CID 14299896. Archived from the original on 2007-06-09.
  14. Liu, Wen-Chi; Chang, Tsangyao (2008). "Rational Bubbles in the Korea Stock Market? Further Evidence based on Nonlinear and Nonparametric Cointegration Tests" (PDF). Economics Bulletin. 3 (34): 1–12.
  15. Harumi Itoa and Darin Lee, Journal of Economics and Business, Volume 57, Issue 1, January–February 2005, Pages 75-95. Assessing the impact of the September 11 terrorist attacks on U.S. airline demand. doi 10.1016/j.jeconbus.2004.06.003. Also available here http://www.brown.edu/Departments/Economics/Papers/Papers/2003/2003-16_paper.pdf
  16. Rahav G, Gabbay R, Ornoy A, Shechtman S, Arnon J, Diav-Citrini O. Primary versus nonprimary cytomegalovirus infection during pregnancy, Israel. Emerg Infect Dis [serial on the Internet]. 2007 Nov [May 15, 2009]. Available from https://www.cdc.gov/EID/content/13/11/1791.htm
  17. Chan P-C, Huang L-M, Wu Y-C, Yang H-L, Chang I-S, Lu C-Y, et al. Tuberculosis in children and adolescents, Taiwan, 1996–2003. Emerg Infect Dis [serial on the Internet]. 2007 Sep. Available from https://www.cdc.gov/EID/content/13/9/1361.htm
  18. Gyasi, ME; Amoaku, WMK; Adjuik, MA (December 2007). "Epidemiology of Hospitalized Ocular Injuries in the Upper East Region of Ghana". Ghana Med J. 41 (4): 171–175. PMC 2350113. PMID 18464900.
  19. Handcock, Mark S.; Hunter, David R.; Butts, Carter T.; Goodreau, Steven M.; Morris, Martina (2008). "statnet: Software Tools for the Representation, Visualization, Analysis and Simulation of Network Data". J Stat Softw. 24 (1): 1548–7660. doi:10.18637/jss.v024.i01. PMC 2447931. PMID 18618019.
  20. Hume, Michael E.; Scanlan, Charles M.; Harvey, Roger B.; Andrews, Kathleen; Snodgrass, James D.; Nalian, Armen G.; Martynova-Van Kley, Alexandra; Nisbet, David J. (2008). "Denaturing Gradient Gel Electrophoresis as a Tool To Determine Batch Similarity of Probiotic Cultures of Porcine Cecal Bacteria". Applied and Environmental Microbiology. 74 (16): 5241–5243. Bibcode:2008ApEnM..74.5241H. doi:10.1128/aem.02580-07. PMC 2519268. PMID 18586972.
  21. Bylesjö, Max; Nicholson, Jeremy K; Holmes, Elaine; Trygg, Johan (2008). "K-OPLS package: Kernel-based orthogonal projections to latent structures for prediction and interpretation in feature space". BMC Bioinformatics. 9: 106. doi:10.1186/1471-2105-9-106. PMC 2323673. PMID 18284666.
  22. Sapre, N. S.; Pancholi, N.; Gupta, S. (2008). "Computational Modeling of Substitution Effect on HIV–1 Non–Nucleoside Reverse Transcriptase Inhibitors with Kier–Hall Electrotopological State (E–state) Indices, Internet Electron". Internet Electronic Journal of Molecular Design. 7: 55–67.
  23. Chawla, Anju (2007). "Exploring project selection behavior of academic scientists in India". Research Evaluation. 16 (1): 35–45. doi:10.3152/095820207x196768.
  24. Data Sharing for Demographic Research Knowledge Base, question on OSIRIS, University of Michigan, http://dsdr-kb.psc.isr.umich.edu/answer.html?i=1076 Archived 2011-07-20 at the Wayback Machine
  25. IDAMS, Internationally Developed Data Analysis and Management Software Package. WinIDAMS Reference Manual (release 1.3) UNESCO, 2008. Preface. http://portal.unesco.org/ci/en/ev.php-URL_ID=25081&URL_DO=DO_TOPIC&URL_SECTION=-465.html
  26. "A Short Preview of Free Statistical Software Packages for Teaching Statistics to Industrial Technology Majors" Journal of Industrial Technology (Volume 21-2, April 2005), Ms. Xiaoping Zhu and Dr. Ognjen Kuljaca. http://www.nait.org/jit/current.html
  27. Felix Grant, "Free Statistics Software, Yours, Free to keep....", Scientific Computing World, Sept/Oct 2004, http://www.scientific-computing.com/scwsepoct04free_statistics.html
  28. Edward J. Wegman and Jeffrey L. Solka. 2005. Statistical Software for Today and Tomorrow. http://www.galaxy.gmu.edu/ (listed as "A Guide to Statistical Software".
  29. Hwan-sik, Choia; Kiefer, Nicholas M. (2005). "Software evaluation: EasyReg International". International Journal of Forecasting. 21 (3): 609–616. doi:10.1016/j.ijforecast.2005.02.003.
  30. Shackman, Gene. 2006. "Comparing free statistical software for data sets with no missing values" and "Comparing free statistical software, Handling missing data". Both available here "Free Software" http://gsociology.icaap.org/methods/soft.html
  31. Shackman, Gene (10 May 2022). "Free To Use Statistical Software: Comparing Statistical Analyses". SSRN. SSRN 4105959. Retrieved 12 June 2022.
  32. Acock, Alan C (2005). "SAS, Stata, SPSS: A Comparison". Journal of Marriage and Family. 67 (4): 1093–1095. doi:10.1111/j.1741-3737.2005.00196.x. Summarized in Hom, Willard. 2006. Choosing Between SAS, Stata, and SPSS. http://www.cccco.edu/SystemOffice/Divisions/TechResearchInfo/ResearchandPlanning/AbstractsofResearch/ResearchMethods/tabid/302/Default.aspx Archived 2009-05-26 at the Wayback Machine
  33. Wass, John. No date. Comparative Statistical Software Review. Tabulations and musings from your editor's biased perspective. Scientific Computing. http://www.scientificcomputing.com/comparative-statistical-software.aspx
  34. Epi Info™ Community Health Assessment Tutorial. The Epi Info™ Community Health Assessment Tutorial was produced by the collaborative efforts of the Centers for Disease Control and Prevention (CDC), the Assessment Initiative (AI), and the New York State Department of Health (NYSDOH). https://www.cdc.gov/epiinfo/communityhealth.htm
  35. Cholera Outbreak in Rwenshama: Using Epi Info for Windows in an Outbreak Investigation. Coordinating Office for Global Health - DGPHCD, https://www.cdc.gov/cogh/dgphcd/training/softwaretraining.htm
  36. Introduction to EPI2000. GPVEC Great Plains Veterinary Educational Center. University of Nebraska - Lincoln. http://gpvec.unl.edu/videos/epi-stats.asp
  37. The North Carolina Center for Public Health Preparedness Training Website http://nccphp.sph.unc.edu/training/index.html Archived 2010-06-16 at the Wayback Machine
  38. Contributed Documentation. https://cran.r-project.org/other-docs.html.
  39. William Revelle, Using R for psychological research: A simple guide to an elegant package, 2008, http://personality-project.org/r/
  40. Dong-Yun Kim, MAT 356 R Tutorial, Spring 2004. http://www.math.ilstu.edu/dhkim/Rstuff/Rtutor.html
  41. R FAQ. Frequently Asked Questions on R. Version 2.8.2009-03-18. ISBN 3-900051-08-9 http://lib.stat.cmu.edu/R/CRAN/doc/FAQ/R-FAQ.html
  42. R-help -- Main R Mailing List: Primary help. https://stat.ethz.ch/mailman/listinfo/r-help
  43. Pspp-users -- PSPP user discussion, http://lists.gnu.org/mailman/listinfo/pspp-users
  44. R Development Core Team. An Introduction to R. Version 2.8.1 (2008-12-22). ISBN 3-900051-12-7. https://cran.r-project.org/doc/manuals/R-intro.html
  45. Documentation, No Date Given. PSPP. https://www.gnu.org/software/pspp/documentation.html
  46. Imai, Kosuke, Gary King and Olivia Lau (2006). "Zelig: Everyone's Statistical Software".{{cite web}}: CS1 maint: multiple names: authors list (link)
  47. Kevin M. Sullivan. Mar 3 2008. Introduction to Epi Info (Version 3.4.1) Analyze Data Module. http://www.sph.emory.edu/~cdckms/ Archived 2011-07-19 at the Wayback Machine
  48. "Analytics, Business Intelligence and Data Management".
  49. "IBM SPSS Statistics - Overview".
  50. Statistics.com list of commercial software http://www.statistics.com/resources/software/commercial/fulllist.php3 Archived 2011-03-04 at the Wayback Machine
  51. Van Eck, Richard, Microsiris, Statistical and Data Management Software System. Version 9.1, 2006. Van Eck Computer Consulting. http://www.microsiris.com/MicrOsiris.htm
  52. Unesco, How to work with WinIDAMS. Section on Missing data values. http://www.unesco.org/webworld/idams/selfteaching/eng/emissing-data.htm
  53. CDC. Epi Info Training Session. Using Epi Info in an Outbreak Investigation. Advanced Analysis and Mapping. https://www.cdc.gov/cogh/dgphcd/training/softwaretraining.htm
  54. Gillian Raab, Susan Purdon, Kathy Buckner and Iona Waterston. The R Package. Napier University (Edinburgh) and the National Centre for Social Research (London). http://www2.napier.ac.uk/depts/fhls/peas/rpackage.asp

This article incorporates material from the Citizendium article "Free statistical software", which is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License but not under the GFDL.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.