No. 10. An Overview of Python Libraries for Data Science Manuscript Received: 20 March 2023, Accepted: 12 May 2023, Published: 15 September 2023, ORCiD: 0000-0003-0873-3340

Main Article Content

Ankush Joshi
Haripriya Tiwari

Abstract

In this era Python is the most popular as well as in -demand language for Data Science due to the number of libraries available for data processing, analysis and data visualization. The aim of this review paper is to give the overview of different available libraries. For this we grouped 48 different libraries in 3 different categories which are Data Collection, Data Analysis & Processing and Data Visualization. For comparison we use the GitHub community base (Stars, Forks and commits) as well as their properties and functionalities.

Article Details

Section
Articles

References

Anaconda, 2021 State of Data Science, https://www.anaconda.com/state-of-data-science-2021, [Accessed on 14 February 2023].

BeautifulSoup official documentation, https://www.crummy.com/software/BeautifulSoup/bs4/doc/, [Accessed on 14 February 2023].

Textract official documentation, https://textract.readthedocs.io/en/stable/, [Accessed on 14 February 2023].

Scrapy official documentation, https://scrapy.org/companies/, [Accessed on 14 February 2023].

T. D. Smedt and W. Daelemans, “Pattern for Python,” J. Machine Learning Res., vol. 13, pp. 2031–2035, 2012.

Taspinar/twitterscraper, https://github.com/taspinar/twitterscraper, [Accessed on 14 February 2023].

G. J. J. van den Burg, A. Nazábal, and C. Sutton, “Wrangling Messy CSV Files by Detecting Row and Type Patterns,” Data Min. Knowl., Disc 33, pp. 1799–1820, 2019.

Camelot official documentation, https://camelot-py.readthedocs.io/en/master/, [Accessed on 14 February 2023].

C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. Fernández del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke and T. E. Oliphant, “Array Programming with NumPy,” Nature, vol. 585, pp. 357–362, 2020.

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt and SciPy 1.0 Contributors, “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature, vol. 17(3), pp. 261-272, 2020.

Pandas (Computer Software), https://github.com/pandas-dev/pandas, [Accessed on 14 February 2023].

F. Pedregosa, G. Varoquaux ,A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrotand and E. Duchesnay, “Scikit-learn: Machine Learning in Python,” J. Machine Learning Res., vol. 12, pp. 2825-2830, 2011.

S. Seabold and Josef Perktold. “Statsmodels: Econometric and Statistical Modeling with Python.” in Proc. 9th Python in Sc. Conf., Texas, pp. 92-96, 2010.

Pycaret: An Open Source, Low Code Machine Learning Library in Python, Moez Ali, https://pycaret.org/, [Accessed on 14 February 2023].

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu and X. Zheng, “TensorFlow, Large-scale Machine Learning on Heterogeneous Systems,” Comp. Sci., 2015. arXiv:1603.04467.

OpenCV Introduction, https://docs.opencv.org/4.x/d1/dfb/intro.html, [Accessed on 14 February 2023].

Official Documentation, https://keras.io/, [Accessed on 14 February 2023].

Andrew Collette, Official Documentation h5py, https://pypi.org/project/h5py/, [Accessed on 14 February 2023].

D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman and C. Furlanello. “mlpy: Machine Learning Python,” Comp. Sci., 2012. arXiv:1202.6548.

S. Sonnenburg, H. Strathmann, S. Lisitsyn, V. Gal, F. J. I. García, W. Lin, S. De, C. Zhang, frx, tklein23, E. Andreev, J. Behr, sploving, P. Mazumdar, C. Widmer, P. D. Zora, G. D. Toni, S. Mahindre, A. Kislay, K. Hughes, R. Votyakov, khalednasr, S. Sharma, A. Novik, A. Panda, E. Anagnostopoulos, L. Pang, A. Binder, serialhex and B. Esser, “shogun-toolbox/shogun: Shogun 6.1.0,” Zenodo, 2017. https://doi.org/10.5281/zenodo.1067840.

Development Team, Official Documentation mrjob, https://mrjob.readthedocs.io/en/latest/index.html, [Accessed on 14 February 2023].

Dumbo, https://klbostee.github.io/dumbo/, [Accessed on 14 February 2023].

B. A. White, Hadoopy: Python Wrapper for Hadoop using Cython, https://hadoopy.readthedocs.io/en/latest/, [Accessed on 14 February 2023].

S. Leo and G. Zanetti, “Pydoop: A Python MapReduce and HDFS API for Hadoop,” in Proc 19th ACM Int. Symp. High Perform. Distrib. Comput., Illinois, pp. 819-825, 2010.

PySpark Documentation, https://spark.apache.org/docs/latest/api/python/#:~:text=PySpark%20is%20an%20interface%20for,data%20in%20a%20distributed%20environment, [Accessed on 14 February 2023].

Cython, https://github.com/cython/cython, [Accessed on 14 february 2023].

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai and S. Chintala, “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in 33rd Conf. Neural Inform. Process. Sys., Canada, pp. 1-12, 2019.

PyTables Hierarchical Datasets in Python, https://www.pytables.org, [Accessed on 14 February 2023].

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama and T. Darrell, “Caffe: Convolutional Architecture for Fast Feature Embedding”, Comp. Sci., 2014. arXiv:1408.5093.

Theano 1.0.5, https://pypi.org/project/Theano/, [Accessed on 10 April 2023].

Mxnet Features, https://mxnet.apache.org/versions/1.9.1/features, [Accessed on 10 April 2023].

The Microsoft Cognitive Toolkit, https://learn.microsoft.com/en-in/cognitive-toolkit/, [Accessed on 10 April 2023].

TFLearn: Deep Learning Library Featuring A Higher-level API for TensorFlow, http://tflearn.org/, [Accessed on 10 April 2023]

Daniel Nouri, Nolearn: Sckit-learn Compatible Neural Network Library, 2014. https://github.com/dnouri/nolearn.

Danielenricocahall, “Elephas: Distributed Deep Learning with Keras & Spark,” https://github.com/danielenricocahall/elephas#elephas-distributed-deep-learning-with-keras--spark.

S. Dieleman, J. Schlüter; C. Raffel, E. Olson, S. K. Sønderby, D. Nouri, D. Maturana, M. Thoma, Eric Battenberg Baidu Research, J. Kelly, J. De Fauw, M. Heilman, diogo149, B. McFee, H. Weideman, takacsg84, peterderivaz, Jon, instagibbs, Dr. K. Rasul, CongLiu, Britefury and J. Degrave, Lasagne: First Release, 2015, http://dx.doi.org/10.5281/zenodo.27878

J. D. Hunter, "Matplotlib: A 2D Graphics Environment," Comput. Sci. & Eng., vol. 9, no. 3, pp. 90-95, 2007.

M. L. Waskom, “Seaborn: Statistical Data Visualization,” J. Open Source Soft., vol. 6(60), 3021, pp. 1-4, 2021.

Plotly Open Source Graphing Library for Python, https://plotly.com/python/, [Accessed on 14 February 2023].

Bokeh Documentation, https://docs.bokeh.org/en/latest/, [Accessed on 14 February 2023].

A Grammar of Graphics for Python, https://plotnine.readthedocs.io/en/stable/, [Accessed on 14 February 2023].

J. VanderPlas, B. E. Granger, J. Heer, D. Moritz, K. Wongsuphasawat, A. Satyanarayan, E. Lees, I. Timofeev, B. Welsh and S. Sievert, “Altair: Interactive Statistical Visualizations for Python,” J. Open Source Soft., vol. 3(32), 1057, pp. 1-2, 2018.

Pygal: Beautiful Python Charting, https://www.pygal.org/en/stable/, [Accessed on 14 February 2023].

Geoplotlib, https://github.com/andrea-cuttone/geoplotlib#readme, [Accessed on 14 February 2023].

Folium, https://github.com/python-visualization/folium, [Accessed on 14 February 2023].

Ggplot, https://github.com/yhat/ggpy, [Accessed on 14 February 2023].