I’ve been learning data science by taking an online course from Coursera IBM Data Science Specialization. It is the second time for me to recap this course. Although this Data Science course is a foundation course for Data Science beginners, it has many curated lecture videos and practical tools with real world applications. The whole specialization course has 11 courses and contains quizzes, hands-on exercises and one final cap stone project. You will also be familiar with IBM Watson Studio while practicing their exercises. In this blog, I will be sharing some open-source tools for Data Science that I have learnt in the Course 2: Digital Tools for Data Science.
There are wide-ranging series of tools for Data Science that you can deploy for your data science projects. You can select those tools based on the role that you are working in and the test that you are going to perform. Each tool has its pros and cons depending on different applications and purposes. For a complete data science life cycle, you will need to know the whole process from data managing to model building and deploying.
Many databases that you can explore in this world are relational databases that will have a tabular form with specific rows and columns for storing your data. With these tools, databases can also be managed efficiently across different platforms integrated with your web servers or the local machine.
Data Integration and Transformation
Extract, Load and Transform or Extract, Transform and Load (ELT or ETL)- This process can also be noted as data refinery and cleansing. Converting data formats, integrating with other databases will be performed in this data integration and transformation stage. Below is the list of most popular open-source tools.
If you don’t want to perform visualization through programming, you can use some applications which provide colorful visualization features and additional plugins for your exploratory data analysis step. For data visualization, you can use any applications or programming languages. Since data visualization is useful both for the post-processing and the pre-processing stages, you can choose your aesthetic visualization tool for your data.
After you have successfully done managing and exploring the key features from your database, you will consider how to deploy your data into a model so that you can integrate it into web app, mobile application or other embedded electronic devices. For this stage, you can use some open-source platforms to deploy your model.
Model Monitoring and Assessment
To keep track and monitor your model, you can use some open source APIs. After some time you have developed your model, you will need to make some changes or modify to improve its accuracy.
Code Asset Management
Code Assess Management can be also referred as a version control which is to track, modify or replace any set of files. Git is a standard tool for handling your project with speed and efficiency. You can also use GitHub for hosting your model and version control.
Data Asset Management
Data asset is files similar to videos, images, documents, music and other media which are shared and created across social media platforms such as YouTube, Facebook, etc. It can also done running locally or through server to server. This data asset management is crucial part for managing massive number of files and contents to be able to access the data from a central location.
Development Environments are the places where you do everything from scratch or you can use third-party modules to build models. Some has interactive consoles for you to see real-time changes.
Fully Integrated Visual Tools
Fully integrated visual tools provide all-in-one features for performing data science cycle from managing data to exporting your model. Therefore, you don’t need to concern barriers in programming languages and other digital tools.
Now, I’m going to share popular programming languages for Data Science.
Python is the one of the most popular programming languages which is regarded as top ten in-demand programming language in recent decade. It is a open-source and general purpose language. Due to its high-level programming syntaxes and thousands of third-party modules, you can be able to learn Python within a weeks and execute your own data science project. You don’t need to know many advanced programming architectures and data structures for doing data science. The more you know, the more you will be merrier with python.
Popular libraries in Python for Data Science
Rstudio is also popular for statisticians, mathematicians and data miners. R is free software that you can use if for private and commercial use. You don’t require any programming backgrounds for R. It is still widely used for data analysis and developing statistical software.
SQL (Structured Query Language) which can be pronounced as sequel. SQL is a older language and initially developed by IBM in 1974. It is designed for relational databases. If you don’t familiar with relational databases, imagine Microsoft Excel or Google Spreadsheet. In SQL, however, you have to use SQL syntax to manage your data such as creating, reading, updating and deleting(CRUD). SQL can also be deployed with webservers and with many other cloud computation platforms. Many SQL databases are available ranging from Microsoft SQL server, MySQL, IBMDB2, Apache Spark SQL, Oracle and more.
In addition to these said three languages, you can use other programming languages for your data science projects. If you already have competence in Java programming, JS(Java Script), Julia and C/C++, you can also perform data science with those programming languages. You are not supposed to learn every programming languages available on this earth for your data science journey.
- index | TIOBE — The Software Quality Company index | TIOBE — The Software Quality Company. (2021). Retrieved 14 May 2021, from https://www.tiobe.com/tiobe-index/
- Schwabish, J., 2021. Better data visualizations. New York Chichester, West Sussex: Columbia University Press, p.Appendix 1.
- IBM Data Science Specialization Course : Tools For Data Science