In this article, we are going to cover machine learning basics. It’s impossible to cover all concepts of machine learning with Python in a short article. I am going to cover the basic concepts by keeping in mind the people who are beginners in this area.
What is Machine Learning
Human beings are the most intelligent species on the earth right now. We have evolved as a species because we have an ability to learn from our past experiences and take actions accordingly.
Machine Learning is nothing but trying to explore whether machines can learn from past experiences as humans do? In simple words, the term Machine Learning refers to the learning process of machines through past experiences for a given problem.
For e.g., let’s say, a human is able to forecast the sales of next quarter by analyzing historical data. Similarly, in Machine Learning – we are trying to build the ability to predict the sales of the next quarter by feeding the historical data to the machine.
The Need for Machine Learning
The demand for Machine Learning with Python is increasing every day. Many businesses are exploring the option of Machine Learning for solving their business problems.
The need for Machine Learning is derived from the fact that we want to take decisions that are data driven. Machine Learning can process huge amounts of data that a human might not be able to process. Also, a human might take decisions based upon his/her perception or bias. Machine Learning can be helpful when there is a need to scale the business without much additional cost.
For e.g., A human might be able to predict sales for 10 products. However, when the number of products increases to 10000, it will be difficult for a single person to do the prediction for so many products.
Hence, we need Machine Learning to help us to take decisions which are data-driven and scalable.
How Machine Learning is different than Traditional Programming
In traditional programming, we write a program where we specify a set of instructions which computers execute. The instructions given to computers are typically mathematical in nature.
For e.g., we provide a formula to calculate simple interest based on principal amount, interest rate and duration. Then we provide 3 input parameters and ask to calculate the simple interest.
In Machine Learning, we do not give the formula to calculate anything. However, we provide historical data along with results and ask computers to find out patterns in the data and create a model (formula/equation). We can use this model to predict output for unknown values in future.
e.g., To detect fraud through banking transactions, we can provide historical transactions data to the machine. And we can ask machines to discover patterns within the data so that it can identify fraud transactions. Here, we are not giving any instructions (i.e., formula) to machines about how to identify the fraud transactions.
Defining Machine Learning
According to Professor Mitchell,
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
In simple terms, Machine Learning is set of algorithms which:
- improves the performance P
- at solving a problem / executing task T
- with experience E
- Defining the Task, T
A Machine Learning task ‘T’ can be defined as a real-world problem to be solved and which is difficult to be solved by traditional programming approach.
Few examples of Machine Learning with Python tasks are as below –
- Classification – Where a machine learning algorithm is supposed to assign a class to each data sample. E.g., categorization of animals into cats, dogs etc.
- Regression – Where a machine learning algorithm is supposed to predict a value for a data sample based upon the existing data instead of classification. E.g., predicting sales based on marketing cost.
- Clustering – Where a machine learning algorithm is supposed to identify inherent patterns, relationships, similarities amongst the data samples and create different groups of these data samples.
e.g., To group different species of plants and animals in biology based upon their characteristics.
Defining the Experience, E
In Machine Learning, experience can be defined as a process where machine learning algorithms process input data samples and figure out inherent patterns in the data samples. The process of algorithm learning the patterns in the data and gaining experience is called training the model. Then, this experience can be used to solve the problem at hand.
Defining the Performance, P
In order to understand whether a model is behaving the way we want or not, we will need to specify some parameters. This is where defining performance ‘P’ comes into picture. There are different performance attributes like accuracy, precision, recall, F1 score, error rate and many more.
We can define our own performance attribute in case required.
- Features / Attributes – Features are individual independent variables in the dataset that act as the input in your system.
- Labels – Label is something that we are trying to predict according to our problem statement.
- Training Dataset – A part of the input dataset which is used for training the model.
- Test Dataset – A part of input dataset which is used for testing the model that is trained.
Check the iris dataset from Kaggle for understanding features and labels.
Classification of Machine Learning Methods
Based upon the level of human supervision involved, Machine Learning methods can be classified as below. It will not be possible to cover them in this article considering the length of the article.
- Supervised learning
- Unsupervised learning
- Semi-supervised learning
- Reinforcement learning
Basic Steps in Machine Learning
Now, let us try to understand the basic steps followed in Machine Learning to solve a problem. We will use Python. It is expected that the reader is aware about how to install Python on their system and understand some basic Python instructions.
Problem Statement: To predict the salary of a person based upon the experience.
Dataset: I am using a dataset from Kaggle which contains experience and salary.
If you go through the csv file quickly, you will notice that there are 30 records and each record has 2 columns – ‘YearsExperience’ and ‘Salary’. In this example, ‘YearsExperience’ is Feature/Attribute and ‘Salary’ is Label.
Now, lets see how to proceed to create a model for this problem.
1. Load Data
First, we will need to import a few libraries required for our processing.
Then load dataset.
Check the first few records.
Visit the site: Khatrimaza
2. Exploratory Data Analysis (EDA)
Let us try to understand more about the data that we have.
This means that we have 30 rows with 2 columns.
From above information, we can understand that we have 30 non-null values in both columns. i.e., we don’t have any missing value in the dataset.
From the above output, we can say that we have an average experience of 5.3 years and average salary of 7603. We can also get minimum and maximum values from above output.
3. Data Preparation
Let us check if we need to do any processing on data before feeding it to the model, so that we don’t feed bad quality data to the model.
As we already know that there are no missing values in the dataset. Now, let us check if we have any duplicate records.
As we can see, we have the same number of records after dropping duplicates. This means that we don’t have any duplicate record in the dataset.
As we don’t need to process data for missing / duplicate values, we can move to separate features and labels and assign them to X and y respectively.
Above output gives the shape of X and y.
4. Split Train and Test
Why is splitting data into Train and Test necessary? This is because we train a model using Training data. Then verify the model by using Testing data. Splitting data into Train and Test allows the model to learn on few records which are known to the model. And it tries to predict outcomes for Test records which are not known to the model. This way, we can build better models and the chances are high that the model will perform well when it is presented with unknown values in production. Otherwise, a model which learns from all input data records can remember the input data so well that if we present any unknown inputs for prediction, it might fail miserably in production. It’s like reciting the entire textbook for the exam instead of understanding/ learning. And when we get the question outside the textbook, we cannot answer it because we have not learned anything.
We have split the input dataset into training dataset and testing dataset. We have specified the ratio of training data: testing data as 80:20.
Now, we have 24 records in the training dataset and 6 records in testing dataset.
5. Build a Model
In Machine Learning, we have many options to choose at this step. We can choose from many algorithms to build a model. Here, we are choosing a Linear Regression algorithm. Import LinearRegression class from the module linear_model of the Scikit-learn library.
Then call the fit method by supplying features and labels of training data. In this step, the model will learn the patterns in the dataset.
6. Predict the Output
Now, let’s try to predict by using the above model which has been trained now.
7. Evaluate Model
Generally, for evaluating regression models below 3 metrics are used –
Mean Squared Error
Mean Absolute Error
Root Mean Squared Error
8. Deploy the model and use it
Now, we can use the above model for prediction by supplying the values which are not part of either training or testing data.
This brings us to the end of the blog on Machine Learning with Python. We hope that you have gained valuable insight into the concept. You can take up the free online Machine Learning with Python Course by Great Learning Academy and learn more.