I’ve spent the past week completing a crash course on Machine Learning. Here’s what I’ve learned..
Machine Learning(ML) is a subfield of Artificial Intelligence, with other popular subfields including Perception and Deep Learning. The goal of Machine Learning is to understand the structure of a set of data and to fit it to models that can be understood and utilized by people. ML fits data to models by training on data inputs and utilizing statistical analysis to generate outputs within a specific range.

The data inputs are typically structured in the form of a set of ‘features’. The output of the data is normally called it’s classification label. For example, if the output for a set of data labeled if a vehicle was a sedan, suv, van, motorcycle, etc., it’s feature set may include datapoints like how many seats it has, how many doors it has, how big is the engine, if is it 4 wheel drive, etc.
There are a variety of statistical models which are used in Machine learning, depending on the type of grouping you want to generate based on your data. Although the models can vary, the implementation approach is still similar across them. A set of input feature set data is used to train the statistical model, which features that feature set to also provide the correct output classification labels. The ML program then updates it’s statistical model, e.g. weights for inputs and groupings until it can generate an accurate output based on the training data.
The statistical models in Machine Learning are typically called regressions or classifications. These differ based on if you are trying to find a continuous output value (regressions) or a discrete label (classification). Regressions are used for prediction continuous outputs e.g., predicting sales based on demand where the total sales number will continue to grow as demand increase. Classification utilizes models called classifiers to find discrete values e.g., if a dataset denotes a sedan or a minivan.

Regression and Classification models fall under a category of ‘supervised’ machine learning models. Supervised models map a series of inputs to outputs based on a series of input/output examples. These examples are used to train the statistical models used. Common Regression models include the well-known best fit line, linear regression, etc. Common Classifiers include Decision Trees, Random Forests, Neural Networks, and Naïve Bayesian models. Random Forests are actually a set of Decision Tree models, in which the collective outputs are then used to vote on the correct answer. These types of models which are made up of several models whose outputs are they used to generate a collective output is called ‘Ensemble Learning techniques’.
‘Unsupervised’ machine learning models identify patterns in input data without references to labeled outcomes, meaning these models are not trained on a labeled dataset. It draws its own inferences. The two major methods used are clustering and dimensionality reduction. Clustering is used to ‘cluster’ datapoints with like datapoints and identifies different cluster sets. This can be used in things like image processing to identify if an image is a dog or a cat. Common clustering models include k-mean (nearest neighbor) clustering. Dimensionality reduction are models that reduce the dimensionality of your feature set to the key features. The two main methods are feature elimination or feature extraction. A popular model of this is Principal Component Analysis.

When actually training and generating your finalized model, there are a few different approaches that are used to boost it’s predictive accuracy. The first method is called ‘boosting’, which essentially creates an ‘Ensemble learning’ model from your chosen classifier. This method generates a set of models from your classification method e.g., slightly changing the weights for different input features across each model. Then it takes these collective outputs, weighs each of them, and uses them to vote on a ‘true’ answer for the set of models. Another method is using Genetic Algorithms. Genetic Algorithms is a training technique that generates a random set of models from your chosen classifier. It then sees which of these models is most accurate, and generates ‘child’ models from these top models as the next ‘generation’ set to test. It does this for multiple generations, which should ultimately generate an accurate ‘evolved’ model. Children can also be generated with things like trait swapping and mutations to ensure variety to test for better solutions than what the parents offered.
I’ll be deep diving on Neural Networks next and how to implement them in Python from scratch. Stay tuned for more!
