Machine learning and Artificial intelligence are broad terms, something we often stress when we talk and meet collaborators in the industry. The concept machine learning consists of many subcategories of algorithms, where each category has their own pros and cons making them suitable for different tasks.
To visualize these categories and to exemplify which algorithms are suitable for which task has Evispot created an algorithm fact sheet. The purpose is it to give you an overview of the different subcategories of machine learning.This is a simplified overview where details have been left out and some simplifications have been made. Combinations of algorithms can also be used, this is called ensemble models.
However, the first question to ask yourself before initiating a project is if the amounts of data available. The amount is important to make sure that patterns and insights in data are representative.Machine learning algorithms will always find patterns in the given data (if there is one). However, the question is if these patterns are representative of the population you want to analyze.
Evispot’s algorithm sheet.
This is merely an overview of the most common algorithms used by Evispot, and our view of them.
Anomaly Detection Algorithms
Anomaly detection algorithms are used for identification of items, events, and observations which don’t follow expected patterns in the given dataset. As shown in the fact sheet above, one category is rare if the data. The algorithms are suitable for tasks where the majority of the observations are similar and a few differences from the masses. The typical anomaly algorithms are used to identify issues or problems as bank frauds, errors in a text or medical problems. Anomalies are also referred to as outliers, novelties, noise, deviations, and exceptions. Example of anomaly detection algorithms is Support Vector Machines (SVM), PCA-based anomaly and Cluster analysis-based outlier detection.
However, the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such data unless it has been aggregated appropriately. Instead, an anomaly detection algorithm may be able to detect the microclusters formed by these patterns.
Example of usage: Identifying bank frauds, errors in text and medical problems.
Classifications algorithms are used to identify where, to a set of categories, a new observation belongs to. This is done on the basis of the training set of data containing observations whose category is known. Classification algorithms are used for patterns recognition. As can be seen in the figure are neural networks and random forest algorithms both included in the category of classification. As the figure shows, one difference between these algorithms is the transparency of them. Where neural network, most often, retain higher accuracy than random forest but lack in terms of transparency, making them suitable for different kinds of tasks.
Example of usage: Categorising loan applications to a given number of risk categories and classifying images based on whats on them.
Clustering algorithms are used to group a set of objects in such way that the objects in the same group (defined as clusters) are more similar to each other than to those in the other groups. Simply put – grouping many objects to clusters containing similar objects. The notion of a “cluster” cannot be precisely defined which is one of the reasons why there are so many clustering algorithms. In difference to classifications, algorithms aren’t clustering algorithms suitable for tasks with labels data and learning by examples. Clustering is instead used to find the connections in the dataset.
Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.Example of suitable clustering algorithms includes KMeans, Spectral Clustering, and Agglomerative Clustering.
Example of usage: Identifying payment archetypes based payment behaviours, identifying groups of genrés in large groups of articles.
When it comes to credit decisions regression algorithms are commonly known, since many of the existing credit scoring models have this is basis. Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression algorithms are used to estimate real values based on a continuous variable(s). The models establish a relationship between independent and dependent variables by fitting the best line. This best fit line is known as regression line and represented by a linear equation Y= a *X + b.
The best way to understand linear regression is to relive this experience of childhood. Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight, without asking them their weights! What do you think the child will do? He/she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is a linear regression in real life. The child has actually figured out that height and build would be correlated to the weight by a relationship, which looks like the equation above
Examples of usage: Forecasting value of a house
Time Series Algorithms
Time Series algorithms uses a set of powerful statistical and machine learning tools for predicting future events based on past data. Most commonly, a time series is a sequence taken at successive equally spaced points in time. By indexing events on a timeline is it possible to forecast new events. This can be used to forecast when an invoice will be paid or forecasting a cash flow based on accounting data. Time series algorithms include HMM discrete and AR continuous.
Example of usage: Forecasting cash flow on historical accounting data.
As earlier mentioned, this algorithm sheet is used to give you a holistic overview of which algorithms we most commonly use and their characteristics.
These can be combined and each category contains several other algorithms. If you have any questions regarding algorithms or machine learning in general, don’t hesitate to contact us on email@example.com
For more information on machine learning, please visit: