Common terms explained
A machine learning method that’s very loosely based on neural connections in the brain. Neural networks are a system of connected nodes that are segmented into layers — input, output, and hidden layers. The hidden layers (there can be many) are the heavy lifters used to make predictions. Values from one layer are filtered by the connections to the next layer, until the final set of outputs is given and a prediction is made.
With supervised learning techniques, the data scientist gives the computer a well-defined set of data. All of the columns are labelled and the computer knows exactly what it’s looking for. It’s similar to a teacher handing you a syllabus and telling you what to expect on the test.
In unsupervised learning techniques, the computer builds its own understanding of a set of unlabelled data. Unsupervised ML techniques look for patterns within data, and often deal with classifying items based on shared traits.
Classification is concerned with building models that separate data into distinct classes. Well-known classification schemes include decision trees and support vector machines. Classification is a supervised machine learning problem. It deals with categorizing a data point based on its similarity to other data points. You take a set of data where every item already has a category and look at common traits between each item. You then use those common traits as a guide for what category the new item might have.
Clustering is used for analysing data which does not include pre-labelled classes, or even a class attribute at all. Clustering techniques attempt to collect and categorise sets of points into groups that are “sufficiently similar,” or “close” to one another. “Close” varies depending on how you choose to measure distance. Complexity increases as the more features are added to a problem space. As clustering does not require the pre-labelling of instance classes, it is a form of unsupervised learning, meaning that it learns by observation as opposed to learning by example.
Cross-validation is a deterministic method for model building, achieved by leaving out one of k segments, or folds, of a dataset, training on all k-1 segments, and using the remaining kth segment for testing; this process is then repeated k times, with the individual prediction error results being combined and averaged in a single, integrated model. This provides variability, with the goal of producing the most accurate predictive models possible.
This machine learning method uses a line of branching questions or observations about a given data set to predict a target value. They tend to over-fit models as data sets grow large. Random forests are a type of decision tree algorithm designed to reduce over-fitting. Decision trees are top-down, recursive, divide-and-conquer classifiers. Decision trees are generally composed of 2 main tasks:
Tree induction is the task of taking a set of pre-classified instances as input, deciding which attributes are best to split on, splitting the dataset, and recursing on the resulting split datasets until all training instances are categorised.
Tree pruning is the process of removing the unnecessary structure from a decision tree in order to make it more efficient, more easily-readable for humans, and more accurate as well. This increased accuracy is due to pruning’s ability to reduce overfitting.
This is when you learn something too specific to your data set and therefore fail to generalise well to unseen data. This causes it to do badly for unseen data. An example is if you are a student and you memorise the answers to all 1000 questions in your textbook without understanding the material. Overfitting happens when a model considers too much information. It’s like asking a person to read a sentence while looking at a page through a microscope. The patterns that enable understanding get lost in the noise.
Regression is very closely related to classification. While classification is concerned with the prediction of discrete classes, regression is applied when the “class” to be predicted is made up of continuous numerical values. Linear regression is an example of a regression technique. Regression is another supervised machine learning problem. It focuses on how a target value changes as other values within a data set change.
An algorithm is a set of instructions we give a computer so it can take values and manipulate them into a usable form. This can be as easy as finding and removing every comma in a paragraph, or as complex as building an equation that predicts how many goals a footballer will score in a year.
Like anything added?
Please contact us at Inflaim@uea.ac.uk