Skip to main content

What Can Artificial Intelligence Do For Me? (Part 2)


“Computers are useless. They can only give you answers.”

-Pablo Picasso (1964)

In the last post on this topic, I finally started describing some concrete techniques and ways we could start using those at Vergent LMS. In hindsight, perhaps this was shortsighted – what if management reads this and says to themselves, “wow, we need some of that, yesterday!” Creating a bunch of new and urgent work is not going to earn many brownie points among the other developers. Even worse, I promised to continue describing even more techniques and possible applications in our products. Now, having realized the error of my ways, I am nevertheless honor-bound to continue in my exposition of techniques from data analysis, artificial intelligence, and machine learning. Last time, I surveyed some early approaches and found that they still offer attractive opportunities for improving the user experience. In this post, we’ll look at some more mathematical and algorithmic approaches to creating usable business intelligence from big piles of data. I’ve decided to hold off on discussing evolutionary computation and neural networks and to make this a three-part series, mostly to build suspense, but also because the post would have been too big if I had tried to cover everything here.

Regression analysis is a technique that predates machine learning, per se, but which can often be used to perform many of the same kinds of tasks and answer many of the same kinds of questions. Indeed, it can be viewed as an early approach to machine learning, in that it provides a tool with which to reduce to mechanical calculation the process of determining whether there exist meaningful relationships in data. The basic idea of regression analysis is that you start with a bunch of data points and want to predict one attribute of those data points based on their other attributes. For instance, we might want to predict for a given customer the amount of a loan they might like to request at a particular time, or whether some marketing strategy may or may not be effective, or other quantifiable aspects of the customer’s potential future behavior. Next, you choose a parameterized class of functions that relate the dependent variable (the thing you want to predict) to the independent variables (the other attributes of your data points). A common and useful class of functions, and one which can be used in the absence of more specific knowledge about underlying relationships in the data, are linear functions of the form f(x) = a + bx. Here, f is a function with scalar parameter a and vector parameter b which takes the vector x representing the independent variables belonging to a data point and maps that vector to the corresponding predicted value of the dependent variable. Once a parameterized class of functions has been chosen, the last step before performing the regression is to identify an appropriate distance metric to measure the error between values predicted by the curve of best fit and the data on which that curve is trained. If we choose linear functions and squared vertical difference between the line and the sample points, we get the ubiquitous least-squares linear regression technique. Other classes of functions – polynomial, logistic, sinusoidal, exponential – may be appropriate in some contexts, just as other distance metrics – such as absolute value rather than squared value – may give results that represent a better fit in some applications.

Once the hyperparameters (selection of dependent variable, class of functions, and distance metric) for the regression problem have been chosen, optimal parameter values can be solved by using a combination of manual analysis and computer calculation. These optimal parameters identify a particular function belonging to the parameterized class which fits the available data points more closely than any other function in the class (according to the chosen distance metric). Measures of goodness of fits – such as the correlation coefficient and chi-squared coefficient – can help us answer not only how closely our curve matches the training data, but also whether we have “overfit” that data – that is, whether we should expect there are simpler curves that provide nearly as good a fit as the one under consideration. We will take a closer look at regression analysis and how to apply it to scenarios of interest to Vergent LMS in future posts.

Often, the dependent variables we care about do not vary over a continuous range of values – for instance, we might be interested only in whether we should expect some new data point will or won’t have some characteristic; in other cases, we might want to label new data points with what we expect to be accurate labels from some relatively small, fixed set of labels – for instance, we might want to assign a customer to one of several processing queues depending on what we expect to be those customers’ needs. While regression analysis can still be used in these scenarios – by fitting some curves and assigning ranges of values of the dependent variable to fixed labels – so-called classification techniques can also be used. One benefit of using classification approaches, where possible, is that these techniques can find relationships that may not be analytically tractable – that is, relationships which could be hard to describe using parameterized classes of analytic functions.

One popular approach to classification involves constructing decision trees based on the training data which, at each stage of branching, seek to maximize the achieved information gain (in the information-theoretic sense). As a very simple example, suppose the training data set consists of data points that give a person’s name, whether they graduated from high school, and whether they are currently employed. Our training data set might look like (John, yes, yes), (Jane, yes, yes), (John, no, no). If we want to construct a decision tree to aid in determining whether new individuals are likely to be employed based on their name and high-school graduation status, we would choose to split first on the graduation status, because doing so splits the sample space into two groups which are most distinct concerning the dependent variable: one group has 100%, yes and the other has 100% no. Had we branched on names first, we would have had one group with 50% yes and 50% no, and another with 100% yes; these groups are less distinct. In more complicated scenarios, branching would continue at each level as long as groups could still meaningfully be split into increasingly distinct subgroups and then end. The resulting decision tree would give a method according to which new samples could be classified – simply find where they fit in the tree according to their characteristics.

Another approach to classification involves attempting to split the training dataset in two by finding a hyperplane (when there are only two independent variables, the hyperplane is a normal two-dimensional line) which best separates samples with different labels. For instance, suppose our training dataset consists of types of trees and coordinates in a large field where those trees grow. The data points might be (1, 1, apple), (2, 1, apple), (1, 2, apple), (4, 1, pear), (1, 4, pear) and (4, 4, pear). A line with equation y = 3 – x separates all the apple trees from all the pear trees, and we could use that line to predict whether trees will be more likely to be apple or pear trees by checking which side of the line the tree is on. Finding the best hyperplane can be reduced to a quadratic programming problem and solved numerically.

The approaches to data analysis and data mining we’ve looked at so far can be considered to examples of supervised machine learning: they are supervised in the sense that we (humans) label the training data set for the computer, and the computer can learn the relationships by trusting our labels. What kinds of problems and approaches can be used for unsupervised machine learning – in case we don’t know how to meaningfully label the data ourselves, for instance? Clustering is a useful way to uncover potentially useful relationships in data that we might not even know to look for. Given a bunch of data points, clustering seeks to divide the sample space into groups – or clusters – where members of each cluster are more similar to each other (based on their characteristics) than they are to members of other clusters. A bottom-up approach to clustering is to make every data element a cluster initially, and then iteratively combine the two closest clusters into a single cluster, until you end up with just one cluster remains. This creates a tree that defines sets of increasingly fine-grained clusters at lower levels of the hierarchy. A top-down approach might start with a single cluster and iteratively split the cluster by separating the data element that is most different from the average element in the cluster and moving the data points close to that point into the new cluster. Other approaches, k-nearest-neighbors, and k-means work similarly and employ heuristics to improve the performance of the clustering process.

In this month’s thrilling installment of the Smart Money blog series, we’ve seen how traditional mathematical, statistical, and algorithmic techniques can be used to analyze data and derive useful information about the relationships in that data. All of these techniques, and many like them, are easily automated and take the human more or less out of the loop of figuring out the relationships of interest. These techniques, however, are still inherently constrained by the imagination and intelligence of the humans employing them: performing a linear regression will always give you the equation of a line, even if the relationships are non-linear; clustering will only cluster by the chosen distance metric, not by one that may be more natural for the given dataset; and so on. Next time, we will look at biologically-inspired techniques of artificial intelligence and machine learning – including evolutionary computation and machine learning – which are not necessarily limited by our preconceived notions of what answers should look like for our datasets.