Grok technology builds cutting edge machine learning techniques on top of biologically based algorithms.
Download the CLA White Paper
Grok generates automated action based on forecasted predictions. Data is sent via an API to Grok, after which Grok builds a model of the data, and returns a prediction for each record it receives. As Grok receives data it builds and modifies its models. The predictions are then fed to a rules-based action generator that interfaces with control systems. This section will describe the core learning algorithm, followed by an overview of the model creation and adaptive learning processes.
The Cortical Learning Algorithm (CLA) and the underlying data format, Sparse Distributed Representations (SDRs), are modeled after biological principles described in the neuroscience section. Neurons in the neocortex are grouped in columns. When an input is seen, neurons in specific columns fire. This process is emulated in a spatial pooler that identifies a pattern based on which columns contain active neurons. As the input is seen in different contexts, different neurons within the column fire. For example, if "ABC" and "NBC" are seen, the same columns are active when the C is seen, but different cells within the columns fire depending on whether they were preceded by "AB" or "NB." This process is emulated in a temporal pooler that identifies sequences based on which cells in the columns are activated. Finally, a CLA classifier takes the input state and generates a prediction N steps in the future, as well as the probability distribution of that prediction.
This process is described in more detail below.
First, a stream of data records is fed to Grok, one record at a time. The records must be in time order for Grok to find time-based patterns. The sensor region converts the input record into a binary vector representation (string of 1s and 0s).
For category input, this encoding process consists of assigning random bits in a sparse manner, i.e., with more 0s than 1s. For example, the vector may consist of 121 bits, where 21 are randomly selected to be 1s and the rest are 0s. Scalar (numerical) values are represented by selecting 21 adjacent bits (called "W") along a variable range of total bits (called "N"). For example, the minimum value is represented by the first 21 bits of the string being 1s, and the maximum value is represented by the last 21 bits of the string being 1s. The values between min and max are represented by selecting 21 adjacent bits in the middle of the string of N bits. Think of a slider widget of width W on a track with N increments. The total number of bits is a variable, which allows Grok to divide the data in different degrees of granularity. For example, if the minimum is 1 and the maximum is 10,000, the difference between 1 and 10 amounts to noise. As a result, from a pattern recognition perspective, it makes sense to represent 1 and 10 in the same "bucket," i.e., the same binary representation.
The spatial pooler treats each column of cells as an individual unit. Each column is exposed to a random subset (50%) of the binary encodings for each input record, which is called a "potential pool." This refers to bits that could potentially activate cells in the columns.
Each connection between the column and the input is called a "synapse," and has a value associated with it between 0 and 1 called "permanence." The permanence is a binary state that becomes active when a scalar threshold of permanence is reached. If a synapse permanence is below the threshold, the input is ignored. If the permanence is above the threshold, the column interacts with the input bit.
For each new input, Grok computes the "overlap score," which is the number of connected synapses in the potential pool that are 1s. If a column is connected to a bit that is 0, or is not connected to a bit that is a 1, nothing is added to the score. (Note that this differs a weighting.) The columns containing the top 2% of overlap scores are selected to represent the "spatial representation" of the input. For example, if there are 2048 columns, the spatial pooler describes the input in terms of the 40 columns with the highest overlap scores. This is the spatial representation of the input. It's important to note that similar inputs will generate similar representations.
Next, Grok looks at all the active columns. If the active column connects to a 1 bit, then the permanence of that synapse is incremented by, say, 0.1. Conversely, if an active column is connected to a 0 bit, then the permanence is decremented by 0.1. This process reinforces the connection, making the synapses more closely match the inputs. As in human learning, patterns that are seen more frequently become remembered, and patterns that are seen infrequently are forgotten as less relevant. (Permanence of inactive columns is not adjusted.)
The output of the spatial pooler is a binary vector with 2048 bits, 40 of which are 1s and the rest are 0s. This is a sparse distributed representation (SDR).
The temporal pooler looks at each SDR and decides which cells in each column to activate. Each cell has three states: inactive, active and predicted. As is described in the neuroscience section, as sequences of input are received, over time each cell learns the transitions between inputs. A cell remembers the cell that activated before it did, and if that previously activated cell activates, it predicts that it will activate next. In the brain, a physical connection is created between neurons that fire in sequence, so when one neuron fires, a neuron it is connected to predicts it will fire next.
The first time an input is seen, a random cell in the column is selected. The input that preceded the activation is remembered. For example, if a "B" is triggered, the cell remembers if it saw an "A" beforehand (an A-B sequence). The next time a B is seen, if the previous input was an A, the same cell is activated. If a B is subsequently seen after a C, however (a C-B sequence), then the B column is activated but a different cell is selected. In other words, the same column is activated whenever a B is seen, but different cells in the B column are activated depending on which sequence the B is in (AB, CB, etc.). This is the reason there are multiple cells in a column. The cells add a dimension of context in time. The spatial pooler tells you that the input is a B, and the temporal pooler tells you which sequence the B belongs to, which in turn will help determine what follows B.
Inputs are not always predicted, however. If an input is seen from a column that contains no predicted cells (e.g., a new sequence), then it is unclear which cell to activate. In this case, the entire column is activated, which is called a "burst." This phenomenon is seen in the brain when an entire column of neurons fires together.
There are 32 cells in each column, multiplied by 2,048 columns, generating a binary vector of 65,536 bits that is fed to the CLA classifier. There are approximately 40 bits that are 1s, but there are typically a few more because of the bursting.
The CLA classifier is not modeled after a biological process, but was added to help make multi-step predictions. Each cell has a lookup table with a histogram tracking how often that cell was active when a given input was seen. To predict multiple steps in the future, a time-delayed sliding buffer of the length of the prediction window is used. For example, let's say we wish to predict three steps in the future, and the current sequence we are viewing is ABCD. When the D input is seen, the input value for A in the histogram of the lookup table three steps before is incremented. The active bits of A are thus associated with D.
A lookup table is maintained for each length of prediction window. For example, you can predict three steps ahead with one lookup table. But to generate predictions for one and two steps ahead also, separate lookup tables must be maintained for each.
The number of possible combinations of fields, encodings, and internal parameters (e.g., learning rates) can generate a very large number of possible models. Grok trains many models in parallel, perhaps up to 50 or 100, to see which produces the best predictions.
Grok automatically generates models based on a uniquely optimized implementation of particle swarm optimization (PSO), which is an AI technique inspired by observing schools of fish and flocks of birds. In a PSO, each "particle" is an instance of a possible model that moves around a search space, such as a range of parameter values. Each particle is initialized with a random position and velocity, and keeps track of its best position relative to the global position that generates the best results. Particles thus move in a direction that blends the local and global best values. The random component helps explore the space more thoroughly and efficiently.
Particle Swarm Optimization. Particles start in different places and converge towards an optimal solution.
Grok makes several important modifications to PSO to enable time based (temporal) predictions, and adapt to changes over time (online learning). For example, Grok gives more weight to recent predictions, using a method similar to how the strength of synapses between human neurons are modified. Grok also takes into account a user-defined cost function.
Moreover, Grok performs dozens of PSO runs in parallel. In each case, the pair of fields that generates the best results is selected, along with associated parameters. For the best field pair, a best third field and associated parameters are also determined. Grok eliminates models that perform relatively poorly, then starts training new models. Some new models are randomly chosen from the set of all possible models and other new models are derived from the current best model. Grok continues testing models in parallel until the improvement in model accuracy declines and the process is stopped. You can think of this model selection process as evolution, finding out which set of senses work best.
You can define the number of particles (1, 5, or 15) to optimize processing time. Each particle runs as a separate parallel process on the Hadoop cluster. Note that this PSO implementation is algorithm independent.
Conventional machine learning techniques are static: the best statistical fit is calculated from a training data set, verified on a testing data set, and applied to real-world data. Some are powerful enough to find a fit for nearly any set of training data, although this introduces the possibility of "over-fitting." This is similar to how a specific set of economic variables can be selected that correlates with all past Presidential elections, but fail to predict the next one. More importantly, real-world data changes over time. In these cases, previously accurate models must be retrained with new data, repeating the time and expense of the original manual process.
Grok's automated learning, on the other hand, does not require conventional training and testing data sets. As described above, Grok returns a prediction for every data input value, and adjusts over time. Grok can automatically adjust predictions if the underlying field combinations and parameters remain valid. For example, if predicting energy usage in a school that is correlated to a regular class schedule, Grok will automatically adjust predictions if the class schedule changes. If a more dramatic change in underlying factors occurs, a new automated swarm can be initiated to explore new field combinations and parameters.
In addition to generating predictions, Grok can use the CLA algorithm to detect anomalies. Anomalies can be defined as events which have not been seen before, and therefore will not be predicted. Anomalies are in a sense the inverse of predictions. As a result, the bursting process described above can be used to identify anomalies. Since a bursting column represents an unpredicted input, the ratio of bursting columns to active columns indicates the degree of novelty of the input. Grok reports this ratio as an anomaly score. An anomaly score of 1.0 means that 100% of the columns are in a bursting state. When Grok is first exposed to new data, the anomaly score is 1.0 for all inputs, because they are all new and Grok needs to learn the "normal" state of the system. After an initial learning period, a high anomaly score indicates that Grok observed a state that had not previously been seen.
This method of detecting anomalies removes the need to define, measure and classify all potential problems before you can detect them. Grok simply reports to you that the input it sees is novel. This can be particularly useful in applications such as fraud, where humans are continuously attempting to present new profiles to avoid detection. On the other hand, you may also want Grok to remember rare events. Note that a rare event seen a second time is by definition no longer anomalous. In these cases, Grok's classification mechanism can be combined with anomaly detection to detect rare events.
There are many machine learning techniques that have been applied effectively to many problem domains. Generally, these techniques are used by data scientists and are highly tuned for a specific problem.
Grok’s approach is different. Grok is a general solution that can be applied to virtually any problem with standard data types, without requiring tuning for a specific domain. And because Grok is automatic, it is dramatically lower cost than an expert-driven system.
The following attributes differentiate Grok from standard techniques:
Also, the following blog posts describe:
Deep neuroscience differentiates Grok from conventional machine learning. You don't have to understand the biology to learn how Grok works, or see what building Grok solutions involves. But if you're curious, the neuroscience answers questions like why Grok encodes data in 2,000-bit strings, or how grouping "cells" in "columns" enables sequence learning. An understanding of Grok's cortical learning algorithm can help explain the adaptive and automated aspects of Grok's modeling system.
Computers excel at performing precise tasks quickly, but can’t match the flexibility and learning abilities of the human brain, which processes high-velocity data streams every waking moment. For vision alone, your optic nerve contains an array of a million fibers, sending 10 million bits of data to your brain per second. Understanding how the brain processes these torrents of data provides the key to building an adaptive prediction system for streaming data.
At the core of every Grok model is the Cortical Learning Algorithm (CLA), a detailed and realistic model of a layer of cells in the neocortex. Contrary to popular belief, the neocortex is not a computing system, it is a memory system. When you are born, the neocortex has structure but virtually no knowledge. You learn about the world by building models of the world from streams of sensory input. From these models, we make predictions, detect anomalies, and take actions.
In other words, the brain can best be described as a predictive modeling system that turns predictions into actions. Three key operating principles of the neocortex are described below.
Grok converts disparate data types into a common format, to learn patterns and see relationships. This format must be flexible enough to generalize and recognize "similar" patterns, an ability that has eluded computers. Artificial intelligence experts call this the problem of "representation." How do you represent and store information about the world? The brain's model of the world generates concepts like what a car is, what it does, and what its attributes are. We translate sensory input into representations so effortlessly that it's difficult to understand why computers struggle with it.
Computers store data in “dense” representations of 1s and 0s. For example, ASCII characters are stored in blocks of 8 bits. The letter "m" is represented by the string "01101101." Each 1 and 0 has no inherent meaning, and changing one bit will completely change the meaning of the entire string of bits (“vector”).
By contrast, data stored in the brain is very sparse. The human brain has between 30 and 100 billion neurons, but at any given time only a few percent are active. You can translate this into a data storage system called “Sparse Distributed Representations” (SDRs), where active neurons are represented by 1s and inactive neurons are 0s. SDRs have thousands of bits, but typically only about 2% are 1s and 98% are 0s.
This diagram represents sparsity: two thousand circles with a small number of red circles active.
In SDRs, unlike computer data, each bit has meaning. This means that if two vectors have 1s in the same position they are semantically similar. Vectors can therefore be expressed in degrees of similarity rather than simply being identical or different. These large vectors can be stored accurately even using a subsampled index of, say, 10 of 2,000 bits. This makes SDR memory fault tolerant to gaps in data. SDRs also exhibit properties that reliably allow the neocortex to determine if a new input is unexpected. After understanding the benefits of SDRs, it is difficult to imagine that an intelligent system could be built without them.
For more details about SDRs, watch this excerpt from a talk given by Jeff Hawkins.
To make predictions with streaming data, Grok needs to identify how patterns change over time. This may seem obvious, but most machine learning techniques assume that each data point is statistically independent of the previous and next records. For example, the result you get from rolling a dice has nothing to do with the previous roll. Such methods can be adapted to deal with sequences over time, but are usually not inherently "temporal."
By contrast, the primary memory function in the neocortex is sequence memory. Learning sequences allow you to predict what will happen next. Even vision and touch can be expressed in terms of sequences of input rather than isolated data points. This process can be explained by examining the structure and operations of the neocortex, where all high-level intelligence resides. The neocortex contains layers of neurons arranged in columns that respond together, though almost all connections between neurons are in horizontal rows. Each neuron contains dendrites that have spines connecting it to other neurons.
Sequences are represented in neurons by the way they connect to each other and become active based on sensory input. When a neuron becomes active, synapses are formed with a small subset of 10 or 20 neurons that were previously active (this is the subsampling property of SDRs in action). Once these connections are formed, if the connections to those cells subsequently become active, the neuron anticipates that it may fire next. This is the basis of sequence memory: you learn to predict the next step.
A one-step ("first order") prediction system is not enough to learn complex patterns, however. For example, given the word "like," it is difficult to predict the next word. But if you heard "time flies like...," the extra context could help you predict the next words you hear will be "an arrow." Or if you heard, "try it, you'll like...," you might predict "it" instead. We can create a "variable order" memory system to learn longer sequences by adding cells to form a column. When an input is detected, one of the cells in a column is activated. If that same input is subsequently seen as part of a different sequence, a different cell in the same column is activated. This allows us to expand exponentially the representations of a given input in different contexts. When you hear "like," the same column activates, but you can distinguish the different meanings of "like" because different cells in the column fire, allowing you to make different predictions for what will follow.
The diagram below illustrates how this works. For a given SDR, instead of one cell, each bit is represented by one cell in a column of ten. The second representation shows a variation of the same input with different cell activations. If you have 40 active columns and 10 cells per column, this means there are 10 40 ways to represent the same input in different contexts.
Grok emulates the brain's variable order sequence memory. The resulting system has high capacity, is fault tolerant, and capable of "semantic generalization" (grouping similar inputs and patterns together).
For more details on sequence memory, watch this excerpt from a talk given by Jeff Hawkins
Conventional predictive analytics involves gathering large amounts of data, spending time to figure out the correlations, and then deploying the model to make predictions. With high-velocity data, however, static models can become obsolete quickly. The data may change faster than you can repeat the process and rebuild your models.
The brain faces similar challenges. When sensory data enters the brain, you don’t have time to store it and figure it out later. Your brain must process every new input, whether it is useful or not. If the pattern repeats, you reinforce it, and if it does not, you forget it.
Neuroscience used to believe that learning only occurred in the strengthening and weakening of synapses. It turns out that synapses can form and grow rapidly, rather than simply increasing and decreasing the weight.
This is modeled with two factors: the degree of growth (permanence) and whether it is connected. The more an input is seen, the higher the permanence value becomes. At some threshold level, the connection is established. If you continue to see a pattern, it increases the permanence, which makes it harder to forget. Grok implements a simplified version of this system that enables it to filter out noise and infrequent patterns, but learn and remember useful patterns.
For more details on on-line learning, watch this excerpt from a talk given by Jeff Hawkins
The Cortical Learning Algorithm is fully detailed in the following white paper, although it was written before Grok was envisioned. The document is available in several languages, thanks to the generosity of the translators listed below (Grok has not verified the accuracy of these translations).
Jeff Hawkins first described his theory of intelligence in the book On Intelligence, which was written with the help of Sandra Blakeslee.
On Intelligence is available in the following languages.
In addition, the following online translations can be downloaded here: