fxai

Attempting the Numenta Anomaly Benchmark

2016-07-01T00:00:00+00:00

Being an objective test problem, NAB allows us to tease apart the contribution of various aspects of HTM models. It turns out the current best result on NAB can be achieved with only first-order transition memory. It seems the regular sampling rate of most time series data does not produce a directly meaningful sequence of transitions. If we take effective time steps only when sufficient changes occur this indeed gives some improved performance, but still sees no benefit to higher-order transitions. Ultimately, like most things in HTM, I think human-level anomaly detection will require robust temporal pooling.

The most frustrating thing about working on Hierarchical Temporal Memory has been the lack of any standard test problems. I’ve spent ages thinking and reading papers, trying to decide what kind of problems are suitable. Visual object recognition – or does that depend on understanding attention? Game playing – or does that require complex behaviour generation? Natural language processing – or does that need working memory? What even should an HTM system be trying to do and how can we measure it?

To its credit, Numenta has recently established one such standardised test problem: Numenta Anomaly Benchmark (NAB). The task is anomaly detection in a stream of numeric values. It includes a corpus of artificial data, events with known causes, and other real world data annotated with anomalies, each over a window of time. Currently there are 58 time series with a total of about 366,000 records and 116 anomaly windows.

Initial results have an HTM model coming out on top compared with several other methods. This is exciting because it is the first example I have seen of HTM doing something useful, and doing it better than the alternatives.

The HTM model listed on the NAB leaderboard was run in NuPIC with these parameter values. I set out to find out which aspects of the model were really important, by exploring the parameter space and algorithm variations. Also, since at last we have a standard test problem, this was an opportunity to verify my implementation of HTM, Comportex.

Headline result

I have not produced a much better score than the original Numenta model. Rather, my main result was to show that that score can be equalled with a much simpler model. This tells us something useful about the contribution of different aspects of these HTM models. I’ll discuss some below.

The headline result here comes from Comportex with mostly the same parameters as the Numenta model, but with the following differences:

first-order transition memory (1 cell per column, down from 32);
a potential receptive field sample of 16% (down from 80%);
only 16 segments per cell (down from 128);
sampled linear encoders;
timestamp given as distal input (not proximal / driving input);
a distal stimulus threshold of 18 (not 20).

If you’ve been following the HTM community you’ll recognise that this approach is based directly on the successful work of Marcus Lewis.

Additionally, some post-processing was applied:

Instead of the raw bursting rate, a delta anomaly score was calculated: it considers only newly active columns (ignoring any remaining active from the previous timestep). The bursting rate is calculated only within these new columns. To handle small changes, the number of columns considered – i.e. divided by – is kept from falling below 20% of the total number of active columns (20% of 40 = 8).
If an anomaly is detected (above some specified threshold), no further detections are reported for 40 time steps. This stops false positives from going crazy.
Note that Numenta’s Anomaly Likelihood filtering was not applied.

Anyway, here is the final result as scored by NAB:

model	standard	low FP rate	low FN rate
original NuPIC model + anomaly likelihood	65.3	58.6	69.4
original NuPIC model, raw bursting score	52.5	41.1	58.3
selected Comportex model, delta anomaly score	64.6	58.8	69.6
selected Comportex model, raw bursting score	59.5	53.4	63.8

A model variant that achieves a notably better low FP rate score of 59.8 is also described below under Effective time steps.

Discussion

This is a good result and confirms that Comportex is a competitive implementation of HTM. But the most interesting part is how this allows us to run experiments to tease apart the contribution of different aspects of the models. Let look at the most important ones:

First-order

The result shows that the reported performance of HTM on NAB can be explained with only first-order transition memory. That is surprising, given that NAB consists of many complex time series, and some were even deliberately constructed as artificial higher-order sequence problems. However, the result follows similar findings for prediction on the HotGym data (by Marcus Lewis), and on the New York Taxi data (by me).

Of course I did run experiments with 32 cells per column, the NuPIC standard. The result was a large decrease in score – around the same amount as when leaving out the timestamp input, or when leaving out the delta anomaly score. (see Appendix)

That a first-order model gives an improved result suggests that the current design of HTM transition memory is incomplete. While it is possible that my own implementation has subtle problems, the fact remains that HTM’s higher order transition memory has not been demonstrated to have a benefit on any real problem, as far as I know. My feeling is that transition memory will work better when constrained by higher-level contexts, i.e. with temporal pooling.

Delta anomaly score

The delta anomaly score is defined as:

(number-of-newly-active-columns-that-are-bursting) /
max(0.2 * number-of-active-columns, number-of-newly-active-columns)

Note that this will not pick up cases when columns stay on unexpectedly, like a flat-line scenario. To catch this I also tried a variant with instead the number-of-newly-bursting-columns – i.e. counting the columns that changed into a bursting state, even if they were previously active. However, that gave worse results.

Here’s an example of an obvious anomaly in the realKnownCause/machine_temperature_system_failure series, around time step 4000:

Section of the machine_temperature_system_failure data. Anomaly windows are outlined in red. The middle plot is delta anomaly scores and the bottom plot is raw bursting scores, from a baseline HTM model.

Looking at the HTM columns with Sanity, we can see that the anomaly rolls in over several time steps, hiding the overall magnitude of the anomaly:

HTM columns over time (time steps go left to right) on the machine_temperature_system_failure data, highlighting time step 3982. Red columns are bursting.

That is why delta anomaly calculation helps. I like that this simple approach achieves good results without the need for a separate “anomaly likelihood” Gaussian estimation process.

Limited potential receptive field

It is curious that there is such a strong benefit in limiting the set of inputs that each column can connect to. I think this reflects that the learning process in HTM is too greedy, so it loses the ability to discriminate between values. A “boosting” mechanism is supposed to correct for this, but in practice does more harm than good. There is more to say on this but I’ll leave that for another time and a simpler test problem.

Compared to the roughly 5% increase in score going from 80% to 16% global connectivity, we get about two thirds of that benefit if the 16% connectivity is restricted to a local area of the input bits (80% connectivity within 20% area = 16%). Probably in the local case, while its scope is limited, greediness still applies within a local range of values.

Sampled linear encoders

The original Numenta model used a Random Distributed Scalar Encoder for the time series value, and a periodic scalar encoder for the hour. Instead I used Marcus Lewis’ sampled linear encoders for both - mainly for aesthetic reasons. Also I didn’t have an implementation of RDSE. So I didn’t test the impact of this decision. But just look at the beautiful encoding they produce - how can this not be good?

Value encoder. x-axis is numeric range in 5% steps. Active bits vertically.

Hour-of-day encoder. x-axis is hour. Active bits vertically.

Effective time steps

The time series data in NAB are sampled at different time scales and also change at very different rates. Some are smooth and wavelike and others are super spiky. As I mentioned in the context of the delta anomaly score, sometimes an apparently sharp peak would in fact roll in over many time steps. Because there is only an incremental change on each time step, the magnitude of the anomaly can be underestimated.

The sampling rate at which data happens to be available may not be the best rate for learning meaningful transitions.

One idea is to delay taking an effective time step in HTM until a sufficient change occurs to be registered. That is, wait until the set of active columns are, say, 20% different from the last effective time step. At that point, learn the transitions from the last effective time step.

Interestingly, this gives an improved result on the low FP rate profile:

model	standard	low FP rate	low FN rate
original NuPIC model + anomaly likelihood	65.3	58.6	69.4
original NuPIC model	52.5	41.1	58.3
selected Comportex model (above)	64.6	58.8	69.6
effective time steps when 20% columns change	64.7	59.8	69.5

Incidentally, I think the low FP rate profile is the most reasonable. It balances one correctly detected anomaly to about 10 false positives; whereas the standard profile balances one to about 20 false positives, which seems excessive. And low FN rate is more like one to 30.

Limitations

Anyway, this effective time steps approach is interesting but obviously not a final solution. It has no way to detect when an expected event does not occur, since there is no change of value involved.

Ideally I think the system would be entrained to a rhythm according to regular patterns in the data, and that would define effective time steps.

Further work

I’m keen to break down the effects of parameters and algorithm variations by file. Which files change their score order under in each experiment? That could help to investigate what’s going wrong, in detail.

Looking forward to hearing your thoughts directly or on the HTM Forum.

–Felix

Code

The results here were produced with nab-comportex 0.1.0 using Comportex 0.0.14.

Appendix: Specific experiments

Results here are expressed as a change in score relative to a starting point “baseline” model. The baseline model is the same as headline model except:

distal stimulus threshold 20
local receptive field 20% diameter and 80% fraction.

Note: The scoring here is not from official NAB, it is from my own implementation of the NAB scoring rules. And my scores are not quite the same as the official NAB scores, although they are usually closely correlated. I have not tracked down the cause of the inconsistency.

Scores shown as differences from baseline:

|   | settings                                      | standard | low FP rate | low FN rate |
|------------------------------------------------------------------------------------------|
| * | baseline                                      | =0.0     | =0.0        | =0.0        |
|   | without delta anomaly; raw bursting score     | -4.8     | -2.7        | -3.4        |
|   | global receptive field, 80% fraction          | -1.1     | -1.2        | -1.6        |
|   | global receptive field, 16% fraction          | +0.4     | +1.7        | +0.5        |
|   | no timestamp input                            | -3.8     | -3.5        | -3.0        |
|   | depth 32 cells per column                     | -2.6     | -4.6        | -2.3        |
|   | depth 32 cells per column, newly bursting     | -3.2     | -5.9        | -2.3        |
|   | depth 32 cells per column, raw bursting       | -5.5     | -5.8        | -5.7        |
|   | distal stimulus threshold 19                  | +0.3     | +1.0        | +0.8        |
|   | distal stimulus threshold 18                  | +1.1     | +1.5        | +1.9        |
|   | distal stimulus threshold 17                  | -0.1     | +0.9        | +2.0        |
| $ | global receptive field, 16% fraction, stim 18 | +1.3     | +2.5        | +1.4        |
|   | effective time steps when 20% columns change  | -0.8     | +0.5        | -1.3        |
|   | effective time steps when 25% columns change  | -1.9     | +0.1        | -2.4        |
|   | effective time steps when 15% columns change  | -1.3     | -0.7        | -1.5        |
|   | effective time steps, distal stimulus 18      | -0.7     | +1.1        | -0.9        |
| ! | effective time steps, 16% fraction, stim 18   | +1.8     | +3.1        | +2.0        |
|   | depth 32, effective time steps & stimulus 18  | -2.6     | -1.0        | -2.1        |

* baseline model

$ headline model

! headline effective time steps model

Where baseline model scores are (different from official NAB scoring):

|   | settings                                      | standard | low FP rate | low FN rate |
|------------------------------------------------------------------------------------------|
| * | baseline                                      | 66.1     | 60.5        | 70.5        |

Sequence replay in HTM

2015-12-01T00:00:00+00:00

If you are not redirected automatically, follow the link to this post.

The taming of the SDR

2015-09-15T00:00:00+00:00

In which I come to a way of visually understanding the activity of cells in HTM, which have Sparse Distributed Representations (SDRs). This new way is like a state transition diagram, but with fuzzy, evolving states. Several problems in my existing demos become immediately obvious.

Sparse Distributed Representations

Does the brain work like a computer program? No, and perhaps the deepest explanation of the difference is in their representation of information. Computers store data and instructions in a precise structure using every bit; in a digital memory address, one bad bit makes the whole thing useless. Brains use a sparse, distributed representation where vastly more bits (cells) are needed, but in its own form captures meaning, in terms of a “halo” of related concepts. Read more.

Think of representing a particular person. A computer program might use their name or email address. But that in itself doesn’t tell you anything about the person, and if any bit was wrong it would fail. In contrast, a sparse distributed representation might include where they live, their friends, occupation, etc. If bits of that were missing, one would still have a general idea who it might be or could fall back to someone “similar”.

The problem with using SDRs is that they are hard to pin down. How do you work with something that is:

fuzzy – it can be active to varying degrees, mixed with noise or other SDRs;
evolving – it can change over time as it is learned from experience.

A sketch of a diagram

Earlier this year, Rob Freeman and I were exchanging emails on some HTM-related ideas. At one point he said,

I want to experiment with allowing distal excitation to spread over region-1 independently of time steps (or before each time step?)

For instance, if you take my earlier example abcabcabcxyzabc and plot all the subnetworks in a pooled representation […] it should look like (if my ascii art works out):
     abc
   /     \
abc   -   abc
   \     /
     xyz

Well, I may have missed the main point Rob was making at the time, but that little ascii-art diagram got me thinking. Why can’t we distil the activity in a HTM region down into a state transition diagram?

I spent the next couple of months, on and off, building and refining such a plot. In retrospect it seems like an obvious visualisation, and I’ve come to rely on it as an essential tool for understanding HTM activity.

The method

I’m going to talk about SDRs of cell activity in a single layer of a HTM network. Note, cell activity, not column activity, which means that these representations are context-specific. And because my main display in ComportexViz shows column activity, it doesn’t help much with seeing these context-specific cell SDRs.

This is as much an algorithm as it is a plot.

The main idea is to tame SDRs by recording them with a unique name when they appear in distinguishable form. Then we can recognise them when they appear again, even partially, and we can decide how to evolve their form.

Let’s call a named SDR a state (since it will play a role on something like a state transition diagram). A state keeps track of which cells have been active on its watch, and how often.

Every time step we check the set of active learning cells. If they overlap a known state sufficiently well, meeting a threshold parameter, that known state is said to match. It’s even possible that multiple states could match. Otherwise, the active learning cells define a new state. In any case, all these cells are counted towards the matching state (or states).

It’s not quite that simple because states are fuzzy. Some cells are more essential to a state than others. That’s why we keep running counts: the influence of a cell is weighted by its specificity to a state. So if a cell participates in states A and B an equal number of times, it will count only half as much to A as a cell fully specific to A. Code.

Here we see there is a tension at the heart of the idea. When a state evolves and expands, as it appears mixed in various different contexts, is it still reasonable to think of it as a single coherent thing? Well, perhaps not, and it will often make sense to reset, building states anew after some time. But abstraction of concepts from fuzzy, messy reality is how we understand the world.

Labels from the input data are also counted on matching states, but this is only for display, it is not used to define states in any way.

So that’s basically it for taming SDRs. Now for the display, at one point in time. We can draw the states, but what about the transitions between them? Those we can derive from distal synapse connections. Easy: get all connections between cells, then map both the source and target cells into states, weighted by specificity. You end up with the connections between states, each with a score, being the weighted number of connected synapses between them. Code.

As well as drawing the matching states, it’s useful to show any partial matches of states by currently active or predicted cells.

Reading the diagram

Here’s an example where the inputs are single letters. This display shows the activity in a layer of cells after seeing the sequence*:

o n e o n e o n e t w o t w o t w o t h r e e

What does it tell us? Firstly, each of the letters of “one” and “two” generated distinct states when they were first seen, were re-activated on repetition, and the transitions between them were learned (blue lines). The start of “three”, “thr” also generated new states, but the “e” matched the previous “e” state from “one” instead of generating a new state. Incidentally, this is undesirable and indicates some bad parameters in this particular example. The current state, at the time this plot was drawn, was matching that “e” state (outlined black), and also extending it with other current cells that were not already in the state (the green part). Lastly, the initial “o” state is being predicted for the next time step (it is shaded blue).

More precisely, here are the rules:

States are drawn in order of appearance, by default.
If any of a state’s cells are currently active that fraction will be shaded red (whether active due to bursting or not).
Similarly, any predictive cells (predicting activation for the next time step) will be shaded blue.
If any of a state’s cells are the current learning cells that fraction will be outlined in black.
When a matching state will be extended to include new cells, those are shown in green.
Transitions are drawn as blue curves. Thickness corresponds to the number of connected synapses, weighted by specificity of both the source and target cells.
The height of a state corresponds to the (weighted) number of cells it represents.
The width of a state corresponds to the number of times it has matched.
Labels are drawn with horizonal spacing by frequency.

Putting it to work

Let’s see if the display is useful. Continuing the example from a previous post we’ll use the following sequence of words as input (with each sentence repeated).

Jane has eyes .
Jane has a head .
Jane has a mouth .
Jane has a brain .
Jane has a book .
Jane has no friend .
Chifung has eyes .
Chifung has a head .
Chifung has a mouth .
Chifung has a brain .
Chifung has no book .
Chifung has a friend .

Here we go.

Erm… ok. That’s not good. Every word seen comes up as a completely new state, even when repeated. The problem here was the “boosting” mechanism which is supposed to encourage fair, efficient use of columns by biasing activation towards neglected columns. It does settle down after a while but clearly this is pretty crazy. Let’s turn off boosting.

Running it again, repeated states are now re-activated, but we have a new problem. Here are two consecutive time steps (has, a):

The sequence Jane has eyes had already been learned, so “eyes” was predicted. When the actual input “a” appeared, it somehow ended up matching the state that was previously active for “eyes”. How?

The answer is that the predicted cells, with their distal excitation, were biased to become active, so the prediction became part of the actual representation. This is a non-standard part of the algorithm, sometimes called prediction-assisted recognition. However in this case the effect is clearly too strong. We can confirm the cause by looking at the sources of excitation on activated cells:

So indeed there were 7 cells which became active on the strength of their predictive (distal) excitation without having much direct (proximal) excitation.

When we revise the parameter :distal-vs-proxmal-weight down from 1.0 to 0.2, the confusion we observed is resolved:

Moving on… when we let it run through, something odd happens in the second section:

The word “Chifung” ends up as a different state within each sentence. This happens because, the whole input being an unbroken sequence, “Chifung” appears after sequences that were recognised from previous learning:

has no friend . Chifung
has eyes . Chifung
has a head . Chifung
has a brain . Chifung

Coming up in these different learned contexts leads to distinct context-dependent representations. That feels frustrating, because it is obvious to us that those are not meaningful contexts, and do not alter the meaning of “Chifung”.

A better way to read sentences may be using multiple layers. A lower layer learns sequences of words in a sentence, but starts anew rather than continuing sequences across sentence boundaries. A higher layer could track transitions across sentences, which of course involves Temporal Pooling, a topic I won’t discuss here. This approach can be seen in the second-level motor online demo.

There are several online demos at http://nupic-community.github.io/comportexviz/

Looking forward to hearing your thoughts directly or on the NuPIC-theory mailing list.

–Felix

UPDATE 2015-09-18: Chetan, replying on NuPIC-theory, suggested running a higher-order sequence like
6874230
1874235
This is “higher-order” because it’s not enough to learn the transitions in isolation. The context of the first letter must be carried through the following 5 steps in order to unambiguously predict the final letter.

I ran it through my second-level motor demo, where each word is repeated until it is learned before moving on, and the sequence learning is reset when starting each new word. Learning follows a pattern where on each repetition the new pattern becomes progressively more distinct from the old pattern, until after about 6 repetitions there is a fully distinct representation of the new pattern.

You can run this yourself online, but I also recorded it as a video, in case that is easier.

Code

The results here were produced with Comportex 0.0.10 with ComportexViz 0.0.10.

* The first example comes from the second-level motor demo, so the input is not actually a simple sequence.

Local inhibition algorithm

2014-11-13T00:00:00+00:00

In which I propose and demonstrate a new local inhibition algorithm for HTM that seems more natural than existing methods. I attempt to model the waves of inhibition around firing neurons. A target level of sparsity is maintained by an adaptive control on the stimulus threshold parameter. One could imagine other formulations and I look forward to hearing other ideas.

A crucial step in HTM is the selection of neural columns to become active, given their feed-forward inputs. Essentially, only those with most excitation will be activated, forming a sparse representation. But the details could be important. Real neural columns, I am told, on firing trigger a wave of inhibition in the surrounding area.

Inhibition algorithms in NuPIC

In NuPIC the seemingly universal choice is global inhibition of columns. The top few columns with highest excitation are selected to become active, ignoring any spatial relationships between them. The justification for this is computational performance; a comment in the code warns of a 60x performance difference between the global vs local algorithms.

The local inhibition algorithm in NuPIC works as follows. Taking the target activation density—typically 2% (as with the global algorithm)—it keeps just that number of the most active columns in each sliding window over the column space. So inhibition is applied in local spatial areas, but within each area the spatial relationships between columns are still ignored.

The local window size is not a parameter but is calculated as the average receptive field size, mapped back from the input space to the column space.

Proposed algorithm

I have tried to model the inhibitory waves around active columns, and their relative timing to some extent. Keep in mind that my neuroscience knowledge is poor.

In a nutshell, starting with the most excited columns and working down, each becomes active and inhibits all neighbours within a base distance; it may also inhibit further neighbours within an outer radius if their excitation is sufficiently low. Specifically, if the neighbour’s excitation level is below a linear ramp from the original column’s level down to zero at the outer radius. That radius is defined by the average receptive field size. As soon as any column is inhibited in this process it is removed from consideration.

That is the inhibition algorithm itself, but notice that it does not enforce any particular activation density. Also it may be overly sensitive to noise in areas lacking any real signal (a criticism that probably applies to any local algorithm). So, in addition, I have made the stimulus threshold an adaptive parameter, changing slightly each time step according to the actual vs target activation density. Columns with excitation below the stimulus threshold can not become active. This is a global mechanism, but operates at a longer time scale; perhaps there is some neurochemical process like this?

UPDATE 2014-11-25: There is a problem with the adaptive stimulus threshold. After training on familiar input for a while, it becomes well recognised with many connected synapses so the stimulus threshold rises. When new input comes in, it has fewer connected synapses, i.e. a lower stimulus value. This gives rises to too few—even zero—active columns. Bad! Anyway, we can simply take the top N most excited columns after accounting for local inhibition. That is the same as having the threshold adapt immediately within one time step.

This algorithm is a “local algorithm” in the sense that it considers local spatial interactions in its processing. But it is not a “local algorithm” in the sense that its computation is distributed. Of course one could split up the column space and run inhibition on each chunk, but that is not my focus here. Rather I am interested in whether accounting for local interactions could have some effect on information processing.

Interactive demonstration

The plot below shows 200 columns along the x axis with some generated excitation levels on the y axis. Red columns are active. Hit the Step button a few times to see the actual activation level converge on the target activation level, and to contrast this kind of local inhibition with global inhibition (mirrored below the axis):

Your browser does not support iframes.

And a similar demonstration in 2D:

Your browser does not support iframes.

Finally here is a basic example of the algorithm running in HTM:

Isolated fixed sequences 2D.

Note: Google Chrome browser recommended.

Perf

Of course local inhibition is slower to compute than the simple global approach, which scales approximately linearly by the number of columns. But it is not as bad as (number of columns) X (inhibition radius), because as columns are inhibited they are removed and ignored. The performance depends on distributional properties of the input. In one (fairly arbitrary) test I ran on Comportex, the local algorithm was about 25X slower than a simple sort. However, usually the inhibition step is not the slowest part of a time step in Comportex; rather, learning on proximal synapses takes longer.

The code

The demonstrations here were compiled from ComportexViz 0.0.6 (local-inhibition-1d.cljs, local-inhibition-2d.cljs). with Comportex 0.0.6 (inhibition code is in inhibition.cljx and tuning the stimulus threshold is in cells.cljx).

UPDATE 2014-11-25: fixed version is Comportex 0.0.7 and ComportexViz 0.0.7. Inhibition code is in inhibition.cljx.

As always, I value your advice.

–Felix

HTM protocols

2014-11-05T00:00:00+00:00

In which I explain two of the central abstractions in Comportex (my HTM project). Firstly the decomposition of a time step into activation, learning, and depolarisation phases across an HTM network. Secondly the interface for working with synapse connection graphs. I need to know if my abstractions have problems, before I go further down the road of building on them. Also, they may help others to think about HTM in software.

In Clojure, protocols make abstractions. They are minimalist. Elegant weapons, for a more… civilized age ^[1]. Defining a protocol simply creates the named functions, but leaves their implementations to be supplied separately as needed.

Comportex’s protocols cover networks, regions, layers, synapses, sensory inputs, encoders and topologies in about 150 lines (er, thanks in part to the paucity of documentation), and can be read in full at protocols.cljc.

I would like to focus on two protocols.

Activate, Learn, Depolarise

I try to follow Rich Hickey’s design philosophy: design is about “taking things apart so you can put them back together”. He talks about making sure each component is just doing one thing, “decomplecting” concerns to make something simple.

In my early work on Comportex I had a step function for regions which activated cells, performed learning by updating synapses, and calculated depolarised cells. These concerns were complected.

Eventually I started thinking about networks of regions, where there are feed-forward inputs to regions from below, but also feedback connections from above, motor signal connections from below and maybe other connections. As I imagine it the feed-forward inputs should propagate through the whole network first, then feedback, motor, and other signals to distal synapses propagate afterwards. So the feedback passed down would reflect the state of higher regions after they received feed-forward input from the same time step.

This idea is behind the PHTM protocol (the prefix P is conventional). To satisfy the protocol, an object must implement its functions: htm-sense, htm-activate, htm-learn, htm-depolarise. Such an implementation is given in core.cljc

UPDATE 2014-11-25: Originally I had sensory input on each time step arriving via input sources/channels embedded in an HTM model. Marcus Lewis helped me to realise the path of functional purity (and clojurescript compatibility) lies in passing the input value as an argument to the step function. Exactly the kind of feedback I needed.

UPDATE 2015-08-28: The activation step has been further decomposed into htm-sense to encode input senses and htm-activate to do the feed-forward pass through the network, activating cells and columns. See issue #19.

(defprotocol PHTM
  "A network of regions, forming Hierarchical Temporal Memory."
  (htm-sense [this in-value]
  (htm-activate [this])
  (htm-learn [this])
  (htm-depolarise [this]))

(defn htm-step
  [this in-value]
  (-> this
      (htm-sense in-value)
      (htm-activate)
      (htm-learn)
      (htm-depolarise)))

I define a function here too, htm-step, which takes an object satisfying the PHTM protocol and applies to it the functions in canonical order. This function returns the HTM network advanced one time step and is the centrepiece of the API.

Depolarisation comes after activation because we often want to use depolarised cells to make predictions of the next time step. One consequence of this design is that motor signals, which act to depolarise cells, should appear the timestep before a corresponding sensory (feed-forward) signal in order to be useful. I think that is reasonable.

Learning comes before depolarisation because it needs to know the previously depolarised cells (when “punishing” unfulfilled predictions); if learning was the last phase we would need to store the values for an extra step—not a big deal but slightly less elegant.

Of course for all this to work it needs to call corresponding functions on individual regions, and within regions on layers of cells.

(defprotocol PRegion
  "Cortical regions need to extend this together with PTopological,
   PFeedForward, PTemporal, PParameterised."
  (region-activate [this ff-bits stable-ff-bits])
  (region-learn [this ff-bits])
  (region-depolarise [this distal-ff-bits distal-fb-bits]))

Here the ff-bits argument is the set of active bits coming in through a feed-forward pathway, and fb-bits is the set of bits coming in through a feed-back pathway. In the depolarise function there is the odd “distal-ff-bits” argument: this will be motor commands from below and maybe sensory too. I separate ff from fb in the depolarise function because we might want to selectively enable or disable feedback.

And similarly within a region there are layer-activate, layer-learn, layer-depolarise functions.

Synapse Graphs

The information content and computational load of HTMs is in the synapse connections. Usually, proximal synapses are represented and used very differently to distal synapses (e.g. in Numenta’s CLA White Paper). Potential proximal synapses are represented as a fixed explicit list, whereas distal synapses start empty and grow and die over time. I wanted to try the latter implicit approach for proximal synapses too. This led to a protocol encompassing both cases.

Synapse graphs as presented here represent the connections from a set of sources to a set of targets. In the case of proximal synapses the sources will be input bits and the targets will be columns. In the case of distal synapses the sources will be cells (typically in the same layer but not necessarily) and the targets will be distal dendrite segments.

The choice of how to represent the set of potential synapses, explicitly or implicitly, can be made separately.

UPDATE 2015-08-28: Originally I had defined functions here that operated on a single synaptic target at a time, reinforcing its synapses or adding or removing synapses. So the learning algorithms involved looping over a lot of targets to update each one. Essentially, procedural programming.

I realised a better way is to build a list representing the updates, before applying those updates in one go: bulk-learn. Like the way Datomic works, you build up transaction data structures before submitting them. This is faster because it allows transient mutation to be used (we don’t need intermediate states); allows the updates to be inspected easily; and is just cleaner.

Also, I moved excitations, the function calculating activity in a whole layer of cells given its input bits, into the protocol. As with bulk-learn, abstracting the process improved performance by allowing internal use of transients.

(defprotocol PSynapseGraph
  "The synaptic connections from a set of sources to a set of targets.
   Synapses have an associated permanence value between 0 and 1; above
   some permanence level they are defined to be connected."
  (in-synapses [this target-id]
    "All synapses to the target. A map from source ids to permanences.")
  (sources-connected-to [this target-id]
    "The collection of source ids actually connected to target id.")
  (targets-connected-from [this source-id]
    "The collection of target ids actually connected from source id.")
  (excitations [this active-sources stimulus-threshold]
    "Computes a map of target ids to their degree of excitation -- the
    number of sources in `active-sources` they are connected to -- excluding
    any below `stimulus-threshold`.")
  (bulk-learn [this seg-updates active-sources pinc pdec pinit]
    "Applies learning updates to a batch of targets. `seg-updates` is
    a sequence of SegUpdate records, one for each target dendrite
    segment."))

A question arises of how to look up the target dendrite segments by cell in this model (since target ids refer to segments, not cells). This can be solved with another protocol which is extended only to distal synapse graphs:

(defprotocol PSegments
  (cell-segments [this cell-id]
    "A vector of segments on the cell, each being a synapse map."))

Alternative backends

Protocols leave their implementation open, so as long as we program to protocols we can write and use alternative backends. This will be important. The demos I run in Comportex today are at a tiny toy scale. But Fergal Byrne has been designing a scalable architecture for running HTMs in Clojure, in Clortex. I hope that with the right protocols our efforts can be made to work together.

As always, I value your thoughts.

–Felix

Hackathon demo: cortical.io encoder

2014-10-27T00:00:00+00:00

Last weekend I joined Numenta’s Fall 2014 Hackathon. A fantastic event. It underscores Numenta’s approach of being totally open with their work and supportive of the community.

It feels like we are at the cusp of a revolution, where a few more good ideas will really make this thing fly. So it was inspiring to hear from Jeff Hawkins about the current challenges, and to see Chetan and Yuwei’s brilliant demonstration of temporal pooling in action.

I was particularly glad to meet fellow functional programmers including system designer and Clojurist Fergal Byrne, Clojurist / Clojurescripter Marcus Lewis, Racketeer Rian Shams and Lisper Eric McCarthy.

My hack was a cortical.io encoder, for semantic representation of words as input to HTM. In Comportex. The approach was to make requests to the cortical.io REST API and store the results in a cache used by the encoder itself. I did this in both Clojure (JVM) and Clojurescript (Javascript) implementations. Since cortical.io produces two dimensional bit arrays, I also implemented two dimensional field visualisations in ComportexViz.

Here it is: interactive demo of cortical.io encoder.

Note: may take up to a minute to initialise. Maximise browser window before loading page. Google Chrome browser recommended.

Er, there is also a video of me presenting this… but I didn’t present it well. I was so focused on getting something working that I put zero minutes of preparation into the talk. I did not even try to address why I use Clojure. But Rian Shams did give a nice introduction to the joys of functional programming. Good on him.

The code

The demo here was compiled from Comportex 0.0.5 with ComportexViz 0.0.5.

The Clojure version of the encoder is just this:

(ns org.nfrac.comportex.cortical-io
  (:require [org.nfrac.comportex.protocols :as p]
            [org.nfrac.comportex.topology :as topology]
            [org.nfrac.comportex.util :as util]
            [clojure.string :as str]
            [clj-http.client :as http]))

(def base-uri "http://api.cortical.io/rest")

(def retina-size [128 128])

(def query-params {:retina_name "en_associative"})

(defn get-fingerprint
  [api-key term]
  (http/post (str base-uri "/expressions")
             {:query-params query-params
              :content-type :json
              :as :json
              :form-params {:term term}
              :with-credentials? false
              :throw-exceptions false
              :headers {"api-key" api-key}}))

(defn get-similar-terms
  [api-key bits max-n]
  (http/post (str base-uri "/expressions/similar_terms")
             {:query-params (assoc query-params
                              :get_fingerprint true
                              :max_results max-n)
              :content-type :json
              :as :json
              :form-params {:positions bits}
              :with-credentials? false
              :throw-exceptions false
              :headers {"api-key" api-key}}))

(defn apply-offset
  [xs offset]
  (->> xs
       (map #(+ % offset))
       (into (empty xs))))

(defn random-sdr
  []
  (let [size (apply * retina-size)]
   (set (repeatedly (* size 0.02)
                    #(util/rand-int 0 (dec size))))))

(defn look-up-fingerprint
  "Returns a fingerprint for the term, being a set of active indices.
   If the term is not found in the cache, makes a synchronous call to
   cortical.io REST API and stores the result in the cache (an atom).
   If the request to cortical.io fails, the term is assigned a new
   random SDR."
  [api-key cache term]
  (let [term (str/lower-case term)]
    (or (get @cache term)
        (get (swap! cache assoc term
                    (let [result (get-fingerprint api-key term)]
                      (if (http/success? result)
                        (set (get-in result [:body :positions]))
                        (do (println "cortical.io lookup of term failed:" term)
                            (println result)
                            (random-sdr)))))
             term))))

(defn cortical-io-encoder
  [api-key cache min-votes]
  (let [topo (topology/make-topology retina-size)]
    (reify
      p/PTopological
      (topology [_]
        topo)
      p/PEncodable
      (encode
        [_ offset term]
        (if (seq term)
          (cond->
           (look-up-fingerprint api-key cache term)
           (not (zero? offset)) (apply-offset offset))
          #{}))
      (decode
        [_ bit-votes n]
        (let [bits (keep (fn [[i votes]]
                           (when (>= votes min-votes) i))
                         bit-votes)]
          (if (empty? bits)
            []
            (let [result (get-similar-terms api-key bits n)]
              (if (http/success? result)
                (->> (:body result)
                     (map (fn [item]
                            {:value (get item :term)})))))))))))

If used directly, the encoder above makes synchronous calls to the REST API, which will slow things down. A better way is to run a separate thread to do the API calls while the main thread takes care of the HTM algorithm. Code for that is given in cortical_io_channel.clj.

Results?

Not really.

Just as the hackathon was wrapping up I fed in some of these children’s stories repeated a couple of times. I ended up just feeding them in as a continuous stream of words without sentence breaks. Then I fed in the start of a sentence and looked up the top prediction of the next word. I fed that in as actual input, then asked for the next word, etc. I call this a stream of associations. It was not scientific at all but here are a couple of the more interesting samples:

> (submit "the poor little")
> (stream-of-associations 20)
red
riding
hood
said
it
is
little
hoops
arranged
gone
left-handed
and

> (submit "what a good")
> (stream-of-associations 20)
time
at
mother
duck
said
he
sure

Issues

As Jeff mentioned during Francisco Webber’s talk at the hackathon, the spatial clustering in cortical.io fingerprints may actually be undesirable for use in HTM. Because there is local inhibition of column activation, if there are only a few small clusters of input bits, this may produce a too-sparse activation pattern, losing information.

Also, the lookup of similar terms usually seems to produce unlikely suggestions. This may be resolved by refining the proximal synapse fields after long training. But for quick usage it seems we get more reasonable results by comparing the predicted bits directly against the words we have seen before.

And of course performance. In particular, my Clojurescript implementation is very very slow.

–Felix

Learning Simple Sentences

2014-10-15T00:00:00+00:00

In which I make an interactive demo of word sequence learning with HTM, with an eye to how generalisation might happen. I find some generalisation through word representations mixing their feed-forward receptive fields. This occurs because I bias column activations to depolarised cells. Of course this is only a superficial start at looking at generalisation.

On Rob Freeman’s insistence I’ve made up a demo of word sequence learning in HTM. I was resistant because I thought generalisation in language was inextricably tied up with the semantic content of the concepts involved. Rob suggested just “babbling” with some arbitrary words and looking at how generalisation might happen nonetheless. I am still not entirely convinced, but it is intriguing.

So I made up some simple sentences which share context. Each sentence is presented as a sequence of words, where each word is given a unique representation completely unrelated to the other words. This is in contrast to the approach of cortical.io, which represents semantic overlap between words in their encoding for input to HTM (not that I know anything much about it).

> Jane has eyes .
> Jane has a head .
> Jane has a mouth .
> Jane has a brain .
> Jane has a book .
> Jane has no friend .
> Chifung has eyes .
> Chifung has a head .
> Chifung has a mouth .
> Chifung has a brain .
> Chifung has no book .
> Chifung has a friend .

Although these sentences sound as if they come from a logic system, remember that HTM is seeing just a sequence of meaningless tokens. The words are to help us think about what kinds of generalisation might be reasonable. As an input stream, the above is exactly equivalent to:

V X Y Z O
V X Y A B O
V X Y A C O
V X Y A D O
V X Y A E O
V X Y F G O
V H Y Z O
V H Y A B O
V H Y A C O
V H Y A D O
V H Y F E O
V H Y A G O

To me it seems reasonable to generalise on these sequences such that when it gets to “Chifung has a”, before brain, then brain and book could be predicted as possible options (along with head and mouth). This is generalisation because it would never have seen the exact sequence “Chifung has a brain” before.

Some technical details with the input. I present each sentence 3 times so that synapses can learn enough to become connected. I start by presenting the words “Jane” and “Chifung” on their own to stabilise their feed-forward receptive fields. Sentences are separated by a gap (a time step with no input at all), which allows the next sequence to start fresh, without continuous context. It is useful to include a start token (“>”) and end token (“.”) on each, so that words can have a specific representation for starting a sentence, and so the end of a sentence can be predicted.

Predictions and votes

How can we extract predictions from HTM in terms of the source input words? Start with the set of cells in the predictive state. Through their columns, trace back their proximal synapses connected to the encoded input bit array. This gives a number of votes (number of connected synapses) for each input bit. Going over each possible word, work out the percentage of votes falling in that word’s bit-set, and the average number of votes over the word’s complete bit-set. (These would only give different orderings if the inputs were of different sizes).

Play with it

Here’s the interactive demo. You also have the option to enter your own input!

Simple sentences demo

Note: Maximise browser window before loading page. Google Chrome browser recommended.

Results

Here are some highlights of the above demo using my default parameter values.

First, a very basic sort of generalisation can be seen as a consequence of bursting. Columns burst when they are activated by input they didn’t predict. In that case all cells in the columns become active, and consequently, predictions are made from that input in any previous context. For example, when first presented with “Chifung has”, that “has” is bursting and so opens up the previously-learned associations (see Predictions at the bottom left):

However, that generalisation is short-lived, since as soon as the transition “Chifung has” is learned, it gets its own representation and is no longer bursting (note no predictions this time):

A curious thing happens a little later on. Some generalisation appears to happen, specifically brain and book come up as predictions when they haven’t been seen in the context before:

Note that the predictions are fairly light, at only 1 to 2 votes per bit, so not enough to stop the transition from bursting on first exposure to an actual input of “brain”.

These predictions are a result of the columns representing “mouth” overlapping—and thus sharing feed-forward synapses with—those representing “brain”, “head” and “book”:

So, how did that arise? Well, a recently added feature in my code is to bias columns containing predicted (depolarised) cells to become active; an idea I got from Fergal Byrne. When the representation of “mouth” was first formed in “Jane has a mouth”, the columns/cells for “head” were being predicted, and consequently some became active. Since active columns adapt their input fields to the current input, this led to the overlap in representations. Similarly the later inputs “brain” and “book” appeared when "mouth" was predicted and so ended up overlapping with it.

I tested this by turning off the biasing behaviour (proximal-vs-distal-weight=10000, global-inhibition=true), and sure enough the phenomenon did not occur.

Here is another example of this phenomenon, this time generalising the prediction of “book” to “mouth” and “brain”:

Parameters

While all parameters are listed in the code, I’ve reproduced the descriptions of the relevant ones here, together with their default values in the demo.

You can change them in the interactive demo and of course I encourage you to do so.

Proximal synapses and columns

column-dimensions = [1000] - size of column field as a vector, one dimensional [size] or two dimensional [width height].
ff-potential-radius = 1.0 - range of potential feed-forward synapse connections, as a fraction of the longest single dimension in the input space.
ff-potential-frac = 0.3 - fraction of inputs within range that will be part of the potentially connected set.
ff-perm-inc = 0.05 - amount to increase a synapse’s permanence value by when it is reinforced.
ff-perm-dec = 0.01 - amount to decrease a synapse’s permanence value by when it is not reinforced.
ff-perm-connected = 0.20 - permanence value at which a synapse is functionally connected. Permanence values are defined to be between 0 and 1.
ff-stimulus-threshold = 3 - minimum number of active input connections for a column to be overlapping the input (i.e. active prior to inhibition).

Distal synapses and sequence memory

depth = 8 - number of cells per column.
max-segments = 5 - maximum number of segments per cell.
seg-max-synapse-count = 18 - maximum number of synapses per segment.
seg-new-synapse-count = 12 - number of synapses on a new dendrite segment.
seg-stimulus-threshold = 9 - number of active synapses on a dendrite segment required for it to become active.
seg-learn-threshold = 7 - number of active synapses on a dendrite segment required for it to be reinforced and extended on a bursting column.
distal-perm-inc = 0.05 - amount by which to increase synapse permanence when reinforcing dendrite segments.
distal-perm-dec = 0.01 - amount by which to decrease synapse permanence when reinforcing dendrite segments.
distal-perm-connected = 0.20 - permanence value at which a synapse is functionally connected. Permanence values are defined to be between 0 and 1.
distal-perm-init = 0.16 - permanence value for new synapses on dendrite segments.
distal-punish? = false - whether to negatively reinforce synapses on segments incorrectly predicting activation.
global-inhibition = false - whether to use the faster global algorithm for column inhibition (just keep those with highest overlap scores), or to apply inhibition only within a column’s neighbours.
inhibition-base-distance = 4 - the distance in columns within which a cell inhibits all neighbouring cells with lower excitation.
inhibition-speed = 2 - controls effective inhibition distance. For every multiple of this distance away a cell is, its excitation must be exceeded by one extra active synapse for it to be inhibited. E.g. if this is 2, a cell X, 6 columns away from Y, will be inhibited by Y if exc(Y) > exc(X) + 3.
activation-level = 0.03 - fraction of columns that can be active (either locally or globally); inhibition kicks in to reduce it to this level.
proximal-vs-distal-weight = 2 - scaling to apply to the number of active proximal synapses before adding to the number of active distal synapses (on the winning segment), when selecting active cells.
spontaneous-activation? = false - if true, cells may become active with sufficient distal synapse excitation, even in the absence of any proximal synapse excitation.
alternative-learning? = false - if true, an extra learning step happens. Alternative predictions (i.e. depolarised cells) are carried forward an extra time step (as if the predicted cells were active); these forward-predicted cells learn on distal segments in the current context (as if they were active).

Anyway, I’m not sure how generally desirable the behaviour I described above is. I am sure that this is only a very superficial start at looking at generalisation.

As always, I value your advice.

–Felix

The code

The demo here was compiled from Comportex 0.0.4 with ComportexViz 0.0.4

Temporal Pooling of Isolated Sequences

2014-09-29T00:00:00+00:00

In which I fall back to testing temporal pooling on the simplest possible problem, that of fixed sequences presented in isolation from each other. I find that simple sequences do get a temporal pooling representation that is both sensitive and specific. Longer sequences are not fully covered due to the decay rate in temporal pooling scores; this needs rethinking. Sequences which start the same and later diverge give rise to pooled representations over them both together, as well as each unique sub-sequence.

In my recent post I tested a temporal pooling algorithm on an input stream consisting of fixed sequences randomly mixed together, with poor results. On Rob Freeman’s suggestion I went on to test it with the simpler problem of fixed sequences in isolation. I should have done that first, of course…

See it run

The problem uses the same fixed sequences as last time but presented randomly in series, separated by a small gap. As before this problem comes with an interactive demo.

Sequences in isolation

_Note: Maximise browser window before loading page. Google Chrome browser recommended.

The sequence are well predicted after a few hundred time steps, and consequently temporal pooling emerges in the higher-level region.

As before, let’s try to measure this.

Measurement

So I “warmed up” the system for 4000 time steps, keeping the following 2000 time steps. I filtered them down to where the given pattern occurs (excluding the first step it appears on, which is unpredictable), and selected the top 3 most frequently active temporal pooling cells. These are my candidates.

It is then a simple matter to compute sensitivity, specificity and precision for each candidate cell. The results are shown in the following table.

pattern	candidate	in-steps	out-steps	active-in	active-out	sensitivity	specificity	precision
run-0-5	0	154	925	154	0	1.00	1.00	1.00
run-0-5	1	154	925	154	0	1.00	1.00	1.00
run-0-5	2	154	925	154	0	1.00	1.00	1.00
rev-5-1	0	96	990	96	0	1.00	1.00	1.00
rev-5-1	1	96	990	96	0	1.00	1.00	1.00
rev-5-1	2	96	990	96	0	1.00	1.00	1.00
run-6-10	0	132	945	132	104	1.00	0.89	0.56
run-6-10	1	132	945	132	116	1.00	0.88	0.53
run-6-10	2	132	945	132	116	1.00	0.88	0.53
jump-6-12	0	116	965	116	132	1.00	0.86	0.47
jump-6-12	1	116	965	116	132	1.00	0.86	0.47
jump-6-12	2	116	965	116	128	1.00	0.87	0.48
twos	0	231	846	198	0	0.86	1.00	1.00
twos	1	231	846	198	0	0.86	1.00	1.00
twos	2	231	846	198	0	0.86	1.00	1.00
saw-10-15	0	203	879	203	0	1.00	1.00	1.00
saw-10-15	1	203	879	174	0	0.86	1.00	1.00
saw-10-15	2	203	879	174	0	0.86	1.00	1.00

The results are encouraging. The first two patterns are pooled over perfectly by several candidate cells. The next two—run-6-10 and jump-6-12—show good sensitivity but lower specificity (precision only 50%). The last two patterns—twos and saw-10-15—show good specificity but reduced sensitivity at 86% (although one cell managed to cover saw-10-15 completely).

ad-lib hypotheses

Patterns run-6-10 and jump-6-12 share their first few elements so are probably pooled over by the same cells (thus 50%-50% precision). I hope there will be other cells which distinguish them by pooling over the parts which make them unique.
The last two patterns are the longest sequences. The temporal pooling scores decay over time and so towards the end of the sequence they may start to be dominated by current feed-forward activation, even in the absence of bursting.

Sensitivity over long sequences

Let’s look over the same plot as last time, showing candidate cell activations over each pattern instance. This display helps to diagnose problems with sensitivity (coverage of the target sequence) so it only makes sense to look at those with imperfect sensitivity, twos and saw-10-15. Here we show cells in red for active and in black for inactive.

Pattern "twos" 15 candidate cells

Pattern "saw-10-15" 15 candidate cells Pattern instance

The plot shows that indeed that some cells falter at the end of the sequence, while others start later, missing the first part but reaching to the end. The logical OR of these cells would be enough to fully cover the sequence.

I ran a simulation with another even higher-level region, and the statistics are essentially the same. The decay rate of temporal pooling scores is the same in the higher region so the same thing is happening. I think the mechanism needs reworking here.

Specificity to distinct subsequences

Two patterns had problems with specificity, run-6-10 and jump-6-12. Their values are as follows

`run-6-10`	`[6 7 8 9 10]`
`jump-6-12`	`[6 7 8 11 12]`

Since the first step of all sequences is unpredictable it is ignored for pooling. The remainder consists of 2 shared values and 2 unique values each. If we look down the list of candidate cells there are in fact some cells which cover 50% (2 steps) of each pattern with 100% specificity:

pattern	candidate	in-steps	out-steps	active-in	active-out	sensitivity	specificity	precision
run-6-10	0	132	945	132	104	1.00	0.89	0.56
run-6-10	1	132	945	132	116	1.00	0.88	0.53
run-6-10	…	…	…	…	…	…	…	…
run-6-10	13	132	945	66	0	0.50	1.00	1.00
run-6-10	14	132	945	66	0	0.50	1.00	1.00

jump-6-12	0	116	965	116	132	1.00	0.86	0.47
jump-6-12	1	116	965	116	132	1.00	0.86	0.47
jump-6-12	…	…	…	…	…	…	…	…
jump-6-12	16	116	965	58	0	0.50	1.00	1.00
jump-6-12	17	116	965	58	0	0.50	1.00	1.00

So that’s good—we have some cells representing “something starting with [6 7 8]” and other cells specific to each which appear at their point of divergence.

Thoughts

The mechanism I am using (which, by the way, was first articulated by Jake Bruce) produces a stream of temporal pooling cells over the life of a sequence. That has some nice-seeming properties and some potential problems.

If we see a sequence starting from the second element, it will activate some of these pooling cells but not the ones which are normally activated on the first element. It seems to me that these original pooling cells should be reactivated somehow from observing later parts of the sequence. I have raised this question before. One way this could work is by lateral activation between temporal pooling cells; as they stay active they continue to learn lateral connections to each other. Then any subset of them could complete the missing ones if enough lateral activation were enough to activate cells even without feed-forward input.

As always, I value your advice.

–Felix

The code

The demo here was compiled from Comportex 0.0.3-SNAPSHOT with ComportexViz 0.0.3-SNAPSHOT.

The extra analysis code is here: temporal_pooling_experiments.clj.

Temporal Pooling Mechanism

2014-09-26T00:00:00+00:00

In which I implement a temporal pooling mechanism and test it with fancy statistics, only to realise a more fundamental problem: my sequence memory algorithm isn’t predicting mixed sequences well enough (which is a prerequisite to temporal pooling over them).

A central idea in Hierarchical Temporal Memory is that activity in the cortex becomes more stable as one looks further up in the hierarchy of cortical regions. Taking vision as an example, the first regions respond to small parts of the retina and change rapidly. Higher regions will have stable representations of, say, an elephant as we look at it from any angle. This process of abstracting up the hierarchy is referred to as temporal pooling. You can read Jeff Hawkins’ new ideas about temporal pooling.

There should be specific cells which become active in response to, say, an elephant and not for anything else. We can measure this on labelled input data with standard classification metrics. Given a candidate cell:

Sensitivity is the fraction of elephant observations on which the cell was (correctly) active.
Specificity is the fraction of non-elephant observations on which the cell was (correctly) inactive.
Precision is the fraction of the cell’s activations on which an elephant was in fact being observed.

I have attempted to implement a mechanism for temporal pooling. For now, the implementation does not cover the full sensori-motor scheme (only Layer 3 not Layer 4), so the experiments I describe here use only sensory inputs.

Mechanism

EDIT: this idea was first articulated by Jake Bruce.

My mechanism extends the standard spatial pooling (also known as pattern memory), selecting which columns in a region become active. The input data are the active cells from the region below, as well as the subset of those correctly predicted by the region below, which I call signal cells. My mechanism sets up persistent temporal pooling scores which are used like column overlap scores (i.e. to select active columns). Temporal pooling scores are generated for a column when it overlaps with signal cell inputs; they decay over time; they are interrupted when dominated by other columns’ overlap scores, or when bursting (non-signal) input comes in to the column itself.

Notice that I do not explicitly turn off pooling when bursting inputs appear nearby. Rather I assume that bursting inputs will generate dominant overlap scores because bursting column activation is dense compared to the sparse (one cell per column) activation from predicted columns. Whether that is a reasonable assumption in general remains to be seen. Further insights from the biology may help to clarify it. Anyway, the mechanism step-by-step:

The overlap score of each column with the input is computed as usual, but any existing temporal pooling scores (see step 5) replace the current overlap scores. code
The set of active columns is derived from overlap scores as usual by lateral inhibition. Any columns which did not become active lose their temporal pooling status. code
If any column has a current overlap score above its previous temporal pooling score, the previous temporal pooling score is deleted. However, the column may remain in a temporal pooling state if the new overlap comes from signal cells (step 5). code
Temporal pooling scores on continuing active columns are reduced by a decay factor. code
If any active columns overlap with signal-cell inputs this overlap is multiplied by a signal boosting factor and becomes the column’s new temporal pooling score. code
(Sequence Memory / Temporal Memory): In continuing temporal pooling columns, the previously active cell remains active. code In newly active temporal pooling columns, a cell is chosen to become the active temporal pooling cell in the usual way (depending on whether the column itself is bursting or predicted).
The temporal pooling cell is active and thus learns by growing lateral synapses. code

See it run

To test my mechanism, I made an input stream consisting of six fixed one-dimensional sequences, each repeating at random intervals, mixed together. See section The data below. You can run it online and observe temporal pooling in the higher region after about 300 time steps:

Sequences mixed with variable gaps

Note: Maximise browser window before loading page. Google Chrome browser recommended.

In the visualisation, columns are coloured as follows:

red columns are bursting (active without being predicted);
blue columns had predicted cells but did not become active;
purple columns were simultaneously active and predicted;
green columns are in a temporal pooling state and thus are remaining active over multiple time steps. (May be green or brown according to whether the column is itself predicted or bursting).

This screenshot gives an example of temporal pooling:

The block on the left shows the input bits, arranged in time steps from left to right. The middle block shows columns of the first region, in corresponding time steps from left to right. Yes it is confusing that the “columns” (neural minicolumns) of a region are themselves arranged in a column (tabular column) for this visualisation. The next block shows columns of the second region, which receives as input the active cells from the first region. One time step is highlighted across all panels, and one column is selected in the highest region, showing its constituent cells and their distal dendrite segments.

The input shows a simple repeated pattern over 6 time steps. The first time step was not predicted since it occurs randomly, but the following 5 time steps were predicted, as indicated by the purple columns in the first region in those time steps. (Later steps of the sequence occur further down in the input field, activating columns further down in the region, so they are not visible here). Predictive columns are also mapped back onto the input bits so we can see which part of the input field was predicted.

The correctly predicted columns in the first region generate temporal pooling scores in the next region, visible as green and brown columns. While the input sequence is correctly predicted, some of these temporal pooling cells remain active.

But let’s try to measure this.

Measurement

To measure the sensitivity, specificity and precision of temporal pooling over a given input pattern, we need to come up with some candidate cells. I did this by “warming up” the system for 5000 time steps, then keeping the following 2000 time steps. I filtered them down to where the given pattern occurs (excluding the first step it appears on, which is unpredictable), and selected the top 3 most frequently active temporal pooling cells. These are my candidates.

It is then a simple matter to compute sensitivity, specificity and precision for each candidate cell. The results are shown in the following table and plot.

pattern	candidate	in-steps	out-steps	active-in	active-out	sensitivity	specificity	precision
run-0-5	0	230	1724	112	27	0.49	0.98	0.81
run-0-5	1	230	1724	123	46	0.53	0.97	0.73
run-0-5	2	230	1724	108	21	0.47	0.99	0.84
rev-5-1	0	192	1760	109	76	0.57	0.96	0.59
rev-5-1	1	192	1760	99	50	0.52	0.97	0.66
rev-5-1	2	192	1760	91	17	0.47	0.99	0.84
run-6-10	0	204	1745	116	151	0.57	0.91	0.43
run-6-10	1	204	1745	114	115	0.56	0.93	0.50
run-6-10	2	204	1745	111	133	0.54	0.92	0.45
jump-6-12	0	204	1745	117	128	0.57	0.93	0.48
jump-6-12	1	204	1745	120	149	0.59	0.91	0.45
jump-6-12	2	204	1745	113	151	0.55	0.91	0.43
twos	0	280	1680	101	11	0.36	0.99	0.90
twos	1	280	1680	89	4	0.32	1.00	0.96
twos	2	280	1680	101	44	0.36	0.97	0.70
saw-10-15	0	268	1694	157	112	0.59	0.93	0.58
saw-10-15	1	268	1694	148	88	0.55	0.95	0.63
saw-10-15	2	268	1694	138	76	0.51	0.96	0.64

pattern,candidate,in-steps,out-steps,active-in,active-out,sensitivity,specificity,precision run-0-5,0,230,1724,112,27,0.49,0.98,0.81 run-0-5,1,230,1724,123,46,0.53,0.97,0.73 run-0-5,2,230,1724,108,21,0.47,0.99,0.84 rev-5-1,0,192,1760,109,76,0.57,0.96,0.59 rev-5-1,1,192,1760,99,50,0.52,0.97,0.66 rev-5-1,2,192,1760,91,17,0.47,0.99,0.84 run-6-10,0,204,1745,116,151,0.57,0.91,0.43 run-6-10,1,204,1745,114,115,0.56,0.93,0.5 run-6-10,2,204,1745,111,133,0.54,0.92,0.45 jump-6-12,0,204,1745,117,128,0.57,0.93,0.48 jump-6-12,1,204,1745,120,149,0.59,0.91,0.45 jump-6-12,2,204,1745,113,151,0.55,0.91,0.43 twos,0,280,1680,101,11,0.36,0.99,0.9 twos,1,280,1680,89,4,0.32,1.0,0.96 twos,2,280,1680,101,44,0.36,0.97,0.7 saw-10-15,0,268,1694,157,112,0.59,0.93,0.58 saw-10-15,1,268,1694,148,88,0.55,0.95,0.63 saw-10-15,2,268,1694,138,76,0.51,0.96,0.64

Specificity Sensitivity {{pattern}}

The results are disappointing. While the specificity is reasonably high, sensitivity is low—under 60%—meaning that none of the candidate cells can reliably indicate the presence of its pattern on its own.

ad-lib hypotheses

The visual display suggested that cells were indeed staying active over any one pattern instance, but not necessarily the same cells for different instances. So the selection of active columns in the higher region may not be consistent across instances of the pattern.
Perhaps this is all we should expect from one region and we should look at an even higher region for more complete temporal pooling.
Perhaps the feed-forward synapses are continuing to learn and change and this is causing the inconsistency in column activation. Specifically it may be the learning on temporal pooling columns that is weakening their receptive field.
Perhaps feed-forward synapses should be reinforced more strongly when they carry input from signal cells, to make it more likely for those columns to become active again. (I got this idea from nupic.research.)
Perhaps column activation should be biased by predicted (depolarised) cells in the column. (I got this idea from Fergal Byrne.) This seems to promise more consistent column activations, rather than the current approach of randomly choosing between any columns with equal receptive field overlaps.

Consistency between pattern instances

Is the problem just randomness in the choice of active colums when each pattern instance appears? If so then I would expect if we looked over a few candidate cells then collectively they would cover all instances of the pattern.

Since I am concerned with the sensitivity measure, I filtered the time steps down to just those where the pattern is occuring, grouped into each instance of its occurence. Below is a plot of activations over time of 25 candidate cells, being the top ones by overall sensitivity. I did this for three different patterns.

25 candidate cells Pattern "rev-5-1"

Pattern "run-6-10" 25 candidate cells

Pattern "twos" 25 candidate cells Pattern instance

The picture is not what I expected. Different candidate cells do not seem to be complementary in their activations, they seem remarkably similar. Some pattern instances see widespread pooling while others do not see pooling at all (i.e. if you look down one bar in the plot, there are no horizontal lines indicating pooling). This suggests that the problem is not spatial pooling (selection of columns) but rather sequence memory (prediction of sequences).

I’d better check this out…

Observing carefully

Hi again. I’ve just been watching the simulation more carefully. Well, now I feel foolish—I see what is going wrong and it is more fundamental than any of the hypotheses I threw up above. The first region is just not predicting the input sequences well in many instances, so of course they are not being pooled over.

While pattern instances that appear in isolation are generally fully predicted, when two patterns occur together the predictions seem to fail.

Take this example, after a generous training period of 4000 time steps. Four pattern instances are visible (rev-5-1, run-0-5, jump-6-12, run-6-10), and all but the third have interference problems.

The clearest case is the last pattern: its initial time step is predicted based on the final step of the previous pattern; but this is a spurious prediction since its occurence is random. So, only one cell per column is active instead of the usual full bursting, and it is evidently not the one that usually predicts the next step of the pattern. The next step appears unexpectedly and bursts the columns. After that the sequence prediction recovers. This suggests that failing predictions should be more harshly punished?

The first pattern is missed on its final timestep, probably due to the mixed input on the second-last time step activating different columns from the ones which usually predict the final time step. This suggests that columns should be preferentially activated if they contain predictive cells?

A different kind of prediction failure is shown in the screenshot below. The pattern sequence prediction breaks down after correctly predicting the first 4 time steps. Inspecting the dendrite segments shown we can see there is one that has fallen just below the activation threshold (it has an activation level of 8 and the threshold is 9). This would have been caused by prior incorrect predictions being punished, i.e. the segment synapses being weakened. This suggests that failing predictions should be less harshly punished??

The curse of parametricality

I’ll need to go back to the basic algorithm and confront the many implementation details and parameter values I have chosen. For reference here are the parameter values I used in the simulations described above. You can also view and modify them in the online simulation.

(def spec
  {:ncol 1000
   :potential-radius-frac 0.1
   :activation-level 0.02
   :global-inhibition false
   :stimulus-threshold 3
   :sp-perm-inc 0.05
   :sp-perm-dec 0.01
   :sp-perm-signal-inc 0.05
   :sp-perm-connected 0.20
   :duty-cycle-period 100000
   :max-boost 2.0
   ;; sequence memory:
   :depth 8
   :max-segments 5
   :max-synapse-count 18
   :new-synapse-count 12
   :activation-threshold 9
   :min-threshold 7
   :connected-perm 0.20
   :initial-perm 0.16
   :permanence-inc 0.05
   :permanence-dec 0.01
   })

The code

The demo here was compiled from Comportex 0.0.2 with ComportexViz 0.0.2.

The extra analysis code is here: temporal_pooling_experiments.clj.

The data

The (toy) problem domain I have constructed here is an attempt to test temporal pooling and sequence memory in a simple but meaningful way. I would like others to try simulating the same problem domain, or to suggest any other that can serve as a kind of benchmark or shared example.

The input is made up from 6 different fixed patterns, named as follows:

`run-0-5`	`[0 1 2 3 4 5]`
`rev-5-1`	`[5 4 3 2 1]`
`run-6-10`	`[6 7 8 9 10]`
`jump-6-12`	`[6 7 8 11 12]`
`twos`	`[0 2 4 6 8 10 12 14]`
`saw-10-15`	`[10 12 11 13 12 14 13 15]`

These patterns are fed into the input stream, each instance separated from its next repeat by a gap with random duration of between 1 and 75 time steps. As such, the input on each timestep is a set of (up to 6) integers in the range [0 15]. These are encoded by simply dividing up the input array (of 400 bits) into 16 non-overlapping blocks and activating the block corresponding to the integer. The encoded bits from each integer are ORed together. The code is mixed_gaps_1d.cljx.

I have generated a CSV data file containing 10,000 time steps of the input stream: mixed_fixed_1d_10k.csv (70kb).

The file has 6 columns containing either integers or blanks. The set of integers from each row should be encoded with a scalar encoder of range [0 15], bit width 400, and 25 active bits. The final input set is the union of the active bits from each encoded integer.

Thanks for reading this. I would appreciate your advice.

–Felix

P.S. I’m loving charting with variancecharts.com.

Visualization Driven Development of the Cortical Learning Algorithm

2014-07-11T00:00:00+00:00

I’ve spent the last few months trying to build the Cortical Learning Algorithm, the core of HTM, from scratch. Partly to understand it deeply myself, and partly to help others understand it. My building material is Clojure. I have something that looks reasonable now, but it has certainly exposed a lot of intricacies in the algorithm that are not clear, and some which are probably open research questions.

To understand an algorithm, one must of course grasp the idea of it, and also follow the procedure in code. (I don’t like pseudo-code as it often skips important details or even turns out to be incorrect.) But that will not be enough to know what the algorithm will do when played out over time. To get there, interactive graphic displays are your friend. They help gain an intuitive understanding as well as answer specific questions, and to diagnose problems quickly. This is known as Visualization Driven Development.

In my case I was able to run the algorithm directly in a web browser (since pure Clojure can be compiled to Javascript), which makes it easier to build an accessible interactive visualization.

I have started with a one-dimensional input array and a one-dimensional region of mini-columns. This allows the states to be lined up over time on the horizontal axis. Color is fundamental to the display:

red represents active states (of input bits, columns, cells, segments)
blue represents predicted states (of input bits, columns, cells)
accordingly, purple represents simultaneously active and predicted states.

The simulation can be run and paused for detailed investigation. A typical example would be to notice some feature of the model, say a bursting (unpredicted) column where one hoped the input had already been learned; at this point stop the simulation and click on the column to reveal its constituent cells and their dendrite segments. Use the arrow keys to step forwards and backwards in time, and check which cells, segments and synapses are involved and how they are being modified. Full details are given in a text box.

The algorithm parameters are shown and can be changed either during a run, or after a reset. Please refer to the annotated source code of the algorithm for descriptions of each of the parameters.

Here it is with four different input streams, presented on separate pages.

Note: Maximise browser window before loading page. Google Chrome browser recommended.

The code

The demos here are compiled from Comportex 0.0.1 with ComportexViz 0.0.1.

Unresolved questions

I should emphasise that I am not sure what I have implemented here is true to Numenta’s algorithm. And it does not include temporal pooling yet in any form. My reference was primarily the Numenta CLA White Paper but ignoring the temporal pooling aspects (since theory has moved on since it was written). But it leaves some details unclear.* Some specific questions:

How frequently does boosting in the spatial pooler apply? If we apply it continuously over a rolling 1000-timestep window, the same set of columns may be repeatedly boosted hundreds of times before the long-term average is affected and boosting turns off. On the other hand if we only apply boosting once every 1000 timesteps (as currently implemented) this generates a sharp change where all active columns switch off, and inactive columns switch on, every 1000 steps.
How do lateral synapses (on dendrite segments) refine their predictions? The White Paper implies – if I understand correctly – that if a cell stops predicting, without first becoming active, then the last active dendrite segments are punished (their active synapses have their permanence reduced). I have tried to implement this. But I couldn’t find code for it in NuPIC.
How exactly is the learning cell and segment chosen from a bursting column? And from a predicted column with multiple active segments? Do we choose segments based on their activation level from “learning” cells? If there are too few, do we fall back to activation from all active cells? How should we break ties?
Can multiple cells per column be in a predictive state? If so, can there be multiple learning cells per column? Otherwise, how are they resolved? If one active cell is chosen to be reinforced, should other active cells be reinforced too, or punished to differentiate them, or ignored?
Should we reinforce active synapses only to learning cells, or to all active cells?
Should a cell learn only when it changes state, or every time step?

UPDATE: Chetan very helpfully answered my questions on the mailing list here. My implementation matches his description except for point 6 (I had learning every time step even on continuing cells).

Pressing ahead

The most obvious major features to do next are:

add noise to the inputs, with interactive controls;
two-dimensional input fields and column arrangements;
hierarchy – regions feeding forward into other regions;
temporal pooling (the new theory).

I want to get some more interesting examples going to motivate further development.

Looking forward to hearing your thoughts on the Numenta mailing lists or by direct email.

–Felix

* I also looked through the NuPIC source code but could not find answers to all my questions. It has a mix of different methods, many targeting classifier accuracy rather than general learning.

The General Idea

2014-05-27T00:00:00+00:00

Greetings!

Like many people, it seems, my passion for artificial intelligence has been re-kindled recently. Douglas Hofstadter and Jeff Hawkins were the main re-kindlers for me. Now I’ve set off exploring.

I’m hoping to push the boundaries of artificial intelligence, creating something that can learn, explore, generalise, theorise, and surprise.

Why?

Because it would be so cool. Also, it seems self-evident that the tools we build should have some basic intelligence. Software and games use tricks to appear intelligent, but they are mindless zombies, unable to deal with even slightly new situations, and will happily bang their heads against a brick wall until, um, they’re taken out by a zombie hunter, or something. Computational intelligence has the potential to make technology do what we want, with robust adaptability. The cost is giving up precisely defined behaviour, as well as vastly more computation and memory requirements. Not to mention the difficulty of developing all this. But smart people are working on it. And it is fun.

How?

I’m convinced that Jeff Hawkins’ Hierarchical Temporal Memory (HTM) is the proper basis for such computational intelligence. Of course, HTM is a general theory and has not yet been worked out at a level of detail applicable to most interesting problems. My strategy will be to focus on examples. To attack specific problems as a way to inspire development of the theory. I’ll design simulations with opportunities for learning and watch what goes wrong.

But…

HTM is based on a memory system, arranged in a hierarchy of increasing abstraction, where the memory is simultaneously a predictive function. So it can recognise things, or situations or whatever, and infer what is happening and what may happen next. There is no emotion, no desire, no fear. Just recognition, generalisation, analogy. Is that scary? Of course every technology will be put to nasty uses, but I think it’s hard to argue that more intelligence is a bad thing.

Anyway, looks like we’ve got a lot of work to do. Better get on with it.

–Felix