# Limitations to Estimating Mutual Information in Large Neural Populations

^{*}

*Keywords:*sensory coding; information theory; entropy; sampling bias

Next Article in Journal

Next Article in Special Issue

Next Article in Special Issue

Previous Article in Journal

Previous Article in Special Issue

Previous Article in Special Issue

Queensland Brain Institute & School of Mathematics and Physics, The University of Queensland, St. Lucia, QLD 4072, Australia

Author to whom correspondence should be addressed.

Received: 2 March 2020
/
Revised: 22 April 2020
/
Accepted: 22 April 2020
/
Published: 24 April 2020

(This article belongs to the Special Issue Information Theory in Computational Neuroscience)

Information theory provides a powerful framework to analyse the representation of sensory stimuli in neural population activity. However, estimating the quantities involved such as entropy and mutual information from finite samples is notoriously hard and any direct estimate is known to be heavily biased. This is especially true when considering large neural populations. We study a simple model of sensory processing and show through a combinatorial argument that, with high probability, for large neural populations any finite number of samples of neural activity in response to a set of stimuli is mutually distinct. As a consequence, the mutual information when estimated directly from empirical histograms will be equal to the stimulus entropy. Importantly, this is the case irrespective of the precise relation between stimulus and neural activity and corresponds to a maximal bias. This argument is general and applies to any application of information theory, where the state space is large and one relies on empirical histograms. Overall, this work highlights the need for alternative approaches for an information theoretic analysis when dealing with large neural populations.

The neural code is inherently stochastic, so that even when the same stimulus is repeated, we observe a considerable variability in the neural response. Thus, an important question when considering the interplay between sensory stimuli and neural activity, and when trying to understand how information is represented and processed in neural systems, is how much an observer can actually infer about the stimulus from only looking at its representation in the neural activity. This problem can be formulated and quantitatively studied in the general framework of information theory.

With neural information processing happening at the level of populations, any such analysis necessarily has to consider the activity of an entire population [1,2]. The number of neurons which can be recorded simultaneously has increased in recent years, driven by technologies such as calcium fluorescence imaging [3] and high-density electrode arrays [4]. These methods allow recording of population activity from thousands of neurons simultaneously. However, at these scales, information theoretic analyses of neural activity become considerably harder. Critically, any information theoretic analysis depends at some point on the precise knowledge of the joint probability distribution of the states of the stimuli and the neural population.

However, estimating this distribution from data, and thus entropy and mutual information, is notoriously difficult. This problem is well known, in particular, when the state space for the neural activity becomes large, as is the case when considering large neural populations [5,6,7,8,9]. This problem is generally addressed by applying corrections derived from asymptotic considerations [10], or shuffling/bootstrap procedures together with extrapolations [11,12,13,14,15,16,17]. A crucial result in this context is the inconsistency theorem, which implies that under some circumstances estimators that are frequently employed will yield arbitrarily poor estimates [18]. More recently, and, in particular, with regard to entropy estimation, there have been advances towards new estimators or model-based approaches for estimating the aforementioned joint probability distributions [19,20,21,22,23,24].

In the following we revisit the direct (and model-independent) estimate of entropy and mutual information from experimental histograms. Rhee et al. [7] noted that estimated probability distributions from data samples appear “thinner” and that this leads to what has become known as the sampling bias. Moreover, according to these authors, ensuring the state space is adequately sampled becomes proportionally more difficult with the size of the neural population. The aim of this article is to make this intuition precise, and to show that under certain assumptions with increasing population size any such direct estimate of mutual information will with high probability only yield the stimulus entropy. This emphasises the problems of an information theoretic analysis on the basis of experimental histograms and highlights the importance of new approaches when it comes to estimating information theoretic quatities for large neural populations.

In a typical experiment investigating sensory processing, one studies the activity from a neural population of interest in response to a defined set of stimuli. Due to the stochastic nature of neural activity, every stimulus is presented separately multiple times, yielding a finite set of samples of neural activity for every stimulus [25]. From these samples across all stimuli, one can then estimate the relative frequency of patterns of neural activity in response to different stimuli in order to, for example, calculate the dependence in terms of mutual information between the stimuli and the neural activity [26,27].

To make this paradigm more concrete, suppose we are interested in the representation of the visual field in the primary visual cortex. In order to probe this representation, we simultaneously record the spiking activity from a neural population in the respective brain area when exposed to a visual stimulus whose position we vary in the course of the experiment. We can sample the neural activity for every presentation of the stimulus, by measuring the neural activity, for example, in terms of the number of spikes. In order then to quantify the representation of the different stimulus values in their entirety, we can collect the samples of neural activity that we observed, compute empirical histograms and perform an information theoretic analysis.

The central objects of interest in information theory are random variables and their distributions. At its core one defines a measure of their uncertainty, the entropy [28]. If X is a discrete random variable taking values in an at most countable set, its entropy $H\left(X\right)$ is defined as
where $\mathbb{P}\left[X=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right]$ denotes the probability distribution of X, and therefore $\mathbb{P}\left[X=x\right]$ the probability the X attains the value x. In this definition, we adopt the customary convention that $0log0:=0$. For the purpose of this paper, we leave the base of the logarithm unspecified; however, we note that in the context of information theory the base is generally chosen to be 2 so that entropy will be measured in units of bits.

$$H\left(X\right):=-\sum _{x}\mathbb{P}\left[X=x\right]log\mathbb{P}\left[X=x\right],$$

If ${X}^{\prime}$ is a second random variable, the conditional entropy of X given ${X}^{\prime}$, $H\left(X|{X}^{\prime}\right)$, is defined as
where $\mathbb{P}\left[X=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\wedge {X}^{\prime}=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right]$ and $\mathbb{P}\left[\left.X=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\phantom{{X}^{\prime}=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}}\right|{X}^{\prime}=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right]$ denote the joint probability distribution of X and ${X}^{\prime}$ and conditional probability distribution of X given ${X}^{\prime}$, respectively. The conditional entropy is a measure for the residual uncertainty in a random variable given observation of another random variable. Specifically, we have that $H\left(X|{X}^{\prime}\right)\le H\left(X\right)$ with equality if X and ${X}^{\prime}$ are independent.

$$H\left(X|{X}^{\prime}\right):=-\sum _{x,{x}^{\prime}}\mathbb{P}\left[X=x\wedge {X}^{\prime}={x}^{\prime}\right]log\mathbb{P}\left[\left.X=x\phantom{{X}^{\prime}={x}^{\prime}}\right|{X}^{\prime}={x}^{\prime}\right],$$

An important quantity in information theory is the mutual information between two random variables, X and ${X}^{\prime}$, which is defined as

$$MI(X;{X}^{\prime})=H\left(X\right)-H\left(X|{X}^{\prime}\right).$$

Mutual information quantifies the amount of entropy of X that knowledge of ${X}^{\prime}$ annihilates or, in other words, the information that ${X}^{\prime}$ holds about X. Furthermore, as the mutual information is symmetric in its arguments it likewise also quantifies the information X holds about ${X}^{\prime}$. Mutual information is non-negative and vanishes if and only if X and ${X}^{\prime}$ are independent. Because of this, mutual information is frequently used as a measure of the independence of two random variables. The mutual information between X and ${X}^{\prime}$ can be written in terms of the relative entropy (Kullback–Leibler divergence) between their joint probability distributions and the product of the corresponding marginal distributions, which defines a premetric on the probability distribution.

Importantly, we note that, in contrast to what the notation suggests, both the entropy and the mutual information do not depend on the random variables and their values themselves, but are rather functionals of their distributions [29]. As a consequence of that, we refrain from specifying the random variables’ codomains in the following and if a random variable X attains the value x, x may simply be regarded as a label in an abstract alphabet.

Building on the experimental paradigm we outlined above, we devise and analyse a simple computational model. We assume the stimulus to be modelled by a discrete, almost surely non-constant random variable S and the neural population’s activity by the discrete vector-valued random variable $X\equiv {\u2a02}_{n=1}^{N}{X}^{\left(n\right)}$, where N is the size of the population. As the stimulus is determined by the experimental protocol, we assume perfect knowledge of its statistics. In addition, we assume that for every stimulus value s there exist a subset of the whole neural population ${U}_{\u2aebs}\subseteq \left\{1,\dots N\right\}$ such that ${N}_{s,{s}^{\prime}}:=\left|{U}_{\u2aebs}\cap {U}_{\u2aeb{s}^{\prime}}\right|=\omega lnN$ as $N\to \infty $, i.e., ${N}_{s,{s}^{\prime}}$ asymptotically grows faster than the logarithm, for all stimulus values s and ${s}^{\prime}$ such that the components ${X}^{\left(n\right)}\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}S\in \{s,{s}^{\prime}\}$ for $n\in {U}_{\u2aebs}\cap {U}_{\u2aeb{s}^{\prime}}$ are independent and identically distributed and almost surely non-constant (Figure 1). In an experimental setting, one might think of ${U}_{\u2aebs}$ as the set of neurons that are not receptive to a stimulus value s and therefore activated independently according to a common noise profile. Importantly, these sets differ in general from stimulus to stimulus. However for every two stimulus values, they overlap on a sufficiently large common set. Moreover, in contrast to what one might initially think, these sets cannot be excluded since one can imagine a scenario where every neuron in the population is receptive to at least one of the stimulus values and is simultaneously an element of at least one independent set for other stimulus values.

Next, suppose that for every stimulus value s we are given ${K}_{s}$ independent samples from $\mathbb{P}\left[\left.X=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\phantom{S=s}\right|S=s\right]$, which we denoted as ${x}^{\left(s\right)}:={\left\{{x}_{k}^{\left(s\right)}\right\}}_{k=1,\dots {K}_{s}}$. Based on these samples the empirical estimate for $\mathbb{P}\left[\left.X=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\phantom{S=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}}\right|S=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right]$, $\widehat{\mathbb{P}}\left[\left.X=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\phantom{S=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}}\right|S=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right]$, is
for any sample x and stimulus value s. The Kronecker delta here attains the value 1 whenever its arguments coincide, and otherwise vanishes. Again, in the experimental setting, one might think of ${x}^{\left(s\right)}$ as the samples of neural activity that was recorded in response to a stimulus value s.

$$\widehat{\mathbb{P}}\left[\left.X=x\phantom{S=s}\right|S=s\right]=\frac{1}{{K}_{s}}\sum _{k=1}^{{K}_{s}}{\delta}_{x,{x}_{k}^{\left(s\right)}}$$

Consistent with the intuition of Rhee et al. [7], with high probability, the samples for every stimulus value are mutually different, as we are considering larger and larger neural populations and, moreover, these samples even become uniquely associated with one of the stimulus values, i.e., ${x}^{\left(s\right)}\cap {x}^{\left({s}^{\prime}\right)}=\mathrm{\u2300}$ for $s\ne {s}^{\prime}$. This simplifies the empirical estimate for $\mathbb{P}\left[\left.X=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\phantom{S=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}}\right|S=\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right]$ from above so that, for a sample ${x}_{k}^{\left(s\right)}$ and a stimulus value ${s}^{\prime}$, $\widehat{\mathbb{P}}\left[\left.X={x}_{k}^{\left(s\right)}\phantom{S={s}^{\prime}}\right|S={s}^{\prime}\right]=\frac{1}{{K}_{{s}^{\prime}}}{\delta}_{s,{s}^{\prime}}$, using that every sample is uniquely associated with a stimulus value and occured only once for those stimulus value. As we will show, the probability for this event is at least $1-\mathcal{O}\left({N}^{-\infty}\right)$ as $N\to \infty $. Note, that for a sequence ${\left({z}_{N}\right)}_{N\in \mathbb{N}}$ we say ${z}_{N}=\mathcal{O}\left({N}^{-\infty}\right)$ if ${z}_{N}=\mathcal{O}\left({N}^{-m}\right)$ for every $m\ge 0$, i.e., ${lim}_{N\to \infty}{N}^{m}{z}_{N}=0$. This not only implies that in the limit of an infinitely large population, the probability is 1, but also makes a statement about the dependence on the size of population. In fact, the probability approaches 1 eventually faster than any polynomial, so that it will be close to 1 even for moderately sized populations.

For any two stimuli there exists by assumption a subset of the components of X which is independent and identically distributed when conditioned on either of the two stimuli. As it suffices that the samples differ in these components for them to be mutually different, the probability for this event is a lower bound on the probability that the samples are mutually different. Therefore, the probability for all samples corresponding to stimulus s and ${s}^{\prime}$ to be mutually different is at least $1-\mathcal{O}\left({N}^{-\infty}\right)$, for N large. As we show in the next section, this is the probability that a finite number of independent samples from a random vector of length ${N}_{s,{s}^{\prime}}$ with ${N}_{s,{s}^{\prime}}=\omega \left(lnN\right)$ are mutually different, provided the components are independent and identically distributed and almost surely non-constant, which is the case by assumption.

From that the probability that this event occurs for all stimulus values can be estimated to be at least $1-{\sum}_{s,{s}^{\prime}}\left(1-\left(1-\mathcal{O}\left({N}^{-\infty}\right)\right)\right)=1-\mathcal{O}({N}^{-\infty})$.

In that event, when all the samples are mutually different, we compute for the entropy and conditional entropy, $\widehat{H}(X)$ and $\widehat{H}(X|S)$, using the law of total probability to express both in terms of the distribution of S and the empirical estimate of the distribution of X given S, the fact that the latter vanishes away from the samples and that, as we have seen above, for a sample ${x}_{k}^{\left(s\right)}$ and a stimulus value ${s}^{\prime}$, it takes the form $\widehat{\mathbb{P}}\left[\left.X={x}_{k}^{\left(s\right)}\phantom{S={s}^{\prime}}\right|S={s}^{\prime}\right]=\frac{1}{{K}_{{s}^{\prime}}}{\delta}_{s,{s}^{\prime}}$,
and

$$\begin{array}{cc}\hfill \widehat{H}(X)& =-\sum _{x}\widehat{\mathbb{P}}\left[X=x\right]log\widehat{\mathbb{P}}\left[X=x\right]\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =-\sum _{s}\mathbb{P}\left[S=s\right]\sum _{x}\widehat{\mathbb{P}}\left[\left.X=x\phantom{S=s}\right|S=s\right]log\sum _{{s}^{\prime}}\mathbb{P}\left[S={s}^{\prime}\right]\widehat{\mathbb{P}}\left[\left.X=x\phantom{S={s}^{\prime}}\right|S={s}^{\prime}\right]\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =-\sum _{s}\mathbb{P}\left[S=s\right]\sum _{k=1}^{{K}_{s}}\widehat{\mathbb{P}}\left[\left.X={x}_{k}^{\left(s\right)}\phantom{S=s}\right|S=s\right]log\sum _{{s}^{\prime}}\mathbb{P}\left[S={s}^{\prime}\right]\widehat{\mathbb{P}}\left[\left.X={x}_{k}^{\left(s\right)}\phantom{S={s}^{\prime}}\right|S={s}^{\prime}\right]\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =-\sum _{s}\mathbb{P}\left[S=s\right]log\mathbb{P}\left[S=s\right]\frac{1}{{K}_{s}}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =H\left(S\right)+\mathbb{E}\left[log{K}_{S}\right]\hfill \end{array}$$

$$\begin{array}{cc}\hfill \widehat{H}(X|S)& =-\sum _{s,x}\widehat{\mathbb{P}}\left[X=x\wedge S=s\right]log\widehat{\mathbb{P}}\left[\left.X=x\phantom{S=s}\right|S=s\right]\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =-\sum _{s}\mathbb{P}\left[S=s\right]\sum _{x}\widehat{\mathbb{P}}\left[\left.X=x\phantom{S=s}\right|S=s\right]log\widehat{\mathbb{P}}\left[\left.X=x\phantom{S=s}\right|S=s\right]\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =-\sum _{s}\mathbb{P}\left[S=s\right]\sum _{k=1}^{{K}_{s}}\widehat{\mathbb{P}}\left[\left.X={x}_{k}^{\left(s\right)}\phantom{S=s}\right|S=s\right]log\widehat{\mathbb{P}}\left[\left.X={x}_{k}^{\left(s\right)}\phantom{S=s}\right|S=s\right]\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =-\sum _{s}\mathbb{P}\left[S=s\right]log\frac{1}{{K}_{s}}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\mathbb{E}\left[log{K}_{S}\right].\hfill \end{array}$$

Thus, $\widehat{MI}(X;S)=\widehat{H}(X)-\widehat{H}(X|S)=H(S)$, and in addition we also get from the above that $\widehat{H}(X,S)=\widehat{H}(X)$ and that $\widehat{H}(S|X)=0$. As for mutual information the classical bound $MI\left(X;S\right)\le min(H\left(X\right),H\left(S\right))$ holds, mutual information is in fact estimated to be maximal.

Importantly, in this analysis we did not make any assumptions about the precise dependence between S and X apart from the existence of a sufficiently large subpopulation ${U}_{\u2aebs}$ for every stimulus value s. Therefore, the conclusion holds also if S and X were, in fact, independent and mutual information should have vanished.

In the previous section, we were interested in the probability that a finite set of independent and identically distributed (discrete) random variables yields mutually different realisations. This probability is trivially 0 by the pigeonhole principle if there are more random variables in any such set than the number of values these random variables attain with non-vanishing probability. However, once the number of values that are attained exceeds the number of random variables this is not the case anymore, and for any arbitrary distribution of the random variable it is not obvious how to obtain the exact probabilities.

We first take a step back and consider generally series of the form
for sequences ${\left({a}_{n}\right)}_{n\in \mathbb{N}}$ such that the corresponding series is absolutely convergent. This specifically guarantees that the series in question will also converge. We will show how to evaluate such a series and from a closed-form expression derive the probabilities for any finite set of random variables to yield mutually different realisations. From this it will follow, in particular, that large random vectors yield mutually different values with high probability.

$$\sum _{\genfrac{}{}{0.0pt}{}{{n}_{1},\dots {n}_{K}\in \mathbb{N}}{{n}_{1}\ne {n}_{2}\ne \dots {n}_{K}}}{a}_{{n}_{1}}\cdots {a}_{{n}_{K}}$$

Let ${\left({a}_{n}\right)}_{n\in \mathbb{N}}$ be a sequence with ${a}_{n}\in \mathbb{R}$ for every $n\in \mathbb{N}$ such that ${\sum}_{n\in \mathbb{N}}{a}_{n}$ is absolutely convergent and define ${Q}_{m}:={\sum}_{n\in \mathbb{N}}{a}_{n}^{m}$ for $m\in \mathbb{N}$. Then, for any $K\in \mathbb{N}$ we have that

$$\sum _{\genfrac{}{}{0.0pt}{}{{n}_{1},\dots {n}_{K}\in \mathbb{N}}{{n}_{1}\ne {n}_{2}\ne \dots {n}_{K}}}\prod _{k=1}^{K}{a}_{{n}_{k}}=\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K}{\left(-1\right)}^{K-\left|\alpha \right|}\frac{K!}{\alpha !}\prod _{m=1}^{K}{\left(\frac{{Q}_{m}}{m}\right)}^{{\alpha}_{m}}.$$

Here and in the following we use the conventional notation for multi-indices $\alpha \in {\mathbb{N}}_{0}^{K}$, so that, e.g., $\left|\alpha \right|={\sum}_{k=1}^{K}{\alpha}_{k}$ and $\alpha !={\prod}_{k=1}^{K}{\alpha}_{k}!$. In addition, we set ${\langle \alpha \rangle}_{k}={\sum}_{m=1}^{k}{\alpha}_{m}m$ for $1\le k\le K$.

We will prove this assertion in two steps. We will first derive a recursion relation for the series in question and then in a second step conclude with an induction argument.

We consider a family of operators ${\mathrm{\Gamma}}_{1},\dots {\mathrm{\Gamma}}_{K}$, and for $1\le k\le K$, formally define the corresponding operator ${\mathrm{\Gamma}}_{k}$ on monomials of ${a}_{{n}_{k}}$ via
and extend it linearly to $span\left\{1,{a}_{{n}_{k}},{a}_{{n}_{k}}^{2},\dots \right\}$. Using these operators we find that

$${\mathrm{\Gamma}}_{k}:{a}_{{n}_{k}}^{m}\mapsto \sum _{{n}_{k}\in \mathbb{N}\backslash \{{n}_{1},\dots {n}_{k-1}\}}{a}_{{n}_{k}}{a}_{{n}_{k}}^{m}={Q}_{m+1}-\sum _{r=1}^{k-1}{a}_{{n}_{r}}^{m+1}$$

$${A}_{K}:=\sum _{\genfrac{}{}{0.0pt}{}{{n}_{1},\dots {n}_{K}\in \mathbb{N}}{{n}_{1}\ne {n}_{2}\ne \dots {n}_{K}}}\prod _{k=1}^{K}{a}_{{n}_{k}}=\sum _{{n}_{1}\in \mathbb{N}}{a}_{{n}_{1}}\sum _{{n}_{k}\in \mathbb{N}\backslash \left\{{n}_{1}\right\}}{a}_{{n}_{2}}\cdots \sum _{{n}_{K}\in \mathbb{N}\backslash \{{n}_{1},\dots {n}_{K-1}\}}{a}_{{n}_{K}}=\left({\mathrm{\Gamma}}_{1}\circ {\mathrm{\Gamma}}_{2}\circ \cdots {\mathrm{\Gamma}}_{K}\right)(1).$$

In particular, we have for every $1\le {k}^{\prime}\le k\le K$ and $m\in {\mathbb{N}}_{0}$ that $\left({\mathrm{\Gamma}}_{1}\circ {\mathrm{\Gamma}}_{2}\circ \cdots {\mathrm{\Gamma}}_{k}\right)\left({a}_{{n}_{{k}^{\prime}}}^{m}\right)={\sum}_{\genfrac{}{}{0.0pt}{}{{n}_{1},\dots {n}_{k}\in \mathbb{N}}{{n}_{1}\ne {n}_{2}\ne \dots {n}_{k}}}{a}_{{n}_{1}}\cdots {a}_{{n}_{{k}^{\prime}}}^{m+1}\cdots {a}_{{n}_{k}}={\sum}_{\genfrac{}{}{0.0pt}{}{{n}_{1},\dots {n}_{k}\in \mathbb{N}}{{n}_{1}\ne {n}_{2}\ne \dots {n}_{k}}}{a}_{{n}_{1}}\cdots {a}_{{n}_{k}}^{m+1}=\left({\mathrm{\Gamma}}_{1}\circ {\mathrm{\Gamma}}_{2}\circ \cdots {\mathrm{\Gamma}}_{k}\right)\left({a}_{{n}_{k}}^{m}\right)$, where we have relabeled the indices and the interchangability of the sums is guaranteed by the absolute convergence of every individual series. Therefore, $\left({\mathrm{\Gamma}}_{1}\circ {\mathrm{\Gamma}}_{2}\circ \cdots {\mathrm{\Gamma}}_{k}\right)\left({a}_{{n}_{{k}^{\prime}}}^{m}\right)=\left({\mathrm{\Gamma}}_{1}\circ {\mathrm{\Gamma}}_{2}\circ \cdots {\mathrm{\Gamma}}_{k}\right)\left({a}_{{n}_{k}}^{m}\right)$.

Now, using the linearity of the operators ${\mathrm{\Gamma}}_{1},\dots {\mathrm{\Gamma}}_{K}$ we have that

$$\begin{array}{cc}\hfill \left({\mathrm{\Gamma}}_{1}\circ {\mathrm{\Gamma}}_{2}\circ \cdots {\mathrm{\Gamma}}_{K}\right)\left({a}_{{n}_{K}}^{m}\right)& =\left({\mathrm{\Gamma}}_{1}\circ {\mathrm{\Gamma}}_{2}\circ \cdots {\mathrm{\Gamma}}_{K-1}\right)\left({Q}_{m+1}-\sum _{r=1}^{K-1}{a}_{{n}_{r}}^{m+1}\right)\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& ={Q}_{m+1}\left({\mathrm{\Gamma}}_{1}\circ {\mathrm{\Gamma}}_{2}\circ \cdots {\mathrm{\Gamma}}_{K-1}\right)\left(1\right)-\sum _{r=1}^{K-1}\left({\mathrm{\Gamma}}_{1}\circ {\mathrm{\Gamma}}_{2}\circ \cdots {\mathrm{\Gamma}}_{K-1}\right)\left({a}_{{n}_{r}}^{m+1}\right)\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& ={Q}_{m+1}{A}_{K-1}-\left(K-1\right)\left({\mathrm{\Gamma}}_{1}\circ {\mathrm{\Gamma}}_{2}\circ \cdots {\mathrm{\Gamma}}_{K-1}\right)\left({a}_{{n}_{K-1}}^{m+1}\right)\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\sum _{k=1}^{K}{\left(-1\right)}^{k+1}\frac{\left(K-1\right)!}{\left(K-k\right)!}{Q}_{m+k}{A}_{K-k}.\hfill \end{array}$$

In the last step we explicitly resolved the recursion, which can be verified by a simple inductive argument. In particular, for $m=0$ we obtain an expression for ${A}_{K}$ so that we conclude after reordering the sum and rearranging the terms that

$$\frac{{A}_{K}}{{\left(-1\right)}^{K}K!}=-\frac{1}{K}\sum _{k=0}^{K-1}{Q}_{K-k}\frac{{A}_{k}}{{\left(-1\right)}^{k}k!}.$$

As ${A}_{K}$ is precisely the sum we intend to compute, the last expression yields a recursion relation for it with initial datum ${A}_{0}=1$. In particular, we note that this equation is reminiscent of a discrete Volterra equation of convolution type. Thus, this concludes the first step of the argument and in the second step we will show that the expression stated in the beginning actually solves this recursion relation.

For the induction argument we assume now that the assertion holds for $1\le k\le K-1$ in order to perform the step $K-1\to K$ and we use the recursion relation derived above to relate ${A}_{K}$ to ${A}_{0},{A}_{1},\dots {A}_{K-1}$. In addition, we denote with ${e}_{k}$ the multi-index that is 0 in every component except the kth one, where it is 1.

$$\begin{array}{cc}\hfill \frac{{A}_{K}}{{\left(-1\right)}^{K}K!}& =-\frac{1}{K}\sum _{k=0}^{K-1}{Q}_{K-k}\frac{1}{{\left(-1\right)}^{k}k!}\left\{\sum _{\alpha \in {\mathbb{N}}_{0}^{k}:{\langle \alpha \rangle}_{k}=k}{\left(-1\right)}^{k-\left|\alpha \right|}\frac{k!}{\alpha !}\prod _{m=1}^{k}{\left(\frac{{Q}_{m}}{m}\right)}^{{\alpha}_{m}}\right\}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =-\frac{1}{K}\sum _{k=0}^{K-1}\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=k}\frac{{\left(-1\right)}^{-\left|\alpha \right|}}{\alpha !}{Q}_{K-k}\prod _{m=1}^{K}{\left(\frac{{Q}_{m}}{m}\right)}^{{\alpha}_{m}}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =-\frac{1}{K}\sum _{k=1}^{K}\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K-k}\frac{{\left(-1\right)}^{-\left|\alpha \right|}}{\alpha !}{Q}_{k}\prod _{m=1}^{K}{\left(\frac{{Q}_{m}}{m}\right)}^{{\alpha}_{m}}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\sum _{k=1}^{K}\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K-k}\frac{\left({\alpha}_{k}+1\right)k}{K}\frac{{\left(-1\right)}^{-|\alpha +{e}_{k}|}}{\left(\alpha +{e}_{k}\right)!}\prod _{m=1}^{K}{\left(\frac{{Q}_{m}}{m}\right)}^{{\left(\alpha +{e}_{k}\right)}_{m}}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K}\frac{{\left(-1\right)}^{-\left|\alpha \right|}}{\alpha !}\prod _{m=1}^{K}{\left(\frac{{Q}_{m}}{m}\right)}^{{\alpha}_{m}}\hfill \end{array}$$

In the last step, we used that, as we will show, for any multi-index functional $\mathrm{\Phi}$,

$$\sum _{k=1}^{K}\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K-k}\frac{\left({\alpha}_{k}+1\right)k}{K}\mathrm{\Phi}\alpha +{e}_{k}=\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K}\mathrm{\Phi}\alpha .$$

Indeed, we first observe that the terms of $\mathrm{\Phi}$ that appear in both sums are identical. Therefore, we only have to carefully evaluate the coefficients.

$$\begin{array}{cc}\hfill \phantom{\rule{1.em}{0ex}}& \sum _{k=1}^{K}\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K-k}\frac{\left({\alpha}_{k}+1\right)k}{K}\mathrm{\Phi}\alpha +{e}_{k}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K}\sum _{k=1}^{K}\sum _{{\alpha}^{\prime}\in {\mathbb{N}}_{0}^{K}:{\langle {\alpha}^{\prime}\rangle}_{K}=K-k}\frac{\left({\alpha}_{k}^{\prime}+1\right)k}{K}\mathrm{\Phi}{\alpha}^{\prime}+{e}_{k}{\delta}_{\alpha ,{\alpha}^{\prime}+{e}_{k}}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K}\mathrm{\Phi}\alpha \sum _{k=1}^{K}\frac{{\alpha}_{k}k}{K}\sum _{{\alpha}^{\prime}\in {\mathbb{N}}_{0}^{K}:{\langle {\alpha}^{\prime}\rangle}_{K}=K-k}{\delta}_{\alpha ,{\alpha}^{\prime}+{e}_{k}}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K}\mathrm{\Phi}\alpha \sum _{k=1}^{K}\frac{{\alpha}_{k}k}{K}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K}\mathrm{\Phi}\alpha \hfill \end{array}$$

This concludes the argument, so that as claimed
$$\sum _{\genfrac{}{}{0.0pt}{}{{n}_{1},\dots {n}_{K}\in \mathbb{N}}{{n}_{1}\ne {n}_{2}\ne \dots {n}_{K}}}\prod _{k=1}^{K}{a}_{{n}_{k}}=\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K}{\left(-1\right)}^{K-\left|\alpha \right|}\frac{K!}{\alpha !}\prod _{m=1}^{K}{\left(\frac{{Q}_{m}}{m}\right)}^{{\alpha}_{m}}.$$

In order to apply this result now in a probabilistic context to answer the question about the probability that any finite set of (discrete) random variables yields mutually different realisations, consider a discrete random variable X that takes values in the set $\left\{{x}_{1},{x}_{2},\dots \right\}$, which we assume without loss of generality to be countably infinite. Then, applying the above result to the sequence ${\left(\mathbb{P}\left[X={x}_{n}\right]\right)}_{n\in \mathbb{N}}$ we can compute the probability that K independent copies of this random variable, ${X}_{1},\dots {X}_{K}$, all yield mutually different realisations. Specifically, we have that
with ${Q}_{m}={\sum}_{n\in \mathbb{N}}\mathbb{P}{\left[X={x}_{n}\right]}^{m}$.

$$\begin{array}{cc}\hfill \mathbb{P}\left[{X}_{1}\ne {X}_{2}\ne \cdots {X}_{K}\right]=\mathbb{E}\left[{\U0001d7d9}_{{X}_{1}\ne {X}_{2}\ne \cdots {X}_{K}}\right]& =\sum _{{x}_{1}\ne {x}_{2}\ne \cdots {x}_{K}}\prod _{k=1}^{K}\mathbb{P}\left[X={x}_{k}\right]\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K}{\left(-1\right)}^{K-\left|\alpha \right|}\frac{K!}{\alpha !}\prod _{m=1}^{K}{\left(\frac{{Q}_{m}}{m}\right)}^{{\alpha}_{m}}.\hfill \end{array}$$

Now, consider random vectors of increasing length, $X\equiv {\u2a02}_{n=1}^{f\left(N\right)}{X}^{\left(n\right)}$, with independent components and assume that there exists $\u03f5>0$ such that ${Q}_{2}^{\left(n\right)}={\sum}_{x}\mathbb{P}{[{X}^{\left(n\right)}=x]}^{2}\le 1-\u03f5$ for every $n\in \mathbb{N}$. For the latter a necessary condition is that ${X}^{\left(n\right)}$ is almost surely non-constant and in particular it is satisfied if the components are identically distributed and almost surely non-constant. Then, if $f\left(N\right)=\omega \left(ln,N\right)$ increasing, we will show that this implies that in the limit $N\to \infty $
where we recall that for some sequence ${\left({z}_{N}\right)}_{N\in \mathbb{N}}$ we have that ${z}_{N}=\mathcal{O}\left({N}^{-\infty}\right)$ if for every $m\ge 0$${lim}_{N\to \infty}{N}^{m}{z}_{N}=0$.

$$\mathbb{P}\left[{X}_{1}\ne {X}_{2}\ne \cdots {X}_{K}\right]=1-\mathcal{O}\left({N}^{-\infty}\right),$$

Indeed, $0<{Q}_{m}={\prod}_{n=1}^{f\left(N\right)}{Q}_{m}^{\left(n\right)}\le {\left({sup}_{n\in \mathbb{N}}{Q}_{m}^{\left(n\right)}\right)}^{f\left(N\right)}\le {\left({sup}_{n\in \mathbb{N}}{Q}_{2}^{\left(n\right)}\right)}^{f\left(N\right)}\le {\left(1-\u03f5\right)}^{f\left(N\right)}$ for every $m\ge 2$ so that we conclude that ${Q}_{m}\le {\mathrm{e}}^{-\left|ln1-\u03f5\right|f\left(N\right)}=\mathcal{O}\left({N}^{-\infty}\right)$. This implies the claim since $\mathbb{P}\left[{X}_{1}\ne {X}_{2}\ne \cdots {X}_{K}\right]$ is a polynomial in terms of ${Q}_{1},\dots {Q}_{K}$ with the only asymptotically non-vanishing term being ${Q}_{1}^{K}=1$.

Besides the immediate application to random vectors that we required in the last section, we remark that the expression for the probability of random variables being mutually different can also be used to derive a closed-form expression for the Stirling numbers of the first kind. Briefly, let X be a uniform random variable on $\left\{1,\dots L\right\}$ for some $L\in \mathbb{N}$, so that ${Q}_{m}={L}^{1-m}$ in the above statement. The probability that $K\le L$ independent copies of this random variable are all mutually different is then given by the above expression. On the other hand, this probability can also be computed purely combinatorially to be $\frac{{\left(L\right)}_{K}}{{L}^{K}}$, where ${\left(\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right)}_{K}$ denotes the falling factorial [30]. By definition, ${\left(L\right)}_{K}={\sum}_{n=0}^{K}s\left(K,n\right){L}^{n}$ with $s\left(K,n\right)$ the (signed) Stirling numbers of the first kind. Therefore, comparing the two expressions we find the Stirling numbers to be given as (cf. [31])

$$s\left(K,n\right)={\left(-1\right)}^{K-n}\left\{\begin{array}{cc}\sum _{\alpha \in {\mathbb{N}}_{0}^{K}:{\langle \alpha \rangle}_{K}=K}{\delta}_{\left|\alpha \right|,n}\frac{K!}{\alpha !{\prod}_{m=1}^{K}{m}^{{\alpha}_{m}}}\hfill & 1\le n\le \phantom{\rule{3.33333pt}{0ex}}K\hfill \\ 0\hfill & \mathrm{otherwise}\hfill \end{array}\right..$$

In this work, we have analysed a concrete, but general, model of sensory processing and demonstrated the extent of the sampling bias when directly estimating information theoretic quantities from experimental histograms. We have shown that as we consider larger and larger neural populations, with high probability any estimate of entropy or mutual information will only depend on the stimulus entropy.

We found that the issue lies in the fact that the samples of neural activity in response to the presentation of different stimulus values turn out to be mutually distinct. One way this can happen is through the existence of a subpopulation of neurons that only contributes independent noise to the population’s activity. Importantly, the composition of the subpopulation can be stimulus-dependent, so that it is impossible to exclude this subpopulation from the beginning in the analysis. A plausible origin for these noisy subpopulations are sufficiently localised, compactly supported receptive fields. In a simplified scenario, imagine a neural population, whose neurons are receptive to any one of three stimuli. For each of the stimuli, the neurons receptive to one of the other two stimuli constitute such a noisy subpopulation. Now, as we have mentioned also earlier, none of the three subpopulations can be excluded on the grounds that it contributes only noise across all stimuli, yet their activity lets samples recorded from the population appear more distinctive than they are. While for small neural populations the effect of noisy subpopulations can be counteracted by increasing the number of samples that one considers, this becomes infeasible for even moderately sized populations due to the combinatorially large repertoire of population activity. As we have shown, as we consider larger and larger populations and therefore also larger noisy subpopulations the possibility to sample even one activity pattern twice is exponentially suppressed. In reality, receptive fields are localised, yet at least in computational models their range often extends indefinitely. This is the case, for example, for Gaussian receptive fields and the activity of a neuron then depends strongly on some stimulus values while for others the dependence in vanishingly small. While the assumptions of our model are clearly not met, we argue that for all practical purposes the consequence is the same, although weakened. This is because one will only ever consider a finite number of samples. The smaller the dependence between the stimulus and the activity, the weaker it manifests itself in particular when only considering those finitely many samples, so that corresponding neurons appear essentially independent.

In the model we have proposed, we assumed that for any two stimulus values there exist sufficiently large subpopulations such that their components contribute independent and identically distributed noise when exposed to either of the two stimulus values. Now, this is certainly a simplification and a more realistic assumption would be that in the absence of sensory stimulation the activity in this subpopulations is governed by low-dimensional dynamics in addition to some individual noise [32,33]. In principle though, our conclusions should hold true irrespectively following the same general argument that we formalised in this work: In a large neural population, stochastic variability, even if it is only in a relatively small subpopulation, is sufficient to produce unique samples of neural activity in an experiment. In turn, this then manifests itself in a maximal bias when estimating mutual information and other information theoretic quantities.

Without any further assumptions on the statistical relation between the different neurons within the population, this work shows that, despite being a powerful framework, in general, information theoretic analyses become essentially intractable when considering larger neural populations, because of the difficulties of accurately estimating the joint probability distribution between activity and stimulus states from experimental histograms. Therefore, this is where modelling approaches come into play. These approaches frequently employ maximum entropy Ising models that were fit initially only to low-order [34] and later also higher-order statistics of the data [35,36,37]. Alternative approaches include the cascaded logistic model which utilises Dirichlet processes at its core [38] or more recently the population tracking model [24]. However, for an information theoretic analysis it is in addition important to be able to compute the involved quantities in an efficient way. While most models include the possibility to sample states, this is not an option for large populations again because of the issues presented in this work. In the context of maximum entropy Ising models, thermodynamic considerations turned out to be useful to efficiently compute quantities such as the entropy [37,39,40]. In the population tracking model, on the other hand, one derives a reduced, low-dimensional model, from which those quantities can be computed likewise [24].

Altogether, while for many statistics of neural activity it might be sufficient to consider experimental histograms, this is not the case for information theoretic quantities when considering large neural populations. Given the many insights that information theoretic analyses can bring [41,42,43], this motivates new approaches for studying experimental recordings from ever larger neural populations. Furthermore, it also inspires consideration of how the brain handles the informational constraints we have identified.

Conceptualization, J.M. and G.J.G.; Investigation, J.M.; Supervision, G.J.G.; Writing—original draft, J.M.; Writing—review & editing, J.M. and G.J.G. All authors have read and agreed to the published version of the manuscript.

This work was supported by an Australian Government Research Training Program (RTP) Scholarship awarded to Jan Mölter and the Australian Research Council Discovery Grant DP170102263 awarded to Geoffrey J. Goodhill.

We thank Marcus A. Triplett for very helpful discussions and comments on the manuscript.

The authors declare no conflict of interest.

- Pouget, A.; Dayan, P.; Zemel, R. Information processing with population codes. Nat. Rev. Neurosci.
**2000**, 1, 125–132. [Google Scholar] [CrossRef] [PubMed] - Sakurai, Y. Population coding by cell assemblies—What it really is in the brain. Neurosci. Res.
**1996**, 26, 1–16. [Google Scholar] [CrossRef] - Scanziani, M.; Häusser, M. Electrophysiology in the age of light. Nature
**2009**, 461, 930–939. [Google Scholar] [CrossRef] [PubMed] - Jun, J.J.; Steinmetz, N.A.; Siegle, J.H.; Denman, D.J.; Bauza, M.; Barbarits, B.; Lee, A.K.; Anastassiou, C.A.; Andrei, A.; Aydın, Ç.; et al. Fully integrated silicon probes for high-density recording of neural activity. Nature
**2017**, 551, 232–236. [Google Scholar] [CrossRef] - Quian Quiroga, R.; Panzeri, S. Extracting information from neuronal populations: Information theory and decoding approaches. Nat. Rev. Neurosci.
**2009**, 10, 173–185. [Google Scholar] [CrossRef] - Kinney, J.B.; Atwal, G.S. Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA
**2014**, 111, 3354–3359. [Google Scholar] [CrossRef] - Rhee, A.; Cheong, R.; Levchenko, A. The application of information theory to biochemical signaling systems. Phys. Biol.
**2012**, 9, 045011. [Google Scholar] [CrossRef] - Dorval, A.D. Estimating Neuronal Information: Logarithmic Binning of Neuronal Inter-Spike Intervals. Entropy
**2011**, 13, 485–501. [Google Scholar] [CrossRef] - Macke, J.H.; Murray, I.; Latham, P.E. How biased are maximum entropy models? Adv. Neural Inf. Process. Syst.
**2011**, 24, 2034–2042. [Google Scholar] - Panzeri, S.; Senatore, R.; Montemurro, M.A.; Petersen, R.S. Correcting for the Sampling Bias Problem in Spike Train Information Measures. J. Neurophysiol.
**2007**, 98, 1064–1072. [Google Scholar] [CrossRef] - Treves, A.; Panzeri, S. The Upward Bias in Measures of Information Derived from Limited Data Samples. Neural Comput.
**1995**, 7, 399–407. [Google Scholar] [CrossRef] - Panzeri, S.; Treves, A. Analytical estimates of limited sampling biases in different information measures. Network
**1996**, 7, 87–107. [Google Scholar] [CrossRef] [PubMed] - Adibi, M.; McDonald, J.S.; Clifford, C.W.G.; Arabzadeh, E. Adaptation Improves Neural Coding Efficiency Despite Increasing Correlations in Variability. J. Neurosci.
**2013**, 33, 2108–2120. [Google Scholar] [CrossRef] [PubMed] - Takaguchi, T.; Nakamura, M.; Sato, N.; Yano, K.; Masuda, N. Predictability of Conversation Partners. Phys. Rev. X
**2011**, 1, 011008. [Google Scholar] [CrossRef] - Pachitariu, M.; Lyamzin, D.R.; Sahani, M.; Lesica, N.A. State-Dependent Population Coding in Primary Auditory Cortex. J. Neurosci.
**2015**, 35, 2058–2073. [Google Scholar] [CrossRef] - Lopes-dos Santos, V.; Panzeri, S.; Kayser, C.; Diamond, M.E.; Quian Quiroga, R. Extracting information in spike time patterns with wavelets and information theory. J. Neurophysiol.
**2015**, 113, 1015–1033. [Google Scholar] [CrossRef] - Montgomery, N.; Wehr, M. Auditory Cortical Neurons Convey Maximal Stimulus-Specific Information at Their Best Frequency. J. Neurosci.
**2010**, 30, 13362–13366. [Google Scholar] [CrossRef] - Paninski, L. Estimation of Entropy and Mutual Information. Neural Comput.
**2003**, 15, 1191–1253. [Google Scholar] [CrossRef] - Zhang, Z. Entropy Estimation in Turing’s Perspective. Neural Comput.
**2012**, 24, 1368–1389. [Google Scholar] [CrossRef] - Yu, Y.; Crumiller, M.; Knight, B.; Kaplan, E. Estimating the amount of information carried by a neuronal population. Front. Comput. Neurosci.
**2010**, 4, 10. [Google Scholar] [CrossRef] - Archer, E.W.; Park, I.M.; Pillow, J.W. Bayesian entropy estimation for binary spike train data using parametric prior knowledge. Adv. Neural Inf. Process. Syst.
**2013**, 26, 1700–1708. [Google Scholar] - Vinck, M.; Battaglia, F.P.; Balakirsky, V.B.; Vinck, A.J.H.; Pennartz, C.M.A. Estimation of the entropy based on its polynomial representation. Phys. Rev. E
**2012**, 85. [Google Scholar] [CrossRef] [PubMed] - Xiong, W.; Faes, L.; Ivanov, P.C. Entropy measures, entropy estimators, and their performance in quantifying complex dynamics: Effects of artifacts, nonstationarity, and long-range correlations. Phys. Rev. E
**2017**, 95, 062114. [Google Scholar] [CrossRef] [PubMed] - O’Donnell, C.; Gonçalves, J.T.; Whiteley, N.; Portera-Cailliau, C.; Sejnowski, T.J. The Population Tracking Model: A Simple, Scalable Statistical Model for Neural Population Data. Neural Comput.
**2017**, 29, 50–93. [Google Scholar] [CrossRef] [PubMed] - Victor, J.D. Approaches to Information-Theoretic Analysis of Neural Activity. Biol. Theory
**2006**, 1, 302–316. [Google Scholar] [CrossRef] [PubMed] - Timme, N.M.; Lapish, C. A Tutorial for Information Theory in Neuroscience. eNeuro
**2018**, 5. [Google Scholar] [CrossRef] - Pregowska, A.; Szczepanski, J.; Wajnryb, E. Mutual information against correlations in binary communication channels. BMC Neurosci.
**2015**, 16, 32. [Google Scholar] [CrossRef] - Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J.
**1948**, 27, 379–423. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2005. [Google Scholar] [CrossRef]
- Stanley, R.P. Enumerative Combinatorics, 2nd ed.; Cambridge Studies in Advanced Mathematics; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar] [CrossRef]
- Malenfant, J. Finite, closed-form expressions for the partition function and for Euler, Bernoulli, and Stirling numbers. arXiv
**2011**, arXiv:1103.1585. [Google Scholar] - Stringer, C.; Pachitariu, M.; Steinmetz, N.; Reddy, C.B.; Carandini, M.; Harris, K.D. Spontaneous behaviors drive multidimensional, brainwide activity. Science
**2019**, 364, eaav7893. [Google Scholar] [CrossRef] - Triplett, M.A.; Pujic, Z.; Sun, B.; Avitan, L.; Goodhill, G.J. Model-based decoupling of evoked and spontaneous neural activity in calcium imaging data. bioRxiv
**2019**. [Google Scholar] [CrossRef] - Schneidman, E.; Berry, M.J., II; Segev, R.; Bialek, W. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature
**2006**, 440, 1007–1012. [Google Scholar] [CrossRef] [PubMed] - Granot-Atedgi, E.; Tkačik, G.; Segev, R.; Schneidman, E. Stimulus-dependent Maximum Entropy Models of Neural Population Codes. PLoS Comput. Biol.
**2013**, 9, e1002922. [Google Scholar] [CrossRef] [PubMed] - Tkačik, G.; Marre, O.; Mora, T.; Amodei, D.; Berry, M.J., II; Bialek, W. The simplest maximum entropy model for collective behavior in a neural network. J. Stat. Mech.
**2013**, 2013, P03011. [Google Scholar] [CrossRef] - Tkačik, G.; Marre, O.; Amodei, D.; Schneidman, E.; Bialek, W.; Berry II, M.J. Searching for Collective Behavior in a Large Network of Sensory Neurons. PLoS Comput. Biol.
**2014**, 10, e1003408. [Google Scholar] [CrossRef] [PubMed] - Park, I.M.; Archer, E.W.; Latimer, K.; Pillow, J.W. Universal models for binary spike patterns using centered Dirichlet processes. Adv. Neural Inf. Process. Syst.
**2013**, 26, 2463–2471. [Google Scholar] - Tkačik, G.; Schneidman, E.; Berry, M.J., II; Bialek, W. Ising models for networks of real neurons. arXiv
**2006**, arXiv:q-bio/0611072. [Google Scholar] - Tkačik, G.; Schneidman, E.; Berry, M.J., II; Bialek, W. Spin glass models for a network of real neurons. arXiv
**2009**, arXiv:0912.5409. [Google Scholar] - Stevens, C.F.; Zador, A.M. Information through a Spiking Neuron. Adv. Neural Inf. Process. Syst.
**1996**, 8, 75–81. [Google Scholar] - Strong, S.P.; Koberle, R.; de Ruyter van Steveninck, R.R.; Bialek, W. Entropy and Information in Neural Spike Trains. Phys. Rev. Lett.
**1998**, 80, 197–200. [Google Scholar] [CrossRef] - Borst, A.; Theunissen, F.E. Information theory and neural coding. Nat. Neurosci.
**1999**, 2, 947–957. [Google Scholar] [CrossRef] [PubMed]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).