### Definition of magnetic field and relaxation time

The magnetic field \(\mathbf H\) relates to the energy density \(\varepsilon\) as \(\mathbf H=-\partial \varepsilon /\partial (M \mathbf m)\), and therefore, is obtained from Eq. (1) as

$$\beginaligned \mathbf H=\beginpmatrix -4\pi M N_x m_x \\ -4\pi M N_y m_y \\ \left[ \left( 2K_1/M \right) -4\pi M N_z \right] m_z + \left( 4K_2/M \right) \left( 1-m_z^2 \right) m_z \endpmatrix. \endaligned$$

(9)

We should note that the magnetization dynamics described by the LLG equation is unchanged by adding a term proportional to \(\mathbf m\) to \(\mathbf H\) because the LLG equation conserves the magnitude of \(\mathbf m\). Adding a term as such corresponds to shifting the origin of the energy density \(\varepsilon\) by a constant. In the present case, we added a term \(4\pi M N_x\mathbf m\) to \(\mathbf H\) and obtained Eq. (2), where we should remind that \(N_x=N_y\) because we assume a cylinder shaped MTJ. The added term to \(\mathbf H\) shifts the origin of the energy density \(\varepsilon\) by the constant \(-2\pi M^2 N_x \mathbf m^2=-2\pi M^2N_x\) and makes it depend on \(m_z\) only.

The LLG equation in the present system can be integrated as

$$\beginaligned t=\frac1+\alpha ^2\alpha \gamma \left( H_\mathrmK1+H_\mathrmK2\right) \left[ \log \left( \frac\cos \theta _\mathrmf\cos \theta _\mathrmi\right) -\fracH_\mathrmK1+H_\mathrmK2H_\mathrmK1\log \left( \frac\sin \theta _\mathrmf\sin \theta _\mathrmi\right) +\fracH_\mathrmK22H_\mathrmK1\log \left( \fracH_\mathrmK1+H_\mathrmK2 \sin ^2\theta _\mathrmfH_\mathrmK1+H_\mathrmK2 \sin ^2\theta _\mathrmi\right) \right] , \endaligned$$

(10)

where \(\theta _\mathrmi\) and \(\theta _\mathrmf\) are the initial and final values of \(\theta =\cos ^-1m_z\). Equation (10) provides the relaxation time from \(\theta =\theta _\mathrmi\) to \(\theta =\theta _\mathrmf\). Note that the relaxation time is scaled by \(\alpha \gamma H_\mathrmK1/(1+\alpha ^2)\) and \(H_\mathrmK2/H_\mathrmK1\), which can be manipulated by the voltage control of magnetic anisotropy. We also note that Eq. (10) has logarithmic divergence due to asymptotic behavior in the relaxation dynamics.

### Role of spin-transfer torque

We neglected spin-trasnfer torque in the main text because the current magnitude in typical MTJ used for voltage control of magnetic anisotropy effect is usually small. For example, when using typical values^{47,49} for the voltage (0.4 V), resistance (60 k\(\Omega\)), and cross-section being \(\pi \times 60^2\) nm\(^2\), the value of the current density is about 0.06 MA/cm\(^2\) (6.7 \(\mu\)A in terms of current). Such a value is sufficiently small compared with that used in spin-transfer torque switching experiments^{54}. To verify the argument, we perform numerical simulations, where spin-transfer torque, \(-H_\mathrms \mathbf m \times (\mathbf p \times \mathbf m)\), is added to the right-hand side of Eq. (3). We fix the values of \(H_\mathrmK2=500\) Oe and \(H_\mathrmK1=-0.1 H_\mathrmK2=-50\) Oe. The unit vector \(\mathbf p\) along the direction of the magnetization in the reference layer points to the positive *z* direction. Spin polarization *P* in the spin-transfer torque strength, \(H_\mathrms=\hbar P j/(2eMd)\), is assumed to be 0.5. Figure 4a shows time evolution of \(\mathbf m\) for the current density *j* of 0.06 MA/cm\(^2\). Although the magnetization slightly moves from the initial (stable) state due to spin-transfer torque, the change of the magnetization direction is small compared with that shown in Fig. 1b. Therefore, we do not consider that spin-transfer torque plays a major role in physical reservoir computing, although current cannot be completely zero in experiments.

For comprehensiveness, however, we also show the magnetization dynamics when the current density *j* is increased by one order Figure 4b shows the dynamics for \(j=0.6\) MA/cm\(^2\), where the magnetization switching by spin-transfer torque is observed. We note that the current density is sufficiently small compared with that used in typical MTJs in nonvolatile memory^{54}. Nevertheless, the switching is observed because of a small value of the magnetic anisotropy field in the present system. We assume that \(H_\mathrmK2\) is finite and \(|H_\mathrmK1|<H_\mathrmK2\) to make a tilted state of the magnetization [\(m_z=\pm \sqrt1-(\)] stable due to the following reason. Remind that there are other stable states, such as \(m_z=\pm 1\) for \(H_\mathrmK1>0\) and \(m_z=0\) for \(H_\mathrmK1<0\), when \(H_\mathrmK2=0\). Note that these states (\(m_z=\pm 1\) or \(m_z=0\)) are always local extrema of energy landscape. Accordingly, once the magnetization saturates to these states, it cannot change the direction even if another input is injected. This conclusion can be understood in a different way, where the relaxation time given by Eq. (10) shows a divergence when \(\theta _\mathrmi=0\) (\(m_z=+1\)), \(\pi\) (\(m_z=-1\)), or \(\pi /2\) (\(m_z=0\)) is substituted. On the other hand, for a finite \(H_\mathrmK2\), the magnetization can move from the state \(m_z=\pm \sqrt/H_\mathrmK2)\) when an input signal changes the value of \(H_\mathrmK1\) and makes the state no longer an extremum. We note that the assumption \(|H_\mathrmK1|<H_\mathrmK2\) restricts the magnitude of the magnetic field. In fact, the magnitude of \(\mathbf H\) is small due to a small value of \(H_\mathrmK2=500\) Oe found in experiments^{47,48} and the restriction of \(|H_\mathrmK1|<H_\mathrmK2\). Since a critical current density destabilizing the magnetization by spin-transfer effect is proportional to the magnitude of the magnetic field, a small \(\mathbf H\) implies that a small current mentioned above might induce a large-amplitude magnetization dynamics.

In summary, the magnitude of the current density is sufficiently small, and the magnetization dynamics are mainly driven by voltage control of magnetic anisotropy effect. The condition to stabilize a tilted state, however, might make the magnitude of the magnetic field, as well as the critical current density of spin-transfer torque switching, small. Thus, even a small current may cause nonnegligible dynamics. Simultaneously, however, it is practically difficult to increase the current magnitude by one order, and therefore, in the present study, we still consider that voltage control of magnetic anisotropy effect is the main driving force of the magnetization dynamics.

### Evaluation method of memory capacity

The memory capacity corresponds to the number of data which can be reproduced from the output data, as mentioned in the main text. The evaluation of the memory capacity consists of two processes. During the first process called training (or learning), weights are determined to reproduce the target data from the output data. In the second process, the reproducibility of the target data defined from other input data is evaluated.

Let us first describe the training process. We inject the random binary input \(b_k=0\) or 1 into MTJ as voltage pulse. The number of the random input is *N*. The input \(b_k\) is converted to the first order magnetic anisotropy field through the voltage control of magnetic anisotropy, which is described by Eq. (6). We choose \(m_z\) as output data, which can be measured experimentally through magnetoresistance effect. Figure 5a shows an example of the time evolution of \(m_z\) in the presence of several random binary inputs, where the values of the parameters are those at the maximum STM capacity conditions, i.e., the pulse width and the first order magnetic anisotropy field are \(t_\mathrmp=69\) ns and \(H_\mathrmK1^(1)=-430\) Oe. As can be seen, the injection of the random input drives the dynamics of \(m_z\).

The dynamical response \(m_z(t)\), during the presence of the *k*th input \(b_k\), is divided into nodes, where the number of nodes is \(N_\mathrmnode\). We denote the \(i(=1,2,\ldots ,N_\mathrmnode)\)th output with respect to the *k*th input as \(u_k,i=m_z(t_0+(k-1)t_\mathrmp+i(t_\mathrmp/N_\mathrmnode))\), where \(t_0\) is time for washout. The output \(u_k,i\) is regarded as the status of the *i*th neuron at a discrete time *k*. Figure 5b shows an example of the time evolution of \(m_z\) with respect to an input pulse, whereas the dots in the inset of the figure are the nodes \(u_k,i\) defined from \(m_z\). The method to define such virtual neurons is called time-multiplexing method^{15,20,21}. We also introduce bias term \(u_k,N_\mathrmnode+1=1\). In the training process, we introduce weight \(w_D,i\) and evaluate its value to minimize the error,

$$\beginaligned \sum _k=1^N\left( \sum _i=1^N_\mathrmnode+1w_D,iu_k,i -y_k,D\right) ^2, \endaligned$$

(11)

where, \(y_k,D\) are the target data defined by Eqs. (4) and (5). For simplicity, we omit the superscripts such as “STM” and “PC” in the target data because the difference in the evaluation method of the STM and PC capacities is merely due to the definition of the target data. In the following, we add superscripts or subscripts, such as “STM” and “PC”, when distinguishing quantities related to their capacities are necessary. The weight should be introduced for each target data. According to the above statement, we denote the weight to evaluate the STM (PC) capacity as \(w_D,i^\mathrmSTM(PC)\), when necessary. Also, we note that the weights are different for each delay *D*.

Once the weights are determined, we inject other random binary inputs \(b_k^\prime \) to the reservoir, where the number of the input data is \(N^\prime \). Note that \(N^\prime \) is not necessarily the same with *N*. Here, we use the prime symbol to distinguish the input data from those used in training. Similarly, we denote the output and target data with respect to \(b_k^\prime \) as \(u_n,i^\prime \) and \(y_n,D^\prime \), respectively, where \(n=1,2,\ldots ,N^\prime \). From the output data \(u_n,i^\prime \) and the weight \(w_D,i\), we define the system output \(v_n,D^\prime \) as

$$\beginaligned v_n,D^\prime =\sum _i=1^N_\mathrmnode+1w_D,iu_n,i^\prime . \endaligned$$

(12)

Figure 5c shows an example of the comparison between the target data \(y_n,D^\prime \) (red line) and the system output \(v_n,D^\prime \) (blue dots) of STM task with \(D=1\). It is shown that the system output well reproduces the target data. The reproducibility of the target data is quantified from the correlation coefficient \(\mathrmCor(D)\) between \(y_n,D^\prime \) and \(v_n,D^\prime \) defined as

$$\beginaligned \mathrmCor(D)\equiv \frac \sum _n=1^N^\prime \left( y_n,D^\prime – \langle y_n,D^\prime \rangle \right) \left( v_n,D^\prime – \langle v_n,D^\prime \rangle \right) \sqrt \sum _n=1^N^\prime \left( y_n,D^\prime – \langle y_n,D^\prime \rangle \right) ^2 \sum _n=1^N^\prime \left( v_n,D^\prime – \langle v_n,D^\prime \rangle \right) ^2 , \endaligned$$

(13)

where \(\langle \cdots \rangle\) represents the averaged value. Note that the correlation coefficients are defined for each delay *D*. We also note that the correlation coefficients are defined for each kind of capacity, as in the case of the weights and target data. In general, \([\mathrmCor(D)]^2 \le 1\), where \([\mathrmCor(D)]^2=1\) holds only when the system output completely reproduces the target data. Figure 5d shows an example of the dependence of \([\mathrmCor(D)]^2\) for STM task on the delay *D*. The results implies that the reservoir well reproduces the target data until \(D=3\), whereas the reproducibility drastically decreases with the delay *D* increasing. The STM and PC capacities, \(C_\mathrmSTM\) and \(C_\mathrmPC\), are defined as

$$\beginaligned C=\sum _D=1^D_\mathrmmax\left[ \mathrmCor(D)\right] ^2. \endaligned$$

(14)

Note that the definition of the memory capacity obeys, for example, Refs.^{18,20,21,25}, where the memory capacity in Eq. (14) is defined by the correlation coefficients starting from \(D=1\). In some papers such as Refs.^{15,30}, however, the square of the correlation coefficient at \(D=0\) is added to the right-hand side of Eq. (14).

In the present study, we introduce \(N_\mathrmnode=250\) nodes and use \(N=1000\) and \(N^\prime =1000\) random binary pulses for training of the weight and evaluation of the memory capacity, respectively. The number of nodes is chosen so that the value of the capacity saturates with the number of nodes increasing; see the inset of Fig. 5d. We also use 300 random binary pulses before the training and between training and evaluation for washout. The maximum delay \(D_\mathrmmax\) is 20. Note that the value of each node should be sampled within a few hundred picosecond: specifically, in the case of an example shown in Fig. 2c, it is necessary to sample data within \(t_\mathrmp/N_\mathrmnode=69 \mathrmns/250 \simeq 276\) ps. We emphasize that it is experimentally possible to sample data within such a short time. For example, in Ref.^{21}, \(t_\mathrmp=20\) ns and \(N_\mathrmnode=200\) were used, where the sampling step is 100 ps.

### NARMA task

The evaluation procedure of the NMSE in NARMA task is similar to that of the memory capacity. The binary input data, \(b_k=0\) or 1, in the evaluation of the memory capacity are replaced by uniform random number \(r_k\) in (0, 1). The variable \(z_k\) in Eq. (7) is generally defined as \(z_k=\mu +\sigma r_k\)^{30}, where the parameters \(\mu\) and \(\sigma\) are determined to make \(z_k\) be in (0, 0.2)^{15}. As in the case of the evaluation of the memory capacity, the evaluation of the NMSE consists of two procedures. The first procedure is the training, where the weight is determined to reproduce the target data from the output data \(u_k,i\). Secondly, we evaluate the reproducibility of another set of the target data from the system output \(v_n^\mathrmNARMA2\) defined from the weight and the output data. Then, the NMSE can be evaluated. Note that some papers^{13,27,30} define the NMSE in a slightly different way, where \(\sum _n=1^N^\prime \left( y_n^\mathrmNARMA2 \right) ^2\) in the denominator of Eq. (8) is replaced by \(\sum _n=1^N^\prime \left( y_n^\mathrmNARMA2 – \overliney^\mathrmNARMA2 \right) ^2\), where \(\overliney^\mathrmNARMA2\) is the average of the target data \(y_n^\mathrmNARMA2\). In this work, we use the definition given by Eq. (8), which is used, for example, in Refs.^{12,15,18}.

### Evaluation of Lyapunov exponent

We evaluated the conditional Lyapunov exponent as follows^{58}. The LLG equation was solved by the fourth-order Runge-Kutta method with time increment of \(\Delta t=1\) ps. We added perturbations \(\delta \theta\) and \(\delta \phi\) with \(\varepsilon =\sqrt\delta \theta ^2+\delta \varphi ^2=10^-5\) to \(\theta (t)\) and \(\varphi (t)\) at time *t*. Let us denote the perturbed \(\theta (t)\) and \(\varphi (t)\) as \(\theta ^\prime (t)\) and \(\varphi ^\prime (t)\), respectively. Solving the LLG equation from time *t* to \(t+\Delta t\), the time evolution of the perturbation is obtained as \(\varepsilon ^\prime (t)=\sqrt[\theta ^\prime (t+\Delta t)-\theta (t+\Delta t)]^2+[\varphi ^\prime (t+\Delta t)-\varphi (t+\Delta t)]^2\). A temporal Lyapunov exponent is obtained as \(\lambda (t)=(1/\Delta t)\log [\varepsilon ^\prime (t)/\varepsilon ]\). Repeating the procedure, the temporal Lyapunov exponent is averaged as \(\lambda (\mathcal N)=(1/\mathcal N)\sum _i=1^\mathcal N\lambda (t_i)=[1/(\mathcal N \Delta t)]\sum _i=1^\mathcal N\log \\varepsilon ^\prime [t_0+(i-1) \Delta t]/\varepsilon \\), where \(t_0\) is time at which the first random input is injected, whereas \(\mathcal N\) is the number of averaging. The Lyapunov exponent is given by \(\lambda =\lim _\mathcal N \rightarrow \infty \lambda (\mathcal N)\). In the present study, we used the time range same as that used in the evaluations of the memory capacity and the NMSE and added uniform random input. Hence, notice that \(\mathcal N=\mathcal Mt_\mathrmp/\Delta t\) depends on the pulse width, where \(\mathcal M\) is the total number of the random inputs including washout, training, and evaluation. We confirmed that \(\lambda (\mathcal N)\) monotonically saturates to zero; at least, \(|\lambda (\mathcal N)|\) is one or two orders of magnitudes smaller than \(1/t_\mathrmp\). Thus, the expansion rate of the perturbation, \(1/\lambda (\mathcal N)\), is much slower than the injection rate of the input signal. Considering these facts, we concluded that the largest Lyapunov exponent can be regarded as zero, and therefore, chaos is absent. Note that the absence of chaos in the present system relates to the facts that the free layer is axially symmetric and the applied voltage modifies the perpendicular anisotropy only. When there are factors breaking the symmetry, such as spin-transfer torque with an in-plane spin polarization, chaos will appear^{30}.