1. Introduction
Deep Neural Networks (DNNs) have become a popular choice for tasks such as image classification, face recognition, and Natural Language Processing. This has however been at the cost of massive computations on von Neumann architectures exhibiting high energy and area requirements
(et al., 2012a). The emergence of novel devices and specialpurpose architectures has called for a shift from conventional digital hardware for implementing neural algorithms (et al., 2013c).Attempts have been made towards dedicated hardware designs and realization of the synaptic weights (and neurons) of a Neural Network (NN) by using CMOS transistors in an analog fashion
(et al., 2010b); but these have met with challenges of scalability and volatility. Parallel research work has focused on using postCMOS devices such as memristors, which are nonvolatile devices with a variable resistance (et al., 2015e). However, the fabrication of multilevel memristors with stable states is still a challenge (et al., 2016d).Another choice is the Magnetic Tunnel Junction (MTJ), an emerging binary device (since it has 2 stable states) which has shown its potential as storage elements and is a promising candidate for replacing CMOS in memory chips (et al., 2013b). Its nonvolatility and scalability makes it a particularly lucrative choice for logicinmemory type architectures for neural networks. MTJs and memristors can be connected in a crossbar configuration which allows greater scalability and higher performance due to their inherent parallelism (et al., 2015e, 2016c, a). Several studies have investigated how the crossbar arrays with memristors (et al., 2013a, 2015f), MTJs (et al., 2016d, 2015a) and domainwall ferromagnets (et al., 2016a, 2015f) can implement Spiking Neural Networks (SNN) trained using SpikeTiming Dependent Plasticity (STDP), both . Hasan et al. (et al., 2014) and Soudry et al.(et al., 2015b)
have implemented multilayer NNs on memristive crossbars trained onchip using the backpropagation algorithm and demonstrated on supervised learning tasks.
Continuous weight networks can be simplified into discrete weight networks without significant degradation in classification accuracy while achieving substantial power benefits (et al., 2016f). The use of discrete weight networks, such as BinaryConnect (et al., 2015d) and in (et al., 2016e), also stems from the challenge to address the high storage and computational demands of a large number of fullprecision weights. The existence of only 2 stable states in MTJs makes them a good candidate for the realization of binary weight networks. One way of training such NNs is to perform weight updates stochastically, which is justifiable from evidences that learning in human brains also has some stochasticity associated (et al., 2013c)
. That such a method can lead to convergence with high probability in a finite time has been shown in
(et al., 2005b).Obtaining optimal weights for a binary network in software can be impractical because its discrete nature requires integer programming. Also, when physically realizing an NN on hardware, the underlying device variations can have a substantial impact on the model accuracy, and need to be accounted for in the training process. Merely characterizing the variations in the hardware platform is not sufficient for overcoming this issue.
In this paper, we explore the use of MTJ crossbars for the hardware implementation of the synaptic weight matrices of a neural network. We propose the insitu training of such an MTJ crossbar NN, which allows us to exploit its inherent parallelism for significantly faster training and also accounts for device variations. We advocate a probabilistic way of updating the MTJ synaptic weights through the gradient descent algorithm by exploiting the stochasticity in their switching. We experiment with two crossbar structures: with and without access transistors. The latter poses the additional challenge of sneakpath currents during programming which makes training insitu the only choice to achieve satisfactory performance. Finally, we support our proposed techniques with data by modeling device and circuit properties and running simulations.
2. Background
In this section we describe the basics of neural networks and the parallelism offered by the crossbar architecture, and introduce the characteristics of Magnetic Tunnel Junctions.
2.1. Neural Networks
The computation performed by any layer of an NN during the inference (forward propagation) phase basically comprises a matrixvector multiplication. Say,
is the input to a layer and represents the synaptic weight matrix, then the output is(1) 
where f() is an activation function. Training of the NN can be done by backpropagation using the
gradient descentoptimization method. The weight update of the synapse connecting the
input to the output is given as(2) 
where E is the cost function of the presented input sample , is the learning rate and is the error calculated at the output using and the desired output. It is worth noting that such a weight update is local in nature, in that it depends only on the information available at the synapse  the input to it and the error at its output. The weight update of the entire matrix can thus be written as
(3) 
The major computational cost of this algorithm comes from the complexity of eqns. (1) and (3) whose implementation on generalpurpose hardware requires time and memory of the same order, thereby not motivating their use for largescale applications. Fortunately, the nature of computation in eqn. (1) and the locality of weight update enable the design of highly parallel hardware that reduce the overall complexity to O(1).
2.2. The Crossbar Architecture
The physical realization of a synaptic weight matrix is possible using the gridlike crossbar structure where each junction has a resistance corresponding to one synapse. Fig 4LABEL:sub@crossbar_general shows a simplified crossbar with each row corresponding to an input and each column to an output neuron. Let be the voltage applied at the input terminal and be the conductance of the synapse connecting it to the output. By Ohm’s Law, the current through that synapse is and by Kirchhoff’s law the total current at the output is
(4) 
which bears similarity to the dot products in (1). This can then be fed to suitable analog circuits for implementing the activation function.
Since the outputs are obtained almost instantaneously after the inputs are applied, the matrixvector multiplication of eqn. (1) is performed in parallel with constant time complexity. As for the update phase, the crossbar resistances can be modified by suitably modeling the required change as the product of 2 physical quantities derivable from the inputs and the errors. In this way, the operations can be done in parallel using the synapses.
2.3. Magnetic Tunnel Junction
The Magnetic Tunnel Junction (MTJ) is a 2terminal spintronic device consisting primarily of 2 ferromagnetic layers separated by a thin tunnel barrier (typically MgO). The magnetic orientation of one of the magnetic layers is fixed, whereas that of the other is free, as shown in fig 4LABEL:sub@MTJ_states. MTJs possess 2 stable states where the relative magnetic orientations of the free and fixed layers are Parallel (P) and AntiParallel (AP) respectively, with the P state exhibiting a lower resistance than the AP state ().
It is possible to switch the state of the MTJ by passing spinpolarized current of appropriate polarity which flips the magnetization of the free layer through the mechanism of spintransfer torque (et al., 2003). The time required to switch is heavily dependent on the magnitude of the switching current. Not only that, this switching process is a stochastic one, in the sense that a pulse of given amplitude and duration has only a certain probability to successfully change the state. This stochasticity is due to thermal fluctuations in the initial magnetization angle and is an intrinsic property of the STT switching (et al., 2003).
Depending on the magnitude I of the current and the critical current (et al., 2016d), the switching probability in the highspeed precessional regime is expressed as
(5) 
where , is the pulse width, is the thermal stability and is the mean switching time (which is dependent on )(et al., 2011).


The spin transfer efficiency () of an MTJ is different for the 2 switching directions, with having a smaller value than (et al., 2012b). This makes , which means that the same magnitude and duration of current will correspond to different switching probabilities for the 2 switching directions. Fig. 7 shows the dependence of the switching probability on pulse width and switching current for the transition. Observe the similarity in the nature of variation with and . The transition too depicts this kind of a behavior, albeit with different values of and .
3. MTJ Crossbar based Neural Networks
The stochastic switching nature of MTJs has necessitated the usage of high write currents or write duration in memory applications to ensure low write errors. Alternatively, one can also use them to implement the synaptic weights in a crossbar where each crosspoint would be an MTJ in one of its 2 states. They are capable of being programmed with high speeds and exhibit endurance of the order of write cycles. However, the inherently binary nature of MTJs implies that such synapses can represent only 2 weight values and hence can implement only binary networks. Although it is possible to have some continuous behavior with the inclusion of a domain wall in the free layer (et al., 2016a), the maturity of such technology is not at par with that of the binary version (et al., 2015f).
Training Binary Networks: Obtaining optimal binary weights for an NN is an NPhard problem with an exponential time complexity, and hence a solution must involve training of the binary network of some form. This prompts the use of a probabilistic learning technique since the required weight update is continuous whereas any possible change in the conductance of the MTJ could only be discrete, in fact binary. As stated in (et al., 2013c), stochastic update of binary weights is computationally equivalent to deterministic update of multilevel weights at the system level.
In (et al., 2015a), Vincent et al. exploit the stochastic switching behavior of MTJs to propose its use as a ”stochastic memristive synapse” in an SNN taught using a simplified STDP rule. However, there is no theoretical guarantee of the convergence of STDP for general inputs (et al., 2005a). We propose using a probabilistic learning approach by training using the gradient descent method (which requires weight updates of the form in eqn. (2)) as demonstrated in section 4.2
3.1. The Motivation for Insitu Training
There are 2 ways (primarily) in which MTJs in the crossbar can be connected to their respective input and output terminals 

[leftmargin=*,topsep=1pt, partopsep=0pt, parsep=0pt, itemsep=0pt]

With selector devices (1T1R)  Here each MTJ synapse is connected in series with an MOS transistor (as in fig. 4LABEL:sub@crossbar_1t1r_general), resulting in transistors in the crossbars.

Without selector devices (1R)  Synapses are directly connected to the crossbar terminals; there are no transistors within the crossbar, such as the one in fig. 4LABEL:sub@crossbar_general. While a 1R structure provides greater scalability, it does so at the cost of reduced control of and access to individual synapses.
Stochastic learning can be done (simulated) offline and the final weights obtained can be programmed on to the crossbar deterministically. But, since MTJs have an inherently stochastic switching behaviour, deterministically programming them on a crossbar would require currents having high magnitude and duration to guarantee successful write operations. The possibility of selecting synapses to be written in the 1T1R architecture ensures no sideeffects of this method stemming from alternate current paths (because there would be none). But, despite circumventing this issue, this architecture can suffer from performance degradation due to the intrinsic device variations which only aggravate with scaling. On the other hand, in a 1R architecture, such high programming currents, when they sneak through alternate paths, are bound to cause unwanted changes in neighboring synapses owing to which the weights may never converge. This necessitates insitu training of the crossbar in probabilistic way for both 1T1R and 1R configurations, as only training on the hardware can account for both alternate paths and device variability.
3.2. Network Binarization
Simply using 1 as the binary weight values, represented by the and
states of an MTJ, is naive and estimating a good scaling factor
is essential for overall network performance. An appropriate way to determine a suitable is to minimize the L2 loss between the realvalued weights W and quantized ones, as was done in (et al., 2016f). This provides a solution (the mean of absolute values of ). Thus an MTJ in the () state would signify a weight of ().4. Insitu Training of MTJ Crossbars
We first provide a highlevel understanding of how an MTJ synaptic crossbar implementing an NN should work. For the sake of simplicity, all operations are described for a singlelayer NN and can be easily scaled to multiple layers (more details subsequently). We then illustrate how the gradient descent method can be used for the stochastic weight update of MTJs, and finally describe the insitu training procedure for the 2 crossbar architectures.
4.1. Overview of Operations
The training process is carried out as follows.
Read Phase: Upon receiving a training input , the input terminals are applied with voltages proportional to , whereas the output terminals are maintained at ground potential. Current flows through the synapse and the total current at the output terminals are suitably converted to output .
Write Phase: Using and the desired output, calculate the error . Table 1 lists the 4 possible cases of weight update depending on and . The gradient descent algorithm requires a weight update of the form of eqn. (2). An appropriate way to realize this, as suggested in (et al., 2007), is to set switching probabilities proportional to (the magnitude of) calculated in (2). Our way of achieving this is explained next.
The process of read and write are carried out for each input sample and repeated for several iterations until convergence is achieved.
Input  Error  and  Switch  

4.2. Stochastic Learning of an MTJ Synapse
We will now describe how the stochasticity of MTJ switching can be used to perform weight updates with gradient descent method. Just as the weight update in eqn (2) is a function of 2 variables (the input and the error), the probabilistic switching of MTJs can be controlled by 2 physical quantities the magnitude and the duration of the programming current. We choose the magnitude of the write current to be dependent on the input and the duration on the error . However, as can be evidenced from eqn (5) and fig 7, the switching probability is a highly nonlinear function of the parameters and (recall ), whereas the desired probability, being proportional to , is a linear function of and . Further, the switching probability does not immediately rise with the pulse width and the write current as they increase from 0, indicating some kind of soft threshold. Note that the direction of switching can be decided by the polarity of the write current.
We therefore model switching probabilities by a linear mapping of and to write current and duration respectively as follows. Usually , and henceforth assume for simplicity that (can be ensured by normalizing and adjusting with ). The pulse width is set at a minimum of and increases linearly with (since needs to increase irrespective of the sign of ) as
(6) 
Similarly, the write current () would be a minimum of and increase linearly with as
(7) 
We now wish to find coefficients and that yield MTJ switching probabilities () close to the desired probabilities of weight update. A certain probability of switching can be obtained for different combinations of and , as is evident from fig. 7. We first fix the range of pulse widths by choosing suitable and (refer to table 3). We want a nearly 0 switching probability for irrespective of the value of because for regardless of . We thus choose the maximum (which is ) to be that value of for which the plot of P against starts rising at . That is
for ,  
(8)  for 
where is a small value. So now even if is (as high as) , . In our experiments, we chose to be about .
A symmetric argument holds when . For , we want if , (because for ). But should start increasing as soon as increases, that is
for  
(9)  for 
Fig 8 shows how well the linear model approximates the required switching probabilities (similar curve fitting for as well). Table 2 shows the write currents and duration for boundary values of and and table 3 lists the values of the coefficients in eqns. (6) and (7). One could use nonlinear models for mapping and to and , respectively, in order to better fit the desired switching probabilities; however, that would complicate the analog circuit responsible for the conversion. Owing to this, and the closeness with which the linear model can replicate the stochastic switching characteristics, we stick to the linear version.
Next, we describe the 1T1R and 1R crossbar architectures implementing the NN. We show how these can be trained insitu using the stochastic learning technique described above.
4.3. The 1T1R Architecture
This is the conventional architecture for memory applications where each cell has a selection transistor. One major advantage of being able to selectively turn off certain cells is that it disallows the presence of undesired sneak currents which lead to unnecessary power consumption at a minimum. Fig 11LABEL:sub@crossbar_1t1r shows a 1T1R crossbar where each MTJ synapse is connected in series with an NMOS transistor. Input and output terminals are interfaced with necessary Control Logic (CL). All the transistors in a single column will have a common gate voltage since the corresponding synapses are connected to the same neuron output, and hence will always have the same error ‘’ and write pulse width .
Fig 11LABEL:sub@signals_1t1r plots the signals during both the read and write phases. During the read phase (), all transistors are turned on: so that all columns (neuron outputs) are read simultaneously. Inputs are provided to their respective input CLs which convert them to read voltages . Output currents are processed by the output CLs.


Updating the crossbar:
Decide the write currents that should be provided to each input row and the pulse widths for each output column as described in sec. 4.2. Recall that the former depend on and the latter on . The direction of the currents would depend on the sign of the desired weight update. Apply suitable write voltages at the input terminals while grounding the output terminals to 0.
For the synapse, the write pulse width depends on only , and the write current magnitude depends on . But the direction of switching depends on the signs of and (see Table 1) and has to be decided by the polarity of current. For eg. two MTJ synapses belonging to the same row but different columns may have opposite signs of . Thus, despite having the same input , they are required to switch in opposite directions and hence need write voltages of opposite sign. This requires us to split the write phase into two parts as explained next.
Since the transistor gate control signals are connected to the output CLs, we can select or deselect a certain column based on information at its respective CL, which is the error . We therefore program the crossbar sequentially in 2 stages, with the columns updated in a given stage depending on the signs of . Each phase has a duration of (which need not be more than , see eqn. (6)). The voltage signals in each phase are plotted in fig. 11LABEL:sub@signals_1t1r and detailed below 

[leftmargin=*,topsep=0pt, partopsep=0pt, parsep=0pt, itemsep=0pt]

Phase 1: . Update the weights of the columns which had . Then, the transistor control signals would be
for and (10) for or And the write voltages applied at the input terminals would be
(11) where is the unit step function.

Phase 2: . Update the weights of those columns which had . Here, the signals are opposite to those in phase 1 as shown in fig. 11LABEL:sub@signals_1t1r.
Here is the voltage applied to switch from PAP (APP) and can be obtained using (7) and . and still depend on , but for brevity explicit mention will be omitted henceforth. Let MTJs in the crossbar be arranged in a way that positive (negative) current from the input terminal to output terminal can switch from (); hence , . Parameters in table 3 give volts and volts.
Thus we can see that the read and update operations are completed in time which is . Due to limitations on the scalability of 1T1R architecture, it is worth exploring the feasibility of transistorless crossbars to achieve even higher density of integration.
4.4. The 1R Architecture
Eliminating the need to have an access transistor for every synapse in the crossbar will allow for compact designs having an integration density of about device. But the inability to select the synapses to be updated during programming results in leakage currents through alternate paths that not only waste energy but also can lead to undesirable changes in synaptic conductance. We first see the effect of such currents with the previously proposed writestrategy and then suggest a modified strategy (and circuit) for the 1R architecture
4.4.1. Twophase update:
Let’s analyze the impact of sneak paths on the 1R crossbar with the 2phase update strategy used previously. We first demonstrate the presence of sneak paths with a small example. Fig 16LABEL:sub@sneak_2phase_demo shows a crossbar with transistors only at the output terminals (to choose columns to be written in any particular phase). Assume without loss of generality that a certain input with produced errors at the outputs. The equivalent circuit during write phase 1 is drawn in fig. 16LABEL:sub@sneak_2phase_equiv. It depicts the currents through the synapses, with the ones through and being undesired. These may falsely switch from and from if they are in and states respectively.
We now state a worstcase scenario for a crossbar with inputs. If is large, analysis using Kirchhoff’s Current Law shows that the potential difference across an MTJ synapse could go as high as . The current through such an MTJ, if in the state, is and is high enough (recall ) to switch it from . In the other extreme case, a potential difference of leading to current through an MTJ in the state will switch it from .
It is also necessary to mention an average (expected) case. Here these currents reduce to and , respectively, which are half of those found previously, but still have some probability of switching MTJs (because these currents are roughly the same as and ). Thus, chances of unwanted flips of MTJs are quite significant, which calls for some modification in the circuit and/or in the programming method.
Input  Error  Switch  

Phase 1  
Phase 2  
Phase 3  
Phase 4 
4.4.2. Fourphase Update:
The large sneak currents in the 2phase writing strategy, potentially resulting in false switching, is due to the high potential difference between input terminals having different signs of inputs. One simple way to mitigate this issue is to further split the 2 phases of weight update so that, in a given phase, only rows having the same sign of input are updated at a time. This is equivalent to first clustering the columns according to the sign of , and then further clustering the rows according to the sign of . This proposed 4phase writing scheme would require additional transistors to choose the rows to be updated in a given phase as shown in fig 16LABEL:sub@crossbar_1r. It is summarized in Table 4 where each phase will have the same duration ; thus the total time for updating the crossbar is doubled to . Note that this is still time.
Let us now see how bad the issue of sneakpath leakage is with this strategy. Fig 16LABEL:sub@sneak_4phase_equiv shows the equivalent circuit for the crossbar with the same set of assumptions (only synapses providing alternate current paths are shown). For an crossbar, in the worstcase scenario, sneak currents could be and , and can still result in false switching. This follows intuition as the potential difference between an input terminal and an output terminal is at most or . However, in the average case, the sneak current values are found to be only and . These currents are small, and do not have the potential to cause undesired switching as is evident from the parameters listed in table 3 and the range of values of and . Hence, the 4phase writing scheme significantly reduces the incidences of undesired switching at a small cost of increase in the duration of the write phase. As we shall see, this tradeoff is not only worth but also necessary for satisfactory performance of the training process.
4.5. MultiLayer NNs
Multilayer NNs can be implemented on cascaded crossbars (each representing one layer) with the output of one fed as the input to the next. It is pretty straightforward to implement the backpropagation algorithm on such a structure. Consider a 2layer NN with weight matrices and . For an input , the final output is given as
(12) 
If is the error of the second layer (output), then that of the first layer (hidden) is where is the derivative of activation function , and represents a componentwise product. This operation can be done on the crossbar (of the output layer) itself by reversing the roles of its input and output terminals: is now fed as the input and out comes , which, when multiplied by , gives as the error to be used for updating the weights of the hidden layer.
For the MTJ crossbar NN we described, during forward propagation, the total duration of the read phase would be for an layer NN. Backpropagation of errors to hidden layers would require an extra long read phase for each such layer, during which the error at (the output of) a layer is fed as an input to its crossbar to obtain the error at its preceding layer. Lastly, all the layers can be updated simultaneously (in or time, as per the architecture).
Further, it must be mentioned that a large layer in an NN could be split into multiple crossbars, some of which which share inputs or outputs. All these crossbars can still be read and written in parallel, thanks to the locality of the weight update operations.
5. Experimental Setup and Results
To see how successfully the MTJ crossbar NNs can be trained insitu, we performed system level simulations by modeling the functionality of the crossbar architecture in MATLAB and training it on some datasets with supervised learning. To capture the MTJ device parameters, we used an HSPICE model (et al., 2015c) and included thermal fields in its LLG equations for obtaining the stochastic switching characteristics (et al., 2016b). Certain device parameters used in and obtained from this model were then incorporated into the simulations of the crossbar.
The performance of the neural network was evaluated in the following scenarios (codenamed for further reference). All training processes used the Mean Square Error cost function and neurons had the tanh activation function.

[leftmargin=*,topsep=1pt, partopsep=0pt, parsep=0pt, itemsep=0pt,]

RV: We first train and evaluate a neural network with realvalued weights in MATLAB. Binary quantization step () is obtained from this trained network as shown in sec. 3.2.

DP: Suitable binary weights are obtained by doing probabilistic learning in software on a binary network. Then a 1T1R crossbar and a 1R crossbar are deterministically programmed to these weights. We see the effect of device variations on the former, and of alternate current paths and resulting false switchings on the latter.

ST: An MTJ synaptic crossbar is modeled and stochastically trained insitu using the linear model of stochastic weight update described in sec. 4.2 for the

DV: Device variations of different extent are introduced in the stochastic training of both the 1T1R and 1R crossbars. It reflects in the variations in the resistance of the and states, which usually doesn’t exceed as per experiments (et al., 2010a).
We use the following datasets for evaluation.
SONAR, Rocks vs Mines(Lichman, 2013): Three different NN architectures are considered  one with 1 layer (1L), and two with 2 layers having 15 and 25 hidden neurons respectively, and named 2L15 and 2L25. They were trained, and then tested on 104 samples of the test dataset.
MNIST Digit Recognition(et al., 1998): Three 2layer networks of 50, 100 and 150 hidden units respectively and a 3layer network of 50+25 hidden units were evaluated on the 10000 images of the test dataset.
Wisconsin Breast Cancer (Diagnostic)(WBCD)(Lichman, 2013): A singlelayer network (1L) and 2 twolayer networks (2L10 and 2L20) were considered, and the test dataset had 200 samples.
Table 5 summarizes the accuracy obtained with these networks under the different training scenarios mentioned above. The effect of device variations of different extents on the insitu stochastic training is highlighted for some of the networks in table 6, with fig. 19 plotting the mean square error as the training progresses for the 1R crossbar. Additionally, fig. 22 compares the error for the two write strategies. It doesn’t converge with the 2phase writing scheme due to higher instances of undesired weight changes, but does so with 4 phases.
Dataset  SONAR  MNIST  WBCD  

Network  1L  2L15  2L25  2L50  2L100  2L150  3L  1L  2L10  2L20  
RV  16.4  12.8  11.9  9.87  7.34  6.44  7.25  8.35  7.40  7.10  
DP  1T1R  19.2  15.2  14.3  13.50  10.89  9.55  10.45  9.85  8.30  8.55 
1R  46.8  41.4  42.7  39.42  36.10  37.92  40.48  24.95  27.60  23.65  
ST  1T1R  18.4  14.2  13.6  12.69  10.18  8.96  9.71  9.20  7.70  8.05 
1R  18.3  14.5  14.0  12.72  10.20  9.03  9.66  9.40  7.85  7.95 
Dataset  SONAR  MNIST  WBCD  
Network  1L  2L15  2L100  3L  2L20  
Variation  1T1R  1R  1T1R  1R  1T1R  1R  1T1R  1R  1T1R  1R 
2%  18.5  18.4  14.4  14.7  10.27  10.22  9.67  9.73  8.10  8.05 
5%  18.7  18.7  14.7  14.8  10.28  10.29  9.78  9.80  8.25  8.30 
10%  19.0  19.1  15.1  15.1  10.33  10.43  9.86  9.91  8.30  8.40 
20%  19.3  19.5  16.0  15.9  10.42  10.72  10.15  10.28  8.60  8.75 
It is evident from these results that

[itemindent=7pt,leftmargin=0pt,topsep=1pt, partopsep=0pt, parsep=0pt, itemsep=0pt]

When an MTJ synaptic crossbar without access transistors is stochastically trained insitu (ST1R), it shows classification accuracy only slightly lower (about at worst) than when the same network is trained in software with realvalued weights (RV, which can be considered to be the best achievable). However, it brings about significant improvement (up to ) in accuracy over a deterministically programmed crossbar (DP1R) since the latter suffers from undesired weight changes arising from alternate current paths.

Insitu training also benefits the crossbar with transistors (ST1T1R against DP1T1R) in the presence of device variations by slightly improving accuracy (by about ).

It is possible to compensate for the loss in accuracy due to use of a binary network by increasing the size of the network (adding more hidden layers and/or neurons).

Further, the trained crossbar has robustness even in the face of device variations, owing primarily to the faulttolerant nature of NN and its learning algorithms. As can be seen in table 6, increase in misclassification rates remain within even with variation.
The accuracy degradation of that we achieve (on going from RV to ST) is comparable to the reported by (et al., 2016d) and the in (et al., 2015a). However, it must be mentioned and emphasized that any comparison is fair only if they are on the same dataset and network architecture. The benefit of using insitu training can also be seen when we compare our work with that of (et al., 2016c) (which performs offline learning). On the MNIST 2L100 network, we obtained an error rate of , whereas (et al., 2016c) had a much higher value of on the same network, although it must be mentioned that the latter were at a disadvantage due to linear activation units.




6. Conclusion
In this work, we show how MTJ crossbars representing weights of an ANN can be trained insitu by exploiting the stochastic switching properties of MTJs and performing weight updates in a way akin to gradient descent. We demonstrate how the learning algorithm can be implemented on crossbars with and without transistors. Results show these stochastically trained binary networks can achieve classification accuracy almost as good as that of those trained in software and implemented on processors. This paves the way for the attainment of highly scalable neural systems in the future capable of performing complex applications.
References
 (1)
 et al. (2015a) A. F. Vincent et al. 2015a. Spintransfer torque magnetic memory as a stochastic memristive synapse for neuromorphic systems. IEEE transactions on biomedical circuits and systems 9, 2 (2015), 166–174.
 et al. (2016a) A. Sengupta et al. 2016a. Hybrid SpintronicCMOS Spiking Neural Network with OnChip Learning: Devices, Circuits, and Systems. Physical Review Applied 6 (2016).
 et al. (2016b) A. Sengupta et al. 2016b. Probabilistic deep spiking neural systems enabled by magnetic tunnel junction. IEEE Transactions on Electron Devices 63 (2016), 2963–70.
 et al. (2010a) D. C. Worledge et al. 2010a. Switching distributions and write reliability of perpendicular spin torque MRAM. In Electron Devices Meeting (IEDM), 2010 IEEE International.
 et al. (2013a) D. Querlioz et al. 2013a. Immunity to device variations in a spiking neural network with memristive nanodevices. IEEE Transactions on Nanotechnology 12, 3 (2013).
 et al. (2015b) D. Soudry et al. 2015b. Memristorbased multilayer neural networks with online gradient descent training. IEEE transactions on neural networks and learning systems 26, 10 (2015), 2408–2421.
 et al. (2016c) D. Zhang et al. 2016c. All spin artificial neural networks based on compound spintronic synapse and neuron. IEEE transactions on biomedical circuits and systems 10, 4 (2016), 828–836.
 et al. (2016d) D. Zhang et al. 2016d. Stochastic spintronic device based synapses and spiking neurons for neuromorphic computation. In Nanoscale Architectures (NANOARCH), 2016 IEEE/ACM International Symposium on. IEEE, 173–178.
 et al. (2016e) F. Li et al. 2016e. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).
 et al. (2011) H. Tomita et al. 2011. Highspeed spintransfer switching in GMR nanopillars with perpendicular anisotropy. IEEE Transactions on Magnetics 47, 6 (2011), 1599–1602.
 et al. (2012a) J. Dean et al. 2012a. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NIPS). 1223–1231.

et al. (2007)
J. H. Lee et al.
2007.
Defecttolerant nanoelectronic pattern classifiers.
International Journal of Circuit Theory and Applications 35, 3 (2007), 239–264.  et al. (2015c) J. Kim et al. 2015c. A technologyagnostic MTJ SPICE model with userdefined dimensions for STTMRAM scalability studies. In Custom Integrated Circuits Conference (CICC), 2015 IEEE. IEEE, 1–4.
 et al. (2010b) J. Misra et al. 2010b. Artificial neural networks in hardware: A survey of two decades of progress. 74, 1 (2010), 239–255.
 et al. (2013b) K. L. Wang et al. 2013b. Lowpower nonvolatile spintronic memory: STTRAM and beyond. Journal of Physics D: Applied Physics 46, 7 (2013), 074003.
 et al. (2015d) M. Courbariaux et al. 2015d. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in NIPS. 3123–3131.
 et al. (2015e) M. Prezioso et al. 2015e. Training and operation of an integrated neuromorphic network based on metaloxide memristors. Nature 521, 7550 (2015), 61–64.

et al. (2016f)
M. Rastegari et al.
2016f.
Xnornet: Imagenet classification using binary
convolutional neural networks. In
European Conference on Computer Vision
. Springer.  et al. (2013c) M. Suri et al. 2013c. Bioinspired stochastic computing using binary CBRAM synapses. IEEE Transactions on Electron Devices 60, 7 (2013), 2402–2409.
 et al. (2014) R. Hasan et al. 2014. Enabling back propagation training of memristor crossbar neuromorphic processors. In Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 21–28.
 et al. (2005a) R. Legenstein et al. 2005a. What can a neuron learn with spiketimingdependent plasticity? Neural computation 17, 11 (2005), 2337–2382.
 et al. (2015f) S. Saïghi et al. 2015f. Plasticity in memristive devices for spiking neural networks. 9 (2015).

et al. (2005b)
W. Senn et al.
2005b.
Convergence of stochastic learning in perceptrons with binary synapses.
Physical Review E 71, 6 (2005), 061907.  et al. (1998) Y. LeCun et al. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
 et al. (2012b) Y. Zhang et al. 2012b. Asymmetry of MTJ switching and its implication to STTRAM designs. In Proceedings of the Conference on Design, Automation and Test in Europe.
 et al. (2003) Z. Li et al. 2003. Magnetization dynamics with a spintransfer torque. Physical Review B 68, 2 (2003), 024404.

Lichman (2013)
M. Lichman.
2013.
UCI Machine Learning Repository.
(2013). http://archive.ics.uci.edu/ml
Comments
There are no comments yet.