Academic
Publications
Training Products of Experts by Minimizing Contrastive Divergence

Training Products of Experts by Minimizing Contrastive Divergence,10.1162/089976602760128018,Neural Computation,Geoffrey E. Hinton

Training Products of Experts by Minimizing Contrastive Divergence   (Citations: 390)
BibTex | RIS | RefWorks Download
It is possible to combine multiple latent-variable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual "expert" models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A product of experts (PoE) is therefore an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary. Training a PoE by maximizing the likelihood of the data is difficult because it is hard even to approximate the derivatives of the renormalization term in the combination rule. Fortunately, a PoE can be trained using a different objective function called "contrastive divergence" whose derivatives with regard to the parameters can be approximated accurately and efficiently. Examples are presented of contrastive divergence learning using several types of expert on several types of data.
Journal: Neural Computation - NECO , vol. 14, no. 8, pp. 1771-1800, 2002
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...Since RBMs are in the intersection between Boltzmann machines and product of experts models, they can be trained using contrastive divergence as described in [67]...

    George E. Dahlet al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabular...

    • ...We extend the conventional shallow MLPs used in [12] to deep neural networks (DNNs) [13], which has been shown to have very good theoretical properties [14] and demonstrated superior performances for both phone [16][17] and word recognition [15][13][18][19].,A popular trick is to initialize the parameters of each layer greedily and generatively by treating each pair of layers in DNNs as a restricted Boltzmann machine (RBM) before doing a joint optimization of all the layers [14].,Because ��������� is extremely expensive to compute exactly, the contrastive divergence (CD) approximation to the gradient is used, where ��������� is replaced by running the Gibbs sampler initialized at the data for one full step [14]...

    Dong Yuet al. Boosting attribute and phone estimation accuracy with deep neural netw...

    • ...A popular trick is to initialize the parameters of each layer greedily and generatively by treating each pair of layers in DNNs as a restricted Boltzmann machine (RBM) before joint optimization of all the layers [11][12]...

    Dong Yuet al. Factorized Deep Neural Networks for Adaptive Speech Recognition

    • ...Given a collection of randomly sampled fixations, the first-layer RBM weights W and biases d, b can be trained using contrastive divergence (Hinton, 2002)...

    Misha Denilet al. Learning Where to Attend with Deep Architectures for Image Tracking

    • ...Due to this, SFA differs from both many well-known unsupervised feature extractors (Abut, 1990; Jolliffe, 1986; Comon, 1994; Lee & Seung, 1999; Kohonen, 2001; Hinton, 2002), which ignore dynamics, and other UL systems that both learn and apply features to sequences (Schmidhuber, 1992a, 1992b, 1992c; Lindstädt, 1993; Klapper-Rybicka, Schraudolph, & Schmidhuber, 2001; Jenkins & Matarić, 2004; Lee, Largman, Pham, & Ng, 2010; Gisslen, Luciw, Graziano, & Schmidhuber, 2011), thus assuming that the state of the system itself can depend on past information...

    Varun Raj Kompellaet al. Incremental Slow Feature Analysis: Adaptive Low-Complexity Slow Featur...

Sort by: