Sign in
Author

Conference

Journal

Organization

Year

DOI
Look for results that meet for the following criteria:
since
equal to
before
between
and
Search in all fields of study
Limit my searches in the following fields of study
Agriculture Science
Arts & Humanities
Biology
Chemistry
Computer Science
Economics & Business
Engineering
Environmental Sciences
Geosciences
Material Science
Mathematics
Medicine
Physics
Social Science
Multidisciplinary
Keywords
(9)
High Dimensional Data
Probabilistic Model
Probability Distribution
Satisfiability
Product of Experts
Conditional Independence
Latent Variable
Latent Variable Model
Objective Function
Related Publications
(52)
Exponential Family Harmoniums with an Application to Information Retrieval
Fields of Experts: A Framework for Learning Image Priors
EnergyBased Models for Sparse Overcomplete Representations
A Fast Learning Algorithm for Deep Belief Nets
A learning algorithm for Boltzmann machines
Subscribe
Academic
Publications
Training Products of Experts by Minimizing Contrastive Divergence
Training Products of Experts by Minimizing Contrastive Divergence,10.1162/089976602760128018,Neural Computation,Geoffrey E. Hinton
Edit
Training Products of Experts by Minimizing Contrastive Divergence
(
Citations: 390
)
BibTex

RIS

RefWorks
Download
Geoffrey E. Hinton
It is possible to combine multiple latentvariable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual "expert" models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A
product of experts
(PoE) is therefore an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary. Training a PoE by maximizing the likelihood of the data is difficult because it is hard even to approximate the derivatives of the renormalization term in the combination rule. Fortunately, a PoE can be trained using a different
objective function
called "contrastive divergence" whose derivatives with regard to the parameters can be approximated accurately and efficiently. Examples are presented of contrastive divergence learning using several types of expert on several types of data.
Journal:
Neural Computation  NECO
, vol. 14, no. 8, pp. 17711800, 2002
DOI:
10.1162/089976602760128018
Cumulative
Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
(
dx.doi.org
)
(
www.cs.toronto.edu
)
(
www.cse.msu.edu
)
(
www.learning.cs.toronto.edu
)
(
www.cs.utoronto.ca
)
(
neco.mitpress.org
)
(
www.informatik.unitrier.de
)
(
www.mitpressjournals.org
)
More »
Citation Context
(252)
...Since RBMs are in the intersection between Boltzmann machines and product of experts models, they can be trained using contrastive divergence as described in [
67
]...
George E. Dahl
,
et al.
ContextDependent PreTrained Deep Neural Networks for LargeVocabular...
...We extend the conventional shallow MLPs used in [12] to deep neural networks (DNNs) [13], which has been shown to have very good theoretical properties [
14
] and demonstrated superior performances for both phone [16][17] and word recognition [15][13][18][19].,A popular trick is to initialize the parameters of each layer greedily and generatively by treating each pair of layers in DNNs as a restricted Boltzmann machine (RBM) before doing a joint optimization of all the layers [
14
].,Because ��������� is extremely expensive to compute exactly, the contrastive divergence (CD) approximation to the gradient is used, where ��������� is replaced by running the Gibbs sampler initialized at the data for one full step [
14
]...
Dong Yu
,
et al.
Boosting attribute and phone estimation accuracy with deep neural netw...
...A popular trick is to initialize the parameters of each layer greedily and generatively by treating each pair of layers in DNNs as a restricted Boltzmann machine (RBM) before joint optimization of all the layers [
11
][12]...
Dong Yu
,
et al.
Factorized Deep Neural Networks for Adaptive Speech Recognition
...Given a collection of randomly sampled fixations, the firstlayer RBM weights
W
and biases
d
,
b
can be trained using contrastive divergence (Hinton,
2002
)...
Misha Denil
,
et al.
Learning Where to Attend with Deep Architectures for Image Tracking
...Due to this, SFA differs from both many wellknown unsupervised feature extractors (Abut,
1990
; Jolliffe,
1986
; Comon,
1994
; Lee & Seung,
1999
; Kohonen,
2001
; Hinton,
2002
), which ignore dynamics, and other UL systems that both learn and apply features to sequences (Schmidhuber,
1992a
,
1992b
,
1992c
; Lindstädt,
1993
; KlapperRybicka, Schraudolph, & Schmidhuber,
2001
; Jenkins & Matarić,
2004
; Lee, Largman, Pham, & Ng,
2010
; Gisslen, Luciw, Graziano, & Schmidhuber,
2011
), thus assuming that the state of the system itself can depend on past information...
Varun Raj Kompella
,
et al.
Incremental Slow Feature Analysis: Adaptive LowComplexity Slow Featur...
References
(21)
Connectionist Learning of Belief Networks
(
Citations: 179
)
Radford M. Neal
Journal:
Artificial Intelligence  AI
, vol. 56, no. 1, pp. 71113, 1992
Using Generative Models for Handwritten Digit Recognition
(
Citations: 90
)
Michael Revow
,
Christopher K. I. Williams
,
Geoffrey E. Hinton
Journal:
IEEE Transactions on Pattern Analysis and Machine Intelligence  PAMI
, vol. 18, no. 6, pp. 592606, 1996
Attractor Dynamics in Feedforward Neural Networks
(
Citations: 7
)
Lawrence K. Saul
,
Michael I. Jordan
Journal:
Neural Computation  NECO
, vol. 12, no. 6, pp. 13131335, 2000
Bias/Variance Decompositions for LikelihoodBased Estimators
(
Citations: 43
)
Tom Heskes
Journal:
Neural Computation  NECO
, vol. 10, no. 6, pp. 14251433, 1998
A Maximum Entropy Approach to Natural Language Processing
(
Citations: 1333
)
Adam L. Berger
,
Stephen Della Pietra
,
Vincent J. Della Pietra
Journal:
Computational Linguistics  COLI
, vol. 22, no. 1, pp. 3971, 1996
Sort by:
Citations
(390)
ContextDependent PreTrained Deep Neural Networks for LargeVocabulary Speech Recognition
(
Citations: 13
)
George E. Dahl
,
Dong Yu
,
Li Deng
,
Alex Acero
Journal:
IEEE Transactions on Audio, Speech & Language Processing  TASLP
, vol. 20, no. 1, pp. 3042, 2012
Boosting attribute and phone estimation accuracy with deep neural networks for detectionbased speech recognition
(
Citations: 2
)
Dong Yu
,
Sabato Marco Siniscalchi
,
Li Deng
,
ChinHui Lee
Published in 2012.
Factorized Deep Neural Networks for Adaptive Speech Recognition
(
Citations: 1
)
Dong Yu
,
Xin Chen
,
Li Deng
Published in 2012.
Learning Where to Attend with Deep Architectures for Image Tracking
Misha Denil
,
Loris Bazzani
,
Hugo Larochelle
,
Nando de Freitas
Journal:
Neural Computation  NECO
, vol. 24, no. 8, pp. 21512184, 2012
Incremental Slow Feature Analysis: Adaptive LowComplexity Slow Feature Updating from HighDimensional Input Streams
Varun Raj Kompella
,
Matthew Luciw
,
Jürgen Schmidhuber
Journal:
Neural Computation  NECO
, vol. 24, no. 11, pp. 29943024, 2012