Quantcast

Documentation Center

  • Trial Software
  • Product Updates

gmdistribution.fit

Class: gmdistribution

Gaussian mixture parameter estimates

    Note:   fit will be removed in a future release. Use fitgmdist instead.

Syntax

obj = gmdistribution.fit(X,k)
obj = gmdistribution.fit(...,param1,val1,param2,val2,...)

Description

obj = gmdistribution.fit(X,k) uses an Expectation Maximization (EM) algorithm to construct an object obj of the gmdistribution class containing maximum likelihood estimates of the parameters in a Gaussian mixture model with k components for data in the n-by-d matrix X, where n is the number of observations and d is the dimension of the data.

gmdistribution treats NaN values as missing data. Rows of X with NaN values are excluded from the fit.

obj = gmdistribution.fit(...,param1,val1,param2,val2,...) provides control over the iterative EM algorithm. Parameters and values are listed below.

ParameterValue
'Start'

Method used to choose initial component parameters. One of the following:

  • 'randSample' — To select k observations from X at random as initial component means. The mixing proportions are uniform. The initial covariance matrices for all components are diagonal, where the element j on the diagonal is the variance of X(:,j). This is the default.

  • S — A structure array with fields mu, Sigma, and PComponents. See gmdistribution for descriptions of values.

  • s — A vector of length n containing an initial guess of the component index for each point.

'Replicates'

A positive integer giving the number of times to repeat the EM algorithm, each time with a new set of parameters. The solution with the largest likelihood is returned. A value larger than 1 requires the 'randSample' start method. The default is 1.

'CovType'

'diagonal' if the covariance matrices are restricted to be diagonal; 'full' otherwise. The default is 'full'.

'SharedCov'

Logical true if all the covariance matrices are restricted to be the same (pooled estimate); logical false otherwise.

'Regularize'

A nonnegative regularization number added to the diagonal of covariance matrices to make them positive-definite. The default is 0.

'Options'

Options structure for the iterative EM algorithm, as created by statset. gmdistribution.fit uses the parameters 'Display' with a default value of 'off', 'MaxIter' with a default value of 100, and 'TolFun' with a default value of 1e-6.

In some cases, gmdistribution may converge to a solution where one or more of the components has an ill-conditioned or singular covariance matrix.

The following issues may result in an ill-conditioned covariance matrix:

  • The number of dimension of your data is relatively high and there are not enough observations.

  • Some of the features (variables) of your data are highly correlated.

  • Some or all the features are discrete.

  • You tried to fit the data to too many components.

In general, you can avoid getting ill-conditioned covariance matrices by using one of the following precautions:

  • Pre-process your data to remove correlated features.

  • Set 'SharedCov' to true to use an equal covariance matrix for every component.

  • Set 'CovType' to 'diagonal'.

  • Use 'Regularize' to add a very small positive number to the diagonal of every covariance matrix.

  • Try another set of initial values.

In other cases gmdistribution may pass through an intermediate step where one or more of the components has an ill-conditioned covariance matrix. Trying another set of initial values may avoid this issue without altering your data or model.

Examples

Generate data from a mixture of two bivariate Gaussian distributions using the mvnrnd function:

MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
MU2 = [-3 -5];
SIGMA2 = [1 0; 0 1];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];

scatter(X(:,1),X(:,2),10,'.')
hold on

Next, fit a two-component Gaussian mixture model:

options = statset('Display','final');
obj = gmdistribution.fit(X,2,'Options',options);
10 iterations, log-likelihood = -7046.78

h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]);

Among the properties of the fit are the parameter estimates:

ComponentMeans = obj.mu
ComponentMeans =
    0.9391    2.0322
   -2.9823   -4.9737

ComponentCovariances = obj.Sigma
ComponentCovariances(:,:,1) =
    1.7786   -0.0528
   -0.0528    0.5312
ComponentCovariances(:,:,2) =
    1.0491   -0.0150
   -0.0150    0.9816

MixtureProportions = obj.PComponents
MixtureProportions =
    0.5000    0.5000

The Akaike information is minimized by the two-component model:

AIC = zeros(1,4);
obj = cell(1,4);
for k = 1:4
    obj{k} = gmdistribution.fit(X,k);
    AIC(k)= obj{k}.AIC;
end

[minAIC,numComponents] = min(AIC);
numComponents
numComponents =
     2

model = obj{2}
model = 
Gaussian mixture distribution
with 2 components in 2 dimensions
Component 1:
Mixing proportion: 0.500000
Mean:     0.9391    2.0322
Component 2:
Mixing proportion: 0.500000
Mean:    -2.9823   -4.9737

Both the Akaike and Bayes information are negative log-likelihoods for the data with penalty terms for the number of estimated parameters. They are often used to determine an appropriate number of components for a model when the number of components is unspecified.

References

[1] McLachlan, G., and D. Peel. Finite Mixture Models. Hoboken, NJ: John Wiley & Sons, Inc., 2000.

See Also

|

Was this topic helpful?