Quantcast

Documentation Center

  • Trial Software
  • Product Updates

datasample

Randomly sample from data, with or without replacement

Syntax

y = datasample(data,k)
y = datasample(data,k,dim)
[y,idx] = datasample(data,k,...)
[y,...] = datasample(s,data,k,...)
[y,...] = datasample(data,k,Name,Value)
[y,...] = datasample(data,k,dim,Name,Value)

Description

y = datasample(data,k) returns k observations sampled uniformly at random, with replacement, from the data in data.

y = datasample(data,k,dim) returns a sample taken along dimension dim of data.

[y,idx] = datasample(data,k,...) returns an index vector indicating which values datasample sampled from data.

[y,...] = datasample(s,data,k,...) uses the random number stream s to generate random numbers.

[y,...] = datasample(data,k,Name,Value) or [y,...] = datasample(data,k,dim,Name,Value) samples with additional options specified by one or more Name,Value pair arguments.

Input Arguments

data

Vector, matrix, N-dimensional array, table, or dataset array representing the data from which to sample. By default, datasample regards the rows of a data matrix, or the first nonsingleton dimension of a data array, as data elements. Change this behavior with the dim argument.

k

Positive integer, the number of samples.

dim

Integer specifying the dimension on which to take samples. For example, if data is a matrix and dim is 2, y contains a selection of columns in data. If data is a table or dataset array and dim is 2, y contains a selection of variables in data. Use dim to ensure sampling along a specific dimension regardless of whether data is a vector, matrix or N-dimensional array.

Default: 1

s

Random number stream. Create s using rng or RandStream.

Default: The global random number stream

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'Replace'

Select the sample with replacement if Replace is true, or without replacement if Replace is false. If Replace is false, k must not be larger than the number of data elements in data.

Default: true

'Weights'

Vector with the same number of elements as data elements in data, and with nonnegative elements. Sample with probability proportional to the elements of Weights.

Default: ones(datasize,1), where datasize is the number of data elements in data

Output Arguments

y

  • If data is a vector, y is a vector containing k elements selected from data.

  • If data is a matrix, y is a matrix containing k rows selected from data. Or, if dim = 2, y is a matrix containing k columns selected from data

  • If data is an N-dimensional array, datasample samples along its first non-singleton dimension. Or, if you give a dim name-value pair, datasample samples along the dimension dim.

When the sample is taken with replacement (default), y can contain repeated observations from data. Set the Replace name-value pair to false to sample without replacement.

idx

Vector of indices indicating which elements datasample chose from data to create y. For example:

  • If data is a vector, y = data(idx).

  • If data is a matrix, y = data(idx,:).

Examples

Draw five unique values from the integers 1:10.

y = datasample(1:10,5,'Replace',false)

y =
     6     3     7     8     5
 

Generate a random sequence of the characters ACGT, with replacement, according to specified probabilities.

seq = datasample('ACGT',48,'Weights',[0.15 0.35 0.35 0.15])

seq =
CTTCGACTGTGAGTGGGCGCGACAAGGCTACCGGCCCGGGCGGCACTC
 

Select a random subset of columns from a data matrix.

X = randn(10,1000);
Y = datasample(X,5,2,'Replace',false)

Y =
    0.7007    0.3382    2.1298   -0.1891    0.5026
    0.6520   -0.6693   -0.1961   -0.9915    1.9107
    0.1785    0.6640    2.3247   -1.1735   -1.0020
    1.6760    2.6102   -0.8902   -0.7735    1.8676
   -0.3251   -0.6415   -0.2572   -0.1629   -1.0523
    0.1011    0.9323   -1.3088   -0.4477    0.8036
   -0.5767   -0.5778   -0.8556    0.8672   -0.0727
   -0.0615   -0.9084    0.9020   -0.4185   -1.9520
    0.7256   -1.1228    0.7558    1.2691    2.4997
   -1.2273    0.5754   -0.8755   -0.8224   -1.2066
 

Resample observations from a dataset array to create a bootstrap replicate dataset.

load hospital
y = datasample(hospital,size(hospital,1));
 

Use the second output to sample "in parallel" from two data vectors.

x1 = randn(100,1);
x2 = randn(100,1);
[y1,idx] = datasample(x1,10);
y2 = x2(idx);

Alternatives

You can use randi or randperm to generate indices for random sampling with or without replacement, respectively. However, datasample can be more convenient because it samples directly from your data. datasample also allows weighted sampling.

More About

expand all

Tips

  • To sample random integers with replacement from a range, use randi.

  • To sample random integers without replacement, use randperm or datasample.

  • To randomly sample from data, with or without replacement, use datasample.

Algorithms

datasample uses randperm, rand, or randi to generate random values. Therefore, datasample changes the state of the MATLAB® global random number generator. Control the random number generator using rng.

For selecting weighted samples without replacement, datasample uses the algorithm of Wong and Easton [1].

References

[1] Wong, C. K. and M. C. Easton. An Efficient Method for Weighted Sampling Without Replacement. SIAM Journal of Computing 9(1), pp. 111–113, 1980.

See Also

| | | |

Was this topic helpful?