Main Content

BioMap class

Superclasses: BioRead

Contain sequence, quality, alignment, and mapping data

Description

The BioMap class contains data from short-read sequences, including sequence headers, read sequences, quality scores for the sequences, and data about how each sequence aligns to a given reference. This data is typically obtained from a high-throughput sequencing instrument.

Construct a BioMap object from short-read sequence data. Each element in the object has a sequence, header, quality score, and alignment/mapping information associated with it. Use the object properties and methods to explore, access, filter, and manipulate all or a subset of the data, before analyzing or viewing the data.

Construction

BioMapobj = BioMap constructs BioMapobj, which is an empty BioMap object.

BioMapobj = BioMap(File) constructs BioMapobj, a BioMap object, from File, a SAM- or BAM-formatted file whose reads are ordered by start position in the reference sequence. The data remains in the source file, and the BioMap object accesses it using one or two auxiliary index files. For a SAM-formatted file, MATLAB® uses or creates one index file that must have the same name as the source file, but with an .idx extension. For a BAM-formatted file, MATLAB uses or creates two index files that must have the same name as the source file, but with *.bai and *.linearindex extensions. If the index files are not found in the same folder as the source file, the BioMap constructor function creates the index files in that folder.

When you pass in an unordered BAM-formatted file, the constructor automatically orders the file and writes the data to an ordered file using the same base name and extension with an added character vector “.ordered” before the extension. The new file is indexed and used to instantiate the new BioMap object.

Note

Because the data remains in the source file and is accessed using the index files:

  • Do not delete the source file (SAM or BAM).

  • Do not delete the index files (*.idx,*.bai, or *.linearindex).

  • You cannot modify BioMapobj properties.

Tip

To determine the number of reference sequences included in your source file, use the saminfo or baminfo function. Use SAMtools to check if the reads in your source file are ordered by position in the reference sequence, and also to reorder them, if needed.

BioMapobj = BioMap(Struct) constructs BioMapobj, a BioMap object, from Struct, a MATLAB structure containing sequence and alignment information, such as returned by the samread or bamread function. The data from Struct remains in memory, which lets you modify the BioMapobj properties.

BioMapobj = BioMap(___,'Name',Value) constructs the BioMap object using any of previous input arguments and additional options, specified as name-value pair arguments as follows.

BioMapobj = BioMap(___,'SelectReference',SelectRefValue) selects one or more references when the source data contains sequences mapped to more than one reference. By default, the constructor includes all of the references in the header dictionary of the source file. When the header dictionary is not available, the constructor defaults to including all reference names found in the source data. SelectRefValue is a character vector, string, string vector, or cell array of character vectors. By using this option, you can prevent the BioMap constructor from creating auxiliary index files for references that you will not use in your analysis. If any reads mapped to selected references are paired and BioMapobj is written to a file, the reference sequences of the mates are also included in the file header.

BioMapobj = BioMap(File,'InMemory',InMemoryValue) specifies whether to place the data in memory or leave the data in the source file. Leaving the data in the source file and accessing via an index file is more memory efficient, but does not let you modify properties of BioMapobj. Choices are true or false (default). If the first input argument is not a file name, then this name-value pair argument is ignored, and the data is automatically placed in memory.

Tip

Set the 'InMemory' name-value pair argument to true if you want to modify the properties of BioMapobj.

BioMapobj = BioMap(___,'IndexDir',IndexDirValue) specifies the path to the folder where the index files (*.idx,*.bai, or *.linearindex) either exist or will be created.

Tip

Use the 'IndexDir' name-value pair argument if you do not have write access to the folder where the source file is located.

BioMapobj = BioMap(___,'Sequence',SequenceValue) constructs BioMapobj, a BioMap object, from SequenceValue that contains he letter representations of nucleotide sequences. This name-value pair works only if the data is read into memory.

BioMapobj = BioMap(___,'Header',HeaderValue) constructs BioMapobj, a BioMap object, from HeaderValue that contains header text for nucleotide sequences. This name-value pair works only if the data is read into memory.

BioMapobj = BioMap(___,'Quality',QualityValue) constructs BioMapobj, a BioMap object, from QualityValue that contains the ASCII representation of per-base quality scores for nucleotide sequences. This name-value pair works only if the data is read into memory.

BioMapobj = BioMap(___,'Reference',ReferenceValue) constructs BioMapobj, a BioMap object, and sets the Reference property to ReferenceValue that contains the names of the reference sequences. This name-value pair works only if the data is read into memory.

BioMapobj = BioMap(___,'Signature',SignatureValue) constructs BioMapobj, a BioMap object, from SignatureValue that contains information describing the alignment of each read sequence with the reference sequence. This name-value pair works only if the data is read into memory.

BioMapobj = BioMap(___,'Start',StartValue) constructs BioMapobj, a BioMap object, from StartValue, a vector of positive integers specifying the position in the reference sequence where the alignment of each read sequence starts. This name-value pair works only if the data is read into memory.

BioMapobj = BioMap(___,'Flag',FlagValue) constructs BioMapobj, a BioMap object, from FlagValue, a vector of positive integers indicating the bit-wise information for the status of the 11 flags specified by the SAM format specification. These flags describe different sequencing and alignment aspects of the read sequences. This name-value pair works only if the data is read into memory.

BioMapobj = BioMap(___,'MappingQuality',MappingQualityValue) constructs BioMapobj, a BioMap object, from MappingQualityValue, a vector of positive integers specifying the mapping quality for each read sequence. This name-value pair works only if the data is read into memory.

BioMapobj = BioMap(___,'MatePosition',MatePositionValue) constructs BioMapobj, a BioMap object, from MatePositionValue, a vector of nonnegative integers specifying the mate position for each read sequence. This name-value pair works only if the data is read into memory.

Input Arguments

File

Character vector or string specifying a SAM- or BAM-formatted file that contains only one reference sequence and whose reads are ordered by start position in the reference sequence.

Struct

MATLAB structure containing sequence and alignment information, such as returned by the samread or bamread function. The structure must have a one-based start position.

SelectRefValue

Character vector, string, string vector, or cell array of character vectors specifying the name of the reference sequences in File or Struct. Use saminfo or baminfo to see a complete list of reference sequences in File.

InMemoryValue

Logical specifying whether to place the data in memory or leave the data in the source file. Leaving the data in the source file and accessing it via an index file is more memory efficient, but does not let you modify properties of the BioMap object. If the first input argument is not a file name, then this name-value pair argument is ignored, and the data is automatically placed in memory.

Default: false

IndexDirValue

Character vector or string specifying the path to the folder where the index file either exists or will be created.

Default: Folder where File is located

SequenceValue

String vector or cell array of character vectors containing the letter representations of nucleotide sequences. This information populates the BioMap object's Sequence property. The samread and bamread functions return this information in the Sequence field of the output structure.

QualityValue

String vector or cell array of character vectors containing the ASCII representation of per-base quality scores for nucleotide sequences. This information populates the BioMap object's Quality property. The samread and bamread functions return this information in the Quality field of the output structure.

HeaderValue

String vector or cell array of character vectors containing header text for nucleotide sequences. This information populates the BioMap object's Header property. The samread and bamread functions return this information in the QueryName field of the return structure.

NameValue

Character vector or string describing the BioMap object. This information populates the object's Name property.

Default: ' ', an empty character vector

ReferenceValue

String vector or cell array of character vectors containing the names of the reference sequences. This information populates the object's Reference property. The samread function returns this information in the ReferenceName field of the SAMStruct output argument. The bamread function returns this information in the Reference field of the HeaderStruct output structure.

SignatureValue

String vector or cell array of character vectors containing information describing the alignment of each read sequence with the reference sequence. The samread and bamread functions return this information in the CigarString field of the return structure. This information populates the object's Signature property.

StartValue

Vector of positive integers specifying the position in the reference sequence where the alignment of each read sequence starts. This information populates the object's Start property. The samread and bamread functions return this information in the Position field of the output structure.

FlagValue

Vector of positive integers indicating the bit-wise information for the status of the 11 flags specified by the SAM format specification. These flags describe different sequencing and alignment aspects of the read sequences. This information populates the object's Flag property. The samread and bamread functions return this information in the Flag field of the output structure.

MappingQualityValue

Vector of positive integers specifying the mapping quality for each read sequence. This information populates the object's MappingQuality property. The samread and bamread functions return this information in the MappingQuality field of the output structure.

MatePositionValue

Vector of nonnegative integers specifying the mate position for each read sequence. This information populates the object's MatePosition property. The samread and bamread functions return this information in the MatePosition field of the output structure.

Properties

Flag

Flags associated with all read sequences represented in the BioMap object.

Vector of positive integers such that there is an integer for each read sequence in the object. Each integer indicates the bit-wise information that specifies the status of the 11 flags described by the SAM format specification. These flags describe different sequencing and alignment aspects of a read sequence. A one-to-one relationship exists between the number and order of elements in Flag and Sequence, unless Flag is an empty vector.

Header

Headers associated with all read sequences represented in the BioMap object.

Cell array of character vectors, such that there is a header for each read sequence in the object. Headers can be empty. A one-to-one relationship exists between the number and order of elements in Header and Sequence, unless Header is an empty cell array.

MatePosition

Positions of the mates for all read sequences represented in the BioMap object.

Vector of nonnegative integers such that there is an integer for each read sequence in the object. Each integer indicates the position of the corresponding mate sequence, relative to the reference sequence. A one-to-one relationship exists between the number and order of elements in MatePosition and Sequence, unless MatePosition is an empty vector.

Not all values in the MatePosition vector represent valid mate positions, for example, mates that map to a different reference sequence or mates that do not map. To determine if a mate position is valid, use the filterByFlag method with the 'pairedInMap' flag.

MappingQuality

Mapping quality scores associated with all read sequences represented in the BioMap object.

Vector of integers, such that there is a mapping quality score for each read sequence in the object. A one-to-one relationship exists between the number and order of elements in MappingQuality and Sequence, unless MappingQuality is an empty vector.

Name

Description of the BioMap object.

Character vector describing the BioMap object.

Default: ' ', an empty character vector

NSeqs

Number of sequences in the BioMap object.

This information is read-only.

Quality

Per-base quality scores associated with all read sequences represented in the BioMap object.

Cell array of character vectors, such that there is a quality for each read sequence in the object. Each quality is an ASCII representation of per-base quality scores for a read sequence. Quality can be an empty character vector. A one-to-one relationship exists between the number and order of elements in Quality and Sequence, unless Quality is an empty cell array.

Reference

Reference sequences in the BioMap object.

BioMapobj.NSeqs-by-1 cell array of character vectors specifying the names of the reference sequences.

The reference sequences are the sequences against which the read sequences are aligned.

Sequence

Read sequences in the BioMap object.

Cell array of character vectors containing the letter representations of the read sequences.

SequenceDictionary

Cell array of character vectors that catalogs the names of the references available in the BioMap object.

This information is read-only.

Signature

Alignment information associated with all read sequences represented in the BioMap object.

Cell array of CIGAR–formatted character vectors, such that there is alignment information for each read sequence in the object. Each character vector represents how a read sequence aligns to the reference sequence. Signatures can be empty character vectors. A one-to-one relationship exists between the number and order of elements in Signature and Sequence, unless Signature is an empty cell array.

Start

Start positions of all aligned read sequences represented in the BioMap object.

Vector of integers, such that there is a start position for each read sequence in the object. Each integer specifies the start position of the aligned read sequence with respect to the position numbers in the reference sequence. A one-to-one relationship exists between the number and order of elements in Start and Sequence, unless Start is an empty vector.

Methods

filterByFlagFilter sequence reads by SAM flag
getAlignmentConstruct alignment represented in BioMap object
getBaseCoverageReturn base-by-base alignment coverage of reference sequence in BioMap object
getCompactAlignmentConstruct compact alignment represented in BioMap object
getCountsReturn count of read sequences aligned to reference sequence in BioMap object
getFlagRetrieve read sequence flags from BioMap object
getIndexReturn indices of read sequences aligned to reference sequence in BioMap object
getInfoRetrieve information for single element of BioMap object
getMappingQualityRetrieve sequence mapping quality scores from BioMap object
getReferenceRetrieve reference sequence from BioMap object
getSignatureRetrieve signature (alignment information) from BioMap object
getStartRetrieve start positions of aligned read sequences from BioMap object
getStopCompute stop positions of aligned read sequences from BioMap object
getSummaryPrint summary of BioMap object
setFlagSet read sequence flags for BioMap object
setMappingQualitySet sequence mapping quality scores for BioMap object
setReferenceSet name of reference sequence for BioMap object
setSignatureSet signature (alignment information) for BioMap object
setStartSet start positions of aligned read sequences in BioMap object

Inherited Methods

combineCombine two objects
getRetrieve property of object
getHeaderRetrieve sequence headers from object
getQualityRetrieve sequence quality information from object
getSequenceRetrieve sequences from object
getSubsequenceRetrieve partial sequences from object
getSubsetRetrieve subset of elements from object
setSet property of object
setHeaderUpdate header information of reads
setQualityUpdate quality information
setSequenceUpdate read sequences
setSubsequenceUpdate partial sequences
setSubsetUpdate elements of object
writeWrite contents of BioRead or BioMap object to file

Copy Semantics

Value. To learn how value classes affect copy operations, see Copying Objects in the MATLAB Programming Fundamentals documentation.

Indexing

BioMap objects support dot . indexing to extract, assign, and delete data.

Examples

collapse all

This example shows how to construct a BioMap object from a SAM file and from a structure.

Construct a BioMap object from a SAM-formatted file that is provided with Bioinformatics Toolbox™ and set the Name property.

BMObj1 = BioMap('ex1.sam', 'Name', 'MyObject')
BMObj1 = 
  BioMap with properties:

    SequenceDictionary: 'seq1'
             Reference: [1501x1 File indexed property]
             Signature: [1501x1 File indexed property]
                 Start: [1501x1 File indexed property]
        MappingQuality: [1501x1 File indexed property]
                  Flag: [1501x1 File indexed property]
          MatePosition: [1501x1 File indexed property]
               Quality: [1501x1 File indexed property]
              Sequence: [1501x1 File indexed property]
                Header: [1501x1 File indexed property]
                 NSeqs: 1501
                  Name: 'MyObject'


Construct a structure containing information from a SAM file.

SAMStruct = samread('ex1.sam');

Construct a BioMap object from this structure.

BMObj2 = BioMap(SAMStruct)
BMObj2 = 
  BioMap with properties:

    SequenceDictionary: {'seq1'}
             Reference: {1501x1 cell}
             Signature: {1501x1 cell}
                 Start: [1501x1 uint32]
        MappingQuality: [1501x1 uint8]
                  Flag: [1501x1 uint16]
          MatePosition: [1501x1 uint32]
               Quality: {1501x1 cell}
              Sequence: {1501x1 cell}
                Header: {1501x1 cell}
                 NSeqs: 1501
                  Name: ''