BioIndexedFile Class

Superclasses:

Allow quick and efficient access to large text file with nonuniform-size entries

Description

The BioIndexedFile class allows access to text files with nonuniform-size entries, such as sequences, annotations, and cross-references to data sets. It lets you quickly and efficiently access this data without loading the source file into memory.

This class lets you access individual entries or a subset of entries when the source file is too big to fit into memory. You can access entries using indices or keys. You can read and parse one or more entries using provided interpreters or a custom interpreter function.

Construction

BioIFobj = BioIndexedFile(Format,SourceFile) returns a BioIndexedFile object BioIFobj that indexes the contents of SourceFile following the parsing rules defined by Format, where SourceFile and Format specify the names of a text file and a file format, respectively. It also constructs an auxiliary index file to store information that allows efficient, direct access to SourceFile. The index file by default is stored in the same location as the source file and has the same name as the source file, but with an IDX extension. The BioIndexedFile constructor uses the index file to construct subsequent objects from SourceFile, which saves time.

BioIFobj = BioIndexedFile(Format,SourceFile,IndexDir) returns a BioIndexedFile object BioIFobj by specifying the relative or absolute path to a folder to use when searching for or saving the index file.

BioIFobj = BioIndexedFile(Format,SourceFile,IndexFile) returns a BioIndexedFile object BioIFobj by specifying a file name, optionally including a relative or absolute path, to use when searching for or saving the index file.

BioIFobj = BioIndexedFile(___,Name,Value) returns a BioIndexedFile object BioIFobj by using any input arguments from the previous syntaxes and additional options, specified as one or more Name,Value pair arguments.

Input Arguments

`Format`	Character vector or string specifying a file format. Choices are: `'SAM'` — SAM-formatted file `'FASTQ'` — FASTQ-formatted file `'FASTA'` — FASTA-formatted file `'TABLE'` — Tab-delimited table with multiple columns. Keys can be in any column. Rows with the same key are considered separate entries. `'MRTAB'` — Tab-delimited table with multiple columns. Keys can be in any column. Contiguous rows with the same key are considered a single entry. Noncontiguous rows with the same key are considered separate entries. `'FLAT'` — Flat file with concatenated entries separated by a character vector, typically `'//'`. Within an entry, the key is separated from the rest of the entry by a white space. Note For all file formats, the file contents must only use ASCII text characters. Non-ASCII characters may not be properly indexed.
`SourceFile`	Character vector or string specifying the name of a text file. It can include a relative or absolute path.
`IndexDir`	Character vector or string specifying the relative or absolute path to a folder to use when searching for or saving the index file.
`IndexFile`	Character vector or string specifying a file name, optionally including a relative or absolute path, to use when searching for or saving the index file.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

`IndexedByKeys`	Specifies if you can access the object `BioIFobj` using keys. Choices are `true` or `false`. Tip Set the value to `false` if you do not need to access entries in the object using keys. Doing so saves time and space when creating the object. Default: `true`
`MemoryMappedIndex`	Specifies whether the constructor stores the indices in the auxiliary index file and accesses them via memory maps (`true`) or loads the indices into memory at construction time (`false`). Tip If memory is not an issue and you want to maximize performance when accessing entries in the object, set the value to `false`. Default: `true`
`Interpreter`	Handle to a function that the `read` method uses when parsing entries from the source file. The interpreter function must accept a character vector of one or more concatenated entries and return a structure or an array of structures containing the interpreted data. When `Format` is a general-purpose format such as `'TABLE'`, `'MRTAB'`, or `'FLAT'`, then the default is `[]`, which means the function is an anonymous function in which the output is equivalent to the input. When `Format` is an application-specific format such as `'SAM'`, `'FASTQ'`, or `'FASTA'`, then the default is a function handle appropriate for that file type and typically does not require you to change it.
`Verbose`	Controls the display of the status of the object construction. Choices are `true` or `false`. Default: `true`

Note

The following name-value pair arguments apply only when both of the following are true:

There is no pre-existing index file associated with your source file.
Your source file has a general-purpose format such as 'TABLE', 'MRTAB', or 'FLAT'.

For source files with application-specific formats, the following name-value pairs are pre-defined and you cannot change them.

`KeyColumn`	Positive integer specifying the column in the `'TABLE'` or `'MRTAB'` file that contains the keys. Default: `1`
`KeyToken`	Character vector or string that occurs in each entry before the key, for `'FLAT'` files that contain keys. If the value is `' '`, it indicates the key is the first character vector (or string) in each entry and is delimited by blank spaces. Default: `' '`
`HeaderPrefix`	Character vector or string specifying a prefix that denotes header lines in the source file so the constructor ignores them when creating the object. If the value is `[]`, it means the constructor does not check for header lines in the source file. Default: `[]`
`CommentPrefix`	Character vector or string specifying a prefix that denotes comment lines in the source file so the constructor ignores them when creating the object. If the value is `[]`, it means the constructor does not check for comment lines in the source file. Default: `[]`
`ContiguousEntries`	Specifies whether entries are on contiguous lines, which means they are not separated by empty lines or comment lines, in the source file or not. Choices are `true` or `false`. Tip Set the value to `true` when entries are not separated by empty lines or comment lines. Doing so saves time and space when creating the object. Default: `false`
`TableDelimiter`	Character vector or string specifying a delimiter symbol to use as a column separator for `SourceFile` when `Format` is `'TABLE'` or `'MRTAB'`. Choices are `'\t'` (horizontal tab), `' '` (blank space), or `','`, (comma). Default: `'\t'`
`EntryDelimiter`	Character vector or string specifying a delimiter symbol to use as an entry separator for `SourceFile` when `Format` is `'FLAT'`. Default: `'//'`

Properties

`FileFormat`	File format of the source file This information is read only. Possible values are: `'SAM'` — SAM-formatted file `'FASTQ'` — FASTQ-formatted file `'FASTA'` — FASTA-formatted file `'TABLE'` — Tab-delimited table with multiple columns. Keys can be in any column. Rows with the same key are considered separate entries. `'MRTAB'` — Tab-delimited table with multiple columns. Keys can be in any column. Contiguous rows with the same key are considered a single entry. Noncontiguous rows with the same key are considered separate entries. `'FLAT'` — Flat file with concatenated entries separated by a character vector, typically `'//'`. Within an entry, the key is separated from the rest of the entry by a white space.
`IndexedByKeys`	Whether or not the entries in the source file can be indexed by an alphanumeric key. This information is read only.
`IndexFile`	Path and file name of the auxiliary index file. This information is read only. Use this property to confirm the name and location of the index file associated with the object.
`InputFile`	Path and file name of the source file. This information is read only. Use this property to confirm the name and location of the source file from which the object was constructed.
`Interpreter`	Handle to a function used by the `read` method to parse entries in the source file. This interpreter function must accept a character vector of one or more concatenated entries and return a structure or an array of structures containing the interpreted data. Set this property when your source file has a `'TABLE'`, `'MRTAB'`, or `'FLAT'` format. When your source file is an application-specific format such as `'SAM'`, `'FASTQ'`, or `'FASTA'`, then the default is a function handle appropriate for that file type and typically does not require you to change it.
`MemoryMappedIndex`	Whether the indices to the source file are stored in a memory-mapped file or in memory.
`NumEntries`	Number of entries indexed by the object. This information is read only.

Methods

getDictionary	Retrieve reference sequence names from SAM-formatted source file associated with BioIndexedFile object
getEntryByIndex	Retrieve entries from source file associated with BioIndexedFile object using numeric index
getEntryByKey	Retrieve entries from source file associated with BioIndexedFile object using alphanumeric key
getIndexByKey	Retrieve indices from source file associated with BioIndexedFile object using alphanumeric key
getKeys	Retrieve alphanumeric keys from source file associated with BioIndexedFile object
getSubset	Create object containing subset of elements from BioIndexedFile object
read	Read one or more entries from source file associated with BioIndexedFile object

Copy Semantics

Value. To learn how value classes affect copy operations, see Copying Objects in the MATLAB^® Programming Fundamentals documentation.

Examples

collapse all

Construct a BioIndexedFile object and access its gene ontology (GO) terms

Open Live Script

This example shows how to construct a BioIndexedFile object and access its gene ontology (GO) terms.

Create a variable containing full absolute path of source file.

sourcefile = which('yeastgenes.sgd');

Copy the file to the current working directory.

copyfile(sourcefile,'yeastgenes_copy.sgd');

Construct a BioIndexedFile object from the source file that is a tab-delimited file, considering contiguous rows with the same key as a single entry. Indicate that keys are located in column 3 and that header lines are prefaced with '!'.

gene2goObj = BioIndexedFile('mrtab','yeastgenes_copy.sgd','KeyColumn',3,'HeaderPrefix','!');

Source File: yeastgenes_copy.sgd
   Path: /tmp/Bdoc24a_2528353_1556503/tpd005c5f0/bioinfo-ex58973989
   Size: 21455392 bytes
   Date: 15-Mar-2018 17:45:16
Creating new index file ...
Indexer found 36266 entries after parsing 111912 text lines.
Index File: yeastgenes_copy.sgd.idx
   Path: /tmp/Bdoc24a_2528353_1556503/tpd005c5f0/bioinfo-ex58973989
   Size: 494723 bytes
   Date: 13-Feb-2024 00:26:48
Mapping object to yeastgenes_copy.sgd.idx ... 
Done.

Return the GO term from all entries that are associated with the gene YAT2. Access entries that have a key of YAT2.

YAT2_entries = getEntryByKey(gene2goObj,'YAT2');

Adjust object interpreter to return only the column containing the GO term.

gene2goObj.Interpreter = @(x) regexp(x,'GO:\d+','match');

Parse the entries with a key of YAT2 and return all GO terms from those entries.

GO_YAT2_entries = read(gene2goObj, 'YAT2')

GO_YAT2_entries = 1x14 cell
    {'GO:0004092'}    {'GO:0006066'}    {'GO:0006066'}    {'GO:0009437'}    {'GO:0005829'}    {'GO:0005737'}    {'GO:0004092'}    {'GO:0016740'}    {'GO:0016746'}    {'GO:0006629'}    {'GO:0016746'}    {'GO:0005737'}    {'GO:0006631'}    {'GO:0005737'}