How to sort and open up text files titled in a range of numbers

5 views (last 30 days)
Hi Matlab Community,
I have hundred perhaps thousands of text files from an experiment. These files are titled by the time their were captured. I was wondering if their is a method to capture groups of the files at certain points in time. Since the titles of each file were determined in micro seconds there is no pattern in how the files were titled. For example, I would like to open text files titled between uc#6_123000 and uc#6_321000.
Maybe there is an iterative process for this record keeping?
Thanks for the help!

Answers (4)

dpb
dpb on 24 Jul 2014
Use the time stamp as it was generated to create a list and iterate over it. I presume there's a meaning for 123000 and 321000 above and the letters and other digit create a grosser segregation of some sort.
Now if there's no specific time interval between the two integer values above, your list will include files that don't exist; use exist to test first or put the attempt to open the file in a try...catch block to to just skip the non-existent ones.

Michael Haderlein
Michael Haderlein on 24 Jul 2014
You can use the dir function to get all the files, then extract the time stamp and compare the time stamps of the files with a min/max value:
allfiles=dir('*.m');
d=[allfiles.datenum];
selectedfiles=allfiles(d>=datenum([2014 05 01 0 0 0]) & d<datenum([2014 06 01 0 0 0]));
Should give you all m-files in your folder from this year's May. Then you can start reading selectedfiles.
Best regards,
Michael
  1 Comment
ALEX
ALEX on 24 Jul 2014
Hi Michael,
Thanks for this strategy. I think it worked for the most part. The only trouble I have is that each file name is formatted a little different from your example.
Here is how mine looks with an arbitrary range:
selectedfiles=allfiles(d>=datenum([735598.475717593]) & d<datenum([735597.813472222]))
results:
the returned selectedfiles = 0x1 struct
actual file names are between 'c#6_6_1_3_9_472319296.txt' and 'c#6_6_1_2_100_483853850.txt'
In this this case the last group of numbers in the title is the time of test, so between:
472319296 and 483853850
I hope this sounds familiar. Thanks for your help!
Alex

Sign in to comment.


dpb
dpb on 24 Jul 2014
datenum([735598.475717593])
Don't need datenum; 735598.475717593 is already a serial date number.
The expression returns a 0-sized structure because your lower limit is greater than the upper -- you asked for >735598+ and simultaneously <735597+. This is impossible.
selectedfiles=allfiles(d>=735597.813472222 & d<735598.475717593);
It would seem better to keep the actual date strings or y,m,d,... vector than the absolute numeric values, though.
Again, you'll have to parse the numeric values from the file names themselves as suggested earlier in order to pick individual specific files as the name string date and the file system date stamp don't correlate as you outlined in the original question.
  1 Comment
ALEX
ALEX on 24 Jul 2014
Hi dpb,
I see there was a mistake in how I posted the previous text. I fixed it and it ran it but the results were not expected. It seems like there is no order to 'datenum'. I am limited in the ability to go into y,m,d. I can not name the files that are being produced by the other commercial software, it records instances in micro seconds. I see that 'date' quality of each struct presents the information as '29-Dec-2013 19:31:34.' I think this might help. When you say parse does that mean to isolate a section of that field? For example, we want just 19:31:34. I think that might be better because the 'datenum' didn't seem to have any logical order.
The block you were explaining in your first post, is that the equivalent of loop that will execute even if the results doesn't exist?
Thanks much!
Alex

Sign in to comment.


dpb
dpb on 25 Jul 2014
Edited: dpb on 25 Jul 2014
datenum returns a Matlab serial date number for the given date/time input as a double. The whole number is the number of days since the reference point, the fractional part is fraction of day. It has precision of roughly msec, and the OS doesn't have better than that anyway so, yes, Virginia, you can't find a file to the microsecond that way.
datenums, however, do have a very well defined order but they will be returned from a call to datenum in the order of the dates presented. If you don't enter a chronological set of values, then the returned values won't be in chronological order, either. This behavior is fully consistent with other matrix operations in Matlab.
ADDENDUM
It is, however, true that the output of dir is generally sorted alphanumerically by name(*), not by file system date; hence the returned datenum value in the directory structure will not necessarily be sequential unless the alpha order and the date order coincide. As noted in later response, you can sort it to process them in order. That still doesn't get those within some preselected range of sample times, however, as also noted.
Just came to me what the likely cause of the order confusion is/was...
(*) This is still dependent on the OS default and any options in play on the particular platform.
ENDADDENDUM
I don't quite follow the naming convention other than the last 9(?) digits--is the c#6_6_1_3_9_ portion a month/day pattern or somesuch?
Again, if you want a set of files within some range of microseconds with a given one of these preceding patterns, you'll need to separate out those microsecond values from the file names and then operate on them to glean out those within some range. That separation process is, indeed, "parsing".
If the file names were as the two examples above, if one were to use dir on the directory as
d=dir('c#6_6*.txt');
one would get a return that would look for the name field something like--
>> d.name
ans =
'c#6_6_1_3_9_472319296.txt'
ans =
'c#6_6_1_2_100_483853850.txt'
>>
From there it's not too tough to get the values for all of the various fields--defining a function handle that can parse the names to numeric values as
f=@(x) cell2mat(textscan(char(x),['c#' repmat('%d_',1,5) '%9d.txt'],'collectoutput',1));
where the input x is the name for each file, we can apply that to each entry in the directory with
>> dtvals=reshape(cell2mat(arrayfun(f,[d(:).name],'uniformoutput',false)),6,[]).'
dtvals =
6 6 1 3 9 472319296
6 6 1 2 100 483853850
Now from this you can use
isOK=iswithin(dtvals(:,6),lo,hi);
to return values within a lo and hi range of microsec's.
iswithin is a helper utility function of mine that looks like
function flg=iswithin(x,lo,hi)
% returns T for values within range of input
% SYNTAX:
% [log] = iswithin(x,lo,hi)
% returns T for x between lo and hi values, inclusive
flg= (x>=lo) & (x<=hi);
It's just "syntactic sugar" but it moves the complexity of the actual comparisons to a lower level for ease in reading the top-level code.
  2 Comments
ALEX
ALEX on 25 Jul 2014
Thanks dpb,
I'm not familiar with all of the techniques just described. All of the naming stuff is related to the type of test and channel of the instrument that is being activated, Ch 2 or Ch 3.
for example c#6_6_1_3_9_472319296.txt
c#6_6_1 = specimen 3 = channel 9 = waveform recorded within the series (this example 9 and 100) 472319296 = microseconds when the wave was created
Creating a function handle seems like a good method. I did run everything that you described but received an error.
"Function definitions are not permitted in this context"
Is there a bug in this code?
lo = 472319296;
hi = 483853850;
d = dir('G:\AE DATA FILES\c#6_6_1\Subgroup\*.txt');
f=@(x) cell2mat(textscan(char(x),['c#' repmat('%d_',1,5) '%9d.txt'],'collectoutput',1));
dtvals=reshape(cell2mat(arrayfun(f,[d(:).name],'uniformoutput',false)),6,[]).'
isOK=iswithin(dtvals(:,6),lo,hi);
function flg=iswithin(x,lo,hi)
flg= (x>=lo) & (x<=hi)
All of this seems above my head. As pointed out by colleague, I could just rename all of the files sequentially for example. test1(1), test1(2), test1(3), etc. Then I could open up all of the files and find the time of test within the header and if it agreed with the logical expression then we could keep that file. This doesn't seem as efficient but I don't need to brake any land speed records. It might be easier on my end too.
Comments, Suggestions?
Thanks again for this help!
dpb
dpb on 25 Jul 2014
Edited: dpb on 25 Jul 2014
Yes, the bug is that the code for the function iswithin must, like all other functions, reside in an m-file named iswithin.m
The error is telling you you can't define a function in a script file or at the command line.
On the question of approach -- if there is a header inside the file as well that has the time, you could certainly read it, too. You wouldn't need to rename the files to do so, simply iterate over the returned directory structure. Just return the directory structure from dir for files that have the other suitable characteristics of the desired type of test and/or channels via the wild card name in the search. If, despite the microsecond resolution in the file naming convention there's at least a few milliseconds between the different file creation dates, you could then sort the returned names on the datenum field and process in the sorted order. Then, as you say, you can simply open each in sequence, check the date from the header and accept/reject based on the desired timestamp range. How inefficient this will be depends entirely on just how many files there are as compared to the number desired to be processed and the size of the files. If the header is short and easily parsed, it shouldn't be too bad to simply read a line but it is another step that could be avoided by the above logic to select the files from the name timestamps.
The original idea of creating a list of timestamps within the range is admittedly inefficient as there are likely far more possible timestamps in the range than files and since the timestamp is possible to be any value you can't skip by twos or other difference but must check every one. OTOH, it has the advantage over the above that it would check whether the file exists or not before trying to open/read it so that shouldn't add too much overhead even though the loop could be sizable.
Only if there are duplicate timestamps would the above fail to keep the tests sequential in processing if it were possible for the system to write subsequent files within the resolution of the OS clock. I'd doubt this would be, but suppose if the system were multi-core or otherwise cached stuff theoretically could happen. If that does occur, then the filename parsing is certainly easier than sorting those out.
It seems to me that with the above to parse the filename timestamps you should be on your way ... the error you had is, as noted, simply that you didn't put the utility function in its own file. As an aside, I suggest creating a directory for such code--mine is called 'utilities' and has a plethora of this little snippets. Just create the new subdirectory and add it to the matlabpath so it will be accessible.

Sign in to comment.

Categories

Find more on Startup and Shutdown in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!