Got Questions? Get Answers.
Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
How to remove unwanted text from a .txt file?

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 23 Sep, 2008 22:35:03

Message: 1 of 20

HI,
I work with .txt files that receive streaming data which is set up in uniform fields format. However, whenever there is a connection problem the program sends a 3 or 4 line message to the text fil regarding data not available! How can I programmatically remove those unwanted lines from the file without having to manually open it first?

thanks

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 23 Sep, 2008 22:58:01

Message: 2 of 20

I use textscan to read the dat and when the text is found contrary to the field setup as per the %f etc. then textscan does not continue to load the file and worst is that it does not warn one. So I guess by checking the input matrix size utilizing a timer one could guess that there is the text present in the file at which time would be nice to be able to remove the unwanted txt lines but HOW?

I can think of messy code to do it but how can this be done elegantly?
thanks

Subject: How to remove unwanted text from a .txt file?

From: Walter Roberson

Date: 23 Sep, 2008 23:01:24

Message: 3 of 20

Cy abd wrote:

> I work with .txt files that receive streaming data which is set up in uniform
> fields format. However, whenever there is a connection problem the program sends
> a 3 or 4 line message to the text fil regarding data not available! How can
> I programmatically remove those unwanted lines from the file without having
> to manually open it first?

perl, sed, ed, awk, grep, C, matlab, ...

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 23 Sep, 2008 23:33:01

Message: 4 of 20

how about just the matlab way, since I'm posting in the matlab forum? care to maybe hint on that?

>
> perl, sed, ed, awk, grep, C, matlab, ...

Subject: How to remove unwanted text from a .txt file?

From: Walter Roberson

Date: 24 Sep, 2008 00:18:42

Message: 5 of 20

Cy abd top-posted:

Please do not post your reply above the material you are commenting on: it makes
it difficult to hold a discussion.

>> perl, sed, ed, awk, grep, C, matlab, ...

> how about just the matlab way, since I'm posting in the matlab forum? care to
> maybe hint on that?

The matlab way that -I- would use would be to write a short perl script to do
the work. Matlab arrives with perl installed, accessible via the perl() command.

For example, put this in file allbut.pl

$refuse = shift @ARGV; while (<>) { print unless /(?:$refuse)/o; }


Then to use it, in matlab call with (e.g.)

filteredtext = perl('allbut.pl', 'Lost data connection|Reconnecting', 'XYZ.txt');

where XYZ.txt is the file name of the file to have the lines removed,
and the lines to be deleted are any lines that contain either the string
'Lost data connection' or the string 'Reconnecting' anywhere on the line.

The result variable, filteredtext, would probably be a char vector
(with embedded end of line characters) containing all the -other- lines.
You could write that to a file if you wanted, or it might be more convenient
to textscan() the string without writing it out to a file.


The same task can certainly be done without calling out to perl, but
it is more of a nuisance.

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 24 Sep, 2008 00:35:03

Message: 6 of 20

Walter Roberson <roberson@hushmail.com> wrote in message <VGfCk.562$Cl1.66@newsfe01.iad>...
> Cy abd top-posted:
>
> Please do not post your reply above the material you are commenting on: it makes
> it difficult to hold a discussion.
>
> >> perl, sed, ed, awk, grep, C, matlab, ...
>
> > how about just the matlab way, since I'm posting in the matlab forum? care to
> > maybe hint on that?
>
> The matlab way that -I- would use would be to write a short perl script to do
> the work. Matlab arrives with perl installed, accessible via the perl() command.
>
> For example, put this in file allbut.pl
>
> $refuse = shift @ARGV; while (<>) { print unless /(?:$refuse)/o; }
>
>
> Then to use it, in matlab call with (e.g.)
>
> filteredtext = perl('allbut.pl', 'Lost data connection|Reconnecting', 'XYZ.txt');
>
> where XYZ.txt is the file name of the file to have the lines removed,
> and the lines to be deleted are any lines that contain either the string
> 'Lost data connection' or the string 'Reconnecting' anywhere on the line.
>
> The result variable, filteredtext, would probably be a char vector
> (with embedded end of line characters) containing all the -other- lines.
> You could write that to a file if you wanted, or it might be more convenient
> to textscan() the string without writing it out to a file.
>
>
> The same task can certainly be done without calling out to perl, but
> it is more of a nuisance.

Thank you for the explanation, will get working on it and hopefully will be able to resolve my problem.

Sorry about the TOP posting always used to think that is the riht way, won't happen again though.

One would have thought that ML would have implemented an straight forward solution for such a popular problem within the textscan but unfortunately NOT! It can ignore strings but not complete lines stating with a string! Although I maybe wrong, I'm also looking into commentStyle which might just do that, does it?
thanks again

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 24 Sep, 2008 00:47:02

Message: 7 of 20


> One would have thought that ML would have implemented an straight forward solution for such a popular problem within the textscan but unfortunately NOT! It can ignore strings but not complete lines stating with a string! Although I maybe wrong, I'm also looking into commentStyle which might just do that, does it?
> thanks again

Well, I'll be darned, the solution was sure enough 'commentStyle' :) .

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 24 Sep, 2008 01:19:02

Message: 8 of 20

"Cy abd" <gringoven@gmail.com> wrote in message <gbc2m6$de7$1@fred.mathworks.com>...
>
> > One would have thought that ML would have implemented an straight forward solution for such a popular problem within the textscan but unfortunately NOT! It can ignore strings but not complete lines stating with a string! Although I maybe wrong, I'm also looking into commentStyle which might just do that, does it?
> > thanks again
>
> Well, I'll be darned, the solution was sure enough 'commentStyle' :) .

OK, I'm having problem again, I can get the line to be skipped if it starts with the same single string but how can I get lines starting with different strings to be skipped?
like any lines that start with either 'data' or 'lost'?
thanks

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 24 Sep, 2008 05:31:01

Message: 9 of 20

The following code does what I need to do but is very slow since the .txt file is about 500,000 lines and the 3rd. line after else takes up 85% of the total time! How can that line be optimized please?

// the line consuming the most time 85%.
B(r,:) = [data1 data2 data3 data4 data5 data6 data7];


// the complete code:
r=1;
fid=fopen('Test.txt') ;
while ~feof(fid) ;
  tline=fgets(fid) ;
     if isletter(tline(1))==1 ;
     else
        A = textscan(tline,'%f %f %f %f %f %f %f','delimiter',',');
        [data1 data2 data3 data4 data5 data6 data7] = A{:};
        B(r,:) = [data1 data2 data3 data4 data5 data6 data7];
        r=r+1;
     end
end
fclose all ;


// sample text from the 'Test.txt' average 10 Meg.!:

13,14,02,1212.75,332,11
13,14,02,1212.75,374,11
13,14,03,1212.75,5,22
13,14,03,1212.75,5,22

Subject: How to remove unwanted text from a .txt file?

From: Walter Roberson

Date: 24 Sep, 2008 05:46:45

Message: 10 of 20

Cy abd wrote:
> The following code does what I need to do but is very slow since the .txt file is about 500,000 lines and the 3rd. line after else takes up 85% of the total time! How can that line be optimized please?
>
> // the line consuming the most time 85%.
> B(r,:) = [data1 data2 data3 data4 data5 data6 data7];

That line is taking most of the time because you are not pre-allocating the matrix,
so it is re-sizing the matrix for every line.

Two days ago, Steve Lord and I each explained some pre-allocation strategies
that can be used when the file size is not fixed. The thread was
"How to create a variable array n*2"

> if isletter(tline(1))==1 ;
> else
> A = textscan(tline,'%f %f %f %f %f %f %f','delimiter',',');
> [data1 data2 data3 data4 data5 data6 data7] = A{:};
> B(r,:) = [data1 data2 data3 data4 data5 data6 data7];
> r=r+1;
> end

That can be optimized slightly to

  if ~isletter(tline(1))
    A = textscan(tline, '%f %f %f %f %f %f %f','delimiter',',');
    B(r,:) = [A{:}];
    r = r + 1;
  end

The removal of the ==1 and the elimination of the empty branch will likely
measurably speed up execution of the routine. The other change will speed
up the code measureably for sure (though with your current code, your
time for that line is being overwhelmed by the re-allocations you are doing.)

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 24 Sep, 2008 06:36:02

Message: 11 of 20

Thank you so very much, the code is a lot more efficient now, although the B(r,:) = [A{:}]; line is accounting for 96% of the time consumed. It is amazing how efficient textscan is at only 4.%!
I’ll go and work on the Matrix pre-allocation now. Will report back.

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 24 Sep, 2008 06:48:01

Message: 12 of 20

Couldn't find the "How to create a variable array n*2" thread!
A link would be appreciated.
thanks

Subject: How to remove unwanted text from a .txt file?

From: Walter Roberson

Date: 24 Sep, 2008 07:10:40

Message: 13 of 20

Cy abd wrote:
> Couldn't find the "How to create a variable array n*2" thread!
> A link would be appreciated.

You may have to manually splice this line together into a single URL:

http://groups.google.ca/group/comp.soft-sys.matlab/browse_frm/thread/4453100f1bcc6aea/ca74053a3ccedc2f

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 26 Sep, 2008 02:27:01

Message: 14 of 20

Walter Roberson <roberson@hushmail.com> wrote in message <5JlCk.9617$tp1.8665@newsfe06.iad>...
> Cy abd wrote:
> > Couldn't find the "How to create a variable array n*2" thread!
> > A link would be appreciated.
>
> You may have to manually splice this line together into a single URL:
>
> http://groups.google.ca/group/comp.soft-sys.matlab/browse_frm/thread/4453100f1bcc6aea/ca74053a3ccedc2f

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 26 Sep, 2008 03:11:02

Message: 15 of 20

I seem to have found a reasonable solution to my problem but however I can’t get it to work, Will appreciate reviewing my code below please.
Through the following code I’m hoping to get the textscan to resume.

fid = fopen('test.txt');
[C, position] = textscan(fid(position+1:end),'%f %f %f %f %f %f ','delimiter',',');
% % % % % fid = fclose(fid);
C = [C{:}];

Sample code:

10,13,40,1214.25,5,22
20,13,40,1214.25,84,22
30,13,40,1214.25,30,22
40,13,40,1214.25,2,22
j
50,13,42,1214.00,1,11
k
60,13,43,1214.00,1,11
20,13,43,1214.00,1,11
20,15,59,1214.50,1,23
20,15,59,1214.50,1,23
20,15,59,1214.50,1,23

Subject: How to remove unwanted text from a .txt file?

From: Walter Roberson

Date: 26 Sep, 2008 06:19:05

Message: 16 of 20

Cy abd wrote:
> I seem to have found a reasonable solution to my problem but however I can’t
> get it to work, Will appreciate reviewing my code below please.
> Through the following code I’m hoping to get the textscan to resume.

> fid = fopen('test.txt');

fopen returns a file identifier, which is a positive integer such as 2 .
Numbers are, of course, just a form of vectors as far as Matlab is concerned.

> [C, position] = textscan(fid(position+1:end),'%f %f %f %f %f %f ','delimiter',',');

You do not show any initialization for position. Presuming you initialized it to 0,
then you would be indexing the vector containing the 2 at positions 1 through
the end of the vector. The end of the vector is position one, so that would be
indexing the vector containing the 2 at positions 1 through 1, which is just
going to be the scalar 2. So provided you initialized position to 0, the first
time around, the integer file identifier is going to be passed as the first
element of the textscan() call, resulting in data being read. And then the
position variable is going to be overwritten with the position that was reached in
the file. That might be, for example, 103

As you did not specify any repeat count for textscan, it is going to default
to reading as much as it can, stopping at an error or at end of file.

You do not show any kind of loop, but we can infer one based upon the fact that
you expect the code to resume reading. It won't resume reading just on a single
textscan call so you have presumably branched back up in code not shown.

The code you do show has C as the output for each textscan call, and
has C as the variable at the end that is expected to hold all of the values.
Unfortunately, the next textscan() call would overwrite C. So you are going to need
another variable, such as AllC, initialized to {}, and at each textscan() call
you are going to have to test C to see whether you got any data, and if you
did get data then AllC(end+1:end+length(C)) = C; and it would be AllC that you
would convert to an array at the end.

The second time through the loop, you would be attempting to index the vector
containing the file identifier at location 'position' through to the end. In our
earlier example we said that position might be (say) 103, so you would be attempting
to index the scalar vector containing 2 at locations 103+1 to the end of the vector.
There is, however, no location 104 in the single-element vector, so the indexing
would fail before the textscan() routine was actually called the second time.

If you were to replace the fid(position+1:end) with just fid so as to get rid
of this obvious indexing error, then textscan() would start the new scan from
the current file position. Unfortunately, the current file position would be
at the non-numeric character in the text: when the first textscan() call read
the non-numeric character and found that it did not match the numeric format
it wanted, textscan() automatically put that non-matching character "back in the
buffer", ready to be read by the next call. And since the next call would be
attempting a numeric format, it too would fail... What you do to fix this is
to determine, after the textscan() call, whether you are at end of file;
if you are not at end of file, then fgetl(fid) to read the non-matching line.
Throw that line away (for your purposes) and loop back to try again. When do
you stop trying textscan() calls? When you detect you are at end of file via feof().

> % % % % % fid = fclose(fid);
> C = [C{:}];


Please next time read the documentation more carefully. The code you show,
textscan(fid(position+1:end),...) is obviously just a small variation on the
sample line shown in the textscan documentation talking about resuming, which has
textscan(str(position+1:end),...)
However, if you had read slightly more carefully you would have seen that
that only applies when you are using textscan() to scan a string, not when
you are using textscan to scan a file. It should have been obvious to you that
indexing the file identifier would never work. Instead, I had to write this
long posting describing all the things wrong with your code :(

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 26 Sep, 2008 07:06:02

Message: 17 of 20

Mr. Robertson,

This was a huge lesson for me and many more future users who will benefit from your contribution here and I sincerely thank you and appreciate it.

Trust me, I did read the documentation for many hours and did note the code was for the str and was hoping to find the file alternative that you did very clearly and comprehensively explain above. Thanks again.

I had actually through further reading had come to a very simple and similar conclusion as yours also which would be to place checks and balances in the code to determine if the textscan has reached the eof or it has stopped due encountering text in this case.
If it had stopped then to textscan again but this time utilizing 1 for the Header parameter to replace the previously 0, so the next line will automatically be skipped and repeat the same until the text lines are over-passed and can continue with the textscan to the eof.

Please correct me if wrong, to conclude that the above approach will be the fastest and most elegant way to resolve this problem once and for all, since this issue had existed previously in other threads too and the conclusion had been to sift through the file one line at a time which would be very time consuming Vs. this approach of only resorting to a single line when the textscan has been stopped and even then the same texscan function handles the issue without the need for any further coding of loops by the user.

Beginners like myself, need persons like yourself to write Manuals/Books on ML that would explain things as clearly and comprehensively as you have.

Cy

Subject: How to remove unwanted text from a .txt file?

From: Andres

Date: 26 Sep, 2008 07:09:02

Message: 18 of 20

"Cy abd" <gringoven@gmail.com> wrote in message <gbhjs6$nr8$1@fred.mathworks.com>...
> I seem to have found a reasonable solution to my problem but however I can’t get it to work, Will appreciate reviewing my code below please.
> Through the following code I’m hoping to get the textscan to resume.
>
> fid = fopen('test.txt');
> [C, position] = textscan(fid(position+1:end),'%f %f %f %f %f %f ','delimiter',',');
> % % % % % fid = fclose(fid);
> C = [C{:}];
>
> Sample code:
>
> 10,13,40,1214.25,5,22
> 20,13,40,1214.25,84,22
> 30,13,40,1214.25,30,22
> 40,13,40,1214.25,2,22
> j
> 50,13,42,1214.00,1,11
> k
> 60,13,43,1214.00,1,11
> 20,13,43,1214.00,1,11
> 20,15,59,1214.50,1,23
> 20,15,59,1214.50,1,23
> 20,15,59,1214.50,1,23


I'm sorry for possibly diminishing the learning effect ;-), but Cy contacted me via e-mail for support, so here's some code proposal:

%----% some settings
% file name
fn = 'c:\data\CyXmpl01.txt';
% number of values per data line (_must_ be constant!)
numItemsPerLine = 6;
% estimated number of characters per line
estNumCharPerLine = 23;

%---% preallocate output matrix somehow
% get file info
file = dir(fn);
% guess the number of lines from the file size
estNumLines = round(file.bytes / estNumCharPerLine);
% initialize the output matrix
A = NaN(estNumLines,numItemsPerLine);

%---% import data
fid = fopen(fn);
idx = 1; % line index
tline = 'dummyString'; % just to enter the loop
while ischar(tline)
    % import with textscan as much as possible
    priceBb = textscan(fid,...
               repmat('%f ',1,numItemsPerLine),...
               'delimiter',',',...
               'CollectOutput',true);
    priceBb = priceBb{1};
    numLines = size(priceBb,1);
    A(idx:idx+numLines-1,:) = priceBb;
    idx = idx + numLines;
    % try to read in the next non-data line and ...
    % a) set the file pos indicator to the next line
    % b) or tline=-1 will make us exit the loop
    tline = fgets(fid);
end
fclose(fid);
A = A(1:idx-1,:);

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 26 Sep, 2008 20:03:02

Message: 19 of 20

163,300 lines of a 3.8 Meg file cleaned of Elapsed time is 0.857841 seconds. with 40 lines of text and blank lines including all of this post in 0.857841 seconds! Elegantly and using only the power of textscan!

NOT BAAAD at all, for a not a modest beginner, hey?! 

fileName='C:\test.txt';
formatString='%f%f%f%f%f%f';
numHeaderLines=0;
fid=fopen(fileName,'rt'); % Open file for reading (default), no idea what the t means though?! :).
data=textscan(fid,formatString,'headerlines',numHeaderLines,'delimiter',',');
data = [data{:}];
fileInfo=dir([fileName]); % File stats
fileSize=ceil(fileInfo.bytes); % File size
position = ftell(fid);
eof = fileSize-position;
cnt = 0; %count of skipped lines
% If file conversion was Not stopped due txt presence or even blank lines
% then the while will be skipped.
while eof > 0;
%also if no. of txt lines is known, then 1 can be replaced with any no. of
%lines desired to be skipped.
numHeaderLines=1; %setting header to 1 will skip txt line.
data1=textscan(fid,formatString,'headerlines',numHeaderLines,'delimiter',',');
data =vertcat(data,[data1{:}]);
position = ftell(fid);
cnt = cnt+1;
eof = fileSize-position; %check eof reached?
end
status=fclose(fid);
toc

Subject: How to remove unwanted text from a .txt file?

From: Cy abd

Date: 27 Sep, 2008 07:11:01

Message: 20 of 20

Replacing the line:
data =vertcat(data,[data1{:}]);
with the line below makes the code like 30% faster yet.
data(length(data)+1:length(data)+length([data1{:,1}]),1:6) = [data1{:}];
Obviously the lengthy line above can be fitted with variables.

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us