Overview - University of Western Australia · 10/7/2009 · >> c = strncmp(str1, str2, N) %...
Transcript of Overview - University of Western Australia · 10/7/2009 · >> c = strncmp(str1, str2, N) %...
10/7/09
1
Lecture 20: String Processing
Overview • Reading lines from text files
• Processing strings – Breaking a string into tokens
– Comparing strings for equality
– Converting strings to numerals
Reading lines from text files • Matlab's implementation of the fscanf function is
not really very useful – all the values read have to be stored as data of the same
type within the returned array.
• This limitation somewhat defeats the idea of being allowed to specify a format string.
• The alternative approach to reading a text file is
– to read each line of the file as a long string,
– to break the string into tokens (the words, numbers, or punctuation in the string)
– to process the tokens individually.
Reading lines of text • Matlab provides two functions for
reading lines of text from a file: % Read a line from a file *excluding* the end of
% line characters.
>> line = fgetl(fid)
% Read a line from a file *including* the end of
% line characters.
>> line = fgets(fid)
• If the end of file is encountered, -1 is returned.
10/7/09
2
An example • For example:
% DISPLAYFILE: A function to display the contents
% of a file.
%
% Usage: displayFile(filename)
%
% Arguments: filename - The name of the file to
% display.
%
% Returns: Nil.
%
% Author: Lecturers
% Date: Continuously Updated
Example (cont.) function displayFile(filename)
[fid, msg] = fopen(filename, 'rt'); error(msg);
line = fgets(fid); % Get the first line from % the file. while line ~= -1 fprintf('%s', line); % Print the line on % the screen. line = fgets(fid); % Get the next line % from the file. end
fclose(fid);
Processing Strings • Text: Chapter 6.2.
• Matlab's string processing functions are also closely modelled on their C equivalents.
Breaking a string into tokens • Tokens are bits of a string - for example, the numbers, words, and
punctuation. • Which “bits” are important depends on the context or application.
• Eg. If you are parsing an English sentence you may just be interested in the words and not the punctuation (such as full stops):
• Given “Life wasn’t meant to be easy. But should it be this hard?”
• The tokens of interest are: Life wasn’t meant to be easy But should it be this hard
• On the other hand, if you are parsing numbers, you may want to keep the full stops:
• Given “3.245, 7.683, -24.7”
The desired tokens are: 3.247 7.683 -24.7, not: 3 245 7 683 24 7
10/7/09
3
Delimiters • The things that are used to break up the text into tokens are
called delimiters.
E.g. If the delimiters are “, ” (comma and space) we get
3.247 7.683 -24.7
whereas if they are “, .-” (or all punctuation characters) we get 3 245 7 683 24 7
• Tokens are typically delimited by spaces, tabs, or commas.
e.g. Given a string 'A string of 5 tokens + three more',
tokens are
'A' 'string' 'of' '5' 'tokens' '+' 'three' 'more'
• Example: reading and writing text files with Excel...
The strtok function • The strtok function is used to extract tokens from a string.
• The general syntax of the strtok function is:
[token, remainder] = strtok(string, delim)
– The variable string is the string to be "tokenised".
– Returns the first token delimited by one of the characters in the delim string (the delimiters).
• If delim is omitted, the tokens are delimited by any white space character by default.
• After extracting the first token of string, the remainder of the string is returned as the separate string array remainder.
• The extracted token is stored in the output variable token.
An example • For example, imagine breaking a line of text into a cell array of tokens:
line = 'A string of 5 tokens + three more'; % Initialise or read a
% line from a file.
token={}; ii = 1;
while any(line) [token{ii}, line] = strtok(line);
% Repeatedly apply the
ii = ii + 1; % strtok function. end
• The any function returns True if any element of a vector is non-zero.
• At the end of the while loop, the token cell array contains all the tokens in the string.
>> token =
'A' 'string' 'of' '5' 'tokens' '+' 'three' 'more'
Comparing strings for equality
• Two strings should never be compared for equality using the == operator.
• The == operator will return an error if its two arguments are not of the same length.
• Even if the strings are the same length, the == operator will return an array of boolean values depending on how individual characters match, not a single boolean answer.
10/7/09
4
String Comparison Functions • Strings in Matlab should be compared using the following functions:
>> c = strcmp(str1, str2) % Returns True if str1 is
% identical to str2.
>> c = strcmpi(str1, str2) % Compares strings ignoring % case differences.
>> c = strncmp(str1, str2, N) % Compares the first N % characters of the strings.
>> c = strncmpi(str1, str2, N) % Compares the first N % characters of the strings % ignoring case differences.
• Note that the standard C/Java language functions differ in that they return 0 if the strings are identical, a negative value if the first string is alphabetically less than the second string, and a positive value otherwise.
Converting strings to numerals • The function str2num will evaluate a string as a
Matlab expression, converting it to a number.
• For example:
>> x = str2num('21')
x = 21
>> x = str2num('1+2')
x =
3
Str2num and eval • Note however, that spaces can be significant.
>> x = str2num('1 +2') % This is interpreted as
% an array containing two % numbers: 1 and +2. x = 1 2
• The eval function is similar, but perhaps more predictable. >> x = eval('1 +2') x = 3
>> x = eval('[1 2]') x = 1 2
Summary • The operations described in this lecture form
the basic building blocks that allow you to process text files:
– Reading values from files.
– Reading lines from files.
– Breaking lines into tokens.
– Comparing strings.
– Converting strings to values.
10/7/09
5
Putting it altogether in one example
• Here is a text datafile that specifies the property of 3 beams.
CITS1005BEAM % Magic ID value % % This is a datafile to define a series of beams % File must start with the appropriate Magic value % Comments are indicated by `%' % Number of beams is specified by the keyword `Nbeams' % followed by a value. % % Each beam definition is started by the keyword 'beam' % Fields of `length', `section_area' and `material' % must be specified, % but can appear in any order
Example (cont.)
Nbeams 3 % Number of beams
beam % Start of beam definition length 100 section_area .5 material steel
beam % Start of beam definition length 8 section_area .04 material wood
beam % Start of beam definition
% Oddly formatted, but valid, beam specification material rubber length 1 section_area .3
Design your solution • The structure of this file has a couple of features
– The data file starts with a Magic ID value which is used to uniquely identify the type of data file. Any program written to read these kinds of `beam files‘ should check that the first line starts with this Magic ID. This way the program can reject non beam files.
– The file contains some keywords that establish the context as to how subsequent groups of data should be read. This allows errors to be detected in the reading process.
• The reading states we expect are: 1. Start with expecting to see a Magic ID. 2. Then expect to see Nbeams sepcified. 3. Then expect to see a `beam' keyword. 4. Then expect to have length, section_area and material specified. 5. Then expect to see another `beam' keyword etc, or the end of file.
Operations to use • We use the basic operations of:
– Reading lines from files – Breaking lines into tokens – Processing tokens by string comparison or
conversion into values
% readbeamfile - reads textfile specifying beam structures % % Usage: beam = readbeamfile(filename) % % Returns an array of beam structures each having fields of % `length', `section_area', and `material'
10/7/09
6
Example (cont.) function beam = readbeamfile(filename)
% Define some values to indicate reading states - this allows one % to structure the code logically
waitingForID = 0; % These numbers are arbitrary waitingForNbeams = 1; % - they just have to be unique. waitingForBeam = 2; waitingForData = 3;
% State transitions can only be as follows: % % waitingForID -> waitingForNbeams -> waitingForBeam -> waitingForData -- % ^ / % |______________________________/
[fid, msg] = fopen(filename, 'rt'); error(msg);
Example (cont.) state = waitingForID; % Initialise state beamNo = 0; % Initialise beam index line = fgets(fid); % Get first line while line ~= -1 % While there is still data to read
remainder = line;
while any(remainder) [token, remainder] = strtok(remainder);
if isempty(token) % No tokens left on this line break; % break out of while loop and % go to next line
elseif strncmp(token, '%', 1) % This is a comment, skip what is left break % and go to next line.
elseif state == waitingForID
Example (cont.) if ~strcmpi(token, 'CITS1005BEAM') % Check for file type fclose(fid); error('This is not a CITS1005BEAM data file'); end state = waitingForNbeams; % State transition
elseif state == waitingForNbeams
if strcmpi(token, 'Nbeams') [token, remainder] = strtok(remainder); % Next token is the value Nbeams = str2num(token);
% Allocate memory for struct array beam = struct('length', cell(1,Nbeams), ... 'section_area', cell(1,Nbeams), ... 'material', cell(1,Nbeams));
state = waitingForBeam; % State transition else fclose(fid); error('Unexpected data in file'); end
Example (cont.) elseif state == waitingForBeam
if strcmpi(token, 'beam') beamNo = beamNo + 1; % Increment beam count state = waitingForData; % State transition else fclose(fid); error('Unexpected data in file'); end
elseif state == waitingForData % Fill in the beam data fields
if strcmpi(token, 'length') [token, remainder] = strtok(remainder); % Next token is the value beam(beamNo).length = str2num(token);
elseif strcmpi(token, 'section_area') [token, remainder] = strtok(remainder); % Next token is the value beam(beamNo).section_area = str2num(token);
elseif strcmpi(token, 'material') [token, remainder] = strtok(remainder); % Next token is the value beam(beamNo).material = token;
else fclose(fid); error('Incomplete beam data, or unexpected data in file'); end
10/7/09
7
if beamSpecified(beam(beamNo)) % We have all the data state = waitingForBeam; % Go onto next beam else state = waitingForData; % Keep looking for data for this beam end
end end
line = fgets(fid); % Get next line end
% Check we got all the beams
if beamNo ~= Nbeams fprintf('Data for %d beams were read, %d were expected', beamNo, Nbeams); end
fclose(fid);
% Internal function to check that all fields of a beam structure have been set
function v = beamSpecified(b) v = ~isempty(b.length) & ~isempty(b.section_area) & ~isempty(b.material);
Example (cont.)