PERL for Biologists

Course by Kurt Stüber

Previous, Part 5 ,Next

Reading from files:

Any kind of file can be read from a PERL program and the data therein interpreted, compared, reformatted, used for calculations, etc. The file has to be opened, this means the PROGRAM has to learn the name of the file, and attaches a filehandle to it. The filehandle is special identifier used only to give a name to the file, to be used for read and write operations.

$myfile = "valueable_data.txt";
open( INFILE, $myfile ); 

The variable INFILE is the filehandle. When the statement above is executed two things may happen: First the statement might execute correctly, the file named valuable_data.txt is found, opened, and associated with the filehandle. Second the statement might not execute successfully, for instance when the file named valuable_data.txt does not exist. PERL nonetheless continues, even if the file is not found, but any attempts to read data from this file will of course fail. To check if the statement was correctly executed the following variant might be used:

open( INFILE, $myfile ) || die print "The file $myfile was not found\n";

The two bars signify the logical OR and the PERL program now tries to open the file $myfile and if it does not succeed it dies gracefully after writing a helpful error message.

After the file is opened data can be read from it. The following code reads all lines from a file and writes them to the output window:

while( <INFILE> )
   {
   $current_line = $_;
   chomp $current_line;
   print "$current_line\n";
   }

The special variable $_ is the current line read from INFILE. You may think of a kind of pointer, that always points to the start of the current line. When this line is read the pointer jumps to the start of the next line. The function chomp takes off any carriage return at the end of a string. If the string does not contain a carriage return, nothing is taken off. When the current line printed to the screen this carriage return has to be added again by \n.

Writing to a file:

Again the file to be used for writing has to be declared beforehand:

open( OUTFILE, "> $myfile" ) || die print "The file $myfile was not found\n";

The only difference to the open-statement for the input file is the greater than symbol (>) in front of the file name. If you declare an existing file for writing, than the data in this file will be lost and overwritten. You cannot declare a file for input and output simultaneously. The only variant is to open an existing file for output and appending the written data to the end of the file:

open( OUTFILE, ">> $myfile" ) || die print "The file $myfile was not found\n";

Here the two greater-than symbols (>>) indicate that data have to be appended. The file must exist before the program runs, and the file need not be empty. Any data in the file will be preserved and the new data added at the end.

Writing to files is done with print statements. The filehandle is added as the first parameter before the string to printed is specified:

print OUTFILE "Hello World\n";

Note that are no commas before and after the OUTFILE.

Closing a file:

At the end of the program any open file has to be closed.

close INFILE;
close OUTFILE;

If the file(s) are not closed at the end of the program run, then they will be closed automatically. Explicitely closing a file allows you the open another file during the same program run using the same filehandle. Or you might write to a file, close it and open it again for reading.

Splitting strings:

When you try to interpret the contents of a file you have to inspect the individual lines and dissect them or split the items on a line to different substrings. Look at the following table which contains the results of measurements from five days, two measurements per day. The individual values are separated by semicolons (;).


1;34.5;17.2
2;25.4;19.1
3;31.7;18.4
4;28.3;16.9
5;32.8;17.6

A file of this kind can be produced by EXCEL, when you save the EXCEL table as text and the items are separeted by semicolons. Instead of the semicolons often commas might be used too. Assuming that this table is stored in a file called measurements.txt, the following PERL code reads the individual lines and splits them at the semicolons to separate the values from each other:


my @no_of_measurement = ();
my @first_measurement = ();
my @second_measurement = ();
my @item = ();
my $input_line = "";
my $n = 0;

open( INFILE, "measurements.txt" ) || die print "File not found!\n";
while( <INFILE> )
	{
	$input_line = $_;
	@item = split( /;/, $input_line );
	$no_of_measurement[ $n ] = $item[ 0 ];
	$first_measurement[ $n ] = $item[ 1 ];
	$second_measurement[ $n ] = $item[ 2 ];
	$n++;
	}
	
print "There are $n measurements in the input file.\n";
close INFILE;

Note the function split. Between forward slashes a pattern is defined, in this case a semicolon. At this pattern the string variable $input_line is split and three values stored in the array @item. From this array the values can then be transferred to other arrays (@no_of_measurement, @first_measurement, @second_measurement). Note again, that the enumberation of elements in PERL arrays starts with zero (0) by default.

The patterns are also called regular expressions. They can be defined in a very flexible way by using metacharacters for any symbols not accessible from the keyboard. There is a list of metacharacters or escape characters in this course. Further examples will follow.

Exercises:

Write a program that reads a DNA sequence file and returns the length of the sequence and the base composition. Write the result to another file.

Solutions:


© 2001-2007, by Kurt Stüber.