Go to the previous, next section.
The following example is a complete awk
program, which prints
the number of occurrences of each word in its input. It illustrates the
associative nature of awk
arrays by using strings as subscripts. It
also demonstrates the `for x in array' construction.
Finally, it shows how awk
can be used in conjunction with other
utility programs to do a useful task of some complexity with a minimum of
effort. Some explanations follow the program listing.
awk ' # Print list of word frequencies { for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] }'
The first thing to notice about this program is that it has two rules. The
first rule, because it has an empty pattern, is executed on every line of
the input. It uses awk
's field-accessing mechanism (see section Examining Fields)
to pick out the individual words from the line, and the built-in variable
NF
(see section Built-in Variables) to know how many fields are available.
For each input word, an element of the array freq
is incremented to
reflect that the word has been seen an additional time.
The second rule, because it has the pattern END
, is not executed
until the input has been exhausted. It prints out the contents of the
freq
table that has been built up inside the first action.
Note that this program has several problems that would prevent it from being useful by itself on real text files:
awk
convention that fields are
separated by whitespace and that other characters in the input (except
newlines) don't have any special meaning to awk
. This means that
punctuation characters count as part of words.
awk
language considers upper and lower case characters to be
distinct. Therefore, `foo' and `Foo' are not treated by this
program as the same word. This is undesirable since in normal text, words
are capitalized if they begin sentences, and a frequency analyzer should not
be sensitive to that.
The way to solve these problems is to use other system utilities to
process the input and output of the awk
script. Suppose the
script shown above is saved in the file `frequency.awk'. Then the
shell command:
tr A-Z a-z < file1 | tr -cd 'a-z\012' \ | awk -f frequency.awk \ | sort +1 -nr
produces a table of the words appearing in `file1' in order of decreasing frequency.
The first tr
command in this pipeline translates all the upper case
characters in `file1' to lower case. The second tr
command
deletes all the characters in the input except lower case characters and
newlines. The second argument to the second tr
is quoted to protect
the backslash in it from being interpreted by the shell. The awk
program reads this suitably massaged data and produces a word frequency
table, which is not ordered.
The awk
script's output is now sorted by the sort
command and
printed on the terminal. The options given to sort
in this example
specify to sort by the second field of each input line (skipping one field),
that the sort keys should be treated as numeric quantities (otherwise
`15' would come before `5'), and that the sorting should be done
in descending (reverse) order.
See the general operating system documentation for more information on how
to use the tr
and sort
commands.
Go to the previous, next section.