Go to the previous, next section.

Sample Program

The following example is a complete awk program, which prints the number of occurrences of each word in its input. It illustrates the associative nature of awk arrays by using strings as subscripts. It also demonstrates the `for x in array' construction. Finally, it shows how awk can be used in conjunction with other utility programs to do a useful task of some complexity with a minimum of effort. Some explanations follow the program listing.

awk '
# Print list of word frequencies
{
    for (i = 1; i <= NF; i++)
        freq[$i]++
}

END {
    for (word in freq)
        printf "%s\t%d\n", word, freq[word]
}'

The first thing to notice about this program is that it has two rules. The first rule, because it has an empty pattern, is executed on every line of the input. It uses awk's field-accessing mechanism (see section Examining Fields) to pick out the individual words from the line, and the built-in variable NF (see section Built-in Variables) to know how many fields are available.

For each input word, an element of the array freq is incremented to reflect that the word has been seen an additional time.

The second rule, because it has the pattern END, is not executed until the input has been exhausted. It prints out the contents of the freq table that has been built up inside the first action.

Note that this program has several problems that would prevent it from being useful by itself on real text files:

Words are detected using the awk convention that fields are separated by whitespace and that other characters in the input (except newlines) don't have any special meaning to awk. This means that punctuation characters count as part of words.
The awk language considers upper and lower case characters to be distinct. Therefore, `foo' and `Foo' are not treated by this program as the same word. This is undesirable since in normal text, words are capitalized if they begin sentences, and a frequency analyzer should not be sensitive to that.
The output does not come out in any useful order. You're more likely to be interested in which words occur most frequently, or having an alphabetized table of how frequently each word occurs.

The way to solve these problems is to use other system utilities to process the input and output of the awk script. Suppose the script shown above is saved in the file `frequency.awk'. Then the shell command:

tr A-Z a-z < file1 | tr -cd 'a-z\012' \
  | awk -f frequency.awk \
  | sort +1 -nr

produces a table of the words appearing in `file1' in order of decreasing frequency.

The first tr command in this pipeline translates all the upper case characters in `file1' to lower case. The second tr command deletes all the characters in the input except lower case characters and newlines. The second argument to the second tr is quoted to protect the backslash in it from being interpreted by the shell. The awk program reads this suitably massaged data and produces a word frequency table, which is not ordered.

The awk script's output is now sorted by the sort command and printed on the terminal. The options given to sort in this example specify to sort by the second field of each input line (skipping one field), that the sort keys should be treated as numeric quantities (otherwise `15' would come before `5'), and that the sorting should be done in descending (reverse) order.

See the general operating system documentation for more information on how to use the tr and sort commands.

Go to the previous, next section.