awk学习笔记之基本使用

2014-09-28

awk基本使用方式

awk 'pattern1{action1}pattern2{action2}...' filename
awk -f progfile optional list of input files

If the program is long, however, it is more convenient to put it into a separate file, say progfile, and type the command line
awk自定义函数

function name( args ) { statements }
- awk中的函数与shell中的函数使用有类似之处，函数里面的参数默认是全局的，要使用局部变量只能在参数里面声明，在函数中对变量的修改将同时影响主程序，shell函数中可以使用local关键字来定义局部变量；
- awk中的函数定义位置更灵活（shell中必须在调用之前定义）；
- $0,$1,$2…默认是全部脚本使用，无论在函数内还是函数外，可以认为是全局的；
- function max(a,b,c) 如果在传入参数的时候未传入第三个参数那么默认第三个参数为函数内部的变量，即是局部变量；
- 函数返回值使用 res = max(a,b,c)；
- 函数的参数如果是标量则是传值，数组则是传引用，函数中改变数组的值可以改变全局数组中的值；

For example, give file seq

# seq - print sequences of integers
# input: arguments q, p q, or p q r; q >= p; r > 0
# output: integers 1 to q, p to q, or p to q in steps of r
BEGIN {
    if (ARGC == 2)
        for (i = 1; i <= ARGV[1]; i++)
            print i
    else if (ARGC == 3)
        for (i = ARGV[1]; i <= ARGV[2]; i++)
            print i
    else if (ARGC == 4)
        for (i = ARGV[1]; i <= ARGV[2]; i += ARGV[3])
            print i
}

awk -f seq 10
awk -f seq 1 10
awk -f seq 1 10 1

all generate the integers one through ten.

# File - insert_sort
#insertion sort
{A[NR] = $0}
END{ isort(A, NR)
     for (i = 1; i <= NR; i++)
         print A[i]
    }

# isort - sort A[1..n] by insertion

function isort(A,n,i,j,k){
    for (i = 2; i <= n; i++){
        for ( j = i; j > 1 && A[j-1] > A[j]; j--){
            # swap A[j-1] and A[j]
            t = A[j-1]
            A[j-1]=A[j]
            A[j]=t
        }
    }
}

# number.txt
12
10
2
23
18
23
9
99
88
56
28

执行 awk -f insert_sort number.txt 查看结果.

Awk的一些常用变量

NF 表示字段数量，在执行过程中相当于当前行的字段数
NR 表示记录数量，在执行过程中相当于当前行号
FS 分隔字符，默认是空格键
$0 当前行的文本内容
$1 当前行第一个字段的内容（默认以空格分隔字段）

COMPARISON OPERATORS

OPERATOR	MEANING
<	less than
<=	less than or equal to
==	equal to
!=	not equal to
>=	greater than or equal to
>	greater than
~	matched by
!~	not matched by

Summary of Patterns

BEGIN { statements }

The statements are executed once before any input has been read.
END { statements }

The statements are executed once after all input has been read.
expression { statements }

The statements are executed at each input line where the expression is true, that is,
nonzero or nonnull.
/regular expression/ { statements }

The statements are executed at each input line that contains a string matched by the regular expression.
The regular expression metacharacters are:\ ^ $ . [] | () * + ?

alternation: A|B matches A or B.
concatenation: AB matches A immediately followed by B.
closure: A* matches zero or more A's.
positive closure: A+ matches one or more A's.
zero or one: A? matches the null string or A.
parentheses: (r) matches the same strings as r does.
.: which matches any single character.
a character class: [ABC] matches any of the characters A, B, or C.
compound pattern { statements }

A compound pattern combines expressions with && (AND), || (OR), ! (NOT), and
parentheses; the statements are executed at each input line where the compound
pattern is true.
pattern1 , pattern2 { statements }

A range pattern matches each input line from a line matched by pattern 1 to the next
line matched by pattern 2, inclusive; the statements are executed at each matching
line.

BEGIN and END do not combine with other patterns. A range pattern cannot be part of
any other pattern. BEGIN and END are the only patterns that require an action.

If { statements } is omitted, default print $0.

String Match Patterns

/regexpr/
Matches when the current input line contains a substring matched by regexpr.
expression ~ /regexpr/
Matches if the string value of expression contains a substring matched by regexpr.
expression !~ /regexpr/
Matches if the string value of expression does not contain a substring matched by regexpr.

Some Examples

awk '$1 !~ /^[0-9]+/' countries

文件countries的第一个字段不能以数字开始
/^[0-9]+$/

matches any input line that consists of only digits
/^[0-9][0-9][0-9]$/

exactly three digits
/^(\+|-)?[0-9]+\.?[0-9]+$/

a decimal number with an optional sign and optional fraction
/^[+-]?[0-9]+[.]?[0-9]+$/

also a decimal number with an optional sign and optional fraction
for example: echo "-12.5" | awk '/^[+-]?[0-9]+[.]?[0-9]+$/' will get -12.5
/^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/

a floating point number with optional sign and optional exponent
/^[A-Za-z][A-Za-z0-9]*$/

a letter followed by any letters or digits (e.g., awk variable name)
/(^[A-Za-z]$|^[A-Za-z][0-9]$)/

a letter or a letter followed by a digit (e.g., variable name in Basic) - awk多选结构
/^[A-Za-z][0-9]?$/

also a letter or a letter followed by a digit

A Handful of Useful "One-liners" - from The AWK Programming Language

Print the total number of input lines:
END { print NR }
Print the tenth input line:
NR == 10
Print the last field of every input line:
{ print $NF }
Print the last field of the last input line:

{ field = $NF}
END { print field }
Print every input line with more than four fields:
NF > 4
Print every input line in which the last field is more than 4:
$NF > 4
Print the total number of fields in all input lines:

{ nf = nf + NF }
END { print nf }
Print the total number of lines that contain Beth:

/Beth/ { nlines = nlines + 1 }
END { print nlines }
Print the largest first field and the line that contains it (assumes some $1 is positive):

$1 > max { max = $1; maxline = $0 }
END { print max, maxline }
Print every line that has at least one field:
NF > 0
Print every line longer than 80 characters:
length($0) > 80
Print the number of fields in every line followed by the line itself:
{ print NF, $0 }
Print the first two fields, in opposite order, of every line:
{ print $2, $1 }
Exchange the first two fields of every line and then print the line:
{ temp = $1; $1 = $2; $2 = temp; print }
Print every line with the first field replaced by the line number:
{ $1 = NR; print }
Print every line after erasing the second field:
{ $2 = ""; print }
Print in reverse order the fields of every line:

for (i = NF; i > 0; i = i - 1) printf("%s ", $i)
printf ( "\n" )
Print the sums of the fields of every line:

{ sum= 0
for (i = 1; i <= NF; i = i + 1) sum = sum + $i
print sum
}
Add up all fields in all lines and print the sum:

{ for (i = 1; i <= NF; i = i + 1) sum = sum + $i }
END { print sum }
对于浮点类型的计算, 最好使用精度表示, 自动四舍五入
awk -F, '{x += $1; y += $2}END{printf "%.2f, %.2f\n", x, y}' file
Print every line after replacing each field by its absolute value:

{ for (i = 1; i <= NF; i = i + 1) if ($i < 0) $i = -$i
print
}