Scripts and Utilities -- Cat/Sort/At/Grep/Find lecture



This lecture is a bit of a hodgepodge. It will cover:

Setup


cat

Cat is probably the most basic Unix command. If no input files are specified, cat copies standard input to standard output. If input files are specified, then cat copies the input files to standard output. Thus, cat is ideal for appending two files into a third:
UNIX> cat f1
And she's buying
UNIX> cat f2
a stairway to heah ven
UNIX> cat f1 f2
And she's buying
a stairway to heah ven
UNIX> cat f1 f2 > f3
UNIX> cat f3
And she's buying
a stairway to heah ven
UNIX> 
Remember, when using either the Bourne shell or csh, you cannot redirect standard output to be a file that you're using as input. Why? Because both shells create the output file before running the command. Thus, you'll lose the input file before the command gets executed.

NOTE:
Most of you running csh have a variable 'noclobber' set. This makes it "safe" to use the same file as input and output. It tells csh not to destroy an existing file, instead give a warning and terminate the command. BUT, this "safety" may lull you into a false sense of security because it does NOT affect the operation of any scripts you may use. I would like you to either run the Bourne shell in a window to test the examples or do the following in a window

UNIX> unset noclobber
which will make csh act like sh.
UNIX> cat f3
And she's buying
a stairway to heah ven
UNIX> cat f3 > f3
cat: input/output files  'f3' identical 
UNIX> cat f3
UNIX> cat f1 f2 > f3
UNIX> cat f3
And she's buying
a stairway to heah ven
UNIX> cat < f3 > f3
cat: input/output files '-' identical
UNIX> cat f3
UNIX> 

cat options

There are a few command line arguments to cat that make it a more useful command than it may first appear. Read the man page for a full description. First, -ve displays non-printing characters, and a $ for each newline, which sometimes tells you surprising things:
UNIX> cat sth
And when we wind on down the road
Our shadow's taller than our souls
There walks a lady we all know
Who shines bright lights and wants to know
Why all that glitters is not gold
These lyrics are all old as mold
UNIX> cat -ve sth
And when we wind on down the road$
Our shadow's taller than our souls$
There walks a lady we all know$
Who shines subliminal message^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^Hbright lights and wants to know                   $
Why all that glitters is not gold$
These lyrics are all old as mold$
UNIX> 
Second, -n prepends the line number to each line. E.g.
UNIX> cat md
See Saw, Margery Daw
 Johnny will have a new Master
He will get but a penny a day
 Because this poem is a disaster!
UNIX> cat -n md
   1  See Saw, Margery Daw
   2     Johnny will have a new Master
   3  He will get but a penny a day
   4     Because this poem is a disaster!
UNIX> 
Third, if you intermix - with the filenames, cat will use standard input when it reaches the -:
UNIX> echo "This is a tiresome song" | cat f1 - f2
And she's buying
This is a tiresome song
a stairway to heah ven
UNIX> 
Irritatingly, even cat has different implementations on different machines. Thus, -e works in one way on one machine, and another on another. For example, try -e on SunOS and Solaris machines. Or -s. This is very frustrating if you are trying to write portable shell scripts. My philosophy on this is to not use options which are not portable. This means that before you write a shell script that you are going to send to someone you should test it out on many machines and operating systems. A good way to break a shell script that runs on a BSD machine (like SunOS or Ultrix) is to try it on a System V machine (like Solaris or HPUX). It is frustrating, but you must do it.

Sort

Like cat, sort is a straightforward program, that sorts files. If you give it no command line options, it sorts lexicographically by lines. Note that initial spaces are included, which is sometimes not what you want:
UNIX> cat md
See Saw, Margery Daw
 Johnny will have a new Master
He will get but a penny a day
 Because this poem is a disaster!
UNIX> sort md
 Because this poem is a disaster!
 Johnny will have a new Master
He will get but a penny a day
See Saw, Margery Daw
UNIX> 
Use -r to sort in reverse order, and -n to sort numerically rather than lexicographically:
UNIX> cat gstats
9   8  8.99  6.99 Steve Elkington
14 10 12.28  7.19 Brad "Chicken Neck" Faxon
1   1 10.87 10.87 Tom Kalinowski
12 10 10.13  7.16 Tom Lehman
7   6  9.46  6.87 Greg Norman
14 11 10.02  5.94 Jesper Parnevik
9   9  5.10  5.10 Nick Price
10 10  5.63  5.63 Tiger Woods
UNIX> sort gstats
1   1 10.87 10.87 Tom Kalinowski
10 10  5.63  5.63 Tiger Woods
12 10 10.13  7.16 Tom Lehman
14 10 12.28  7.19 Brad "Chicken Neck" Faxon
14 11 10.02  5.94 Jesper Parnevik
7   6  9.46  6.87 Greg Norman
9   8  8.99  6.99 Steve Elkington
9   9  5.10  5.10 Nick Price
UNIX> sort -n gstats
1   1 10.87 10.87 Tom Kalinowski
7   6  9.46  6.87 Greg Norman
9   8  8.99  6.99 Steve Elkington
9   9  5.10  5.10 Nick Price
10 10  5.63  5.63 Tiger Woods
12 10 10.13  7.16 Tom Lehman
14 10 12.28  7.19 Brad "Chicken Neck" Faxon
14 11 10.02  5.94 Jesper Parnevik
UNIX> sort -r gstats
9   9  5.10  5.10 Nick Price
9   8  8.99  6.99 Steve Elkington
7   6  9.46  6.87 Greg Norman
14 11 10.02  5.94 Jesper Parnevik
14 10 12.28  7.19 Brad "Chicken Neck" Faxon
12 10 10.13  7.16 Tom Lehman
10 10  5.63  5.63 Tiger Woods
1   1 10.87 10.87 Tom Kalinowski
UNIX> sort -nr gstats
14 11 10.02  5.94 Jesper Parnevik
14 10 12.28  7.19 Brad "Chicken Neck" Faxon
12 10 10.13  7.16 Tom Lehman
10 10  5.63  5.63 Tiger Woods
9   9  5.10  5.10 Nick Price
9   8  8.99  6.99 Steve Elkington
7   6  9.46  6.87 Greg Norman
1   1 10.87 10.87 Tom Kalinowski
UNIX> 
(2) Sort lets you break up the input into "fields", and sort on a particular field. The default delimiter for fields is white space. You specify the sorting field with -k n, where n is the number of the field. You can also specify a range of fields with -k n,m .To sort, the first field is 1. ` If you want to sort (numerically) using the third column of gstats, you do:
UNIX> sort -n -k 3 gstats
9   9  5.10  5.10 Nick Price
10 10  5.63  5.63 Tiger Woods
9   8  8.99  6.99 Steve Elkington
7   6  9.46  6.87 Greg Norman
14 11 10.02  5.94 Jesper Parnevik
12 10 10.13  7.16 Tom Lehman
1   1 10.87 10.87 Tom Kalinowski
14 10 12.28  7.19 Brad "Chicken Neck" Faxon
UNIX> 
And if you want to sort by the golfers' first names, you do:
UNIX> sort -k 5 gstats
14 10 12.28  7.19 Brad "Chicken Neck" Faxon
7   6  9.46  6.87 Greg Norman
14 11 10.02  5.94 Jesper Parnevik
9   9  5.10  5.10 Nick Price
9   8  8.99  6.99 Steve Elkington
10 10  5.63  5.63 Tiger Woods
1   1 10.87 10.87 Tom Kalinowski
12 10 10.13  7.16 Tom Lehman
UNIX> 
When sorting lexicographically, sort includes white space, which seems odd. To have it ignore leading white space in a field, you use the -b option. Thus, to sort a file lexicographically, and ignore leading white space, you do:
UNIX> sort -b -k 1 md
 Because this poem is a disaster!
He will get but a penny a day
 Johnny will have a new Master
See Saw, Margery Daw
UNIX> 
NOTE:
On Sun OS and some other systems the method of specifying a field is +n. This is supported by the sort on Solaris but the man page indicates that it is obsolete. This means that a future version of the operating system distribution may not support it at all. Again, portability may require that you do lots of research or use the simplest forms of the utilities.

The last two important options of sort are -u, which strips out duplicates, and -f which ignores the distinction between upper and lower case. As always, read the man page for more info and more options.


At

Please note the changes in Debian. Though the lecture note for Unix system will not be changed. Checking out the man page in the current operating system is always a good idea. Please also note the usage of echo. The equivalent utility to mailx is nail in Debian.

The examples I used in class are:

>echo 'cat md' | at +1minute
>echo 'cat md' | at +2minute
>echo 'cat md' | at 2pm
>atq
14      2005-09-26 12:45 a ytang
15      2005-09-26 12:46 a ytang
16      2005-09-26 14:00 a ytang
>atrm 16
>atq
15      2005-09-26 12:46 a ytang
Job 14 is not displayed anymore because it is already executed. If you don't use echo, you will probably receive error messages. Try it out. Here is the original UNIX class note:

******************************
At is a command that lets you submit a job to be executed sometime in the future. The syntax is:
at [ -c | -k | -s ] [-f file] [-m] timespec  
By default at uses the SHELL variable to determine which shell to use to execute the commands. You can change this with the -s which says to treat the script as a Bourne shell script. The other two options are for ksh or csh respectively. If you don't specify a file, then it uses standard input.

When the proper time comes about, the operating system (specifically, the cron daemon) will execute a process under your ownership that first cd's to the directory in which you made the at call, and then runs the script. If there is output to the job, the at will email you the output. A note here. If the directory is only mounted while YOU are logged in to that particular machine (like your home area) then you cannot log out and expect at to be able to work. Use cron instead.

At is a great program. The main problem with it is that there is no real standard at syntax. Some versions don't support the -s command line argument. Some let you specify relative times (like "at now + 1 hour"). Some don't. However, most of the time, it doesn't matter. If you give at a time without a date, it assumes that you mean the nearest date with that time.

For example, suppose it is 4:00 PM, on June 17, 1997. The following at commands will send mail to your boss at 2:00 AM on June 18:

UNIX> cat bossmail
Hi Boss,

See how late I'm working?  I think I deserve a raise!!!

Me
UNIX> at 2am
at> mailx -s 'an idea' your_boss_email < bossmail
at> 
warning: commands will be executed using /bin/csh
job 866613600.a at Wed Jun 18 02:00:00 1997
UNIX> at 2am June 18
at> mailx -s 'an idea' your_boss_email < bossmail
at> 
warning: commands will be executed using /bin/csh
job 866613601.a at Wed Jun 18 02:00:00 1997
UNIX> echo "mailx -s 'an idea your_boss_email < bossmail" | at 2am June 18
warning: commands will be executed using /bin/csh
job 866613602.a at Wed Jun 18 02:00:00 1997
UNIX>
To see what jobs you have currently queued up, do atq or at -l:
UNIX> atq
Rank     Execution Date     Owner     Job         Queue   Job Name
1st   Jun 18, 1997 02:00   jplank  866613600.a     a     stdin
2nd   Jun 18, 1997 02:00   jplank  866613601.a     a     stdin
3rd   Jun 18, 1997 02:00   jplank  866613602.a     a     stdin
UNIX> at -l
866613600.a     Wed Jun 18 02:00:00 1997
866613601.a     Wed Jun 18 02:00:01 1997
866613602.a     Wed Jun 18 02:00:02 1997
UNIX> 
And if you have a change of heart and want to remove the jobs, use at -r or atrm. Note that all systems don't support all of these -- sometimes you have to hunt around to figure out how at and related commands work. As always, read the man page.
UNIX> at -r 866613600.a
UNIX> at -l
866613601.a     Wed Jun 18 02:00:01 1997
866613602.a     Wed Jun 18 02:00:02 1997
UNIX> atrm 866613601.a 866613602.a
866613601.a: removed
866613602.a: removed
UNIX> at -l
If at -s does not work on your system and you want to run a Bourne shell script, do
UNIX> echo 'sh scriptname' | at time

grep

Grep stands for ``get regular expression''. Its syntax is
UNIX> grep pattern [ files ]
If you don't specify files on the command line, then it will use standard input. It prints out all lines in the specified files that contain the pattern. If you specified more than one file on the command line, then it will prepend the line with the file that it came from. Examples:
UNIX> grep penny md
He will get but a penny a day
UNIX> grep penny < md
He will get but a penny a day
UNIX> grep all md sth
sth:Our shadow's taller than our souls
sth:There walks a lady we all know
sth:Why all that glitters is not gold
sth:These lyrics are all old as mold
UNIX> 
The pattern is a ``regular expression.'' While they're not exactly the same as regular expressions in something like CS380, they're pretty close. I'll borrow from the grep and ed man pages to define regular expressions: Ok, so this means that if you want to grep for any number, you can use [0-9]. If you want all lower case letters, use [a-z], and all lower and upper case letters, use [a-zA-Z]. It's always best to use single quotes when you're specifying patterns. Here are some examples:


UNIX> cat greptest
Jim Plank
This string contains no numbers
This string does though (1)
-9.00
G0 V0LS
UNIX> grep '[Gg]' greptest
This string contains no numbers
This string does though (1)
G0 V0LS
UNIX> grep '[0-9]' greptest
This string does though (1)
-9.00
G0 V0LS
UNIX> grep '[A-Z]' greptest
Jim Plank
This string contains no numbers
This string does though (1)
G0 V0LS
UNIX> grep '[^A-Za-z ]' greptest
This string does though (1)
-9.00
G0 V0LS
UNIX> 
So, to grep for lines with exactly 9 characters, (note the newline doesn't count) do:
UNIX> grep '^.........$' greptest 
Jim Plank
UNIX> 
To grep for lines with at least 9 characters, do:
UNIX> grep '.........' greptest 
Jim Plank
This string contains no numbers
This string does though (1)
UNIX>
To grep for lines that end with two numbers, do:
UNIX> grep '[0-9][0-9]$' greptest
-9.00
UNIX>
Examples: (don't forget the quotes when using the greater-than and less-than signs).


UNIX> grep all sth
Our shadow's taller than our souls
There walks a lady we all know
Why all that glitters is not gold
These lyrics are all old as mold
UNIX> grep '\<.ll\>' sth
There walks a lady we all know
Why all that glitters is not gold
These lyrics are all old as mold
UNIX> grep 'dow\>' sth
Our shadow's taller than our souls
UNIX> grep '\<.\>' sth
Our shadow's taller than our souls       (matching the s in "shadow's")
There walks a lady we all know
UNIX> 
Note that it matches zero or more. So, the following will match all lines, even though none have Z's:
UNIX> grep 'Z*' md
See Saw, Margery Daw
 Johnny will have a new Master
He will get but a penny a day
 Because this poem is a disaster!
UNIX> 
Here are some more examples. The first greps for two words separated by a space (actually, since * can match zero, this will also match a single space, or a word before or following a single space). The second greps for a period followed by any number of zeros, and then the end of line. The last greps for any line with two zeros somewhere.
UNIX> grep '^[^ ]* [^ ]*$' greptest
Jim Plank
G0 V0LS
UNIX> grep '\.0*$' greptest
-9.00
UNIX> grep '0.*0' greptest
-9.00
G0 V0LS
UNIX> 
So, some more examples. This first is equivalent to grepping for 0.
UNIX> grep '0\{1\}' greptest
-9.00
G0 V0LS
This is equivalent to grepping for 000*:
UNIX> grep '0\{2,\}' greptest
-9.00
Here we grep for 5-letter words containing just lower case letters, then 5-letter words, then words of at least 5 letters:
UNIX> grep '\<[a-z]\{5\}\>' greptest
UNIX> grep '\<[A-Za-z]\{5\}\>' greptest
Jim Plank
UNIX> grep '\<[A-Za-z]\{5,\}\>' greptest
Jim Plank
This string contains no numbers
This string does though (1)
UNIX> 
If you want to make sure that grep prints out the file name of the file that the line comes from, include /dev/null on the command line. Then you'll have at least two files on the command line, and grep will be sure to print the file name:
UNIX> grep '\<.ld\>' sth
These lyrics are all old as mold
UNIX> grep '\<.ld\>' sth /dev/null
sth:These lyrics are all old as mold
UNIX> 
grep can do far more than this -- you need to read the man page to figure it all out. Also you should read about egrep and fgrep. The newest manpage for /usr/bin/grep does not really tell you WHICH regular expressions it uses. Assume 'basic' regular expressions and read the man pages for regex(5) and regexp(5). And you may still get it wrong but its better than the alternative.

find

Find is a command that does recursive directory traversal. It is most useful when you need to do one of three things:
  1. Find a file with a specific name.
  2. Grep through all of your files to find a specific word or pattern in one.
  3. Remove a bunch of files with specific names in all your directories.
I'll go over these three examples. Read the man page to figure out how to do other cool things with find.

First, to find a file with a specific name in all directories reachable from x (this one), do:

find x -name name -print
For examples, to find all .c files reachable from your home directory, do:
UNIX> find $HOME -name '*.c' -print
/mahogany/homes/plank/src/jgraph/work/draw.c
/mahogany/homes/plank/src/jgraph/work/edit.c
/mahogany/homes/plank/src/jgraph/work/exit.c
/mahogany/homes/plank/src/jgraph/work/jgraph.c
/mahogany/homes/plank/src/jgraph/work/libmalloc.c
/mahogany/homes/plank/src/jgraph/work/list.c
/mahogany/homes/plank/src/jgraph/work/printline.c
/mahogany/homes/plank/src/jgraph/work/prio_list.c
/mahogany/homes/plank/src/jgraph/work/process.c
/mahogany/homes/plank/src/jgraph/work/show.c
...
UNIX>
Note that you should put the '*' in single quotes. Also, note that find will not traverse '..', nor will it normally traverse soft links to other directories. Sometimes this is a drag, but it's pretty much for the best. Suppose you wanted to find all the core files reachable from the current directory. Then you do:
UNIX> find . -name core -print
Note the way that you match files is using the shell's wildcarding, and not using regular expressions. In other words, if you want to find all your files with two-letter names, do:
UNIX> find $HOME -name '??' -print
If you want to print all files reachable from your home directory, do
UNIX> find $HOME -print
Using find to find filenames is something I do so much that I have a shell script called jf that finds all files reachable from the current directory that contain a given string:

UNIX> jf md
find . -name '*md*' -print
./md
UNIX> jf g
find . -name '*g*' -print
./greptest
./gstats
UNIX>
The second use of find is to grep through all files to find a certain string. For example, suppose I want to find if the word "no" exists in any files reachable from . (current directory). Then I'd do:


UNIX> find . -type f -exec grep no {} \;
This string contains no numbers
1   1 10.87 10.87 Tom Kalinowski
There walks a lady we all know
Who shines bright lights and wants to know                 
Why all that glitters is not gold
UNIX>
Great, so it exists. Here's how you find out what file it's in:
UNIX> find . -type f -exec grep no {} /dev/null \;
./greptest:This string contains no numbers
./gstats:1   1 10.87 10.87 Tom Kalinowski
./sth:There walks a lady we all know
./sth:Who shines bright lights and wants to know                   
./sth:Why all that glitters is not gold
UNIX> 
You'll find that you end up doing things like this more than you'd imagine, especially if you're kind of forgetful, like I am.

Lastly, suppose that you want to remove all of your core files. Then you do (I put the -i in so that you'll be prompted -- leave it off if you don't want to be prompted):

UNIX> find $HOME -name core -exec rm -i {} \;
rm: remove /mahogany/homes/plank/src/ohhell/core? y
rm: remove /mahogany/homes/plank/src/jgammon/core?  y
...
etc
There's a lot going on here, but this is the simple way to do it. You should read the man page for find to figure out exactly what's going on. Some people find find to be quite confusing, especially the -exec part, where you have to put a backslash before the semi-colon. Give it a read. You will also see that on the Solaris systems that find has an option that lets it traverse soft links. I have tried this and it was really ugly. If you ever try to do this, make SURE you know what is going on first. And it is not portable.