Jump to content

Linux/BSD command line wizardry: Learn to think in sed, awk, and grep


Karlston

Recommended Posts

"Do people really write these long, convoluted commands?" In a word: yes.

IT programmer as genius or wizard sitting behind computer.

As a relatively isolated junior sysadmin, I remember seeing answers on Experts Exchange and later Stack Exchange that baffled me. Authors and commenters might chain 10 commands together with pipes and angle brackets—something I never did in day-to-day system administration. Honestly, I doubted the real-world value of that. Surely, this was just an exercise in e-braggadocio, right?

 

Trying to read the man pages for the utilities most frequently seen in these extended command chains didn't make them seem more approachable, either. For example, the sed man page weighs in at around 1,800 words alone without ever really explaining how regular expressions work or the most common uses of sed itself.

 

If you find yourself in the same boat, grab a beverage and buckle in. Instead of giving you encyclopedic listings of every possible argument and use case for each of these ubiquitous commands, we're going to teach you how to think about them—and how to easily, productively incorporate them in your own daily command-line use.

Redirection 101

Before we can talk about sed, awk, and grep, we need to talk about something a bit more basic—command-line redirection. Again, we're going to keep this very simple:

 

Operator Function Example
; Process the command on the right after you're done processing the command on the left. echo one ; echo two
> Place the output of the thing on the left in the empty file named on the right. ls /home/me > myfilesonce.txt ; ls /home/me > myfilesonce.txt
>> Append the output of the thing on the left to the end of the existing file on the right. ls /home/me > myfilestwice.txt ; ls /home/me >> myfilestwice.txt
< Use the file on the right as the standard input of the command on the left. cat < sourcefile > targetfile
| Pipe the standard output of the thing on the left into the standard input of the thing on the right. echo "test123" | mail -s "subjectline" emailaddress

 

Understanding these redirection operators is crucial to understanding the kinds of wizardly command lines you're presumably here to learn. They make it possible to treat individual, simple utilities as part of a greater whole.

 

And that last concept—breaking one complex task into several simpler tasks—is equally necessary to learning to think in complex command-line invocations in the first place!

Grep finds strings

When first learning about tools like grep, I find it helps to think of them as far simpler than they truly are. In that vein, grep is the tool you use to find lines that contain a particular string of text.

 

For example, let's say you're interested in finding which ports the apache web browser has open on your system. Many utilities can accomplish this goal; netstat is one of the older and better-known options. Typically, we'd invoke netstat using the -anp arguments—for all sockets, numeric display, and displaying the owning pid of each socket.

 

Unfortunately, this produces a lot of output—frequently, several tens of pages. You could just pipe all that output to a pager, so you can read it one page at a time, with netstat -anp | less. Or, you might instead redirect it to a file to be opened with a text editor: netstat -anp > netstat.txt.

 

But there's a better option. Instead, we can use grep to return only the lines we really want. In this case, what we want to know about is the apache webserver. So:

me@banshee:~$ sudo netstat -anp | head -n5
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 192.168.188.1:53        0.0.0.0:*               LISTEN      5128/dnsmasq        
tcp        0      0 192.168.254.1:53        0.0.0.0:*               LISTEN      5057/dnsmasq        
tcp        0      0 192.168.122.1:53        0.0.0.0:*               LISTEN      4893/dnsmasq        

me@banshee:~$ sudo netstat -anp | wc -l
1694

me@banshee:~$ sudo netstat -anp | grep apache
tcp6       0      0 :::80                   :::*                    LISTEN      4011/apache2  

me@banshee:~$ sudo netstat -anp | head -n2 ; sudo netstat -anp | grep apache
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp6       0      0 :::80                   :::*                    LISTEN      4011/apache2  
 

We introduced some new commands above: head, which limits output to the first n lines and then truncates it. There's also wc, which, with the argument -l, tells you how many lines of text hit its standard input.

 

So we can translate the four commands above into plain English:

 

  1. sudo netstat -anp | head -n5 : "Find all the open network sockets, but limit output to the first five lines."
  2. sudo netstat -anp | wc -l : "Find all the open network sockets, then tell me how many total lines of text you'd have used to tell me."
  3. sudo netstat -anp | grep apache : "Find all the open network sockets, but only show me the results that include the word 'apache.'"
  4. sudo netstat -anp | head -n2 ; sudo netstat -anp | grep apache : "Find all the open network sockets, but only show me the two header lines—then do it again, but only show me the 'apache' results."

 

By thinking of grep as something much simpler than it actually is, we can jump immediately to finding productive ways to use it—and we can chain these simple uses together to easily describe more complex tasks!

 

Once you're comfortable with using grep to find simple strings as seen above, it can do far more complex tasks. These include but are not limited to: case-insensitive use, more complex patterns (including full regular expressions), exclusion (only show me lines that don't include the pattern), and much, much more. But don't worry about that until after you're familiar with simple grep uses. Once you start, it's truly hard to imagine life without grep anymore!

Sed replaces strings

Now that you know how to limit output to matching (or nonmatching) lines, the next step is learning how to change that output on the fly. For this, sed—the Stream EDitor—will be your tool of choice.

 

In order to use sed, you need to understand at least a little about regular expressions (regexes). We are once again going to ignore the vast majority of what regular expressions can do and focus on the most immediately intuitive and useful: simple pattern replacement.

 

Let's say that you want to change all instances of dog to snake in a bunch of text:

me@banshee:~$ echo "I love my dog, dogs are great!"
I love my dog, dogs are great!

me@banshee:~$ echo "I love my dog, dogs are great!" | sed 's/dog/snake/'
I love my snake, dogs are great!

me@banshee:~$ echo "I love my dog, dogs are great!" | sed 's/dog/snake/g'
I love my snake, snakes are great!
 

We can translate these three commands into plain English:

 

  1. say "I love my dog, dogs are great!"
  2. say "I love my dog, dogs are great!" but change the first instance of dog to snake.
  3. say "I love my dog, dogs are great!" but change all instances of dog to snake.

 

Although we're really just working with plain text, sed actually thinks in regular expressions. Let's unpack the regex s/dog/snake/g: it means to search sed's input for instances of dog and replace them with snake and do so globally. Without the g on the end, sed only makes a single replacement per line of text, as we see in command #2.

 

Alright, now that we understand the simplest possible regular expressions, what might we use sed for on a real-world command line? Let's return to our first example, in which we looked for open network sockets belonging to apache. This time, let's say we want to know which program opened a socket on port 80:

me@banshee:~$ sudo netstat -anp | grep ::80
tcp6       0      0 :::80                   :::*                    LISTEN      4011/apache2  

me@banshee:~$ sudo netstat -anp | grep ::80 | sed 's/.*LISTEN *//'
4011/apache2  
 

In the first command, we look for any line containing the string ::80, which limits us to the program running on the standard HTTP port. In the second, we do the same thing—but we discard all the information prior to the PID and mutex (display name) of the process that owns that socket.

 

In regex language, . is a special character that matches any single character, and * is a special character that matches any sequence of the preceeding characters. So .* means "match any number of any characters," and * (a space followed by an asterisk) means "match any number of spaces."

 

This kind of preliminary processing can make reading a text file full of tons of output much easier later—or it can serve to parse "human friendly" command output down to something that can be passed to another utility as an argument later.

 

Again, there is far, far more to both sed and regular expressions than we see here—but just like grep, I recommend getting comfortable with the most basic use of sed until it feels natural. Wait to go man-page diving until after you're solid on basic use. Only then should you try to slowly, steadily expand your repertoire!

Awk finds columns

Once you get comfortable with sed and grep, you'll start to feel like a superhero—until you realize how hard it is to get only the relevant information out of a single column in the middle of a line. That's where awk comes in. It's worth noting that awk is even more potentially complex (and capable) than either sed or grep were—in fact, if you're enough of an awk wizard, you could technically replace both sed and grep in most use cases.

 

That's because awk is actually an entire scripting language, not just a tool—but that's not how we're going to use it, or think of it, as relative newbies. Instead, we're just going to think of awk as a column extractor. Once again, we'll return to our netstat example. What if we want to find out which port Apache is running on?

me@banshee:~$ sudo netstat -anp | grep apache
tcp6       0      0 :::80                   :::*                    LISTEN      4011/apache2    

me@banshee:~$ sudo netstat -anp | grep apache | awk '{print $4}'
:::80

me@banshee:~$ sudo netstat -anp | grep apache | awk '{print $4, $7}'    
:::80 4011/apache2
 

Once again, we'll translate our examples into plain English:

 

  1. Find all open sockets and the programs that own them, but limit output to the ones with the text 'apache' in them.
  2. Find all open sockets and the programs that own them, but limit output to the ones with the text 'apache' in them—and limit that output to the fourth tabular column only.
  3. Find all open sockets and the programs that own them, but limit output to the ones with the text 'apache' in them—and limit that output to the fourth and seventh tabular columns.

 

Since awk is an entire language, its syntax may feel slightly tortured. You need to encapsulate its command arguments in single quotes, then in curly brackets, and you have to use the keyword print in addition to your column numbers. But that syntax will feel like second nature before you know it—and the seemingly overcomplex syntax makes it possible later to use awk for more complex tasks, like calculating running sums and even averages:

me@banshee:~$ cat awktext.txt
1 2 3
4 5 6
7 8 9

me@banshee:~$ cat awktext.txt | awk '{SUM+=$2}END{print SUM}'
15

me@banshee:~$ cat awktext.txt | awk '{SUM+=$2}END{print SUM/NR}'
5
 

In the above examples, we add the value of the specified column—the second column, specified with $2—to a variable we name SUM. After adding the value of column 2 in each row to SUM, we can either output SUM directly, or we can divide it by a predefined special variable, NR, which means "Number of Rows."

 

And yes, if you're wondering, awk handles decimals fine, as well as varying amounts of whitespace in between columns:

me@banshee:~$ cat awktext.txt
1.0 2.1 3.2 4.3
5.4 6.5 7.6 8.7
9.8 0.9 1.0 2.1

me@banshee:~$ cat awktext.txt | awk '{SUM+=$2}END{print SUM/NR}'
3.16667

me@banshee:~$ cat awktext.txt
1.0     2.1  3.2  4.3
5.4 6.5      7.6       8.7
9.8 0.9 1.0 2.1

me@banshee:~$ cat awktext.txt | awk '{SUM+=$2}END{print SUM/NR}'
3.16667
 

As always, I strongly encourage you to get comfortable with the most basic use of awk—finding data by which column it's in rather than trying to hunt for it by identifying text before and after it—before worrying about its fancier, more complex uses.

Conclusion

With any luck, now that you've seen the most common and least complex uses of these three iconic tools, you'll be ready to begin thinking in them. Now that you know, you'll quickly realize their utility comes up time and time again in the real world!

 

Before learning to think in sed, awk, and grep, I generally accomplished the same tasks by hand-editing large volumes of command output in a text editor, then writing simple shell or perl scripts to process the hand-edited output. Learning these tools increased my productivity enormously—and it can increase yours, too.

 

 

Linux/BSD command line wizardry: Learn to think in sed, awk, and grep

Link to comment
Share on other sites


  • Views 1.3k
  • Created
  • Last Reply

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...