From L38076@beta.ist.utl.pt Fri Sep 27 20:36:12 1996 Path: fu-berlin.de!nntp.zit.th-darmstadt.de!voskovec.radio.cz!news-feed.inet.tele.dk!news.inet.tele.dk!arclight.uoregon.edu!news.sprintlink.net!news-stk-200.sprintlink.net!news.sgi.com!enews.sgi.com!EU.net!Portugal.EU.net!news.rccn.net!news.ist.utl.pt!beta.ist.utl.pt!L38076 From: L38076@beta.ist.utl.pt (Carlos Jorge G.duarte) Newsgroups: comp.editors Subject: do-it-with-sed (long) Date: 24 Sep 1996 17:18:28 GMT Organization: Instituto Superior Tecnico Lines: 2137 Distribution: inet Message-ID: <529554$sc4@ci.ist.utl.pt> NNTP-Posting-Host: beta.ist.utl.pt X-Newsreader: TIN [version 1.2 PL2v [AXP/VMS]] Hi everyone, this is a little (~50k) document on how to use doc, and with some trailing examples. Here it is now, after my name -- Carlos ---- :r! sed -ne '/^-----/{;n;h;n;/^----/{;g;/^.\{72\}$/s/ */ /;p;};}' % Introduction Regular expressions Using sed Sed resume Sed commands Examples Squeezing blank lines (like cat -s) Centering lines Delete comments on C code Increment a number Get make targets Rename to lower case Print environ of bash Reverse chars of lines Reverse lines of files Transform text into a C "printf"able string Prefix non blank lines with their numbers (cat -b) Prefix lines by their number (cat -n) Count chars of input (wc -c) Count lines of input (wc -l) Count words of input (wc -w) Print the filename component of a path (basename) Print directory component of a path (dirname) Print the first few (=10) lines of input Convert a sed script to a bash-command-line command Print last few (=10) lines of input The tee(1) command in sed Print uniq lines of input (uniq) Print duplicated lines of input (uniq -d) Print only and only duplicated lines (uniq -u) Index of sed commands Author and credits and date etc... ======================================================================== ------------ Introduction ------------ This is a little document to help people using sed, not very fancy but better than nothing :-) There are several uses for sed, very of them totally exotic. Most of scripts that appear through the text are useless, as there are (Unix) utilities that do the same job (and more) faster and better. They are intended to show real examples of sed, and to show also the power of sed, as well its weaks. ======================================================================== ------------------- Regular expressions ------------------- To know how to use sed, people should know regular expression (RE for short). This is a brief resume of regular expressions used in SED. c a single char if not special, is matched against text. * matches a sequence of zero or more of previous char, or grouped RE, or class. \+ as *, but matches one or more. \? as *, but only matches zero or one. \{i\} as *, but matches exactly sequences (a number, between 0 and some limit -- on Henry Spencer's regexp(3) library, this limit is 255) \{i,j\} matches trough to , inclusive, sequences. \{i,\} matches more or equal than sequences. \{,j\} matches at most (or equal) sequences. \(RE\) groups RE as a whole, this is used to: - apply postfix operators, like `\(abcd\)*' this will search for zero or more whole sequences of "abcd", if `abcd*', it would search for "abc" followed by zero or more "d"s - use back references (see below) . match any character ^ match the null string at beginning of line, i.e. by another words, what appears at front of ^ must appear at the beginning of line like `^#include' will match only lines where "#include" is the first thing on line, if there are one or two spaces before, the match fail $ the same as ^, but refers to end of line \c matches character `c' -- used to match special chars, referred above (and some more below) [list] matches any single char in list, like `[aeiou]' matches all vowels [^list] matches any single char NOT in list a list may be composed by -, and means all chars between (inclusive) and to include `]' in the list, make it the first char to include `-' in the list, make it the first or last RE1\|RE2 matches or RE1 or RE2 \1 \2 \3 \4 \5 \6 \7 \8 \9, => \i matches the th \(\) reference on RE, this is called back reference, and usually it is (very) slow Notes: ------ - some implementations of sed, may not have all REs mentioned, notably `\+', `\?' and `\|' - the RE is greedy, i.e. if two or more matches are detected, it selects the longest, if there are two or more selected with the same size, it selects the first in text Examples: --------- `abcdef' matched "abcdef" `a*b' matches zero or more "a"s followed by a single "b", like "b" or "aaaaaab" `a\?b' matches "b" or "ab" `a\+b\+' matches one or more "a"s followed by one or more "b"s, the minimum match will be "ab", but "aaaab" or "abbbbb" or "aaaaaabbbbbbb" also match `.*' all chars on line, of all lines (including empty ones) `.\+' all chars on line, but only on lines containing at least one char, i.e. empty lines will not be matched) `^main.*(.*)' search for a line containing "main" as the first thing on the line, that line must also contain an opening and closing parenthesis being the open paren preceded and followed by any number of chars (including none) `^#' all lines beginning with "#" (shell and make comments) `\\$' all lines ended with a single `\' (there are two for escaping `\') -- line continuation in C and make, and shell, etc... `[a-zA-Z_]' any letters or digits `[^ ]\+' (a tab and a space) -- one or more sequences off any char that isn't a space or tab usually this means a word `^.*A.*$' match an "A" that is right in the center of the line `A.\{9\}$' match an "A" that is exactly the last tenth character on line `^.\{,15\}A' match the last "A" on the first 16 chars of the line ======================================================================== --------- Using sed --------- The usual format of sed is: sed [-e script] [-f script-file] [-n] [files...] files... are the files to read, if a "-" appears, read from stdin, if no files are given, read also from stdin -n by default, sed write each line to stdout when it reaches the end of the script (being whatever on the line) this option prevents that, i.e. no output unless there is a command to order SED specifically to do it (like p) -e a "on-line" script, i.e. a script to sed execute given on command line, multiple command line scripts can be given with -e option, in fact, -e is only needed when more than one script is present (specified by a previous -e or -f option) -f read scripts from specified file, several -f options can appear - Scripts are concatenated as they appear, forming a big script. - That script is compiled into a sed program. - That program is then applied to each line of given files (the script itself can change this behavior). - The results are always written to stdout, although same commands can send stuff to specific files - Input files are seen as one to sed, i.e. `sed -n $= *' gives the number of lines of ALL *, something like `cat * | wc -l' I usually use (sorry the pleonasm!) sed on the following ways: ---- on shell scripts, invoking sed like this #!/bin/sh sed [-n] ' whole script ' ---- as an executable itself, like #!/usr/bin/sed -f or #!/usr/bin/sed -nf ---- on the command line, as being part of a shell script, or on an alias (tcsh), or on a function (bash) For the command line, there are two things to know, there is no need on using one -e for each command, although that can be done. Commands may be separated by semi-colons `;', with some exceptions. Example: sed '/^#/d;/^$/d;:b;/\\$/{;N;s/\n//;bb;}' this would /^#/d delete all lines beginned with `#' (comments?) /^$/d delete all empty lines (/./!d could be used instead) :b /\\$/{ N s/\n// bb } would join all lines ended with `\', after deleting the `\' it self the format of this explained script (excepting the description themselves) could be used on a file script but can also be given to sed on one line, without using lots of '-e's Though, there are exceptions to this `;' ending rule: the direct text handling and read/write commands. There are functions, that handle user text directly (insert, append, change). The format of that text is command\ first line\ second line\ ...\ last line no ending \ for the last line example on a sed script file: /#include /{ i\ #ifdef SYSV a\ #else\ #include \ #endif } that, would search for lines `#include ' and then would write #ifdef SYSV #include #else #include #endif Now, for writing the same script on one line, it is needed the -e mechanism... what follows each -e can be considered as an input line from a sed script file, so, nothing kept us from doing sed -e '/#include /{' \ -e 'i\' \ -e '#ifdef SYSV' \ -e 'a\' \ -e '#else\' \ -e '#include \' \ -e '#endif' \ -e '}' on the command line, of course the trailing `\'s could be omitted if we wrote all of this on one line and thus, getting a fast edit-and-test work and of course, lines that don't need to be alone can be joined with the `;' mechanism... rewriting the above, we could get something like: sed -e '/#include /{;i\' -e '#ifdef SYSV' -e 'a\' -e '#else\' \ -e '#include \' -e '#endif' -e '}' NOTE that this fancy work out on the shell command line can be a real pain due to quoting mechanism of shell's. For [ba]sh the above should be fine, but for [t]csh for instance, the '...\' would quote the ' and mess everything up. -- Generally speaking, we can put the above in the following manner: 1. sed commands are usually on one line 2. if we want more (multi-line commands), then, we must end the first line with an `\' -- this is not the same as the classic trailing `\' in C or make, etc... this one says: "Ei sed! This command has more than one line.", the C, make, etc, says: "Ei make, (g)cc, etc... this line is so huge that I wrote its continuation on the next line!" 3. if a command is one line only, it can be separated by a `;' 4. if it is a multi-line, then it must contain all of its line (except the first) by themselves ...and... 5. on command line, what follows a `-e' is like a whole line on a sed script -- The insert etc... command, deal with text, so obviously, they are multi-commands by default, i.e. at least two lines: one for the command, and other for text (which can be empty), but any other command may be a potential multi-liner The read/write commands are exceptions: they need a whole (last) line for them selves, i.e. after the `r' or `w' the rest of the line is treated like a filename. So, after this ones, nothing more can happen (but before can). ======================================================================== ---------- Sed resume ---------- Input ----- Sed input are files (stdin by default), and are seen as a whole. For instance, sed -f some_script /etc/passwd /etc/passwd is exactly the same as ( cat /etc/passwd; cat /etc/passwd ) | sed -f some_script or cat /etc/passwd > foo cat /etc/passwd >> foo cat foo | sed -f some_script or yet sed -f some_script foo i.e. lines from files are read, but no kind of information exist to keep track from where they come. Description ----------- Sed read lines from its input, and apply some actions (or commands, or functions-- a matter of choice) to them. By default, the print command is applied before the next line is read. So sed '' /etc/passwd will be like cat /etc/passwd i.e. each line of /etc/passwd is written after being read. An equivalent form is sed -n 'p' /etc/passwd The general format of an action/function/command is [first_address][,second_address] [arguments] [\] first_address specifies that should be executed only on lines on that addresses (more of these below) by default, will be executed on ALL lines first_address,second_address when second_address is specified first must also exist, and the format is as above will be applied to all lines that match the formed range (including bounds) function see list of them below arguments are particular to each function, some functions doesn't even got arguments \ a sed function, is a one-line function, but there are some exceptions-- on that case, a `\' must be on the end of the line to tell sed that the specified function is composed by more than one line note that this is not the classical `\', that we are used to see on C, make, sh, etc... this is not continuation on the next line-- a sed command is read until a line which does not end in a `\' is found, usually, the line that contains the command satisfies this, but if a command extend it self across lines, all of them, except the last, must end on `\' (more about these on i(nsert), a(append), c(hange) and s(ubstitute) commands) Applying commands ---------------- The commands are gathered into a big command buffer. They are fetched as they appear on scripts input, either being fetched from command line, either from files. All leading space is ignored (more about this on i(nsert), and company). Then, the big command buffer is compiled into a sed program, this sed program will be very fast (it is byte code), thats why sed is a fast and convenient program. Each command of the program will be applied to current line, if there is nothing that prevent this (like specifying an address that does not match current line). Commands are applied one by one, sequentially, and [possible] transformations occurred on line are "applied" before next command is executed. Sequentiality can be changed with some commands (more on this below-- b(ranch) and t(est)). Pattern space ------------- Well, I have been referring by "lines" to the input of each sed command. Actually this is not correct, because a sed command can be applied to more than one line, or even on some parts of several lines. The input of each sed command, is called "pattern space". Usually that pattern space is the current line (that is the default either), but this behavior, obviously can be changed with sed commands (N,n,x,g and G). Addresses --------- There are two kinds of addresses: line addrs and context addrs. Each line read is counted, and one can use this information to absolutely select which lines should commands be applied to. For instance: 30= will write "30" if there are at least 30 lines on input, because the `=' command (print current line) will only be executed on line 30 30,60= will write "30", "31"... "60" on the same conditions as above, i.e. input must contain more or equal than N lines, to the number N to be written $= will write down the number of the last line, a kind of `wc -l' So, resuming: 1 first line 2 second line ... $ last line i,j from i-th to j-th line, inclusive, j can be $ The second kind of addresses are context, or RE, ones. They are a regular expression, and commands will be executed on all pattern spaces matched by that RE. Examples: /.\{73,\}/d will delete all lines that have more than 72 characters /^$/d will delete all empty lines /^$/,/^$/d delete from first empty line seen to the next empty, eating everything appears in the middle (not very useful) The context addresses can be mixed up with line addrs, so: 1,/^$/d delete and leading blank lines, i.e. the first output line will be non empty Resume: ------- - commands may take 0, 1 or 2 addrs - if no addr is given, command is applied to all pattern spaces - if 1 addr is given, then it is applied to all pattern spaces that match that addr - if 2 addrs are given, then it is applied to all formed pattern spaces between the pattern space that matched the first addr, and the next pattern space matched by the second addr if pattern spaces are all the time single lines, this can be said like, if 2 addrs are given, then the command will be executed on all lines between first addr and second (inclusive) if second addr is a RE, then the search starts only on the next line that's why things like /foo/,/foo/ works! ======================================================================== ------------ Sed commands ------------ The following description is arranged in this way: (arg-number) -- mnemonic, short description full description At the end of the file (after examples) it is an index of all commands, sorted by name (i.e. letter) with the short description and mnemonic. Line oriented commands ---------------------- (2)d -- d(elete), delete lines - delete (i.e. don't write) specified lines - the execution re-start at the beginning of the script this is somehow like s/.*// b (2)n -- n(ext), next line - jumps to next line, i.e. pattern space is replaced with the contents of the next line - execution is prosecuted in the command following the `n' command Text commands ------------- (1)a\ -- a(ppend), append lines - add after the specified line (if addr isn't given, then will be added after EACH line of input that executes this, of course) - can have any number of lines, the general format is a\ 1st line\ 2nd\ ...\ last line `next command' - suppose that we have sed -e '$a\' -e '' then a single line containing "the end" is append to the file if we do -e 's/.*//' as the first cmd, then the only thing we will see on output, will be "the end", after a bunch a blank lines, i.e. is written after the line has been processed, but this doesn't mean that the line will be written, usually this is what happens, but nothing imposes it. (1)i\ -- (i)nsert, insert lines - works like the append command, but -- (c)hange, change lines - this will delete current pattern space, and replace it by text - this is roughly the same as, insert then delete, or append then delete, or yet s/.*// b note : sed don't honor leading spaces, so beginned with spaces, will have them removed to avoid this behavior, a `\' can be placed before the first space that one wants to see written, that way the space was conveniently escaped and will be treated like a normal char GNU sed (as version 2.05) don't honor this ignoring- -leading-space procedure note2: in not processed by the sed program, i.e. we insert/change/append raw text directly to output Substitution ------------ This command is so used, that deserves a whole section! (2)s/RE//[flags] -- (s)ubstitute, substitute - on specified lines, text matched by RE, if any, is replaced by - if replacement is done, the flag that permits `test' command to be performed is setted (more about this on `t' command) - the `/' separator, in fact could be ANY character, usually it is `/' due to the fact that almost every programs with regular expressions ability use it, good exceptions are grep and lex, that don't use any char as a delimiter - is raw text, the only exceptions are: & it is replaced by all text matched by RE being so, then s/RE/&/ is a null op, whatever be RE, except as concerning to the test flag \d where `d' is a digit (see below for more), is replaced by the d-th grouped \(\) sub-RE some implementations of sed (more precisely, some implementations of regex(3) library, that some implementations of sed use), limit `d' to be a single digit (1-9), others, as gnu sed (2.05 at least) accepts a valid number gnu sed, also accepts and understands `\0' as a `&', i.e. the whole matched RE. I don't know if this behavior is standard if there wasn't a d-th grouped \(\), then \d is replaced by the null string \c where `c' is any char except digits, quote `c' note that besides above, _every_ other text, is raw, so `\n' or `\t' don't work as one might expect, to insert a newline for instance, one must do s/foo/bar-on-this-line\ foo-on-next/ - are optional, and can be multiple g replace all occurrences of RE, by (the default is to replace only the first) p write the pattern space, only if the subst was successful w work as `p' flag, but the pattern space is written to d where `d' is a digit, replace the d-th occurrence, if any, of RE by Output and files ---------------- (2)p -- (p)rint, print - write specified lines to output (2)l -- (l)ist, list - this works more or less, like the vi's :list, i.e. it prints specified lines, but shows some special characters in \c format like \n and \t - useful to debug sed scripts :-) note: the list command is present on gnu sed 2.05 (actually, the only reason I know about its existence, was by reading the GNU sed source) -- therefor it may be an extension to posix sed (?) (2)w -- w(rite), write to - write specified lines to (1)r -- r(read), read the contents of - insert contents of after specified line - there is no way of adding contents of before first line, but if someone wants that, then include before the other input - if file can not be open, sed goes on, like the command didn't exist, i.e. silently fails Multiple lines -------------- (2)N -- (N)ext, (add) next line - next line of input is added to current pattern space, and a `\n' gets embedded on the pattern space (2)D -- (D)elete, delete first part of the pattern space - delete everything up to (inclusive) the first newline and then jumps to beginning of script, with next line loaded - if just one line is being edited, then `D' is the same as `d' (2)P -- (P)rint, print first part of the pattern space - writes everything up to (inclusive) the first newline - if pattern space is a single line, than `P' is the same as `p' Hold buffer ----------- Sed contains one buffer, where it can keep temporary stuff to work out later. (2)h -- (h)old, hold pattern space - copy current pattern space to hold buffer, overwriting whatever was on it (2)H -- (H)old, hold pattern space -- append - add current pattern space to the _end_ of hold buffer (if hold space is empty, then this is like `h') (2)g -- (g)et, get contents of hold area - copy the contents of hold space to current pattern space - pattern space is loosed (2)G -- (G)et, get contents of hold area -- append - adds contents of hold space to the _end_ of current pattern space (2)x -- e(x)change, exchange - exchanges current pattern space with hold buffer Control flow ------------ (2)! -- Don't - negate address specification of next command - note that if we omit the address, then we mean ALL lines, so, negation of all is nothing, i.e. sed '!s/foo/bar/' will be as good as nothing already, sed '/./!d' as a different meaning: deletes all empty lines, why? because `/./' matches any char, therefor, `/./!' matches no char at all - this can be applied to negate 0, 1 or 2 addresses, negate 0 doesn't make much sense (as indicated above), negating 1, or 2 addresses proves to be highly useful, sometimes it is easier to construct a RE that does not match what we want than the other way (2){ -- {} as in C or sh(1), Grouping - groups a set of commands, that are executed on the specified lines - the first command of the group, may appear right after the `{' (i.e. on the same line) -- usually it is kepted on the next line - the closing `}' must appear on one line by itself - `{...}' can be nested addr1,addr2{ cmds... } can be replaced by addr1,addr2 first_grouped_cmd addr1,addr2 second_grouped_cmd ... addr1,addr2 last_grouped_cmd (0):