How Grep Prevents a Feedback Loop

The blog post grep your way to freedom considers "what happens when you grep a file and append the contents to the same file".

Before moving on to investigate BSD grep, the author observes that GNU grep will stop if it detects that a grep process starts grepping its own output.

Indeed, if I try:

touch test.txt
grep something test.txt > test.txt

I get:

grep: input file ‘test.txt’ is also the output

I was once caught out by a grep feedback loop with a version which obviously didn't have this check. A script I wrote to parse some test results overnight caught its own output in the glob. The result was a 1 TB output file, a temporarily unusable development machine and some displeased colleagues.

Because of this mishap I was interested in understanding how the developers of GNU grep now prevent this.

The Source

I grabbed the source code for grep 3.1 (the version installed on my system) and began grepping.

Finding the check responsible for the loop detection was painless thanks to the printed warning text. The check can be seen below along with the verbose comment that precedes it.

/* If there is a regular file on stdout and
   the current file refers to the same i-node,
   we have to report the problem and skip it.
   Otherwise when matching lines from some
   other input reach the disk before we open
   this file, we can end up reading and
   matching those lines and appending them to
   the file from which we're reading.  Then
   we'd have what appears to be an infinite
   loop that'd terminate only upon filling the
   output file system or reaching a quota.
   However, there is no risk of an infinite
   loop if grep is generating no output, i.e.,
   with --silent, --quiet, -q.

   Similarly, with any of these:
     --max-count=N (-m) (for N >= 2)
     --files-with-matches (-l)
     --files-without-match (-L)

   there is no risk of trouble.

   For --max-count=1, grep stops after
   printing the first match, so there is no
   risk of malfunction.  But even
   --max-count=2, with input==output, while
   there is no risk of infloop, there is a
   race condition that could result in
   "alternate" output.  */

   if (!out_quiet && 
       list_files == LISTFILES_NONE &&
       1 < max_count && 
       S_ISREG (st.st_mode) && 
       SAME_INODE (st, out_stat))
     {
       if (! suppress_errors)
       {
         error (0, 0, 
                _("input file %s"
                  "is also the output"),
                quote (input_filename ()));
       }
       errseen = true;
       goto closeout;
     }

In the if statement this condition is the most important:

SAME_INODE (st, out_stat)) 

which expands to the following:

#define SAME_INODE(a, b) \
((a).st_ino == (b).st_ino \
 && (a).st_dev == (b).st_dev)

In this scope st and out_stat are structures of type struct stat. They are passed to the fstat() function and populated for the file descriptors of the file to be searched (st) and STDOUT_FILENO (st_out).

With these (sizeable) breadcrumbs and some manpage searching it was not too difficult to determine how grep detects when there is a risk of a feedback loop.

The Answer

When a C program runs there are three standard input/output streams that are available via file descriptors. They are:

  • the stdin stream with file descriptor value STDIN_FILENO (0)
  • the stdoutstream with file descriptor value STDOUT_FILENO (1)
  • the stderr stream with file descriptor value STDERR_FILENO (2)

Typically these steams all map to the terminal you are running your program in, be it a tty, pty, etc. However this does not always have to be the case.

When using the Bash shell for instance, if you perform a redirection such as ./program 1> stdout.txt, Bash will prepare the correct file descriptors for you prior to executing your program. If you are interested, there is a summary of how it does this here.

Each file in Linux has an inode (man 7 inode) associated with it. This is a collection of metadata about the file, such as the user id of the file's owner and last access time.

This inode information can be obtained from a given file descriptor with the fstat() system call (see man 2 fstat) which will populate a structure of type struct stat.

Importantly for us, the inode / struct stat also contains:

  • an inode number, which is a unqiue identifier to an inode in a filesystem. This is held in stuct stat.st_ino.
  • the device containing the file. This can be found in struct stat.st_dev.

The (stat.st_ino,stat.st_dev) pair will uniquely identify a file across all available file systems.

To prevent feedback loops Grep compares the inode information of the STDOUT_FILENO file descriptor with the inode information of the file to be searched. If they are the same (and certain other conditions are met) it prints an error message and exits.

Example Code

The following code will print to stderr the device id and inode number for the STDIN_FILENO, STDOUT_FILENO and STDERR_FILENO file descriptors. Additionally if an optional file is provided as a command line argument it's device id and inode number will also be printed.

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdbool.h>

#define VERBOSE                 false
#define PRINT(FMT, ...)         \
    fprintf(stderr, FMT "\n", __VA_ARGS__)
#define PRINT_STAT(STAT, FMT)   \
     PRINT(#STAT ": " FMT, buf.STAT)

static int print_fstat(int fd, char *fdname)
{
    int rtn = 0; 
    struct stat buf;

    PRINT("==== name: %s fd: %d ====",
          fdname, fd);

    memset(&buf, 0, sizeof(buf));

    if ((rtn = fstat(fd, &buf)))
    {
        return rtn;
    }

    PRINT_STAT(st_dev,      "%lu");
    PRINT_STAT(st_ino,      "%lu");

    if (!VERBOSE)
    {
        return rtn;
    }

    PRINT_STAT(st_mode,     "0x%x");
    PRINT_STAT(st_nlink,    "%lu");
    PRINT_STAT(st_uid,      "%u");
    PRINT_STAT(st_gid,      "%u");
    PRINT_STAT(st_rdev,     "%lu");
    PRINT_STAT(st_size,     "%ld");
    PRINT_STAT(st_blksize,  "%ld");
    PRINT_STAT(st_blocks,   "%ld");

    return rtn;
}

int main(int argc, char **argv)
{
    int user_file = -1;

    print_fstat(STDIN_FILENO, "STDIN_FILENO");
    print_fstat(STDOUT_FILENO, "STDOUT_FILENO");
    print_fstat(STDERR_FILENO, "STDERR_FILENO");

    if (argc != 2)
    {
        return 0;
    }

    user_file = open(argv[1], 
                     O_RDONLY | O_CREAT,
                     S_IRUSR | S_IWUSR);

    if (user_file < 0)
    {
        PRINT("Open of (%s) failed", argv[1]);
        return user_file;
    }

    print_fstat(user_file, argv[1]);

    return close(user_file);
}

The following is an example of invoking the above program with the current terminal as the optional file to inspect:

./redirect_test.exe $(tty)

The output of this can be seen below. Note that the (st_dev, st_ino) combination for stdin, stdout, stderr and the current terminal are all identical.

==== name: STDIN_FILENO fd: 0 ====
st_dev: 21
st_ino: 10
==== name: STDOUT_FILENO fd: 1 ====
st_dev: 21
st_ino: 10
==== name: STDERR_FILENO fd: 2 ====
st_dev: 21
st_ino: 10
==== name: /dev/pts/7 fd: 3 ====
st_dev: 21
st_ino: 10

The following command redirects stdout to test.txt, with test.txt also being the optional file to inspect:

./redirect_test.exe test.txt  > test.txt

The output below shows that stdin and stderr reference the same file (the terminal), while stdout and test.txt both reference a second file (test.txt).

==== name: STDIN_FILENO fd: 0 ====
st_dev: 21
st_ino: 10
==== name: STDOUT_FILENO fd: 1 ====
st_dev: 51
st_ino: 8656017
==== name: STDERR_FILENO fd: 2 ====
st_dev: 21
st_ino: 10
==== name: test.txt fd: 3 ====
st_dev: 51
st_ino: 8656017