Saturday, May 4, 2013

[SED]: Remove repeated/duplicate words from a file in Linux

In this post we will see how to delete repeated words. There is a human tendency to write fast and and when we try to review our writing we will find repeated words side by side. If you observe I written "and" two times. This is human mind tendency to process before we write actual word. Its hard to read entire file for duplicate words if the file is big enough to skim the text. This even cause to skip some words. A better procedure is to use some tools like SED and Perl/Python to do this with the help of Regular Expressions.

I have a file abc.txt with following data.

cat abc.txt
Output:

This is is how it works buddy
What else else you want

 Remove repeated words with SED as given below.

sed -ri 's/(.*\ )\1/\1/g'  abc.txt

cat abc.txt

Output:

This is how it works buddy
What else you want

Let me explain sed command which we used.

-r option is for enabling Extended Regular Expression which have grouping option with () braces.
-i option for inserting the changes to original file, Be careful with this option as you can not get your original file once modified.
(.*\ ) for mentioning any group of characters and which is followed by same set of characters which is represented by \1. This concept is called back reference, where \1 can store first set of characters enclosed in first (). And these two things (.*\ )\1 is replaced by same word with \1 which is actual back reference to first (.*\ ).




 

3 comments:

  1. There is a problem with this code. This makes the mistake of deleting parts of words, i.e. "this is" gets affected, not just "is is," how would you change the code so that doesn't occur, so that that it only affects words with blank spaces around them?

    ReplyDelete
  2. Hi Vijay,

    Below code should work for you..

    sed -ri 's/\ (.*\ )\ \1/\ \1/g' abc.txt

    ReplyDelete
  3. Thanks for the quick reply, but it doesn't appear to do anything. I checked this phrase: this is what goes to to paris

    The first code gets rid of both the "is" and one "to"

    The second code does nothing.

    ReplyDelete