How to quickly compare lists of words using the Terminal


How to quickly compare lists of words using the Terminal
Mac Tip #555, 10 October 2012

Checking two lists of words to see if they’re the same or how they differ is massively tedious if you do it manually. With a little know-how you can get your Mac to do the chore in seconds. Here’s how to make it happen.

Actually I’m being overly generous with the word ‘seconds’ there. When I used this technique on files with hundreds of email addresses it was finished before I could even look away.

I help look after the members on a website. I help them with queries about subscriptions, newsletters, and website access.

The website I do this work for is making some big changes behind the scenes, and I needed to compare lists of email addresses gathered from PayPal, the website itself, and the newsletter service.

Note: for the purposes of the screenshots in this post I’ve created a list of dummy email addresses. They are intended to be fictitious. Would you like to try this at home? Here are my test files (3KB zipped file) to save you some work.

Comparing files is easy

You’d think such a comparison would be easy. Take a list of email addresses, sort it into alphabetical order, and compare it with another sorted list.

That kind of comparison is extremely simple with a good text editor such as BBEdit. Make 2 files, sort them alphabetically, and choose Compare Two Front Documents from the Search menu.

Compare Two Front Documents. These two files should be sorted first.

Compare Two Front Documents. These two files should be sorted first.

That works nicely, but BBEdit goes through and highlights lines that aren’t the same. What I needed was a table showing me which addresses were the same in both files, which appeared in only one, and which file it appeared in.

For example, had Jo Bloggs paid a sub but failed to register on the website? Had Chris Smith registered on the website but missed being signed up for the newsletter? It wasn’t enough for me just to know who wasn’t on both lists. I had to know how the lists differed.

The name problem

An additional problem lies in how people sign up for these three separate services: payment, website and newsletter. At the extreme, a member may use a ‘name’ in PayPal of Acme. Inc connected to an email address of dorothy@myisp.com.au, an address on the website of happydays@anotherexample.net, a username of greenfrog, a mailing list address of justanotherday@myexample.edu.

Later that person may change their email address in any or all of those 3 locations too.

That is a perfectly legitimate thing to do, but it makes it very hard for us to see all those different pieces of information as belonging to one single person. We need to be able to connect a website sign up and a mailing list sign up with the person who paid for the service.

While we store that information in a Numbers spreadsheet, we needed to download fresh copies of the lists to be sure we had the most up-to-date information, and then match that data to our records too.

The Unix comm command compares files

After a few moments with Google I discovered the incredibly useful UNIX comm command. It produces a superb table that shows three columns of data:

Unique in File 1 Unique in File 2 The same in both
    miraz@firstbite.co.nz
miraz@mactips.info    
  miraz@example.com  

Or, as the help file says:

The comm utility reads file1 and file2, which should be sorted lexically, and produces three text columns as output: lines only in file1; lines only in file2; and lines in both files.

The problem of case

Sometimes the same email address or name can be written several ways, mixing upper and lower case letters. It’s not helpful to view miraz and MIRAZ or Miraz as different.

We can add an instruction for the comm command to ignore case by including -i:

-i   Case insensitive comparison of lines.

Get ready to use the comm command

The following steps look very long and complicated but that’s only because I’ve written them out in detail. Actually it’s very easy: create 2 files, sort them, save them and then open Terminal, type the command, drag the files in, and the task is done.

  1. Sort each file into alphabetical order (this step is crucial). Handy tip: include a unique item in each file to help you distinguish them. In my real-world example I added a dummy email address in ALL CAPS to each: MIRAZ@MAILINGLIST and MIRAZ@PAYPAL. The caps made the text easy to see.
  2. Save the files. Tip for beginners: make a separate folder and save both files inside it. My folder’s named Demo, and my files are named compare1.txt and compare2.txt.
  3. Open the Terminal.app. It’s inside the ApplicationsUtilities folder. A more-or-less blank window appears.
  4. Open the Finder window containing the files to compare and place it next to the Terminal window
The Terminal window on the left and the Finder window on the right.

The Terminal window on the left and the Finder window on the right. My screenshot was made after using the comm command.

Use the comm command

  1. Type the letters cd followed by a space into the Terminal window. The cd command means change directory. We’re telling the Terminal to look in a particular folder (directory) in the Finder.
  2. Beside the name of the Finder window, at the very top, is a proxy icon. It looks like a folder. Drag the proxy icon into the Terminal window. The path for the folder appears in the Terminal window. In my screenshot the path is /Users/miraz/Demo.
  3. Press Return. The Terminal is now ‘working’ in that folder.
  4. Type comm -i (space hyphen i).
  5. Drag one file from the Finder window into the Terminal window. As in the Step above, the path is inserted into the Terminal window.
  6. Press the Spacebar to create a space.
  7. Drag the other file from the Finder window into the Terminal window. As in the Step above, the path is inserted into the Terminal window.
  8. Press the Spacebar to create a space.
  9. Type a >.
  10. Press the Spacebar to create a space.
  11. Type a file name for the output file. The command creates 3 columns of text. If you tell Terminal to save those 3 columns into a file then you can look at it later using any software you like, for example, Numbers.app or your text editor.
  12. Press Return. The comm command does its work, compares the two files and creates a third file that shows the differences. It saves the file with the filename you entered in the previous step.
  13. Type exit and press Return to finish the Terminal session and quit Terminal if you don’t intend to use it for anything else at the moment.

Here’s what my command looked like. I’ve spread it over a few lines to fit in this Post, and to make it easier to read. When you type the command though, don’t press Return until the very end:

comm -i 
 /Users/miraz/Demo/compare1.txt 
 /Users/miraz/Demo/compare2.txt 
 > compare-results.txt

Open the comparison table

The comparison table with columns that are too narrow.

The comparison table with columns that are too narrow. (Oops — I hadn’t sorted the files correctly before I began. I started over and the next screenshot shows better results.)

In Step 10 above you entered a name for the file that the comm command would create. In my screenshot that name is: compare-results.txt.

If you simply open that file with your text editor, you’re probably get a fright. It’ll look terrible.

Instead, open it with Numbers.app or Excel, or another spreadsheet.

If it still looks terrible, make the columns wider.

The wider columns make the table clearer. Note how my dummy text helps show which column reflects which file.

The wider columns make the table clearer. Note how my dummy text helps show which column reflects which file.

As the screenshot shows, email addresses that are only in file 1 are in the first column, email addresses that are unique to file 2 are in the second column, and email addresses that are the same in both files are in column three. Because the first line of each file was my own special unique text I can easily tell which column is which.

How else you might use this

Let’s imagine you have a bunch of files in a folder on your Mac. There’s another folder on a backup disc that probably has the same files in it. Or does it? After all, you dimly remember there was a problem when you copied them across.

So you look at each folder, and sure enough: one folder contains 427 files while the other contains 562. It’s time to compare.

Open each folder in the Finder and Select All then Copy. All the filenames are copied.

Paste the copied lists into separate text documents and then use the technique listed above to compare them.

It’s easy when you know how.

Try all these great Terminal Tips

Related posts

[wpzon keywords="pressure cooker" sindex="PCHardware" snode="1232597011" sort="salesrank" listing="8"]

One Comment;

  1. Miraz Jordan said:

    Jenine emailed:

    “Great tip, Miraz! Thank you so much! I was wanting to do this (compare the contents of a directory on my home drive and a directory on my backup drive) not too long ago. And now I need to compare attendee lists for conferences in past years and this year, this is going to be so helpful. Thanks!”

Comments are closed.



Top