How to quickly compare lists of words using the Terminal
Mac Tip #555, 10 October 2012
Checking two lists of words to see if they’re the same or how they differ is massively tedious if you do it manually. With a little know-how you can get your Mac to do the chore in seconds. Here’s how to make it happen.
Actually I’m being overly generous with the word ‘seconds’ there. When I used this technique on files with hundreds of email addresses it was finished before I could even look away.
I help look after the members on a website. I help them with queries about subscriptions, newsletters, and website access.
The website I do this work for is making some big changes behind the scenes, and I needed to compare lists of email addresses gathered from PayPal, the website itself, and the newsletter service.
Note: for the purposes of the screenshots in this post I’ve created a list of dummy email addresses. They are intended to be fictitious. Would you like to try this at home? Here are my test files (3KB zipped file) to save you some work.
Comparing files is easy
You’d think such a comparison would be easy. Take a list of email addresses, sort it into alphabetical order, and compare it with another sorted list.
That kind of comparison is extremely simple with a good text editor such as BBEdit. Make 2 files, sort them alphabetically, and choose Compare Two Front Documents from the Search menu.
Compare Two Front Documents. These two files should be sorted first.
That works nicely, but BBEdit goes through and highlights lines that aren’t the same. What I needed was a table showing me which addresses were the same in both files, which appeared in only one, and which file it appeared in.
For example, had Jo Bloggs paid a sub but failed to register on the website? Had Chris Smith registered on the website but missed being signed up for the newsletter? It wasn’t enough for me just to know who wasn’t on both lists. I had to know how the lists differed.
The name problem
An additional problem lies in how people sign up for these three separate services: payment, website and newsletter. At the extreme, a member may use a ‘name’ in PayPal of Acme. Inc connected to an email address of dorothy@myisp.com.au, an address on the website of happydays@anotherexample.net, a username of greenfrog, a mailing list address of justanotherday@myexample.edu.
Later that person may change their email address in any or all of those 3 locations too.
That is a perfectly legitimate thing to do, but it makes it very hard for us to see all those different pieces of information as belonging to one single person. We need to be able to connect a website sign up and a mailing list sign up with the person who paid for the service.
While we store that information in a Numbers spreadsheet, we needed to download fresh copies of the lists to be sure we had the most up-to-date information, and then match that data to our records too.
The Unix comm command compares files
After a few moments with Google I discovered the incredibly useful UNIX comm command. It produces a superb table that shows three columns of data:
| Unique in File 1 | Unique in File 2 | The same in both |
|---|---|---|
| miraz@firstbite.co.nz | ||
| miraz@mactips.info | ||
| miraz@example.com |
Or, as the help file says:
The comm utility reads file1 and file2, which should be sorted lexically, and produces three text columns as output: lines only in file1; lines only in file2; and lines in both files.
The problem of case
Sometimes the same email address or name can be written several ways, mixing upper and lower case letters. It’s not helpful to view miraz and MIRAZ or Miraz as different.
We can add an instruction for the comm command to ignore case by including -i:
-i Case insensitive comparison of lines.
Get ready to use the comm command
The following steps look very long and complicated but that’s only because I’ve written them out in detail. Actually it’s very easy: create 2 files, sort them, save them and then open Terminal, type the command, drag the files in, and the task is done.
- Sort each file into alphabetical order (this step is crucial). Handy tip: include a unique item in each file to help you distinguish them. In my real-world example I added a dummy email address in ALL CAPS to each:
MIRAZ@MAILINGLISTandMIRAZ@PAYPAL. The caps made the text easy to see. - Save the files. Tip for beginners: make a separate folder and save both files inside it. My folder’s named
Demo, and my files are namedcompare1.txtandcompare2.txt. - Open the
Terminal.app. It’s inside theApplications—Utilitiesfolder. A more-or-less blank window appears. - Open the Finder window containing the files to compare and place it next to the Terminal window
The Terminal window on the left and the Finder window on the right. My screenshot was made after using the comm command.
Use the comm command
- Type the letters
cdfollowed by aspaceinto the Terminal window. Thecdcommand meanschange directory. We’re telling the Terminal to look in a particular folder (directory) in the Finder. - Beside the name of the Finder window, at the very top, is a
proxy icon. It looks like a folder. Drag the proxy icon into the Terminal window. Thepathfor the folder appears in the Terminal window. In my screenshot thepathis/Users/miraz/Demo. - Press
Return. The Terminal is now ‘working’ in that folder. - Type
comm -i(space hyphen i). - Drag one file from the Finder window into the Terminal window. As in the Step above, the
pathis inserted into the Terminal window. - Press the
Spacebarto create a space. - Drag the other file from the Finder window into the Terminal window. As in the Step above, the
pathis inserted into the Terminal window. - Press the
Spacebarto create a space. - Type a
>. - Press the
Spacebarto create a space. - Type a file name for the output file. The command creates 3 columns of text. If you tell Terminal to save those 3 columns into a file then you can look at it later using any software you like, for example,
Numbers.appor your text editor. - Press
Return. The comm command does its work, compares the two files and creates a third file that shows the differences. It saves the file with the filename you entered in the previous step. - Type
exitand pressReturnto finish the Terminal session and quit Terminal if you don’t intend to use it for anything else at the moment.
Here’s what my command looked like. I’ve spread it over a few lines to fit in this Post, and to make it easier to read. When you type the command though, don’t press Return until the very end:
comm -i /Users/miraz/Demo/compare1.txt /Users/miraz/Demo/compare2.txt > compare-results.txt
Open the comparison table
The comparison table with columns that are too narrow. (Oops — I hadn’t sorted the files correctly before I began. I started over and the next screenshot shows better results.)
In Step 10 above you entered a name for the file that the comm command would create. In my screenshot that name is: compare-results.txt.
If you simply open that file with your text editor, you’re probably get a fright. It’ll look terrible.
Instead, open it with Numbers.app or Excel, or another spreadsheet.
If it still looks terrible, make the columns wider.
The wider columns make the table clearer. Note how my dummy text helps show which column reflects which file.
As the screenshot shows, email addresses that are only in file 1 are in the first column, email addresses that are unique to file 2 are in the second column, and email addresses that are the same in both files are in column three. Because the first line of each file was my own special unique text I can easily tell which column is which.
How else you might use this
Let’s imagine you have a bunch of files in a folder on your Mac. There’s another folder on a backup disc that probably has the same files in it. Or does it? After all, you dimly remember there was a problem when you copied them across.
So you look at each folder, and sure enough: one folder contains 427 files while the other contains 562. It’s time to compare.
Open each folder in the Finder and Select All then Copy. All the filenames are copied.
Paste the copied lists into separate text documents and then use the technique listed above to compare them.
It’s easy when you know how.
Jenine emailed:
“Great tip, Miraz! Thank you so much! I was wanting to do this (compare the contents of a directory on my home drive and a directory on my backup drive) not too long ago. And now I need to compare attendee lists for conferences in past years and this year, this is going to be so helpful. Thanks!”