Find mismatches in one variable based on another variable

Moderators: statman, Analyst Techy, andris, Fierce, GerineL, Smash

Posts: 6
Joined: Mon Jun 11, 2012 3:51 pm

Find mismatches in one variable based on another variable

Postby wtrfilesx » Mon Jun 11, 2012 3:59 pm

I have a dataset in which multiple rows correspond to the same individual. For validation purposes, I want to look at all the duplicate rows of data for each individual, and make sure that they all have matching ID numbers.

Is it possible to match rows based on Name, and then identify the ones that have different IDs within the various rows?

For example, say the dataset contained the following rows:

1. John Smith #212
2. John Smith #212
3. John Smith # 212
4. Jane Doe #202
5. Jane Doe #210
6. Jane Doe #202
7. John Doe #200
8. John Doe #200

In this example, I would want to find out that Jane Doe has 2 different ID numbers. Since both John Smith and John Doe have matching IDs throughout their rows of data, I want to ignore their records. Is there any way to accomplish this other than manually reviewing the dataset (since I have nearly 600,000 rows of data)? Thank you for your assistance!

Posts: 473
Joined: Thu Apr 05, 2012 5:58 pm

Re: Find mismatches in one variable based on another variabl

Postby Penguin_Knight » Mon Jun 11, 2012 11:09 pm

Since the mean of ID will be difference if one of the serial id within a person is off, you may try aggregating the mean as a new variable, and then flag the cases that are different. Suppose you name is a string variable called name and your id number is a numeric variable called id. Here is the syntax:

Code: Select all


COMPUTE MultiID = (id NE mean_id) .

Who is online

Users browsing this forum: No registered users and 2 guests