Here are the problems:
"Dataset A" has a whole spread of respondents many who filled out the psychometric test I am interested in and many who didn't. I only want to look at the participants who filled out the psychometric test I am interested in.
"Dataset B" has multiple responses for every ID number (each one being a separate Tweet) but 6 columns of each of these responses are the same per participant. These columns are the overall output of the text analysis program per participant. The output of the text analysis program is the data I am interested in. So I will have something like 200 rows per participant when I only need 1.
Step 1 is to make it so in "Dataset B" every participant or ID number only occupies 1 row. I need to delete the duplicates by ID number.
This is not too difficult, there is a procedure for this called "identify duplicate cases":
- data -> identify duplicate cases
In the "define matching cases by" box you identify the variables that make a case unique. In your case, it might just be ID number.
However, someimes you need 2 or more (e.g.: "school" and "student").
Depending on what option you choose, you will have a new variable called primaryfirst or priamry last, which has a value of 1 for all unique (non-duplicate) cases, and a 0 for all duplicates.
Simply use selecte cases -> delete unselected cases to remove the duplicates.
Step 2 is to use the ID numbers from "Dataset B" after the processing in Step 1 to delete all the participants with an ID number that is not in "Dataset B" from "Dataset A".
Assuming you have the same ID numbers in datasat A as in dataset B, you might not need to do this.
What you can do is go to "dataset A", sort on ID-number, and merge with "dataset B"(also sorted on ID number), using option 2, "nonactive dataset is keyed table".
Then, all cases in "dataset b" will be added to the appropriate case in "dataset A".
If "dataset A" has no match in "dataset b", the variables will just be empty for the "twittervariables" you added. You can then delete the cases that have no information on this variable.
If you are worried you might have too many cases and it will go wrong, alternative way is to create a syntax in excel (this is not the most elegant way but it will work):
simply copy and paste all ID-numbers from "dataset b" into column B of your excel file.
In excel, do the following
- in column A, type: if (ID-number =
- in column B, you have all case numbers from dataset B
- in column C, type: ) keep = 1.
"Drag" the arrow down for column A and C so the text is repeated for every cell that has a number in column b.
Copy paste the whole thing and type under it:
SELECT IF (keep = 1).
Then dataset A will just contain ID numbers that are also in dataset B, then merge.
If I have been unclear and you don't quite get what I am talking please let me know.
unless I misunderstood this was one of the best formulated questions in the last couple of years...