So I've got a conundrum that I'm hoping someone else can take a crack at and show me up...
I have one large data set (n=500K, apx) w/ survey responses, all with unique respondent IDs.
Recently, we just received new variables from this same population (e.g. test scores, anthropometric data), in a new data set; but are under strict guidelines to analyze ONLY the original 500K participants. The problem is that the new data set (test/anthros) has about 4 million more rows, since it has both respondents not in our original data AND it has multiple rows for individuals that took the new test multiple times.
So I need to add the new variables to this original data set for those 500K, matching by nearest date, while leaving out the extras.
If respondent ABC1001 took the original survey on 6/1/2014, and the new data set has 4 rows for ABC1001--once for each time he's taken the new tests in 2013, 2014, 2015, and 2016--I need to match his original survey responses with his 2014 test/anthro data. But I need to do this on a large scale by writing in some way to not only combine cases, but to combine the RIGHT cases by finding the smallest difference in dates...
I realize that's a little much for a forum--anyone have any tips though?? Anything AT ALL would be HUGELY appreciated, I'm happy to clear anything up if something's not clear too.