I have a problem that theoretically speaking is quite easy to solve, but becomes really hard practically: I have to calculate a clinical score based on the discharge diagnosis of patients, which are coded with the International Classification of Diseases (ICD) coding scheme.
The first difficulty arises from the fact that there are up to 170 diagnostic codes per patient, but each patient has a different amount of diagnostic codes.
The dataset, in its original form, is structured so that for each diagnostic code for each patient for each admission, there is a record: so that a patient that for a particular admission had, let's say, 15 diagnosis, will have 15 records: each record will have patient's name, medical record number, admission and discharge dates, repeated; and the only variable that will vary for each record will be the diagnostic code. In the original form of the dataset, there are 184K records.
Code: Select all
ID Name Date of birth Gender Date of admission Date of discharge Diagnostic_code 1 Doe, John 01-01-1950 M 10-12-2009 01-01-2010 443.9 1 Doe, John 01-01-1950 M 10-12-2009 01-01-2010 V56.8 1 Doe, John 01-01-1950 M 10-12-2009 01-01-2010 221.02 1 Doe, John 01-01-1950 M 10-12-2009 01-01-2010 428
Code: Select all
ID Name Date of birth Gender Date of admission Date of discharge Diagnostic_code.1 Diagnostic_code.2 Diagnostic_code.3 Diagnostic_code.4 1 Doe, John 01-01-1950 M 10-12-2009 01-01-2010 443.9 V56.8 221.02 428
B18.x, K70.0--K70.3, K70.9, K71.3--K71.5, K71.7, K73.x, K74.x, K76.0, K76.2--K76.4, K76.8, K76.9, Z94.4
Where "x" and "-" are wildcards that indicate "any value after the '.' sign" and a range, respectively. I imagine that coding this is not gonna be easy.
So you can imagine that if I have to write the code in the traditional way (for each combination of diagnostic code variable, item and ICD diagnostic code), its gonna take me forever.
The other option that I thought of was to calculate the score on the original dataset (the one with each record corresponding to a diagnostic code). In this way we would avoid the problem due to the 170 different variables for each possible diagnostic code, but we'd have the obstacle of figuring out how to calculate the score on a per-admission basis. For example, I don't know how I could tell SPSS how to identify an admission based on multiple records, and calculate the score adding points only for a given set of records that I know that represent an admission, and then move on to the next set of records that represent another admission.
I hope to have been able to clearly explain myself and, especially, that someone can help me out...
Thank you very much!