Page 1 of 1


Posted: Tue Mar 13, 2018 8:08 am
by Giocar87
Good morning everyone, I'm using IBM SPSS statstics 21 with Windows 7 PC.
I have a big dataset (2500 patients) with different entries, my question is: how can I transform different words (antibiotics) within a string into different numbers in other columns?
My data looks like this:

String2 Column1: AMOXI
(and so on...)

I would like to have, as an end result, different new columns (Therapy1, Therapy2, Therapy 3), each containing a number which represents each different antibiotic (e.g. CEFTRIAX=1, AMOXI=2,COTRIMOX=3,CIPROFLOX=4).

So far I only managed to select only the first occurrence of one therapy in each string, select it and put it into a separate column through:
DO IF CHAR.INDEX(Column1, 'CEFTRIAX') GT 0. COMPUTE Therapy1 = 1.
(and so on)

How can I select also the other therapies within the same string?

Thank you very much in advance for your help,
Best regards,



Posted: Mon Mar 19, 2018 5:28 pm
by a5000b
There are a few approaches to a problem like this.

First and foremost, if there are a limited number of the possible antibiotic combinations, it might be worth hardcoding your conditions- but if this exceeds more than a few of these and it will change overtime, then it would require a more programmatic approach.

One important question for my suggested approach - does your data always come back in the same predictable pattern ("+" delimited), where you're looking to make n columns where each column represents ONE antibiotic?

antibiotic1 + antibiotic2 + ....antibiotic n?

If so, I think the following might work:

Develop a macro to peel your data apart- something to the effect of:
On each iteration of the macro, compute one variable to hold the first part of the string up until the first plus sign, then another variable to house the rest of it. Repeat this until no plus signs remain in the string. The resulting output should be one row for each patient with one column for each antibiotic in the original text- your max number of columns will be determined by the max number of antibiotics in any one patient case.

Next, you should have a whole bunch of antibiotics, in no particular order. You'll want to run a varstocases statement to get multiple rows for each patient. This will allow you to sort on patient, antibiotic to get them in a uniform order.

Next you'll want to create a lookup file for all possible antibiotics- a simple aggregate on that column should do the trick. You'll want to use this dataset in combination with your list of patients to create a dataset where you have one row for each patient/antibiotic combination.

Next, match the files (using a file to table match where this most recent dataset is your file, and the previous file which only has the antibiotics prescribed to your patients is your table). This will indicate will allow you to determine which (if any) of the antibiotics your patients saw. You'll want all other cases to indicate 0 (or an indicator of your choice).

Finally, you can run a casestovars statement to get it back in the shape you desire- you should see indicators for each of your antibiotics as variables.

However, again, I'm assuming you have some way to reliably split your data. If the data is allowed to flow in freeform, then you're looking at a lot of other possibilities (accounting for typos, missing delimiters etc...).

Hope that helps.