How does weighting my cases (with the "WEIGHT BY..." syntax) affect the results of logistic regression? Might it cause it to report a higher confidence level than is justified, because it thinks the number of cases being input is higher than actually is true?
Background: I'm doing a response model on a catalog mailing to prospects. Typical response rates are about 1%. Thus, if I've mailed 1,000,000 catalogs, I'll get about 10,000 purchasers. Typically, I've handled this by creating a data set that includes all 10,000 responders but only a random 1 or 2 % of non-responders. (FYI: The model is then actually built on only half this sample, as I'll always reserve some of the data for model validation.)
When I've done this without using weighting, I get models that do a good job of ordering prospects in terms of their likelihood of responding, but I have to adjust the probabilities of response way down from what the model predicts and I'm also concerned that my skewing of the data set to oversample responders is causing other distortions that may be causing problems that I'm not even aware of.