Hullo. (If you are in a hurry, just read the last paragraph)
My workplace (an accommodation listing web site) has a problem with scammers. They use stolen credit cards to sign up for accounts on the site, then spam users with fraudulent messages persuading them to send money via Western Union. They are depressingly successful. Given the proportion of users who have their password set to "password" this perhaps isn't surprising, but I digress.
It has occurred to me that we have a potentially massive dataset that could help us predict whether a person is a scammer or not. There are a number of criteria I know how to check against (time it takes for them to send their first message, the time of day they send these messages, etc.) but one big one is the text content of the ads themselves.
I'd like to run an analysis of which words appear in scammer ads, but not in genuine ads. It's stored as a (MS SQL) varchar field- my current plan is to use a cursor (...urgh) to go through each row, splitting the string and UPDATE-ing a count of the word occurrence in a separate table as I go. This sounds hellish, performance wise. Does anyone have any ideas of a better way to do this? |