Text data analysis

From:	Ally	27 Aug 2010 05:42
To:	ALL	1 of 9

37670.1

Hullo. (If you are in a hurry, just read the last paragraph)

My workplace (an accommodation listing web site) has a problem with scammers. They use stolen credit cards to sign up for accounts on the site, then spam users with fraudulent messages persuading them to send money via Western Union. They are depressingly successful. Given the proportion of users who have their password set to "password" this perhaps isn't surprising, but I digress.

It has occurred to me that we have a potentially massive dataset that could help us predict whether a person is a scammer or not. There are a number of criteria I know how to check against (time it takes for them to send their first message, the time of day they send these messages, etc.) but one big one is the text content of the ads themselves.

I'd like to run an analysis of which words appear in scammer ads, but not in genuine ads. It's stored as a (MS SQL) varchar field- my current plan is to use a cursor (...urgh) to go through each row, splitting the string and UPDATE-ing a count of the word occurrence in a separate table as I go. This sounds hellish, performance wise. Does anyone have any ideas of a better way to do this?

Have you heard of "slobber"? Do you tend to think of it as a positive term? Because the one thing that dogs can be trusted to do, besides thumping gaseous dead rats out their ass and besides biting everybody, is to slobber all over themselves and you. You need to do some serious thinking about whether a dog is really what you need in your life right now.

0/0

Reply Quote

More

From:	99% of gargoyles look like (MR_BASTARD)	27 Aug 2010 07:44
To:	Ally	2 of 9

37670.2 In reply to 37670.1
I assume that your scripting language as ASP (or variant thereof) and I cheerfully admit to knowing SFA about ASP. But is there an equivalent to preg_replace? I was just thinking that you could use your blacklist as an array for the pattern and the scammer's text as the subject. Count the number of replacements. Robert, your mother's brother, is. bastard by name, bastard by nature

0/0

Reply Quote

More

From:	Kenny J (WINGNUTKJ)	27 Aug 2010 11:30
To:	Ally	3 of 9

37670.3 In reply to 37670.1
What version/level of SQL Server are you using? That's the sort of thing that you can do in SSIS. If you've got Enterprise Edition, you get all kinds of fuzzy logic and data mining goodness. Kenny The Wisdom of IMDB: Shooter Revealing mistakes: When the crippled sniper shoots himself under the chin, no bullet is actually fired from the gun. If you watch the scene frame-by-frame, a jet of air comes out of the gun and the actor's head cocks back. His neck is intact.

0/0

Reply Quote

More

From:	Ally	27 Aug 2010 14:35
To:	99% of gargoyles look like (MR_BASTARD)	4 of 9

37670.4 In reply to 37670.2
We already do that, to an extent. But in this case I don't already have a blacklist- I want to assemble one from the data we already have. Have you heard of "slobber"? Do you tend to think of it as a positive term? Because the one thing that dogs can be trusted to do, besides thumping gaseous dead rats out their ass and besides biting everybody, is to slobber all over themselves and you. You need to do some serious thinking about whether a dog is really what you need in your life right now.

0/0

Reply Quote

More

From:	Ally	27 Aug 2010 14:38
To:	Kenny J (WINGNUTKJ)	5 of 9

37670.5 In reply to 37670.3
We're on 2005, but not Enterprise. Just Standard, or whatever it's called. I suppose it won't be a critical problem to do it manually- it's not going to run often, after all. Have you heard of "slobber"? Do you tend to think of it as a positive term? Because the one thing that dogs can be trusted to do, besides thumping gaseous dead rats out their ass and besides biting everybody, is to slobber all over themselves and you. You need to do some serious thinking about whether a dog is really what you need in your life right now.

0/0

Reply Quote

More

From:	99% of gargoyles look like (MR_BASTARD)	27 Aug 2010 14:53
To:	Ally	6 of 9

37670.6 In reply to 37670.5
quote: Ally it's not going to run often, after all. <employs legions of scammers to bust Ally's database /> bastard by name, bastard by nature

0/0

Reply Quote

More

From:	Kenny J (WINGNUTKJ)	27 Aug 2010 15:04
To:	Ally	7 of 9

37670.7 In reply to 37670.5

In that case, you won't have the fun stuff for doing pattern matching, etc. No matter! You can still do what you want in SQL rather than .net code:

http://stackoverflow.com/questions/881913/sql-server-function-for-displaying-word-frequency-in-a-column

is a good place to start

Kenny

The Wisdom of IMDB:

Shooter
Revealing mistakes: When the crippled sniper shoots himself under the chin, no bullet is actually fired from the gun. If you watch the scene frame-by-frame, a jet of air comes out of the gun and the actor's head cocks back. His neck is intact.

0/0

Reply Quote

More

From:	Ally	27 Aug 2010 15:06
To:	99% of gargoyles look like (MR_BASTARD)	8 of 9

37670.8 In reply to 37670.6
Hah. It'll run every 24 hours to establish a watch list of words. That watch list will then be used constantly. Have you heard of "slobber"? Do you tend to think of it as a positive term? Because the one thing that dogs can be trusted to do, besides thumping gaseous dead rats out their ass and besides biting everybody, is to slobber all over themselves and you. You need to do some serious thinking about whether a dog is really what you need in your life right now.

0/0

Reply Quote

More

From:	Ally	27 Aug 2010 15:07
To:	Kenny J (WINGNUTKJ)	9 of 9

37670.9 In reply to 37670.7
Aha. Looks perfect-thanks. Have you heard of "slobber"? Do you tend to think of it as a positive term? Because the one thing that dogs can be trusted to do, besides thumping gaseous dead rats out their ass and besides biting everybody, is to slobber all over themselves and you. You need to do some serious thinking about whether a dog is really what you need in your life right now.

0/0

Reply Quote

More