CodingText data analysis

 

Press Ctrl+Enter to quickly submit your post
Quick Reply  
 
 
  
 From:  Ally  
 To:  ALL
37670.1 

Hullo. (If you are in a hurry, just read the last paragraph)

 

My workplace (an accommodation listing web site) has a problem with scammers. They use stolen credit cards to sign up for accounts on the site, then spam users with fraudulent messages persuading them to send money via Western Union. They are depressingly successful. Given the proportion of users who have their password set to "password" this perhaps isn't surprising, but I digress.

 

It has occurred to me that we have a potentially massive dataset that could help us predict whether a person is a scammer or not. There are a number of criteria I know how to check against (time it takes for them to send their first message, the time of day they send these messages, etc.) but one big one is the text content of the ads themselves.

 

I'd like to run an analysis of which words appear in scammer ads, but not in genuine ads. It's stored as a (MS SQL) varchar field- my current plan is to use a cursor (...urgh) to go through each row, splitting the string and UPDATE-ing a count of the word occurrence in a separate table as I go. This sounds hellish, performance wise. Does anyone have any ideas of a better way to do this?

0/0
 Reply   Quote More 

 From:  99% of gargoyles look like (MR_BASTARD)  
 To:  Ally     
37670.2 In reply to 37670.1 
I assume that your scripting language as ASP (or variant thereof) and I cheerfully admit to knowing SFA about ASP. But is there an equivalent to preg_replace? I was just thinking that you could use your blacklist as an array for the pattern and the scammer's text as the subject. Count the number of replacements. Robert, your mother's brother, is.

bastard by name, bastard by nature

0/0
 Reply   Quote More 

 From:  Kenny J (WINGNUTKJ)  
 To:  Ally     
37670.3 In reply to 37670.1 
What version/level of SQL Server are you using? That's the sort of thing that you can do in SSIS. If you've got Enterprise Edition, you get all kinds of fuzzy logic and data mining goodness.

Kenny

The Wisdom of IMDB:

Shooter
Revealing mistakes: When the crippled sniper shoots himself under the chin, no bullet is actually fired from the gun. If you watch the scene frame-by-frame, a jet of air comes out of the gun and the actor's head cocks back. His neck is intact.
0/0
 Reply   Quote More 

 From:  Ally  
 To:  99% of gargoyles look like (MR_BASTARD)     
37670.4 In reply to 37670.2 
We already do that, to an extent. But in this case I don't already have a blacklist- I want to assemble one from the data we already have.
0/0
 Reply   Quote More 

 From:  Ally  
 To:  Kenny J (WINGNUTKJ)     
37670.5 In reply to 37670.3 
We're on 2005, but not Enterprise. Just Standard, or whatever it's called. I suppose it won't be a critical problem to do it manually- it's not going to run often, after all.
0/0
 Reply   Quote More 

 From:  99% of gargoyles look like (MR_BASTARD)  
 To:  Ally     
37670.6 In reply to 37670.5 
quote: Ally
it's not going to run often, after all.

<employs legions of scammers to bust Ally's database />

bastard by name, bastard by nature

0/0
 Reply   Quote More 

 From:  Kenny J (WINGNUTKJ)  
 To:  Ally     
37670.7 In reply to 37670.5 

In that case, you won't have the fun stuff for doing pattern matching, etc. No matter! You can still do what you want in SQL rather than .net code:

 

http://stackoverflow.com/questions/881913/sql-server-function-for-displaying-word-frequency-in-a-column

 

is a good place to start


Kenny

The Wisdom of IMDB:

Shooter
Revealing mistakes: When the crippled sniper shoots himself under the chin, no bullet is actually fired from the gun. If you watch the scene frame-by-frame, a jet of air comes out of the gun and the actor's head cocks back. His neck is intact.
0/0
 Reply   Quote More 

 From:  Ally  
 To:  99% of gargoyles look like (MR_BASTARD)     
37670.8 In reply to 37670.6 
Hah. It'll run every 24 hours to establish a watch list of words. That watch list will then be used constantly.
0/0
 Reply   Quote More 

 From:  Ally  
 To:  Kenny J (WINGNUTKJ)     
37670.9 In reply to 37670.7 
Aha. Looks perfect-thanks.
0/0
 Reply   Quote More 

Reply to All    
 

1–9

Rate my interest:

Adjust text size : Smaller 10 Larger

Beehive Forum 1.5.2 |  FAQ |  Docs |  Support |  Donate! ©2002 - 2024 Project Beehive Forum

Forum Stats