Tuesday, March 27, 2012

General issues concerning Data Mining with SQL server 2005

Hi

I am just starting to learn how to use data mining with SQL server 2005. I have some general questions which I hope someone out there can help me with.

Each month I have approx. 100.000 customers who buy a product.

I want to contact 10.000 of these with a letter for cross-selling some additional products.

Currently I choose these 10.000 randomly from the 100.000 customers. I now want to use SQL server 2005 data mining to choose these 10.000 better, to receive a better response rate. How do I transfer this knowledge to the 100.000 (potential cross-selling customers)

I understand that it could be set up in SQL server like the following:

case table: customer_id, gender, age, city, etc.

nested table: customer_id, product

Is this the way it should be approached and is there a special model/algorithm for this purpose?

Regards,

Dennis

You have the right idea. I don't think you have to take a random sample of customers - you should be able to easily use them all. In general you would use an association rules algorithm for this kind of problem (cross sales based on products). You would mark the nested table as Input and Predict (i.e. PREDICT instead of PREDICT ONLY). To get the best results, you would need to tweak the MINIMUM_SUPPORT and MINIMUM_PROBABILITY parameters to ensure you get enough, but not too many rules. "Enough" generally means that you feel the dataset is represented. "Too many" is generally just a performance issue. If you set your parameters too low, you could run out of memory, potentially. Also, setting MINIMUM_PROBABILITY too low you could get meaningless rules - ones that simply don't really add value, e.g. "people who buy peanut butter also buy gas grills 0.01% of the time...."

Given that you want to include demographic information as well, you may want to try Decision Trees as well. Decision Trees is limited by default to 255 inputs and 255 outputs, but you can change this by setting the algorithm parameters. Note that there is a direct performance relationship with the inputs and outputs with trees, and since everytime you add an output (product) you add an input (product), so the relationship is n squared in products. I have seen very successful implementations with up to 3000 products using DT, and it will definately work wih more, just that the training will take a bit longer.

In either case, to get the results you would use a simple prediction join statement. Depending on what you want to do with the results, you may want them flattened.

For example

SELECT FLATTENED t.CustomerID, Predict(Products, 5)
FROM MyCrossSalesModel
PREDICTION JOIN
SHAPE (.....) // There's a DMX template in management studio that has this syntax(using the query tool will also generate correct syntax)
as t
ON
<mapping>


For trees, you may want to use Predict(Products, 5, $AdjustedProbability) which applies a heuristic weighting to the probabilities such that items that are too popular get downgraded - e.g. if everyone buys milk, there's no point in recommending it to anyone. I believe that in AR, probability and adjusted probability are the same (but I may be remembering incorrectly)

You can also set up single ton queries you could execute through an immediate web interface. You can generate the queries using the query design tools and also see examples at www.sqlserverdatamining.com

|||

Thanks for the answer - is it possible to have following setup in one model:

- case table: input: customer profile (age, gender, residence type, etc.)

- nested: input: products in period t+0

- nested: predicut / input: products in period t+1

(2 nested tables where first table are products which the customer had before Event X and second table are products which the customer had after Event X)

BTW: Can you recommended some books / pages on the net to learn how to use the SQL server 2005 for data mining?

Regards,

Dennis

|||

Yes you can - this is a strategy I recommend to many people to solve the "what products bought in April predict products in May" scenario. It's up to you to set up the nested tables appropriately - you can likely use the Named Query functionality of the Data Source View, or physically partition the data into tables, or however you choose. You would make the first nested table input and the second PREDICT ONLY.

I recommend my book Data Mining with SQL Server 2005 . Joy Mundy's (et al) book also has a good section on data mining The Microsoft Data Warehouse Toolkit: With SQL Server 2005 and the Microsoft Business Intelligence Toolset

Also in my blog post I linked to a chapter of John Hancock's (et al) book, Practical Business Intelligence with SQL Server 2005

|||

Great help mate! Appreciate it.

I got a pretty good idea how to structure the data now for running the data mining

regards,

Dennis

No comments:

Post a Comment