My third blog is about introducing data mining, why is it useful and what can we do with it.I will use the example from my third CA in Data Management and Analytics to demonstrate what are association rules in data mining.
First of all What is data?
Data is information that is unorganized refer to collection of facts such as numbers,words,observations,objects etc.
When data is processed,organized, structured and presented in a meaningful way it becomes information.Information can be converted into knowledge.
Data mining is a process to analyze data from different perspectives and summarizing it into useful information.We are living in an age that is often referred as the information age where data are collected and stored from every part of life from personal data, medical data, business data, scientific data you name it …everything is stored on computers.Retailers, banks, manufacturers, telecommunications providers and insurers, among others, are using data mining to discover relationships among everything from pricing, promotions and demographics to how the economy, risk, competition and social media are affecting their business models, revenues, operations and customer relationships.(http://www.sas.com/en_us/insights/analytics/data-mining.html)
.
Association Rules in data mining:
- It is a technique to uncover how items are associated to each other.It most commonly can be measured in three ways.
- Measure Support:how popular is the item and how many times it appears in the data set.
- Measure Confidence:how likely the item(X) appears when another item(Y) appears in the same data set.
- Measure Lift:Support (X+Y)/Support (X)*Support (Y).
Q1: Lift Analysis
Please calculate the following lift values for the table correlating burger and chips below:
- Lift(Burger, Chips)
- Lift(Burgers, ^Chips)
- Lift(^Burgers, Chips)
- Lift(^Burgers, ^Chips)
Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?
Chips | ^Chips | Total Row | |
Burgers | 600 | 400 | 1000 |
^Burgers | 200 | 200 | 400 |
Total Column | 800 | 600 | 1400 |
A lift value greater than 1 means that there is a positive correlation between X and Y, meaning that the occurrence of X has a positive effect on the occurrence of Y.
Lift smaller than 1 means X and Y occur together less often.
Q1,
Support (X+Y)/Support (X)*Support (Y)
A,
lift(Burger, Chips)
s(B,C) = 600/1400 =0.4285
s(B)=1000/1400=0.7142
s(C)=800/1400=0.5714
lift(B,C)= 0.4285/(0.7142*0.5714)=1.05
lift(B,C)> 1 which means there is a positive correlation
B,
lift(Burger , ^ Chips)
s(B,^C)=400/1400=0.2857
s(B)=1000/1400=0.7142
s(^C)=600/1400=0.4285
lift(B,^C)=0.2857/(0.7142*0.4285)=0.9336
lift(B,^C)<1 which means there is a negative correlation
C,
lift(^Burgers, Chips)
s(^B,C)=200/1400=0.1428
s(^B)=400/1400=0.2857
s(C)=800/1400=0.5714
lift(^B,C)=0.1428/(0.2857*0.5714)=0.875
lift(^B,C)<1 which means there is a negative correlation
D,
lift(^Burger, ^Chips)
s(^B, ^C)=200/1400=0.1428
s(^B)=400/1400=0.2857
s(^C)=600/1400=0.4285
lift(^B, ^C)=0.1428/(0.2857*0.4285)=1.1666
lift(^B, ^C)>1 there is a positive correlation
Q2: Lift Analysis
Please calculate the following lift values for the table correlating shampoo and ketchup below:
- Lift(Ketchup, Shampoo)
- Lift(Ketchup, ^Shampoo)
- Lift(^Ketchup, Shampoo)
- Lift(^Ketchup, ^Shampoo)
Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?
Shampoo | ^Shampoo | Total Row | |
Ketchup | 100 | 200 | 300 |
^Ketchup | 200 | 400 | 600 |
Total Column | 300 | 600 | 900 |
A,
lift(Ketchup, Shampoo)
s(K,S)=100/900=0.1111
s(K)=300/900=0.3333
s(S)=300/900=0.3333
lift(K,S)=0.1111/(0.3333*0.3333)=1
lift(K, S)=1 independent
B,
lift(Ketchup, ^Shampoo)
s(K, ^S)=200/900=0.2222
s(K)=300/900=0.3333
s(^S)=600/900=0.6666
lift(K, ^S)=0.2222/(0.3333*0.6666)=1
lift(K,^S)=1 independent
C,
lift(^Ketchup, Shampoo)
s(^K,S)=200/900=0.2222
s(^K)=600/900=0.6666
s(S)=300/900=0.3333
lift(^K, S)=0.2222/(0.6666*0.33)=1
lift(^K,S)=1 independent
D,
lift(^Ketchup, ^Shampoo)
s(^K,^S)=400/900=0.4444
s(^K)=600/900=0.6666
s(^S)=600/900=0.6666
lift(^K,^S)=0.4444/(0.6666*0.6666)=1
lift(^K,^S)=1 independent
Q3: Chi Squared Analysis
Please calculate the following chi squared values for the table correlating burger and chips below (Expected values in brackets).
- Burgers & Chips
- Burgers & Not Chips
- Chips & Not Burgers
- Not Burgers and Not Chips
For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?
Chips | ^Chips | Total Row | |
Burgers | 900 (800) | 100 (200) | 1000 |
^Burgers | 300 (400) | 200 (100) | 500 |
Total Column | 1200 | 300 | 1500 |
Calculating
c^{2} = Σ (Observed – Expected)^{2 }/ Expected
Burgers & Chips Correlation:
X^{2}=(900-800)^{2} /800+(100-200)^{2}/200+(300-400)^{2}/400+(200-100)^{2}/100
=10000/800+10000/200+10000/400+10000/100
=12.5+50+25+100=187.5
X^{2}>0 So we can say Chips and Burgers are correlated.
X^{2 }is always going to be positive.
We can also say that expected sale of Burgers and Chips 900-800 therefore Burgers and chips are positively correlated.
Burgers& ^Chips expected 200 observed 100 results in negatively correlated.
Chips and ^ Burgers expected 400 observed 300 negatively correlated.
^Burger &^Chips expected 100 observed 200 positively correlated.
Q4: Chi Squared Analysis
Please calculate the following chi squared values for the table correlating burger and sausages below (Expected values in brackets).
- Burgers & Sausages
- Burgers & Not Sausages)
- Sausages & Not Burgers
- Not Burgers and Not Sausages
For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?
Sausages | ^Sausages | Total Row | |
Burgers | 800 (800) | 200 (200) | 1000 |
^Burgers | 400 (400) | 100 (100) | 500 |
Total Column | 1200 | 300 | 1500 |
X^{2}=(800-800)^{2}/800+(200-200)^{2}/200+(400-400)^{2}/400+(100-100)^{2}/100
=0^{2}/800+ 0^{2}/200+0^{2} /400+0^{2} /100=0
INDEPENDENT RESULT for all examples.
Q5:
Under what conditions would Lift and Chi Squared analysis prove to be a poor algorithm to evaluate correlation/dependency between two events?
- Null transactions -> transactions that contain neither B nor C
- it means observed value is equal to the predicted value
Please suggest another algorithm that could be used to rectify the flaw in Lift and Chi Squared?
- _{AllConf(A,B)}
- _{Jaccard(A,B)}
- _{Cosine(A,B)}
- _{Kulczynski(A,B)}
- _{MaxConf(A,B)}
_{ }
_{ }
_{ }
_{ }