Association Analysis


My third blog is about introducing data mining, why is it useful and what can we do with it.I will use the example from my third CA in Data Management and Analytics to demonstrate what are association rules in data mining.

First of all What is data?

Data is information that is unorganized refer to  collection of facts such as numbers,words,observations,objects etc.

When data is processed,organized, structured and presented in a meaningful way it becomes information.Information can be converted into knowledge.

Data mining is a process to analyze data from different perspectives and summarizing it into useful information.We are living in an age that is often referred as the information age where data are collected and stored from every part of life from personal data, medical data, business data, scientific data you name it …everything is stored on computers.Retailers, banks, manufacturers, telecommunications providers and insurers, among others, are using data mining to discover relationships among everything from pricing, promotions and demographics to how the economy, risk, competition and social media are affecting their business models, revenues, operations and customer relationships.(http://www.sas.com/en_us/insights/analytics/data-mining.html)

.

Association Rules in data mining:

  • It is a technique to uncover how items are associated to each other.It most commonly can be measured in three ways.
  • Measure Support:how popular is the item and how many times it appears in the data set.
  • Measure Confidence:how likely the item(X) appears when another item(Y) appears in the same data set.
  • Measure Lift:Support (X+Y)/Support (X)*Support (Y).

 

Q1: Lift Analysis

Please calculate the following lift values for the table correlating burger and chips below:

  • Lift(Burger, Chips)
  • Lift(Burgers, ^Chips)
  • Lift(^Burgers, Chips)
  • Lift(^Burgers, ^Chips)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

Chips ^Chips Total Row
Burgers 600 400 1000
^Burgers 200 200  400
Total Column 800 600 1400

A lift value greater than 1 means that there is a positive correlation between X and Y, meaning that  the occurrence of X has a positive effect on the occurrence of Y.

Lift smaller than 1 means X and Y occur together less often.

Q1,

Support (X+Y)/Support (X)*Support (Y)

A,

lift(Burger, Chips)

s(B,C) = 600/1400 =0.4285

s(B)=1000/1400=0.7142

s(C)=800/1400=0.5714

lift(B,C)= 0.4285/(0.7142*0.5714)=1.05

lift(B,C)> 1  which means there is a positive correlation

B,

lift(Burger , ^ Chips)

s(B,^C)=400/1400=0.2857

s(B)=1000/1400=0.7142

s(^C)=600/1400=0.4285

lift(B,^C)=0.2857/(0.7142*0.4285)=0.9336

lift(B,^C)<1  which means there is a negative correlation

C,

lift(^Burgers, Chips)

s(^B,C)=200/1400=0.1428

s(^B)=400/1400=0.2857

s(C)=800/1400=0.5714

lift(^B,C)=0.1428/(0.2857*0.5714)=0.875

lift(^B,C)<1 which means there is a negative correlation

D,

lift(^Burger, ^Chips)

s(^B, ^C)=200/1400=0.1428

s(^B)=400/1400=0.2857

s(^C)=600/1400=0.4285

lift(^B, ^C)=0.1428/(0.2857*0.4285)=1.1666

lift(^B, ^C)>1 there is a positive correlation

Q2: Lift Analysis

 

Please calculate the following lift values for the table correlating shampoo and ketchup below:

  • Lift(Ketchup, Shampoo)
  • Lift(Ketchup, ^Shampoo)
  • Lift(^Ketchup, Shampoo)
  • Lift(^Ketchup, ^Shampoo)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

Shampoo ^Shampoo Total Row
Ketchup 100 200 300
^Ketchup 200 400 600
Total Column 300 600 900

A,

lift(Ketchup, Shampoo)

s(K,S)=100/900=0.1111

s(K)=300/900=0.3333

s(S)=300/900=0.3333

lift(K,S)=0.1111/(0.3333*0.3333)=1

lift(K, S)=1 independent

B,

lift(Ketchup, ^Shampoo)

s(K, ^S)=200/900=0.2222

s(K)=300/900=0.3333

s(^S)=600/900=0.6666

lift(K, ^S)=0.2222/(0.3333*0.6666)=1

lift(K,^S)=1 independent

C,

lift(^Ketchup, Shampoo)

s(^K,S)=200/900=0.2222

s(^K)=600/900=0.6666

s(S)=300/900=0.3333

lift(^K, S)=0.2222/(0.6666*0.33)=1

lift(^K,S)=1 independent

 

D,

lift(^Ketchup, ^Shampoo)

s(^K,^S)=400/900=0.4444

s(^K)=600/900=0.6666

s(^S)=600/900=0.6666

lift(^K,^S)=0.4444/(0.6666*0.6666)=1

lift(^K,^S)=1 independent

 

Q3: Chi Squared Analysis

Please calculate the following chi squared values for the table correlating burger and chips below (Expected values in brackets).

  • Burgers & Chips
  • Burgers & Not Chips
  • Chips & Not Burgers
  • Not Burgers and Not Chips

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

Chips ^Chips Total Row
Burgers 900 (800) 100 (200) 1000
^Burgers 300 (400) 200 (100)  500
Total Column 1200 300 1500

Calculating

c2 = Σ (Observed – Expected)2 / Expected

Burgers & Chips Correlation:

X2=(900-800)2 /800+(100-200)2/200+(300-400)2/400+(200-100)2/100

=10000/800+10000/200+10000/400+10000/100

=12.5+50+25+100=187.5

X2>0 So we can say Chips and Burgers are correlated.

X2 is always going to be positive.

We can also say that expected sale of Burgers and Chips 900-800 therefore Burgers and chips are positively correlated.

Burgers& ^Chips expected 200 observed 100 results in negatively correlated.

Chips and ^ Burgers expected 400 observed 300 negatively correlated.

^Burger &^Chips expected 100 observed 200 positively correlated.

Q4: Chi Squared Analysis

Please calculate the following chi squared values for the table correlating burger and sausages below (Expected values in brackets).

  • Burgers & Sausages
  • Burgers & Not Sausages)
  • Sausages & Not Burgers
  • Not Burgers and Not Sausages

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

Sausages ^Sausages Total Row
Burgers 800 (800) 200 (200) 1000
^Burgers 400 (400) 100 (100)  500
Total Column 1200 300 1500

X2=(800-800)2/800+(200-200)2/200+(400-400)2/400+(100-100)2/100

=02/800+ 02/200+02 /400+02 /100=0

INDEPENDENT RESULT for all examples.

Q5:

Under what conditions would Lift and Chi Squared analysis prove to be a poor algorithm to evaluate correlation/dependency between two events?

  • Null transactions -> transactions that contain neither B nor C
  • it means observed value is equal to the predicted value

Please suggest another algorithm that could be used to rectify the flaw in Lift and Chi Squared?

  • AllConf(A,B)
  • Jaccard(A,B)
  • Cosine(A,B)
  • Kulczynski(A,B)
  • MaxConf(A,B)

 

 

 

 

 

Leave a comment

Your email address will not be published. Required fields are marked *