Data Analytics For All : March 2016

Sunday, March 13, 2016

Recommender Systems and q-value Potts model

Introduction

Recommendation system (RS) can be described in a general way as a class of applications used to predict a set of items with highest probability of acceptance by a given user and is based on user's responses. The RS connects users with items and it is usually described from the point of view of the performance of algorithms used in the system.
In this note I would like to raise the following question: what can we learn by applying a q-valued Potts model to datasets created by recommender system ? The Potts model describes interaction of a set of spins distributed on a lattice. In contrary to Ising model, spins in the $q$-valued Potts formulation can take $q$ possible states (values).

Analysis

Let us consider a database created by a given RS where each item is characterized by a set $\{rating,tag\}$. Both, the $rating$ and $tag$ values are determined by independent users. From the $tag$ value one can extract a set of users who tagged the item. Those users make an $y$ coordinate. The $x$ coordinate is created by the users who made $rating$ instances. Both coordinates can be used for creation of $x-y$ lattice with $z$ values corresponding to the $rating$ variable.
It is obvious that the available items and users evolve with time in some way. Therefore it is not possible to determine close neighbours of a given spin placed in a lattice point $\{x,y\}$. In order to avoid this problem I assume that all spins on the lattice can interact with each other with the same coupling constant equal 1. After such assumptions the data can be described as the long-ranged $q$-value Potts model with parameter $q$ corresponding to the number of voting options. Using mathematical notations, for a given time $t$ we have $x-y$ lattice occupied by spins $\sigma_{i} = 1,..,q$ . The lattice (graph) state is described by the energy $E$ due to the interaction between spins $\sigma_{i}$: \[ E = - \sum_{i,j} \delta_{\sigma_{i},\sigma_{j}} \] and by the order parameter (also known as a magnetisation) $M$ defined as \[ M = \frac{\left( q \max\{n_{i}\} - N \right)}{\left(q-1\right)} . \] In the formulas above: $\delta$ is the Kronecker symbol, $N$ is the total number of spins in the lattice, $n_{i} <= N$ denotes the number of of spins with orientations $\sigma_{i}$.
Each graph is described by the per-graph quantities: $e = E/N$ and $m = M/N$ . The maximum order in a graph corresponds to $m = 1$ and is equivalent to all spins with the same orientation. The perfect disorder is specific for a case when all orientations are equally numerous: $\max\{n_{i}\} = N/q$, this state corresponds to $m = 0$. As a data for the model I used the movie dataset [1]. The database has 10 voting options ($q = 10$) with possible voting values $V = \{0.5,1,1.5,2,2.5,3,3.5,4,4.5,5\}$.
In the analysis, $y$ users (determined from $tag$ values) were selected globally for entire time range available in the data (9/17/1997 - 1/5/2009 ). The $x$ users were determined for each time step which was chosen as 1 day. For each time step I calculated the energy $e$ and the order parameter $m$. Details of calculations are available in R language listed in [2]. The calculations were performed for different numbers of voting options $q$ and available values $V$. Selected values of $q$ and $V$ are following

$q = 3$, $V = \{2, 3, 4 \}$
$q = 4$, $V = \{2, 3, 4, 5 \}$
$q = 5$, $V = \{1, 2, 3, 4, 5 \}$
$q = 6$, $V = \{1, 1.5, 2, 3, 3.5, 4\}$
$q = 7$, $V = \{1, 1.5, 2, 2.5, 3, 3.5, 4\}$
$q = 10$, $V = \{0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5\}$

For each subset defined by $q$ I made an averaging fit of calculated time evolution of variables $e$ and $m$ using local polynomial regression fitting smoothing (loess) with parameters [3]:

loess(y ~ x, family="gaussian", span=.75, degree=1)

All results are presented on plots below (the R source code for plots is listed in [3]). The energy $e$ (normalized to the same initial value) shows a behaviour typical for the Potts model:

But the most unexpected time evolution is seen on plot of the order parameter $m$:

The relaxation of the order parameter $m$ shows a two-staged process, well distinguished for $q < 6$ . The same behaviour is also visible for $q >=6$ however the steady-state value is not reached by the system yet. The final value of the order parameter also depends on the number of voting options. The stable value of $m$ parameter is higher for models with $q < 6$ ($\approx 0.6$) than for $q >=6$ ($\approx 0.4$). But this result should be checked more carefully on other datasets and it is too early to conclude this observation.

Conclusions

After some analysis work with the dataset from the RS, let me gather 2 main points:

two-staged relaxation process of the order parameter $m$ could be related to the different dynamical mechanisms. One of them leads to the exponential decay and the second one to the self-organizing phenomena. In terms of users behaviour, it suggests existence of two classes of users: randomly using the RS which could be responsible for the fast decay of the parameter $m$ and visitors of the RS with more stable opinions about voting subject. The stabilizing effects due to the non-random visitors will be specific rather for longer working rating webpages. Due to this feature I expect different performance of the same RS on a newly created rating website and on the rating site working for longer time .
The time scale of the order parameter $m$ decay after which the stable stage is reached is faster for smaller $q$ parameter. For the dataset used for the analysis the decays can be grouped for models with $q < 6$ and $q >=6$. This observation implies that a better performance of RS can be achieved for a system with smaller number of voting options.

Any comments and opinions greatly appreciated, thanks.

Bogdan Lobodzinski

Data Analytics For All consulting Service

References

[1] As a data for the model I used the movie dataset created by:

[url=http://www.grouplens.org]Movielens from GroupLens research group[/url],
[url=http://www.imdb.com]IMDb website[/url],
[url=http://www.rottentomatoes.com] Rotten Tomatoes website[/url],

build by Ivan Cantador with the collaboration of Alejandro Bellogin and Ignacio Fernandez-Tobias, members of the Information Retrieval group at Universidad Autonoma de Madrid. The data can be downloaded from http://ir.ii.uam.es/hetrec2011//datasets.html site. [2] the source code in R for calculation of energy $e$ and order parameter $m$ for 4 voting options ($q = 4$) with available values $V = \{2,3,4,5\}$.

// Code
library(readr)
library(rgl)
library(MASS)
library(reshape2)

# Set a random seed for reproducibility
set.seed(1)

# set the q value of the model
q0 <- 4
# set the value options:
options <- c(2,3,4,5)
# set the file name with output:
Output <- "Params_q4.csv" 

cat("reading the data\n")
prefix <- "../moviedata/"
user_rated_input <- paste0(prefix,"user_ratedmovies.dat")
userrate<- read.table(user_rated_input, header=TRUE)

user_tagged_input <- paste0(prefix,"user_taggedmovies.dat")
usertag <- read.table(user_tagged_input, header=TRUE)

cat("Data manipulation\n") 

# 1. replace time by secs
userrate$timeinDays <- as.numeric(as.POSIXlt(userrate$time, format="%m/%d/%Y"))
usertag$timeinDays <- as.numeric(as.POSIXlt(usertag$time, format="%m/%d/%Y"))

# 2. rearrange data first in descending time order:
sort.userrate <- userrate[order(userrate$timeinHours) , ]
sort.usertag <- usertag[order(usertag$timeinDays) , ]

GetMatrix <- function(inputX,inputY, inputZ) {
        x <- NULL
        y <- NULL
        z <- NULL
        x <- inputX
        y <- inputY
        z <- inputZ
        
        # check duplications:
        d1<-NULL
        d1<-data.frame(x,y)
        indexes<-NULL
        indexes<-which(duplicated(d1))
        if (length(indexes)) {
                d1<-d1[!duplicated(d1),]
                z<-z[-indexes]
        }
        mydf <- NULL
        mydf <- data.frame(d1,z)
        mat <- NULL
        mat<-as.matrix(acast(mydf, x~y, value.var="z"))
        mat[is.na(mat)] <- 0
        mat
}

lowerLimit <- q0

EnergyAv <- NULL
MagAv <- NULL
items <- NULL 

for (item in unique(sort.userrate$time)) {  # <- Day's timescale                
                tx <- NULL
                tx <- subset(data.frame(sort.userrate),time == item)

                if ( min(dim(tx)) > 0 ) {
                   merged <- NULL                        
                   merged<-merge(tx,sort.usertag,by=c("movieID"))                
                   if (min(dim(merged)) > 0 ) {
                      myMat <- NULL
                      myMat <- GetMatrix(merged$userID.x,merged$userIDy,merged$rating)
                      myMat1<-NULL
                      myMat1<-myMat[!(myMat == 0)]

                      # selection of states:2,3,4,5 for q = 4 valued model:
        myMat1 <- myMat1[myMat1 %in% options]
               nrOfSpins <- length(myMat1)
        if ( (nrOfSpins > lowerLimit) ) {
     # Energy: 
                          Energy <- 0
                          # counting nr of pairs:
                          for (elem in table(myMat1)) {
                               Energy <- Energy -(choose(elem, 2))
                          }
                          EnergyAv <- c(EnergyAv,Energy/nrOfSpins)
                                        
     # Order parameter Mag:       
                          Mag <- 0
                          Mag <- (q0*max(table(myMat1))-nrOfSpins)/(q0-1)
                          MagAv <- c(MagAv, Mag/nrOfSpins)
                                 
                   # time:
                          items <- c(items, item)
                      }
                   }        
                }
}

cat("Saving results ...\n")
dfq<-NULL
dfq<-data.frame(Time=items,Mag=MagAv,En=EnergyAv)
write_csv(dfq, Output )

[3] R source code used for creation of the plots:

// Code
library(readr)
require(ggplot2)
library(rgl)
library(MASS)
library(lattice)

# read data
prefix <- "./"
q3_input <- paste0(prefix,"Params_q3.csv")
q4_input <- paste0(prefix,"Params_q4.csv")
q5_input <- paste0(prefix,"Params_q5.csv")
q6_input <- paste0(prefix,"Params_q6.csv")
q7_input <- paste0(prefix,"Params_q7.csv")
q10_input <- paste0(prefix,"Params_q10.csv")

q3 <- read.table(q3_input, sep=",", header=TRUE)
q4 <- read.table(q4_input, sep=",", header=TRUE)
q5 <- read.table(q5_input, sep=",", header=TRUE)
q6 <- read.table(q6_input, sep=",", header=TRUE)
q7 <- read.table(q7_input, sep=",", header=TRUE)
q10 <- read.table(q10_input, sep=",", header=TRUE)

# fits & plots:
elems <- list(q3,q4,q5,q6,q7,q10)

# fit of the order parameter:
myx1 <- NULL
mypred1 <- NULL

for (ind in 1:length(elems)) {
        tmp <- NULL
        tmp <- data.frame(elems[ind])

        x <- as.numeric(as.POSIXlt(tmp$Time, format="%m/%d/%Y"))
        myx1 <- c(myx1,list(x))
        
        y <- with(tmp, Mag)
        eval.length <- dim(tmp)[1]
        fit.loess2= loess(y ~ x, family="gaussian", span=.75, degree=1)
        
        pred <- predict(fit.loess2, data.frame(x=x))
        fac<-1/pred[1]
        pred <- pred*fac
        mypred1<- c(mypred1,list(pred))
}

# energy:
# fit of the energy parameter:
myx2 <- NULL
mypred2 <- NULL

for (ind in 1:length(elems)) {
        tmp <- NULL
        tmp <- data.frame(elems[ind])
        
        x <- as.numeric(as.POSIXlt(tmp$Time, format="%m/%d/%Y"))
        myx2 <- c(myx,list(x))
        
        y <- with(tmp, En)
        eval.length <- dim(tmp)[1]
        fit.loess2= loess(y ~ x, family="gaussian", span=.75, degree=1)
        
        pred <- predict(fit.loess2, data.frame(x=x))
        fac<-abs(1/pred[1])
        pred <- pred*fac
        mypred2<- c(mypred2,list(pred))
}

# plots:
xpos <- c(8.94,8.99,9.04,9.09) # x axis: log10(time) position
x01 <- log10(myx1[[1]])
origin_date <- "1970-01-01"
t1 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[1]))]), origin = origin_date),1,10)
t2 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[2]))]), origin = origin_date),1,10)
t3 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[3]))]), origin = origin_date),1,10)
t4 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[4]))]), origin = origin_date),1,10)
TimeLabels <- c(t1,t2,t3,t4)

colors <- terrain.colors(6)

plot(log10(myx1[[ind]]), mypred1[[ind]],type="l",lwd=6, col = colors[ind], xlab = "Log10(Time)", 
     ylab = "Normalized Order parameter", 
     ylim=c(0.2,1.0), xaxt = "n" )
for (i in seq(ind-1,1,-1)) {
        lines(log10(myx1[[i]]), mypred1[[i]],lwd=6, col = colors[i], xlab = "Time (s)", 
        ylab = "Counts")
}
legend("topright", inset=.05, title="q-values:",c("3","4","5","6","7","10"), fill=colors[1:6], 
horiz=TRUE)
axis(side = 1, at = xpos, labels = TimeLabels)#, tck=-.05)

# energy:
plot(log10(myx2[[1]]), mypred2[[1]],type="l",lwd=6, col = colors[1], xlab = "Log10(Time)", 
     ylab = "Normalized Energy", xaxt = "n")#, ylim=c(-,1.0) )
for (i in seq(2,ind,1)) {
        lines(log10(myx2[[i]]), mypred2[[i]],lwd=6, col = colors[i], xlab = "Time (s)", 
        ylab = "Counts")
}
legend("bottomleft", inset=.05, title="q-values:",c("3","4","5","6","7","10"), fill=colors[1:6], 
horiz=TRUE)
axis(side = 1, at = xpos, labels = TimeLabels)#, tck=-.05)

How Data Analytics can help small companies ?

Trying to operate on the market of micro, small and medium-sized enterprises as a data analyst serving solutions based on the machine learning techniques, I have to say - it is a difficult business.

In my opinion, the difficulties arise due to two main, entangled reasons:

completely missing knowledge about a potential of data analytics. Most company owners don't even accept a though that someone from outside of the firm can suggest how to improve his business productivity,
if someone sees the need of a data analytics, the second problem appears: it is the cost of such a project. It can be a serious problem if it is seen as a single payment without noticing a new business landscape which gets opened by the results of the data analysis.

Let us start with the second point: data analytics project costs.

First of all, a usual thinking is that a data analytics requires enormously complex hardware and software infrastructure, therefore it will cost much too much. The truth is just the opposite. All tools which are needed for data analytics tasks of small companies can be completed during a few working days without spending a penny. Actually, it is not fully true, you have to have a good, powerful notebook with access to the network. Already having a computer, what else do you need at the beginning ?

Linux as a free operational system, and software:
Python or/and R (in this case also Rstudio).

All of the above is available for free and downloadable from the Internet as an open source software. The most important part of the whole collection is your business database. Again, even if you collect data using commercial software specific for your activity, it is possible to convert most of the available formats into more accessible ones for the R or Python form. Sometimes, that work might require some effort but in general it is doable.
More problematic is the situation when the company data are on the paper only. What to do in such a case ?
Maybe your accountant can help somehow - in communication with the tax office some forms of digital data have to be used anyway.
Another solution is to initialize the data gathering from the scratch. It can be realized using open source databases installed on a cloud. The cloud is not a free choice but the management costs are really negligible if one compares the cost of cloud computing and storage with other business expenses. In case of a cloud usage, we have to add a note about security of your data, especially in terms of unauthorized external access to your data. Solution is very simple: the data analysis doesn't need real values in your database entries.
It is enough to replace, for example, real addresses of the company clients by short unique strings. Similar encryption can be done with numbers, using a common factor for all values in a given row. The real meaning of the unique strings and numerical factors remains in a company's hands.
If needed, one can use already existing datasets available for free.

Now, we get to the point where a data expert becomes a crucial person. The data scientist can merge the entrepreneur business knowledge with mathematics and practical machine learning solutions. This can lead to a better understanding of your business experience. But the final move belongs to the company managers, data analytics would lead to nothing if the results were not introduced into practical realizations.

Work of the data expert(s), usually includes

consultations: when points like these are discussed:
- what the company would like to develop, explain, understand or predict,
- availability of business data and how they are compatible with the company's goal,
- preparation of a concluding question e.g. the final goal of the data analysis,
pilot solution: a creation of a test model which can be validated on the existing data and tested on a new set of data,
final realization: preparation of software tools which can be used on demand in daily business operations or on a regular basis using old and newly created data.

Between the lines written above a number of interactive actions is included where all obstacles or unpredicted behaviours are discussed. All that is performed until the desired result is achieved. The usual working time of the expert can be estimated between 50 hours (a week) for a micro projects up to several hundreds hours of work.

Now, one can ask how much data analytics products per working hour might cost ? The hourly rate can be estimated as 30 - 100 Euros. So, the final cost can vary between 1500 - 50000 Euro and more.

How to measure the efficiency of a data analytics project ?
A research performed by D. Barton and D. Court which was based on 165 large publicly traded firms shows that Data-driven decision-making can increase the output and productivity by 3-5% beyond the traditional improvement steps (Dominic Barton and David Court, Making advanced analytics work for you, Harvard Business Review, October 2012, Volume 90, Number 10, pp. 78–83). Using this estimation and assuming the smallest increase of 3% of a company output, a minimum enterprise outcome would be about 50 000 Euro for full recompensation of a smallest Data analytics project cost.

We treat the subject very generally, therefore some company owners will not be convinced why they should start to use data-driven decisions instead of relaying on the well grounded on the experience intuition . The numbers: 3-5% of outcome increase (set by the classical actions estimated by authors in cited above research paper) seem to be too small to be used as a solid argument in a discussion with small-firm owners.

Yet, let me add a few arguments which, I hope, can change a bit the dominated view of data analytics in small business.

The most important step: collection of data, especially in small companies is difficult due to the lack of knowledge and free manpower. However, that issue can be easily automated and (what is not negligible) the tool can be fully owned by the company. Collecting sufficient amounts of data takes time, it could be a quite long process in case of small-firm segments. Therefore, the sooner the data collection process will start, the sooner analytical process will become beneficial.
The data-driven management system requires more formalized and structured approach. It could be a difficult point in the transformation to the data-driven manner. But by answering simple questions (sometimes completely neglected and treated as a not important at all) one can achieve the biggest goals. Just a few examples of questions:
Who is buying a given flavour of a bread ?
How is the buying of bread correlated with time ?
Which cookies prefer young, mature and older customers ? etc.
All these answers can be provided by data analytics. And it is straightforward to notice that knowing the answers and following them in real world means clearly higher revenue.
What if the results of data analytics are consistent with the firm-owner intuition ? In our opinion it is a wrong question: even if data-driven development suggestions are compatible with the well established experience of the company leader, we are talking about specific numbers: how many cookies should be delivered to the shop, how many breads should be produced, etc. Do you believe that any, even the best intuition would provide the same numbers ?. Obviously not. Therefore, the initial question should be replaced by the statement: The data-driven results are capable to increase assurance of the manager's intuition.
and finally the non BigData conclusion: new technologies in a company business opens the firm to new ideas and tedious work becomes more interesting.

So, what do you think Dear Reader ? Are you ready for innovations in your view of the company's management style ?

Bogdan Lobodzinski
Data Analytics For All Consulting Service