Sunday, March 13, 2016

Recommender Systems and q-value Potts model


Introduction



Recommendation system (RS) can be described in a general way as a class of applications used to predict a set of items with highest probability of acceptance by a given user and is based on user's responses. The RS connects users with items and it is usually described from the point of view of the performance of algorithms used in the system.
In this note I would like to raise the following question: what can we learn by applying a q-valued Potts model to datasets created by recommender system ? The Potts model describes interaction of a set of spins distributed on a lattice. In contrary to Ising model, spins in the $q$-valued Potts formulation can take $q$ possible states (values).

Analysis


Let us consider a database created by a given RS where each item is characterized by a set $\{rating,tag\}$. Both, the $rating$ and $tag$ values are determined by independent users. From the $tag$ value one can extract a set of users who tagged the item. Those users make an $y$ coordinate. The $x$ coordinate is created by the users who made $rating$ instances. Both coordinates can be used for creation of $x-y$ lattice with $z$ values corresponding to the $rating$ variable.
It is obvious that the available items and users evolve with time in some way. Therefore it is not possible to determine close neighbours of a given spin placed in a lattice point $\{x,y\}$. In order to avoid this problem I assume that all spins on the lattice can interact with each other with the same coupling constant equal 1. After such assumptions the data can be described as the long-ranged $q$-value Potts model with parameter $q$ corresponding to the number of voting options. Using mathematical notations, for a given time $t$ we have $x-y$ lattice occupied by spins $\sigma_{i} = 1,..,q$ . The lattice (graph) state is described by the energy $E$ due to the interaction between spins $\sigma_{i}$: \[ E = - \sum_{i,j} \delta_{\sigma_{i},\sigma_{j}} \] and by the order parameter (also known as a magnetisation) $M$ defined as \[ M = \frac{\left( q \max\{n_{i}\} - N \right)}{\left(q-1\right)} . \] In the formulas above: $\delta$ is the Kronecker symbol, $N$ is the total number of spins in the lattice, $n_{i} <= N$ denotes the number of of spins with orientations $\sigma_{i}$.
Each graph is described by the per-graph quantities: $e = E/N$ and $m = M/N$ . The maximum order in a graph corresponds to $m = 1$ and is equivalent to all spins with the same orientation. The perfect disorder is specific for a case when all orientations are equally numerous: $\max\{n_{i}\} = N/q$, this state corresponds to $m = 0$. As a data for the model I used the movie dataset [1]. The database has 10 voting options ($q = 10$) with possible voting values $V = \{0.5,1,1.5,2,2.5,3,3.5,4,4.5,5\}$.
In the analysis, $y$ users (determined from $tag$ values) were selected globally for entire time range available in the data (9/17/1997 - 1/5/2009 ). The $x$ users were determined for each time step which was chosen as 1 day. For each time step I calculated the energy $e$ and the order parameter $m$. Details of calculations are available in R language listed in [2]. The calculations were performed for different numbers of voting options $q$ and available values $V$. Selected values of $q$ and $V$ are following

  1. $q = 3$, $V = \{2, 3, 4 \}$
  2. $q = 4$, $V = \{2, 3, 4, 5 \}$
  3. $q = 5$, $V = \{1, 2, 3, 4, 5 \}$
  4. $q = 6$, $V = \{1, 1.5, 2, 3, 3.5, 4\}$
  5. $q = 7$, $V = \{1, 1.5, 2, 2.5, 3, 3.5, 4\}$
  6. $q = 10$, $V = \{0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5\}$
For each subset defined by $q$ I made an averaging fit of calculated time evolution of variables $e$ and $m$ using local polynomial regression fitting smoothing (loess) with parameters [3]:
loess(y ~ x, family="gaussian", span=.75, degree=1)

All results are presented on plots below (the R source code for plots is listed in [3]). The energy $e$ (normalized to the same initial value) shows a behaviour typical for the Potts model:
But the most unexpected time evolution is seen on plot of the order parameter $m$:
The relaxation of the order parameter $m$ shows a two-staged process, well distinguished for $q < 6$ . The same behaviour is also visible for $q >=6$ however the steady-state value is not reached by the system yet. The final value of the order parameter also depends on the number of voting options. The stable value of $m$ parameter is higher for models with $q < 6$ ($\approx 0.6$) than for $q >=6$ ($\approx 0.4$). But this result should be checked more carefully on other datasets and it is too early to conclude this observation.

Conclusions


After some analysis work with the dataset from the RS, let me gather 2 main points:
  1. two-staged relaxation process of the order parameter $m$ could be related to the different dynamical mechanisms. One of them leads to the exponential decay and the second one to the self-organizing phenomena. In terms of users behaviour, it suggests existence of two classes of users: randomly using the RS which could be responsible for the fast decay of the parameter $m$ and visitors of the RS with more stable opinions about voting subject. The stabilizing effects due to the non-random visitors will be specific rather for longer working rating webpages. Due to this feature I expect different performance of the same RS on a newly created rating website and on the rating site working for longer time .
  2. The time scale of the order parameter $m$ decay after which the stable stage is reached is faster for smaller $q$ parameter. For the dataset used for the analysis the decays can be grouped for models with $q < 6$ and $q >=6$. This observation implies that a better performance of RS can be achieved for a system with smaller number of voting options.

Any comments and opinions greatly appreciated, thanks.


Bogdan Lobodzinski

Data Analytics For All consulting Service


References


[1] As a data for the model I used the movie dataset created by:
  • [url=http://www.grouplens.org]Movielens from GroupLens research group[/url],
  • [url=http://www.imdb.com]IMDb website[/url],
  • [url=http://www.rottentomatoes.com] Rotten Tomatoes website[/url],
build by Ivan Cantador with the collaboration of Alejandro Bellogin and Ignacio Fernandez-Tobias, members of the Information Retrieval group at Universidad Autonoma de Madrid. The data can be downloaded from http://ir.ii.uam.es/hetrec2011//datasets.html site. [2] the source code in R for calculation of energy $e$ and order parameter $m$ for 4 voting options ($q = 4$) with available values $V = \{2,3,4,5\}$.
// Code
library(readr)
library(rgl)
library(MASS)
library(reshape2)

# Set a random seed for reproducibility
set.seed(1)

# set the q value of the model
q0 <- 4
# set the value options:
options <- c(2,3,4,5)
# set the file name with output:
Output <- "Params_q4.csv" 

cat("reading the data\n")
prefix <- "../moviedata/"
user_rated_input <- paste0(prefix,"user_ratedmovies.dat")
userrate<- read.table(user_rated_input, header=TRUE)

user_tagged_input <- paste0(prefix,"user_taggedmovies.dat")
usertag <- read.table(user_tagged_input, header=TRUE)

cat("Data manipulation\n") 

# 1. replace time by secs
userrate$timeinDays <- as.numeric(as.POSIXlt(userrate$time, format="%m/%d/%Y"))
usertag$timeinDays <- as.numeric(as.POSIXlt(usertag$time, format="%m/%d/%Y"))

# 2. rearrange data first in descending time order:
sort.userrate <- userrate[order(userrate$timeinHours) , ]
sort.usertag <- usertag[order(usertag$timeinDays) , ]

GetMatrix <- function(inputX,inputY, inputZ) {
        x <- NULL
        y <- NULL
        z <- NULL
        x <- inputX
        y <- inputY
        z <- inputZ
        
        # check duplications:
        d1<-NULL
        d1<-data.frame(x,y)
        indexes<-NULL
        indexes<-which(duplicated(d1))
        if (length(indexes)) {
                d1<-d1[!duplicated(d1),]
                z<-z[-indexes]
        }
        mydf <- NULL
        mydf <- data.frame(d1,z)
        mat <- NULL
        mat<-as.matrix(acast(mydf, x~y, value.var="z"))
        mat[is.na(mat)] <- 0
        mat
}

lowerLimit <- q0

EnergyAv <- NULL
MagAv <- NULL
items <- NULL 

for (item in unique(sort.userrate$time)) {  # <- Day's timescale                
                tx <- NULL
                tx <- subset(data.frame(sort.userrate),time == item)

                if ( min(dim(tx)) > 0 ) {
                   merged <- NULL                        
                   merged<-merge(tx,sort.usertag,by=c("movieID"))                
                   if (min(dim(merged)) > 0 ) {
                      myMat <- NULL
                      myMat <- GetMatrix(merged$userID.x,merged$userIDy,merged$rating)
                      myMat1<-NULL
                      myMat1<-myMat[!(myMat == 0)]

                      # selection of states:2,3,4,5 for q = 4 valued model:
        myMat1 <- myMat1[myMat1 %in% options]
               nrOfSpins <- length(myMat1)
        if ( (nrOfSpins > lowerLimit) ) {
     # Energy: 
                          Energy <- 0
                          # counting nr of pairs:
                          for (elem in table(myMat1)) {
                               Energy <- Energy -(choose(elem, 2))
                          }
                          EnergyAv <- c(EnergyAv,Energy/nrOfSpins)
                                        
     # Order parameter Mag:       
                          Mag <- 0
                          Mag <- (q0*max(table(myMat1))-nrOfSpins)/(q0-1)
                          MagAv <- c(MagAv, Mag/nrOfSpins)
                                 
                   # time:
                          items <- c(items, item)
                      }
                   }        
                }
}

cat("Saving results ...\n")
dfq<-NULL
dfq<-data.frame(Time=items,Mag=MagAv,En=EnergyAv)
write_csv(dfq, Output )

[3] R source code used for creation of the plots:
// Code
library(readr)
require(ggplot2)
library(rgl)
library(MASS)
library(lattice)

# read data
prefix <- "./"
q3_input <- paste0(prefix,"Params_q3.csv")
q4_input <- paste0(prefix,"Params_q4.csv")
q5_input <- paste0(prefix,"Params_q5.csv")
q6_input <- paste0(prefix,"Params_q6.csv")
q7_input <- paste0(prefix,"Params_q7.csv")
q10_input <- paste0(prefix,"Params_q10.csv")

q3 <- read.table(q3_input, sep=",", header=TRUE)
q4 <- read.table(q4_input, sep=",", header=TRUE)
q5 <- read.table(q5_input, sep=",", header=TRUE)
q6 <- read.table(q6_input, sep=",", header=TRUE)
q7 <- read.table(q7_input, sep=",", header=TRUE)
q10 <- read.table(q10_input, sep=",", header=TRUE)

# fits & plots:
elems <- list(q3,q4,q5,q6,q7,q10)

# fit of the order parameter:
myx1 <- NULL
mypred1 <- NULL

for (ind in 1:length(elems)) {
        tmp <- NULL
        tmp <- data.frame(elems[ind])

        x <- as.numeric(as.POSIXlt(tmp$Time, format="%m/%d/%Y"))
        myx1 <- c(myx1,list(x))
        
        y <- with(tmp, Mag)
        eval.length <- dim(tmp)[1]
        fit.loess2= loess(y ~ x, family="gaussian", span=.75, degree=1)
        
        pred <- predict(fit.loess2, data.frame(x=x))
        fac<-1/pred[1]
        pred <- pred*fac
        mypred1<- c(mypred1,list(pred))
}

# energy:
# fit of the energy parameter:
myx2 <- NULL
mypred2 <- NULL

for (ind in 1:length(elems)) {
        tmp <- NULL
        tmp <- data.frame(elems[ind])
        
        x <- as.numeric(as.POSIXlt(tmp$Time, format="%m/%d/%Y"))
        myx2 <- c(myx,list(x))
        
        y <- with(tmp, En)
        eval.length <- dim(tmp)[1]
        fit.loess2= loess(y ~ x, family="gaussian", span=.75, degree=1)
        
        pred <- predict(fit.loess2, data.frame(x=x))
        fac<-abs(1/pred[1])
        pred <- pred*fac
        mypred2<- c(mypred2,list(pred))
}

# plots:
xpos <- c(8.94,8.99,9.04,9.09) # x axis: log10(time) position
x01 <- log10(myx1[[1]])
origin_date <- "1970-01-01"
t1 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[1]))]), origin = origin_date),1,10)
t2 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[2]))]), origin = origin_date),1,10)
t3 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[3]))]), origin = origin_date),1,10)
t4 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[4]))]), origin = origin_date),1,10)
TimeLabels <- c(t1,t2,t3,t4)

colors <- terrain.colors(6)

plot(log10(myx1[[ind]]), mypred1[[ind]],type="l",lwd=6, col = colors[ind], xlab = "Log10(Time)", 
     ylab = "Normalized Order parameter", 
     ylim=c(0.2,1.0), xaxt = "n" )
for (i in seq(ind-1,1,-1)) {
        lines(log10(myx1[[i]]), mypred1[[i]],lwd=6, col = colors[i], xlab = "Time (s)", 
        ylab = "Counts")
}
legend("topright", inset=.05, title="q-values:",c("3","4","5","6","7","10"), fill=colors[1:6], 
horiz=TRUE)
axis(side = 1, at = xpos, labels = TimeLabels)#, tck=-.05)

# energy:
plot(log10(myx2[[1]]), mypred2[[1]],type="l",lwd=6, col = colors[1], xlab = "Log10(Time)", 
     ylab = "Normalized Energy", xaxt = "n")#, ylim=c(-,1.0) )
for (i in seq(2,ind,1)) {
        lines(log10(myx2[[i]]), mypred2[[i]],lwd=6, col = colors[i], xlab = "Time (s)", 
        ylab = "Counts")
}
legend("bottomleft", inset=.05, title="q-values:",c("3","4","5","6","7","10"), fill=colors[1:6], 
horiz=TRUE)
axis(side = 1, at = xpos, labels = TimeLabels)#, tck=-.05)

No comments:

Post a Comment