Wednesday, June 22, 2016

Sell - buy signals for financial instruments - a multivariate survival analysis approach (in R).

Survival analysis could be understand as a kind of a forecasting method created for studying how long an analyzed subject will exist. In a survival framework we calculate survival probability of a thing at particular points of time.
And so, it is a natural choice to use the survival method in forecasting of financial market. Especially, for prediction of future price behaviour of a given financial instance and as a consequence for generating buy-sell signals. Searching for some examples of similar analysis I have found only a single work addressing the subject [1]. From this publication I adopted a definition of a daily revenue - given below.

In this note I would like to show how to use the survival analysis method for generation of buy/sell signals for a chosen financial product. As data I use Brent Oil Future options, however the method could be adopted to other instruments as well.

The analysis is presented in R language.
As an input I use historical data of Brent Oil Future options available on the portal stooq.pl. The data contain several features like Open and Close prices, Volume, highest (High) and lowest (Low) prices and Open Interest (OpenInt). Operating on a daily basis, my goal is to predict whether for the next day we expect increase or decrease of the Close value.
Let me begin with preparation of libraries and data engineering.
set.seed(1)
library(ggplot2)
library(lubridate)
library(readr)
library(MASS)
library(survival)
library(xts)
library(rms)

# reading the data:
myDir <- "/analysis/forecasting/data_stooq.pl/"
myFile <- "sc_f_d_2015_2016.csv" 
input <- NULL
input <- read.csv(paste(myDir,myFile,sep=""),header=TRUE)

names(input)
[1] "Date"          "Open"          "High"          "Low"           
[5]"Close"         "Volume"        "OpenInt"
In a next step I will introduce new features:
  1. $corp = Close - Open$ : it corresponds to the definition of the corp of the candlesticks,
    input$corp <- 0 
    input$corp <- input$Close - input$Open
    
  2. Exponential Moving Average - $EMAV$. For this purpose I use functions for calculation of the Volume Weighted Exponential Moving Average (see volume-weighted-exponential-moving-average)
    VEMA.num <- function(x, volumes, ratio) {
            ret <- c()
            s <- 0
            for(i in 1:length(x)) { 
                s <- ratio * s + (1-ratio) * x[i] * volumes[i];
                ret <- c(ret, s); 
            }
            ret
    }
    
    VEMA.den <- function(volumes, ratio) {
            ret <- c()
            s <- 0
            for(i in 1:length(x)) { 
                s <- ratio * s + (1-ratio) * volumes[i]; 
                ret <- c(ret, s); 
            }
            ret
    }
    
    VEMA <- function(x, volumes, n = 10, wilder = F, ratio = NULL, ...)
    {
            x <- try.xts(x, error = as.matrix)
            if (n < 1 || n > NROW(x))
                    stop("Invalid 'n'")
            if (any(nNonNA <- n > colSums(!is.na(x))))
                    stop("n > number of non-NA values in column(s) ", 
                    paste(which(nNonNA),collapse = ", "))
            x.na <- xts:::naCheck(x, n)
            if (missing(n) & !missing(ratio)) {
                    n <- trunc(2/ratio - 1)
            }    
            if (is.null(ratio)) {
                    if (wilder)
                            ratio <- 1/n
                    else ratio <- 2/(n + 1)
            }
            
            foo <- cbind(x[,1], volumes, VEMA.num(as.numeric(x[,1]), volumes, ratio), 
                         VEMA.den(volumes, ratio))
            (foo[,3] / foo[,4]) -> ma
            
            ma <- reclass(ma, x)
            if (!is.null(dim(ma))) {
                    colnames(ma) <- paste(colnames(x), "VEMA", n, sep = ".")
            }
            return(ma)
    }
    
    x <- ts(input$Close)
    input$EMAV <- 0
    Vol <- rep(1, length(input[,1]))   
    input$EMAV <- VEMA(input$Close, Vol,ratio=0.95)
    plot(input$Close,type="l");lines(input$EMAV,col="red")
    
  3. Daily Revenue - $DailyRevenue$ . The feature is calculated as a percentage of $Close$ values between present and previous days [1]
    $DailyRevenue = \frac{( Close[k] - Close[k-1] )}{Close[k-1]} \cdot 100$
    input$DailyRevenue <- 0
    input$DailyRevenue[1] <- 0
    for (k in (2:(length(input$Close)))) {
         input$DailyRevenue[k] <- (input$Close[k]-input$Close[k-1])*100/(input$Close[k-1])
    }
    
  4. Censors:
    In this case I consider two series of censored data corresponding to the increasing and decreasing values of $DailyRevenue$. The increasing or decreasing event is detected by checking whether the Daily Revenue is larger or smaller than a given threshold value. The threshold value could be used as a parameter in our model. Therefore, the value of the $threshold$ could be different from $0$ ($0$ is here a good initial choice). The censoring of data used in the model is following:
    1. censors specific for increasing daily revenue: $censorUp$: if the daily revenue is higher than $threshold$: $censorUp = 0$ (censored point of time). In case when the daily revenue being smaller or equal to $threshold$: $censorUp = 1$ .
      Threshold <- 0
      input$censorUp <- 0
      input[input$DailyRevenue <= Threshold,12] <- 1
      
    2. censors specific for decreasing daily revenue: $censorDown$. In this case the events are opposite to the $censorUp$ feature. The daily revenue is smaller or equal to $threshold$: $censorDown = 1$ . the daily revenue is larger than $threshold$: $censorDown = 0$ .
      input$censorDown <- 0
      input[input$DailyRevenue > Threshold,11] <- 1
      
  5. Time to events: are measured as a time distance between last local extremum of the $DailyRevenue$ and an event defined by censoring series (censored or NOT censored).
    1. $TimeToEventUp$:
      it is a time measured from a last local minimum of the $DailyRevenue$ to the $censorUp = 1$ events or the time distance between the last local maximum of the $DailyRevenue$ to the $censorUp = 0$.
      locmylastmin <- 1
      locmylastmax <- 1
      
      for (k in (2:(dim(input)[1]))) {
           tmp <- input$DailyRevenue[1:k]
           if (input$censorUp[k] == 0) {
                 # find all local minima:
                 locmymin <- which(diff(sign(diff(tmp)))==+2)+1
                 if (length(locmymin) == 0) {locmymin <- which.min(tmp)}
                 locmylastmin <- locmymin[length(locmymin)]
                 dateMin <- input[locmylastmin,1]
                 input$TimeToEventUp[k] <- as.integer(as.Date(input[k,1])-
                                                      as.Date(dateMin))
           }
           if (input$censorUp[k] == 1) {
                 # find all local maxima:
                 locmymax <- which(diff(sign(diff(tmp)))==-2)+1
                 if (length(locmymax)==0) {locmymax <- which.max(tmp)}
                 locmylastmax <- locmymax[length(locmymax)]
                 dateMax <- input[locmylastmax,1]
                 input$TimeToEventUp[k] <- as.integer(as.Date(input[k,1])-
                                                      as.Date(dateMax))
           }
      }
      
    2. $TimeToEventDown$:
      is calculated in an opposite way to the feature $TimeToEventUp$. Details of determination of the $TimeToEventDown$ feature is shown on Fig. 1 .
      locmylastmin <- 1
      locmylastmax <- 1
      
      for (k in (2:(dim(input)[1]))) {
           tmp <- input$DailyRevenue[1:k]
           if (input$censorDown[k] == 1) {
                 # find all local minima:
                 locmymin <- which(diff(sign(diff(tmp)))==+2)+1
                 if (length(locmymin) == 0) {locmymin <- which.min(tmp)}
                 locmylastmin <- locmymin[length(locmymin)]
                 dateMin <- input[locmylastmin,1]
                 input$TimeToEventDown[k] <- as.integer(as.Date(input[k,1])-
                                                        as.Date(dateMin))
           }
           if (input$censorDown[k] == 0) {
                 # find all local maxima:
                 locmymax <- which(diff(sign(diff(tmp)))==-2)+1
                 if (length(locmymax)==0) {locmymax <- which.max(tmp)}
                 locmylastmax <- locmymax[length(locmymax)]
                 dateMax <- input[locmylastmax,1]
                 input$TimeToEventDown[k] <- as.integer(as.Date(input[k,1])-
                                                        as.Date(dateMax))
           }
      }
      
      Definition of the variable $TimeToEventDown$ demonstrated on a part of the $DailyRevenue$ time series. It is a sequence of time distances $T_{Down}^{censored}$ and $T_{Down}^{event}$, where the time $T_{Down}^{censored}$ is a time measured from a last local maximum of the $DailyRevenue$ to the $censorDown = 0$ and $T_{Down}^{event}$ - is the time distance between the last local $DailyRevenue$ minimum to the $censorDown = 1$.

In the next step I build the signal creator.
The basic part is realized in the standard way of calculating of the survival probabilities corresponding to the both series $censorUp$ and $censorDown$. The prediction is calculated by the function survfit without newdata object. In such a case the predicted survival probabilities are calculated on a base of mean values of all covariates included in the formula used for creation of the Cox hazards regression model fit. In our case I use all available covariates in implemented formula, except the $Volume$ feature which is not already a solid number.
formula = Surv(TimeToEventDown, censorDown) ~  
               High + Low + Close + corp + EMAV + DailyRevenue + OpenInt
The code of the function $SurvivalProb$:
SurvivalProb <- function(myinput,What="Down") {        
    train <- myinput
    if (What == "Down") {
        coxphCorp <- coxph(Surv(TimeToEventDown, censorDown) ~  
                           High + Low + Close + corp + EMAV + DailyRevenue + OpenInt, 
                           data=train, method="breslow")
    }
    if (What == "Up") {
        coxphCorp <- coxph(Surv(TimeToEventUp, censorUp) ~  
                           High + Low + Close + corp + EMAV + DailyRevenue + OpenInt, 
                           data=train, method="breslow")
    }
    mysurffit<-survfit(coxphCorp, se.fit = F, conf.int = F)
        
    tmp<-summary(mysurffit)
    myprob <- tmp$surv[1]
        
    myprob
}
The function predicting the signal sell/buy ($SignalPredictor$) calculates it as a ratio between probabilities calculated for the next days related to the increased and decreased daily revenue accordingly. The ratio is defined as: $ratio = \frac{Survival-Probability-that-daily-revenue-will-be-higher-next-day}{Survival-Probability-that-daily-revenue-will-be-smaller-next-day} - \left(1+\epsilon\right)$ where $Survival-Probability-that-daily-revenue-will-be-higher-next-day$ (denoted as $probUp$ in the code) and $Survival-Probability-that-daily-revenue-will-be-smaller-next-day$ ($probDown$ in the code) are calculated by the function $SurvivalProb$. The constant $1+\epsilon$ corresponds to the $ratioValue$ inside the code.
The code of the function $SignalPredictor$ :
SignalPredictor <- function(train,ratioValue) { 

    # Survival-Probability-that-daily-revenue-will-be-smaller-next-day  
    ProbDown <- SurvivalProb(train,"Down")  
           
    # Survival-Probability-that-daily-revenue-will-be-higher-next-day
    ProbUp <- SurvivalProb(train,"Up")      
        
    # signal: + <- buy; - <- sell
    signal <- ((ProbUp/ProbDown)-ratioValue)
    signal 
}
As a practical test how well the model creates buy-sell signals I run the signal predictor in a loop over available dates in the set of historical data of Brent Oil Future. In the code below, the $Dist$ variable denotes the size of the training sample. For a given row $k$ the model use $Dist$ rows in the data as a train sample and makes a prediction for the day: $k + Dist + 1$ ($SignalPredictor$). In the next iteration the model predicts signal for the next day $k + Dist + 2$, using data of the previous day $k + Dist + 1$ as a part of the train sample.
# daily step:
Dist <- 50 
ratioValue <- 1.02
Bias <- 0 
startInd <- Dist
results <- NULL        
results$Signal <- 0

for (k in seq((startInd+Bias+1),length(input[,1]),by=1)) {
     print(as.Date(input[k,1]))
     train <- input[(k-Dist):k,]
     train <- Optimize(train,ratioValue)

     ratio <- SignalPredictor(train,ratioValue)        
     results$Signal <- c(results$Signal,ratio)
}
Inside the loop I use another function $Optimize$. The $Optimize$ function takes the train sample and selects its last row as test data. In the next steps the function selects such a size of the train sample that properly predicts the sign of the $DailyRevenue$ feature from the test sample (e.g. the signal sell/buy corresponds to the sign of the $DailyRevenue$ variable). The resulting size of the train sample can be smaller or equal to the initial train size. The $Optimize$ function:
Optimize <- function(inputData,ratioValue) {
    L <- dim(inputData)[1]
    mydist <- 1
    mycheck <- -1
        
    repeat {
            mytest <- inputData[L,]
            mytrain <- inputData[mydist:(L-1),]
                
            probDown <- SurvivalProb(mytrain,"Down")
            probUp <- SurvivalProb(mytrain,"Up")
                
            ratio <- (probUp/probDown)-ratioValue

            if ( (sign(ratio)*mytest$DailyRevenue >= 0) ) {mycheck <- 1}    
            if (mycheck == 1) {break} 
            if (mydist >= (L-10)) {
                 mycheck <- 1;mydist <- 2;break
            }
            mydist <- mydist + 1
           }
    mydist <- mydist - 1   
    inputData[mydist:L,]
}
The results (sell=buy signals) are shown on the next plot (Fig.2) for an initial values of parameters:
Dist <- 50 
ratioValue <- 1.02
Threshold <- 0
and for the date period: 2015-03-13 - 2016-06-16
Fig. 2
In the Fig. 2 the green parts correspond to the positive values of the ratio where the calculated probability for increasing $Close$ price is larger than for decreasing one. The red part shows the opposite behaviour.
The best signal buy/sell is generated when the ratio changes the sign:
from red to green : buy, from green to red: sell .
The black curve presents the $Close$ values, the blue line represents the values of $EMAV$.
By manipulating with the above parameters ($Dist$,$ratioValue$, $Threshold$ and $EMAV$) one can create different strategies. Using an initial capital equal to the $Close$ value at the initial date of the game, the total income generated by buy and sell signals for the above period and parameters is $16.47$ USD what corresponds to the $41.1$ % of profit from the considered period of time (2013-03-13 - 2016-06-17) .

Conclusions:


If you place capital into financial market it is always better to use several independent models for prediction of behaviour of financial instruments. If most of the methods are coming to the same conclusions, one can decide to take action: to sell or to buy the instrument(s).
The presented model could be treated as an additional source of such a knowledge. From technical point of view the script is easy to adopt to any financial data and has a lot of room for improvement. The most important part of the algorithm is the definition of censors ($censorUp$,$censorDown$) and "time to event" ($TimeToEventUp$,$TimeToEventDown$). I would appreciate any hints concerning improvement of the algorithm.

Hope, the note could be used as an introduction for survival analysis in a financial market. Comments (and links) are very welcome.


Thanks for reading !
Bogdan Lobodzinski
Data Analytics For All Consulting Service

[1] Guangliang Gao, Zhan Bu, Lingbo Liu, Jie Cao, Zhiang Wu: A survival analysis method for stock market prediction. BESC 2015: 116-122

Code for generation Fig. 2:
shift <- 47
factorSig <- 21
Sig <- sign(results$Signal)
L<-length(Sig)
Sigdf <- data.frame(x=input[(startInd+Bias):length(input[,1]),1],Sig=Sig*factorSig) 
JustLine <- data.frame(xTime=input[(startInd+Bias):length(input[,1]),1],Line=0*c(1:L)+shift) 
titleStr <- paste("Brent Oil Future: period from ",as.Date(input[(startInd+Bias+1),1]),
                  " to", as.Date(input[(length(input[,1])),1]))

ggplot(Sigdf, aes(x=as.Date(x), y=Sig+shift)) +
     geom_line() + 
     geom_ribbon(aes(ymin=shift, ymax=ifelse(Sig+shift>shift,Sig+shift,shift)),
          fill = "green", alpha = 0.5,colour = "green") +
     geom_ribbon(aes(ymin=ifelse(Sig+shift < shift,Sig+shift,shift),ymax=shift),
          fill = "red", alpha = 0.5,colour="red") +        
     geom_line(data = JustLine, aes(x=as.Date(xTime), y=Line)) + 
     geom_line(data = input[(startInd+Bias+1):length(input[,1]),], 
          aes(x=as.Date(input[(startInd+Bias+1):length(input[,1]),1]), y=Close)) +
     geom_line(data = input[(startInd+Bias+1):length(input[,1]),], 
          aes(x=as.Date(input[(startInd+Bias+1):length(input[,1]),1]), y=EMAV, colour=4)) +        
     ggtitle(titleStr) +
     labs(x="Date", y="Price in USD") +guides(colour=FALSE)

No comments:

Post a Comment