And so, it is a natural choice to use the survival method in forecasting of financial market. Especially, for prediction of future price behaviour of a given financial instance and as a consequence for generating buy-sell signals. Searching for some examples of similar analysis I have found only a single work addressing the subject [1]. From this publication I adopted a definition of a daily revenue - given below.
In this note I would like to show how to use the survival analysis method for generation of buy/sell signals for a chosen financial product. As data I use Brent Oil Future options, however the method could be adopted to other instruments as well.
The analysis is presented in R language.
As an input I use historical data of Brent Oil Future options available on the portal stooq.pl. The data contain several features like Open and Close prices, Volume, highest (High) and lowest (Low) prices and Open Interest (OpenInt). Operating on a daily basis, my goal is to predict whether for the next day we expect increase or decrease of the Close value.
Let me begin with preparation of libraries and data engineering.
set.seed(1)
library(ggplot2)
library(lubridate)
library(readr)
library(MASS)
library(survival)
library(xts)
library(rms)
# reading the data:
myDir <- "/analysis/forecasting/data_stooq.pl/"
myFile <- "sc_f_d_2015_2016.csv"
input <- NULL
input <- read.csv(paste(myDir,myFile,sep=""),header=TRUE)
names(input)
[1] "Date" "Open" "High" "Low"
[5]"Close" "Volume" "OpenInt"
In a next step I will introduce new features:- corp=Close−Open : it corresponds to the definition of the corp of the candlesticks,
input$corp <- 0 input$corp <- input$Close - input$Open
- Exponential Moving Average - EMAV. For this purpose I use functions for calculation of the
Volume Weighted Exponential Moving Average (see volume-weighted-exponential-moving-average)
VEMA.num <- function(x, volumes, ratio) { ret <- c() s <- 0 for(i in 1:length(x)) { s <- ratio * s + (1-ratio) * x[i] * volumes[i]; ret <- c(ret, s); } ret } VEMA.den <- function(volumes, ratio) { ret <- c() s <- 0 for(i in 1:length(x)) { s <- ratio * s + (1-ratio) * volumes[i]; ret <- c(ret, s); } ret } VEMA <- function(x, volumes, n = 10, wilder = F, ratio = NULL, ...) { x <- try.xts(x, error = as.matrix) if (n < 1 || n > NROW(x)) stop("Invalid 'n'") if (any(nNonNA <- n > colSums(!is.na(x)))) stop("n > number of non-NA values in column(s) ", paste(which(nNonNA),collapse = ", ")) x.na <- xts:::naCheck(x, n) if (missing(n) & !missing(ratio)) { n <- trunc(2/ratio - 1) } if (is.null(ratio)) { if (wilder) ratio <- 1/n else ratio <- 2/(n + 1) } foo <- cbind(x[,1], volumes, VEMA.num(as.numeric(x[,1]), volumes, ratio), VEMA.den(volumes, ratio)) (foo[,3] / foo[,4]) -> ma ma <- reclass(ma, x) if (!is.null(dim(ma))) { colnames(ma) <- paste(colnames(x), "VEMA", n, sep = ".") } return(ma) } x <- ts(input$Close) input$EMAV <- 0 Vol <- rep(1, length(input[,1])) input$EMAV <- VEMA(input$Close, Vol,ratio=0.95) plot(input$Close,type="l");lines(input$EMAV,col="red")
- Daily Revenue - DailyRevenue . The feature is calculated as a percentage of Close values between present and previous days [1]
DailyRevenue=(Close[k]−Close[k−1])Close[k−1]⋅100input$DailyRevenue <- 0 input$DailyRevenue[1] <- 0 for (k in (2:(length(input$Close)))) { input$DailyRevenue[k] <- (input$Close[k]-input$Close[k-1])*100/(input$Close[k-1]) }
- Censors:
In this case I consider two series of censored data corresponding to the increasing and decreasing values of DailyRevenue. The increasing or decreasing event is detected by checking whether the Daily Revenue is larger or smaller than a given threshold value. The threshold value could be used as a parameter in our model. Therefore, the value of the threshold could be different from 0 (0 is here a good initial choice). The censoring of data used in the model is following:- censors specific for increasing daily revenue: censorUp:
if the daily revenue is higher than threshold: censorUp=0 (censored point of time).
In case when the daily revenue being smaller or equal to threshold: censorUp=1 .
Threshold <- 0 input$censorUp <- 0 input[input$DailyRevenue <= Threshold,12] <- 1
- censors specific for decreasing daily revenue: censorDown. In this case the
events are opposite to the censorUp feature.
The daily revenue is smaller or equal to threshold: censorDown=1 .
the daily revenue is larger than threshold: censorDown=0 .
input$censorDown <- 0 input[input$DailyRevenue > Threshold,11] <- 1
- censors specific for increasing daily revenue: censorUp:
if the daily revenue is higher than threshold: censorUp=0 (censored point of time).
In case when the daily revenue being smaller or equal to threshold: censorUp=1 .
- Time to events:
are measured as a time distance between last local extremum of the DailyRevenue and an event defined by censoring series (censored or NOT censored).
- TimeToEventUp:
it is a time measured from a last local minimum of the DailyRevenue to the censorUp=1 events or the time distance between the last local maximum of the DailyRevenue to the censorUp=0.locmylastmin <- 1 locmylastmax <- 1 for (k in (2:(dim(input)[1]))) { tmp <- input$DailyRevenue[1:k] if (input$censorUp[k] == 0) { # find all local minima: locmymin <- which(diff(sign(diff(tmp)))==+2)+1 if (length(locmymin) == 0) {locmymin <- which.min(tmp)} locmylastmin <- locmymin[length(locmymin)] dateMin <- input[locmylastmin,1] input$TimeToEventUp[k] <- as.integer(as.Date(input[k,1])- as.Date(dateMin)) } if (input$censorUp[k] == 1) { # find all local maxima: locmymax <- which(diff(sign(diff(tmp)))==-2)+1 if (length(locmymax)==0) {locmymax <- which.max(tmp)} locmylastmax <- locmymax[length(locmymax)] dateMax <- input[locmylastmax,1] input$TimeToEventUp[k] <- as.integer(as.Date(input[k,1])- as.Date(dateMax)) } }
- TimeToEventDown:
is calculated in an opposite way to the feature TimeToEventUp. Details of determination of the TimeToEventDown feature is shown on Fig. 1 .locmylastmin <- 1 locmylastmax <- 1 for (k in (2:(dim(input)[1]))) { tmp <- input$DailyRevenue[1:k] if (input$censorDown[k] == 1) { # find all local minima: locmymin <- which(diff(sign(diff(tmp)))==+2)+1 if (length(locmymin) == 0) {locmymin <- which.min(tmp)} locmylastmin <- locmymin[length(locmymin)] dateMin <- input[locmylastmin,1] input$TimeToEventDown[k] <- as.integer(as.Date(input[k,1])- as.Date(dateMin)) } if (input$censorDown[k] == 0) { # find all local maxima: locmymax <- which(diff(sign(diff(tmp)))==-2)+1 if (length(locmymax)==0) {locmymax <- which.max(tmp)} locmylastmax <- locmymax[length(locmymax)] dateMax <- input[locmylastmax,1] input$TimeToEventDown[k] <- as.integer(as.Date(input[k,1])- as.Date(dateMax)) } }
Definition of the variable TimeToEventDown demonstrated on a part of the DailyRevenue time series. It is a sequence of time distances TcensoredDown and TeventDown, where the time TcensoredDown is a time measured from a last local maximum of the DailyRevenue to the censorDown=0 and TeventDown - is the time distance between the last local DailyRevenue minimum to the censorDown=1.
- TimeToEventUp:
In the next step I build the signal creator.
The basic part is realized in the standard way of calculating of the survival probabilities corresponding to the both series censorUp and censorDown. The prediction is calculated by the function survfit without newdata object. In such a case the predicted survival probabilities are calculated on a base of mean values of all covariates included in the formula used for creation of the Cox hazards regression model fit. In our case I use all available covariates in implemented formula, except the Volume feature which is not already a solid number.
formula = Surv(TimeToEventDown, censorDown) ~
High + Low + Close + corp + EMAV + DailyRevenue + OpenInt
The code of the function SurvivalProb:
SurvivalProb <- function(myinput,What="Down") {
train <- myinput
if (What == "Down") {
coxphCorp <- coxph(Surv(TimeToEventDown, censorDown) ~
High + Low + Close + corp + EMAV + DailyRevenue + OpenInt,
data=train, method="breslow")
}
if (What == "Up") {
coxphCorp <- coxph(Surv(TimeToEventUp, censorUp) ~
High + Low + Close + corp + EMAV + DailyRevenue + OpenInt,
data=train, method="breslow")
}
mysurffit<-survfit(coxphCorp, se.fit = F, conf.int = F)
tmp<-summary(mysurffit)
myprob <- tmp$surv[1]
myprob
}
The function predicting the signal sell/buy (SignalPredictor) calculates it as a ratio between probabilities calculated for the next days related to the increased and decreased daily revenue accordingly. The ratio is defined as:
ratio=Survival−Probability−that−daily−revenue−will−be−higher−next−daySurvival−Probability−that−daily−revenue−will−be−smaller−next−day−(1+ϵ)
where Survival−Probability−that−daily−revenue−will−be−higher−next−day (denoted as probUp in the code) and
Survival−Probability−that−daily−revenue−will−be−smaller−next−day (probDown in the code) are calculated by the function SurvivalProb.
The constant 1+ϵ corresponds to the ratioValue inside the code. The code of the function SignalPredictor :
SignalPredictor <- function(train,ratioValue) {
# Survival-Probability-that-daily-revenue-will-be-smaller-next-day
ProbDown <- SurvivalProb(train,"Down")
# Survival-Probability-that-daily-revenue-will-be-higher-next-day
ProbUp <- SurvivalProb(train,"Up")
# signal: + <- buy; - <- sell
signal <- ((ProbUp/ProbDown)-ratioValue)
signal
}
As a practical test how well the model creates buy-sell signals I run the signal predictor in a loop over
available dates in the set of historical data of Brent Oil Future.
In the code below, the Dist variable denotes the size of the training sample.
For a given row k the model use Dist rows in the data as a train sample and makes a prediction for the day: k+Dist+1 (SignalPredictor).
In the next iteration the model predicts signal for the next day k+Dist+2, using data of the previous day k+Dist+1 as a part of the train
sample. # daily step:
Dist <- 50
ratioValue <- 1.02
Bias <- 0
startInd <- Dist
results <- NULL
results$Signal <- 0
for (k in seq((startInd+Bias+1),length(input[,1]),by=1)) {
print(as.Date(input[k,1]))
train <- input[(k-Dist):k,]
train <- Optimize(train,ratioValue)
ratio <- SignalPredictor(train,ratioValue)
results$Signal <- c(results$Signal,ratio)
}
Inside the loop I use another function Optimize. The Optimize function takes the train sample and selects its last
row as test data. In the next steps the function selects such a size of the train sample that properly predicts the
sign of the DailyRevenue feature from the test sample (e.g. the signal sell/buy corresponds to the sign of the DailyRevenue variable).
The resulting size of the train sample can be smaller or equal to the initial train size. The Optimize function:Optimize <- function(inputData,ratioValue) {
L <- dim(inputData)[1]
mydist <- 1
mycheck <- -1
repeat {
mytest <- inputData[L,]
mytrain <- inputData[mydist:(L-1),]
probDown <- SurvivalProb(mytrain,"Down")
probUp <- SurvivalProb(mytrain,"Up")
ratio <- (probUp/probDown)-ratioValue
if ( (sign(ratio)*mytest$DailyRevenue >= 0) ) {mycheck <- 1}
if (mycheck == 1) {break}
if (mydist >= (L-10)) {
mycheck <- 1;mydist <- 2;break
}
mydist <- mydist + 1
}
mydist <- mydist - 1
inputData[mydist:L,]
}
The results (sell=buy signals) are shown on the next plot (Fig.2) for an initial values of parameters:
Dist <- 50
ratioValue <- 1.02
Threshold <- 0
and for the date period: 2015-03-13 - 2016-06-16
Fig. 2In the Fig. 2 the green parts correspond to the positive values of the ratio where the calculated probability for increasing Close price is larger than for decreasing one. The red part shows the opposite behaviour.
The best signal buy/sell is generated when the ratio changes the sign:
from red to green : buy, from green to red: sell .
The black curve presents the Close values, the blue line represents the values of EMAV.
By manipulating with the above parameters (Dist,ratioValue, Threshold and EMAV) one can create different strategies. Using an initial capital equal to the Close value at the initial date of the game, the total income generated by buy and sell signals for the above period and parameters is 16.47 USD what corresponds to the 41.1 % of profit from the considered period of time (2013-03-13 - 2016-06-17) .
Conclusions:
If you place capital into financial market it is always better to use several independent models for prediction of behaviour of financial instruments. If most of the methods are coming to the same conclusions, one can decide to take action: to sell or to buy the instrument(s).
The presented model could be treated as an additional source of such a knowledge. From technical point of view the script is easy to adopt to any financial data and has a lot of room for improvement. The most important part of the algorithm is the definition of censors (censorUp,censorDown) and "time to event" (TimeToEventUp,TimeToEventDown). I would appreciate any hints concerning improvement of the algorithm.
Hope, the note could be used as an introduction for survival analysis in a financial market. Comments (and links) are very welcome.
Thanks for reading !
Bogdan Lobodzinski
Data Analytics For All Consulting Service
[1] Guangliang Gao, Zhan Bu, Lingbo Liu, Jie Cao, Zhiang Wu: A survival analysis method for stock market prediction. BESC 2015: 116-122
Code for generation Fig. 2:
shift <- 47
factorSig <- 21
Sig <- sign(results$Signal)
L<-length(Sig)
Sigdf <- data.frame(x=input[(startInd+Bias):length(input[,1]),1],Sig=Sig*factorSig)
JustLine <- data.frame(xTime=input[(startInd+Bias):length(input[,1]),1],Line=0*c(1:L)+shift)
titleStr <- paste("Brent Oil Future: period from ",as.Date(input[(startInd+Bias+1),1]),
" to", as.Date(input[(length(input[,1])),1]))
ggplot(Sigdf, aes(x=as.Date(x), y=Sig+shift)) +
geom_line() +
geom_ribbon(aes(ymin=shift, ymax=ifelse(Sig+shift>shift,Sig+shift,shift)),
fill = "green", alpha = 0.5,colour = "green") +
geom_ribbon(aes(ymin=ifelse(Sig+shift < shift,Sig+shift,shift),ymax=shift),
fill = "red", alpha = 0.5,colour="red") +
geom_line(data = JustLine, aes(x=as.Date(xTime), y=Line)) +
geom_line(data = input[(startInd+Bias+1):length(input[,1]),],
aes(x=as.Date(input[(startInd+Bias+1):length(input[,1]),1]), y=Close)) +
geom_line(data = input[(startInd+Bias+1):length(input[,1]),],
aes(x=as.Date(input[(startInd+Bias+1):length(input[,1]),1]), y=EMAV, colour=4)) +
ggtitle(titleStr) +
labs(x="Date", y="Price in USD") +guides(colour=FALSE)