Thursday, July 15, 2021

Global Real Estate market: a non-expert view

Most analyses of real estate prices compare their behavior over time with other economic indicators, but this is done for a specific country or independently for a group of countries. This text proposes a comparison of real estate prices between different countries by calculating the correlation between them.

I came across some data on real estate prices ( This data contains values of 4 quantities (with short descriptions found in wikipedia and other sources):
  1. the house price index (HPI):
    measures the price changes of residential housing as a percentage change from some specific start date (starting in 1975).
  2. the house price index expressed in real terms (RHPI):
    the deflated house price index (or real house price index) is the ratio between the house price index (HPI)
  3. the personal disposable income index (PDI):
    measures the after-tax income of persons and nonprofit corporations.
  4. It is calculated by subtracting personal tax and nontax payments from personal income.
  5. the personal disposable income expressed in real terms index (RPDI) :
    the deflated PDI.

As input we have time series with specified quantity $Q$ (HPI, RHPI, PDI or RPDI) for N (N=24) countries ( 'Australia', 'Belgium', 'Canada', 'Switzerland', 'Germany', 'Denmark', 'Spain', 'Finland', 'France', 'UK', 'Ireland', 'Italy', 'Japan', 'S. Korea', 'Luxembourg', 'Netherlands', 'Norway', 'New Zealand', 'Sweden', 'US', 'S. Africa', 'Croatia', 'Israel', 'Slovenia'). In order to calculate correlations between countries I do the following calculations:
  1. for a given quantity $Q$ I normalize all data independently to the range [0,1],
  2. I determine the two-site correlation function for each timestamp $t$ \begin{equation} \label{1} Corr_{country, another\_country} \left(t \right) = Q_{country}\left(t \right) Q_{another\_country}\left(t \right) \end{equation} which is used finaly, for calculation of the global correlation for each country: \begin{equation} \label{2} C_{country}\left(t \right) = \frac{\sum_{another\_country=1}^{N} Corr_{country, another\_country} \left(t \right) }{N} \end{equation}

The dynamics of the thus calculated correlation function $C_{country}\left(t \right)$ for the quantity $Q=$RHPI and for all countries is shown in Figure 1. For the quantity RHPI, the correlations are most apparent.
The financial crisis of 2008 is very well visible in this figure (the yellow cylinder between 2007 and 2008).

A couple of observations for the time period 2007-2008:
The longest increase of the value of RHPI is seen for the US, lasting since about 1995. Other countries behave in a weakly correlated way during this period. A strong correlation between countries starts to be visible from about 2005 and quickly increases until the crash around 2007-2008. It looks as if most countries joined the global real estate market at the same time (around 2005) and at a given signal decided to crash - "ready, steady, crash!". The market seems to be too well orchestrated. I know that some people will say that this is a normal behavior because it is a global market, all markets are interconnected, etc. However, please note that it is hard to see any trace of the dot-com crisis of 2000-2003 (dot-com bubble) in this picture. Another comment on the picture concerns the number of countries participating in the crash. There are some countries excluded (Japan, S. Korea, Israel) or weakly participating in the process (Australia, Canada, Switzerland, Germany, New Zealand, Sweden, Croatia). Altogether, 13 countries out of 24 are affected by the crash.

Observations for the time period 2008-now:
The first observation is that more countries are now correlated (and not because of COVID). Still outside the correlated market are Japan, S. Korea and Croatia. Spain, Italy are not correlated (accident at work?).

the management of the real estate market is becoming more and more concise (only one player? ): currently 19 countries out of 24 (crisis 2008: 13 countries).
Question: when will this player decide to make another crash?

If anyone has knowledge of more data I would be grateful for providing it.

Friday, May 14, 2021

Semantic Search: Too many or too few matching pairs ? Dynamically determined selection threshold for matched query pairs

In my recent projects on applying Natural language processing (NLP) methods, a large part is based or contains parts based on semantic search. In a nutshell, we have certain queries (phrases or sentences) on one side and a set of other texts on the other side and our goal is to find the best matching texts to our query. Simply writing, we need to perform semantic search on our data set.

For those who are less familiar with semantic search, let me define the term as:
a kind of lexical comparison of two texts with dominant part of understanding the content of words and phrases, and relations between words or phrases in queries being compared.

While working with semantic search, I encountered a problem with defining the acceptance threshold for my findings. This problem becomes significant when the texts being compared are of significantly different lengths and/or contain significantly different degrees of content. In other words, the problem becomes serious when we deal with the so-called asymmetric semantic search

In the following, I would like to share a method which allows to dynamically determine the acceptance threshold of found pairs of matched entries. This method may determine the final solution or be a prelude to a more modified version. The project code is available on my Github account "".

Let's start describing the method.
  • The data:
    As an experiment I will use reuters data, known as reuters-21578 ( While searching for an answer to our query, we should try to be as precise as possible in formulating the questions. However, sometimes it is not possible. For the purpose of this mini-project, let's formulate our queries in a general form.
    'Behavior of the precious metals market',
    'What is the situation in metal mines',
    'Should fuel prices expect to rise ?',
    'Will food prices rise in the near future ?',
    'I am looking for information about food crops.',
    'Information on the shipbuilding industry'
  • Generation of matched pairs between the queries and the Reuter's texts:
    Our goal is to perform a semantic search. First, we need to generate matching text pairs. In the following I will use the code that is part of the sentence-transformers package "".
  • Similarities and the threshold calculation:
    Having calculated the similarity values, we can move to the main point - choosing the similarity threshold. First, let's look at the similarity plot in the test function for a fixed query ('What is the situation in metal mines ?')
    It is obvious that not all matches shown in the Figure are good (acceptable). So how to choose the threshold value of similarity ?
    The proposed method is fully heuristic and is based on the calculation of the elbow point of the curve of the similarity as a function of the matched text. If we take a look at the examined functional relationship, we can see that this curve (almost always) has an elbow point beyond which the similarity between the found texts and our query changes very slowly. To calculate the "cut off" point (elbow point) I used the KneeLocator package (""). The function KneeLocator ("") contains a sensitivity parameter S which can be used to better select our elbow point.
    The following code and its output shows the details of the calculation and its results. For details, please check "".

    This part, for each query reads all matched sentences gathered from the Reuters data together with calculated similarities. The threshold is calculated by the function KneeLocator, this part is denoteb by bold text in the code below.
    # loop over our list of queries:
    for query in result_df['query'].unique():
    	sentences_ = result_df[(result_df['query'] == query)]['sentence'].values
        x = []
        for y_ in sentences_:
        # similarities between the query and the matched Reuter's texts:
        y1 = result_df[(result_df['query'] == query)]['score'].values
        # determne elbow value:  
        x0 = list(range(len(y1)))
        kn = KneeLocator(x0, y1, S=1., curve='convex', direction='decreasing') 
        elbow_1 = kn.knee
        print ('Elbow point values:\n tekst_id=', elbow_1, \
                		'; threshold value=',y1[elbow_1])
    Resulting value (the threshold point) is presented on the next Figure :

    So, for our query ('What is the situation in metal mines ?'), we found 14 texts in the Reuter's set. Below I have copied the first 3 and last 2 texts from the set of accepted texts (the whole set is too long to present here). The reader can judge for themselves the similarity between the query and the text.
    For comparison, I have also added the text which is not accepted (15), which is not accepted by this method.
    Accepted texts:
    SIX KILLED IN SOUTH AFRICAN GOLD MINE ACCIDENT Six black miners have been killed and two injured in a rock fall three km underground at a South African gold mine, the owners said on Sunday. lt Rand Mines Properties Ltd>, one of South Africa s big six mining companies, said in a statement that the accident occurred on Saturday morning at the lt East Rand Proprietary Mines Ltd> mine at Boksburg, 25 km east of Johannesburg. A company spokesman could not elaborate on the short statement.
    NORANDA BEGINS SALVAGE OPERATIONS AT MURDOCHVILLE lt Noranda Inc> said it began salvage operations at its Murdochville, Quebec, mine, where a fire last week killed one miner and caused 10 mln dlrs in damage. Another 56 miners were trapped underground for as long as 24 hours before they were brought to safety. Noranda said the cause and full extent of the damage is still unknown but said it does know that the fire destroyed 6,000 feet of conveyor belt. Noranda said work crews have begun securing the ramp leading into the zone where the fire was located. The company said extreme heat from the fire caused severe rock degradation along several ramps and drifts in the mine. Noranda estimated that the securing operation for the zone will not be completed before the end of April. Noranda said the Quebec Health and Safety Commission, the Quebec Provincial Police and Noranda itself are each conducting an investigation into the fire. Production at the mine has been suspended until the investigations are complete. The copper mine and smelter produced 72,000 tons of copper anodes in 1986 and employs 680 people. The smelter continues to operate with available concentrate from stockpiled supplies, Noranda said. Reuter
    NORTHGATE QUEBEC GOLD WORKERS END STRIKE Northgate Exploration Ltd said hourly paid workers at its two Chibougamau, Quebec mines voted on the weekend to accept a new three year contract offer and returned to work today after a one month strike. It said the workers, represented by United Steelworkers of America, would receive a 1.21 dlr an hour pay raise over the life of the new contract and improved benefits. Northgate, which produced 23,400 ounces of gold in first quarter, said that while the strike slowed production, We are still looking forward to a very satisfactory performance. The Chibougamau mines produced 81,500 ounces of gold last year.
    NORANDA BRUNSWICK MINERS VOTE MONDAY ON CONTRACT Noranda Inc said 1,100 unionized workers at its 63 pct owned Brunswick Mining and Smelter Corp lead zinc mine in New Brunswick would start voting Monday on a tentative contract pact. Company official Andre Fortier said We are hopeful that we can settle without any kind of work interruption. Fortier added that Brunswick s estimated 500 unionized smelter workers were currently meeting about a Noranda contract proposal and would probably vote next week. The mine s contract expires July 1 and the smelter s on July 21. The Brunswick mine produced 413,800 tonnes of zinc and 206,000 tonnes of lead last year at a recovery rate of 70.5 pct zinc and 55.6 pct lead. Concentrates produced were 238,000 tonnes of zinc and 81,000 tonnes of lead.
    COMINCO lt CLT> SETS TENTATIVE TALKS ON STRIKE Cominco Ltd said it set tentative talks with three striking union locals that rejected on Saturday a three year contract offer at Cominco s Trail and Kimberley, British Columbia lead zinc operations. The locals, part of United Steelworkers of America, represent 2,600 production and maintenance workers. No date has been set for the talks, the spokesman replied to a query. The spokesman said talks were still ongoing with the two other striking locals, representing 600 office and technical workers. Production at Trail and Kimberley has been shut down since the strike started May 9. Each of the five locals has a separate contract that expired April 30, but the main issues are similar. The Trail smelter produced 240,000 long tons of zinc and 110,000 long tons of lead last year, while the Sullivan mine at Kimberley produced 2.2 mln long tons of ore last year, most for processing at Trail. Revenues from Trail s smelter totaled 356 mln Canadian dlrs in 1986.

    Not Accepted texts:
    VESSEL LOST IN PACIFIC WAS CARRYING LEAD The 37,635 deadweight tonnes bulk carrier Cumberlande, which sank in the South Pacific last Friday, was carrying a cargo which included lead as well as magnesium ore, a Lloyds Shipping Intelligence spokesman said. He was unable to confirm the tonnages involved. Trade reports circulating the London Metal Exchange said the vessel, en route to New Orleans from Newcastle, New South Wales, had been carrying 10,000 tonnes of lead concentrates. Traders said this pushed lead prices higher in early morning trading as the market is currently sensitive to any fundamental news due to its finely balanced supply demand position and low stocks. Trade sources said that 10,000 tonnes of lead concentrates could convert to around 5,000 tonnes of metal, although this depended on the quality of the concentrates. A loss of this size could cause a gap in the supply pipeline, particularly in North America, they noted. Supplies there have been very tight this year and there is a strike at one major producer, Cominco, and labour talks currently being held at another, Noranda subsidiary Brunswick Mining and Smelting Ltd.
    LTV lt QLTV> TO NEGOTIATE WITH STEELWORKERS LTV Corp s LTV Steel Corp said it agreed to resume negotiations with the United Steelworkers of America at the local plant levels, to discuss those provisions of its proposal that require local implementation. The local steelworker union narrowly rejected a tentative agreement with the company on May 14, it said. LTV also said it agreed to reopen its offer contained in the tentative agreement reached with the union s negotiating committee as part of a plan to resolve problems through local discussions.

    As you can see, the unaccepted texts are not directly related to mining, which is what we are asking about in our query.

Thanks for reading, please feel free to comment and ask questions if anything is unclear.

Wednesday, April 7, 2021

Weird aspects of ARIMA - how to increase the accuracy of predictions by exogenous data

ARIMA models are commonly used to predict time series. Most importantly, if the ARIMA model is properly chosen, the prediction error is often so small that it is a very difficult task to find better predictions using more sophisticated methods. Since this is not a note about an introduction to ARIMA models I replace the typical introduction to the models with this link which describes the method much better: Forecasting: Principles and Practice" (2nd ed) by Rob J Hyndman and George Athanasopoulos (
One of the variants of ARIMA models is a version using exogeneous data, to which this note is dedicated.
It is not widely known that this version of ARIMA models is strongly dependent on the factor by which we multiply the exogeneous data. Generalizing, we can say that the larger the factor, the smaller the prediction error of the model.

To begin with, let us start with a heuristic proof that the factor determining the ratio between the exogeneous data and the target, can influence the prediction error of the algorithm.

For simplicity, let us omit the part of the time series description which in ARIMA models is responsible for the differentiable part, i.e. we assume that the parameter $d=0$. In other words, we will use the formulations of the ARMAX model.
Then, a given time series X, in ARMAX, can be expressed generally as: \begin{equation} \label{eq1} X_{t}= c + \epsilon_{t} + AR_{t} + MA_{t} + exog_{t} \end{equation} where $c$: constant, $\epsilon_{t}$: white noise, $AR_{t}$: Autoregression part, $MA_{t}$: Moving average part, $exog_{t}$: exogeneous variable(s).
Now, let's define the exogeneous part $exog_{t}$ as: \begin{equation} \label{eq2} exog_{t} = X_{t} \alpha + \gamma_{t} \end{equation} where $\alpha$ is some proportionality factor, $\gamma_{t}$ - white nose.
Therefore, introducing eq. \ref{eq2} to eq. \ref{eq1} we will get: \begin{equation} \label{eq3} X_{t} = c + \epsilon_{t} + AR_{t} + MA_{t} + X_{t} \alpha + \gamma_{t} \end{equation} And after rearranging some terms \begin{equation} \label{eq4} X_{t}\left(1-\alpha\right) = c + \left(\epsilon_{t} + \gamma_{t}\right) + AR_{t} + MA_{t} \end{equation} But the part $AR_{t}$ can be written as $AR_{t} = \sum_{i}^{p} \phi_{i} X_{t-i}$ and similarly the $MA_{t}$ component $MA_{t} = \langle X \rangle + \beta_{t} + \sum_{i}^{q} \theta_{i} \beta_{t-i}$ with: $\beta_{n}$ as a white noise, $ \langle X \rangle $ - the expectation value of the $X$, $\theta_{i}$ - parameters of the model.
With the above in mind, the eq \ref{eq4} becomes : \begin{equation} \label{eq5} X_{t} = \frac{c + \epsilon_{t} + \gamma_{t} + \alpha \langle X \rangle}{1-\alpha} + \sum_{i} \hat{\phi}_{i} X_{t-i} + \langle X \rangle + \hat{\beta}_{t} + \sum_{i} \theta_{i} \hat{\beta}_{t-i} \end{equation} where I introduced notation: $\hat{\beta}_{t-i} = \frac{\beta_{t-i}}{1-\alpha}$, $\hat{\phi}_{i} = \frac{\phi_{i}}{1-\alpha}$ and $\hat{\theta}_{i} = \frac{\theta_{i}}{1-\alpha}$.
Using definitiona of the $AR_{t}$ and $MA_{t}$ we can rewrite eq \ref{eq5} into the final form: \begin{equation} \label{eq6} X_{t} = \frac{c + \epsilon_{t} + \gamma_{t} + \alpha \langle X \rangle}{1-\alpha} + \hat{AR}_{t} + \hat{MA}_{t} \end{equation} where components $\hat{AR}_{t}$ and $\hat{MA}_{t}$ correspond to the definitions $AR_{t}$ and $MA_{t}$ but with $\hat{\phi}$, $\hat{\theta}$ and $\hat{\beta}$ coefficients.
The first component of Equation \ref{eq6} is the most interesting !.
In the case of ARIMA without exogeneous data, the forecasting error is determined by the $\epsilon$ . Now, this error is replaced by the expression \begin{equation} \label{eq7} \epsilon_{t} \longrightarrow \frac{\epsilon_{t} + \gamma_{t}}{1-\alpha} \end{equation}
This is how we reached our final conclusions:
  1. use exogenous variables that are highly correlated ($\alpha \approx 1.$) or anti-correlated ($\alpha \approx -1.$) with the target is equivalent to a model without exogenous variables (but with changed model parameters).
  2. the use of exog data, scaled by the ratio $\frac{exogenous data}{target}$ allows for a significant modification of the final prediction error of the model. The error is now scaled by the factor $\frac{1}{1-\alpha}$. So
    1. we have the error explosions for $\left|\alpha\right| \approx 1$,
    2. for $\left|\alpha\right| < 1 $: error with exog data $>$ error without exogenous variable,
    3. for $\left|\alpha\right| > 1 $: error with exog data $<$ error without exogenous variable.
The above derivation was made for the simplified case without the differential term (i.e., when $d,D=0$). I leave it to the readers to generalize the above formalism into a model with a differential parts.

The next step will be a verification of these hypotheses in practice. A practical example of the implementation of the discussed hypothesis is available in the form of a jupyter-notebook script:
Here, we just present the dependence of the MAPE error as a function of the exog data factor (in the log10 scale). As you can see, by using an appropriate value of the factor we are able to reduce the prediction error by almost half ! The error reaches its minimum value at a factor value of 4000000. Then the error increases. The increasing part after the minimum is reached is not directly visible in the theoretical proof. I will try to explain it in the next part of the article.
All other details are available in the code:

Thank you for the reading !

Wednesday, June 22, 2016

Sell - buy signals for financial instruments - a multivariate survival analysis approach (in R).

Survival analysis could be understand as a kind of a forecasting method created for studying how long an analyzed subject will exist. In a survival framework we calculate survival probability of a thing at particular points of time.
And so, it is a natural choice to use the survival method in forecasting of financial market. Especially, for prediction of future price behaviour of a given financial instance and as a consequence for generating buy-sell signals. Searching for some examples of similar analysis I have found only a single work addressing the subject [1]. From this publication I adopted a definition of a daily revenue - given below.

In this note I would like to show how to use the survival analysis method for generation of buy/sell signals for a chosen financial product. As data I use Brent Oil Future options, however the method could be adopted to other instruments as well.

The analysis is presented in R language.
As an input I use historical data of Brent Oil Future options available on the portal The data contain several features like Open and Close prices, Volume, highest (High) and lowest (Low) prices and Open Interest (OpenInt). Operating on a daily basis, my goal is to predict whether for the next day we expect increase or decrease of the Close value.
Let me begin with preparation of libraries and data engineering.

# reading the data:
myDir <- "/analysis/forecasting/"
myFile <- "sc_f_d_2015_2016.csv" 
input <- NULL
input <- read.csv(paste(myDir,myFile,sep=""),header=TRUE)

[1] "Date"          "Open"          "High"          "Low"           
[5]"Close"         "Volume"        "OpenInt"
In a next step I will introduce new features:
  1. $corp = Close - Open$ : it corresponds to the definition of the corp of the candlesticks,
    input$corp <- 0 
    input$corp <- input$Close - input$Open
  2. Exponential Moving Average - $EMAV$. For this purpose I use functions for calculation of the Volume Weighted Exponential Moving Average (see volume-weighted-exponential-moving-average)
    VEMA.num <- function(x, volumes, ratio) {
            ret <- c()
            s <- 0
            for(i in 1:length(x)) { 
                s <- ratio * s + (1-ratio) * x[i] * volumes[i];
                ret <- c(ret, s); 
    VEMA.den <- function(volumes, ratio) {
            ret <- c()
            s <- 0
            for(i in 1:length(x)) { 
                s <- ratio * s + (1-ratio) * volumes[i]; 
                ret <- c(ret, s); 
    VEMA <- function(x, volumes, n = 10, wilder = F, ratio = NULL, ...)
            x <- try.xts(x, error = as.matrix)
            if (n < 1 || n > NROW(x))
                    stop("Invalid 'n'")
            if (any(nNonNA <- n > colSums(!
                    stop("n > number of non-NA values in column(s) ", 
                    paste(which(nNonNA),collapse = ", "))
   <- xts:::naCheck(x, n)
            if (missing(n) & !missing(ratio)) {
                    n <- trunc(2/ratio - 1)
            if (is.null(ratio)) {
                    if (wilder)
                            ratio <- 1/n
                    else ratio <- 2/(n + 1)
            foo <- cbind(x[,1], volumes, VEMA.num(as.numeric(x[,1]), volumes, ratio), 
                         VEMA.den(volumes, ratio))
            (foo[,3] / foo[,4]) -> ma
            ma <- reclass(ma, x)
            if (!is.null(dim(ma))) {
                    colnames(ma) <- paste(colnames(x), "VEMA", n, sep = ".")
    x <- ts(input$Close)
    input$EMAV <- 0
    Vol <- rep(1, length(input[,1]))   
    input$EMAV <- VEMA(input$Close, Vol,ratio=0.95)
  3. Daily Revenue - $DailyRevenue$ . The feature is calculated as a percentage of $Close$ values between present and previous days [1]
    $DailyRevenue = \frac{( Close[k] - Close[k-1] )}{Close[k-1]} \cdot 100$
    input$DailyRevenue <- 0
    input$DailyRevenue[1] <- 0
    for (k in (2:(length(input$Close)))) {
         input$DailyRevenue[k] <- (input$Close[k]-input$Close[k-1])*100/(input$Close[k-1])
  4. Censors:
    In this case I consider two series of censored data corresponding to the increasing and decreasing values of $DailyRevenue$. The increasing or decreasing event is detected by checking whether the Daily Revenue is larger or smaller than a given threshold value. The threshold value could be used as a parameter in our model. Therefore, the value of the $threshold$ could be different from $0$ ($0$ is here a good initial choice). The censoring of data used in the model is following:
    1. censors specific for increasing daily revenue: $censorUp$: if the daily revenue is higher than $threshold$: $censorUp = 0$ (censored point of time). In case when the daily revenue being smaller or equal to $threshold$: $censorUp = 1$ .
      Threshold <- 0
      input$censorUp <- 0
      input[input$DailyRevenue <= Threshold,12] <- 1
    2. censors specific for decreasing daily revenue: $censorDown$. In this case the events are opposite to the $censorUp$ feature. The daily revenue is smaller or equal to $threshold$: $censorDown = 1$ . the daily revenue is larger than $threshold$: $censorDown = 0$ .
      input$censorDown <- 0
      input[input$DailyRevenue > Threshold,11] <- 1
  5. Time to events: are measured as a time distance between last local extremum of the $DailyRevenue$ and an event defined by censoring series (censored or NOT censored).
    1. $TimeToEventUp$:
      it is a time measured from a last local minimum of the $DailyRevenue$ to the $censorUp = 1$ events or the time distance between the last local maximum of the $DailyRevenue$ to the $censorUp = 0$.
      locmylastmin <- 1
      locmylastmax <- 1
      for (k in (2:(dim(input)[1]))) {
           tmp <- input$DailyRevenue[1:k]
           if (input$censorUp[k] == 0) {
                 # find all local minima:
                 locmymin <- which(diff(sign(diff(tmp)))==+2)+1
                 if (length(locmymin) == 0) {locmymin <- which.min(tmp)}
                 locmylastmin <- locmymin[length(locmymin)]
                 dateMin <- input[locmylastmin,1]
                 input$TimeToEventUp[k] <- as.integer(as.Date(input[k,1])-
           if (input$censorUp[k] == 1) {
                 # find all local maxima:
                 locmymax <- which(diff(sign(diff(tmp)))==-2)+1
                 if (length(locmymax)==0) {locmymax <- which.max(tmp)}
                 locmylastmax <- locmymax[length(locmymax)]
                 dateMax <- input[locmylastmax,1]
                 input$TimeToEventUp[k] <- as.integer(as.Date(input[k,1])-
    2. $TimeToEventDown$:
      is calculated in an opposite way to the feature $TimeToEventUp$. Details of determination of the $TimeToEventDown$ feature is shown on Fig. 1 .
      locmylastmin <- 1
      locmylastmax <- 1
      for (k in (2:(dim(input)[1]))) {
           tmp <- input$DailyRevenue[1:k]
           if (input$censorDown[k] == 1) {
                 # find all local minima:
                 locmymin <- which(diff(sign(diff(tmp)))==+2)+1
                 if (length(locmymin) == 0) {locmymin <- which.min(tmp)}
                 locmylastmin <- locmymin[length(locmymin)]
                 dateMin <- input[locmylastmin,1]
                 input$TimeToEventDown[k] <- as.integer(as.Date(input[k,1])-
           if (input$censorDown[k] == 0) {
                 # find all local maxima:
                 locmymax <- which(diff(sign(diff(tmp)))==-2)+1
                 if (length(locmymax)==0) {locmymax <- which.max(tmp)}
                 locmylastmax <- locmymax[length(locmymax)]
                 dateMax <- input[locmylastmax,1]
                 input$TimeToEventDown[k] <- as.integer(as.Date(input[k,1])-
      Definition of the variable $TimeToEventDown$ demonstrated on a part of the $DailyRevenue$ time series. It is a sequence of time distances $T_{Down}^{censored}$ and $T_{Down}^{event}$, where the time $T_{Down}^{censored}$ is a time measured from a last local maximum of the $DailyRevenue$ to the $censorDown = 0$ and $T_{Down}^{event}$ - is the time distance between the last local $DailyRevenue$ minimum to the $censorDown = 1$.

In the next step I build the signal creator.
The basic part is realized in the standard way of calculating of the survival probabilities corresponding to the both series $censorUp$ and $censorDown$. The prediction is calculated by the function survfit without newdata object. In such a case the predicted survival probabilities are calculated on a base of mean values of all covariates included in the formula used for creation of the Cox hazards regression model fit. In our case I use all available covariates in implemented formula, except the $Volume$ feature which is not already a solid number.
formula = Surv(TimeToEventDown, censorDown) ~  
               High + Low + Close + corp + EMAV + DailyRevenue + OpenInt
The code of the function $SurvivalProb$:
SurvivalProb <- function(myinput,What="Down") {        
    train <- myinput
    if (What == "Down") {
        coxphCorp <- coxph(Surv(TimeToEventDown, censorDown) ~  
                           High + Low + Close + corp + EMAV + DailyRevenue + OpenInt, 
                           data=train, method="breslow")
    if (What == "Up") {
        coxphCorp <- coxph(Surv(TimeToEventUp, censorUp) ~  
                           High + Low + Close + corp + EMAV + DailyRevenue + OpenInt, 
                           data=train, method="breslow")
    mysurffit<-survfit(coxphCorp, = F, = F)
    myprob <- tmp$surv[1]
The function predicting the signal sell/buy ($SignalPredictor$) calculates it as a ratio between probabilities calculated for the next days related to the increased and decreased daily revenue accordingly. The ratio is defined as: $ratio = \frac{Survival-Probability-that-daily-revenue-will-be-higher-next-day}{Survival-Probability-that-daily-revenue-will-be-smaller-next-day} - \left(1+\epsilon\right)$ where $Survival-Probability-that-daily-revenue-will-be-higher-next-day$ (denoted as $probUp$ in the code) and $Survival-Probability-that-daily-revenue-will-be-smaller-next-day$ ($probDown$ in the code) are calculated by the function $SurvivalProb$. The constant $1+\epsilon$ corresponds to the $ratioValue$ inside the code.
The code of the function $SignalPredictor$ :
SignalPredictor <- function(train,ratioValue) { 

    # Survival-Probability-that-daily-revenue-will-be-smaller-next-day  
    ProbDown <- SurvivalProb(train,"Down")  
    # Survival-Probability-that-daily-revenue-will-be-higher-next-day
    ProbUp <- SurvivalProb(train,"Up")      
    # signal: + <- buy; - <- sell
    signal <- ((ProbUp/ProbDown)-ratioValue)
As a practical test how well the model creates buy-sell signals I run the signal predictor in a loop over available dates in the set of historical data of Brent Oil Future. In the code below, the $Dist$ variable denotes the size of the training sample. For a given row $k$ the model use $Dist$ rows in the data as a train sample and makes a prediction for the day: $k + Dist + 1$ ($SignalPredictor$). In the next iteration the model predicts signal for the next day $k + Dist + 2$, using data of the previous day $k + Dist + 1$ as a part of the train sample.
# daily step:
Dist <- 50 
ratioValue <- 1.02
Bias <- 0 
startInd <- Dist
results <- NULL        
results$Signal <- 0

for (k in seq((startInd+Bias+1),length(input[,1]),by=1)) {
     train <- input[(k-Dist):k,]
     train <- Optimize(train,ratioValue)

     ratio <- SignalPredictor(train,ratioValue)        
     results$Signal <- c(results$Signal,ratio)
Inside the loop I use another function $Optimize$. The $Optimize$ function takes the train sample and selects its last row as test data. In the next steps the function selects such a size of the train sample that properly predicts the sign of the $DailyRevenue$ feature from the test sample (e.g. the signal sell/buy corresponds to the sign of the $DailyRevenue$ variable). The resulting size of the train sample can be smaller or equal to the initial train size. The $Optimize$ function:
Optimize <- function(inputData,ratioValue) {
    L <- dim(inputData)[1]
    mydist <- 1
    mycheck <- -1
    repeat {
            mytest <- inputData[L,]
            mytrain <- inputData[mydist:(L-1),]
            probDown <- SurvivalProb(mytrain,"Down")
            probUp <- SurvivalProb(mytrain,"Up")
            ratio <- (probUp/probDown)-ratioValue

            if ( (sign(ratio)*mytest$DailyRevenue >= 0) ) {mycheck <- 1}    
            if (mycheck == 1) {break} 
            if (mydist >= (L-10)) {
                 mycheck <- 1;mydist <- 2;break
            mydist <- mydist + 1
    mydist <- mydist - 1   
The results (sell=buy signals) are shown on the next plot (Fig.2) for an initial values of parameters:
Dist <- 50 
ratioValue <- 1.02
Threshold <- 0
and for the date period: 2015-03-13 - 2016-06-16
Fig. 2
In the Fig. 2 the green parts correspond to the positive values of the ratio where the calculated probability for increasing $Close$ price is larger than for decreasing one. The red part shows the opposite behaviour.
The best signal buy/sell is generated when the ratio changes the sign:
from red to green : buy, from green to red: sell .
The black curve presents the $Close$ values, the blue line represents the values of $EMAV$.
By manipulating with the above parameters ($Dist$,$ratioValue$, $Threshold$ and $EMAV$) one can create different strategies. Using an initial capital equal to the $Close$ value at the initial date of the game, the total income generated by buy and sell signals for the above period and parameters is $16.47$ USD what corresponds to the $41.1$ % of profit from the considered period of time (2013-03-13 - 2016-06-17) .


If you place capital into financial market it is always better to use several independent models for prediction of behaviour of financial instruments. If most of the methods are coming to the same conclusions, one can decide to take action: to sell or to buy the instrument(s).
The presented model could be treated as an additional source of such a knowledge. From technical point of view the script is easy to adopt to any financial data and has a lot of room for improvement. The most important part of the algorithm is the definition of censors ($censorUp$,$censorDown$) and "time to event" ($TimeToEventUp$,$TimeToEventDown$). I would appreciate any hints concerning improvement of the algorithm.

Hope, the note could be used as an introduction for survival analysis in a financial market. Comments (and links) are very welcome.

Thanks for reading !
Bogdan Lobodzinski
Data Analytics For All Consulting Service

[1] Guangliang Gao, Zhan Bu, Lingbo Liu, Jie Cao, Zhiang Wu: A survival analysis method for stock market prediction. BESC 2015: 116-122

Code for generation Fig. 2:
shift <- 47
factorSig <- 21
Sig <- sign(results$Signal)
Sigdf <- data.frame(x=input[(startInd+Bias):length(input[,1]),1],Sig=Sig*factorSig) 
JustLine <- data.frame(xTime=input[(startInd+Bias):length(input[,1]),1],Line=0*c(1:L)+shift) 
titleStr <- paste("Brent Oil Future: period from ",as.Date(input[(startInd+Bias+1),1]),
                  " to", as.Date(input[(length(input[,1])),1]))

ggplot(Sigdf, aes(x=as.Date(x), y=Sig+shift)) +
     geom_line() + 
     geom_ribbon(aes(ymin=shift, ymax=ifelse(Sig+shift>shift,Sig+shift,shift)),
          fill = "green", alpha = 0.5,colour = "green") +
     geom_ribbon(aes(ymin=ifelse(Sig+shift < shift,Sig+shift,shift),ymax=shift),
          fill = "red", alpha = 0.5,colour="red") +        
     geom_line(data = JustLine, aes(x=as.Date(xTime), y=Line)) + 
     geom_line(data = input[(startInd+Bias+1):length(input[,1]),], 
          aes(x=as.Date(input[(startInd+Bias+1):length(input[,1]),1]), y=Close)) +
     geom_line(data = input[(startInd+Bias+1):length(input[,1]),], 
          aes(x=as.Date(input[(startInd+Bias+1):length(input[,1]),1]), y=EMAV, colour=4)) +        
     ggtitle(titleStr) +
     labs(x="Date", y="Price in USD") +guides(colour=FALSE)

Friday, May 6, 2016

Mainstream media and shaping social opinions

Inspired by a question "What can we learn about social behaviour from recommender database ?" which arises from my previous note Recommender Systems and q-value Potts model I spent some time considering a following question:
The question:
how to control opinions in a social group by creation of an environment with whom the group can interact ?

Let us start to define the toy model.
In a given social group we have a set of opinions $K$ distributed among all group members. The group collaborates with external world through an interaction with some set of opinions stated in an external environment. We assume that the environment is too big to be manipulated by the social group, but the environment can modify somehow the distribution of opinions inside the group.
For a better transparency we consider 2 different opinions described as $A$ and $B$. Opinions $A$ and $B$ are defined as states "+1" and "-1" correspondingly. Members are distributed among two subgroups according to a shared opinion and interact with the same environment but with different strength: $\gamma_{A}$ and $\gamma_{B}$ respectively. The $\gamma_{A|B}$ coefficients correspond to the level of group members acceptance of noise created by the environment. Or, in other words, the coupling constants $\gamma_{A|B}$ are weights which characterize how much a given group relies on opinions supported by the environment. The level of believe in a group is proportional to the value of $\gamma_{i}$ couplings. The scheme is depicted in the picture 1.

fig. 1. The environmental $M$ levels are coupled to 2 separate systems $A$ and $B$ with a given coupling constants $\gamma_{A}$ and $\gamma_{B}$. For simplicity each group $A$ and $B$ has the same number of levels $N$.

We assume that the social model will fulfill the principle of least action so the entire set will tend to minimize the total energy. Therefore the Hamiltonian approach should make a frame for our analysis. The model can be described by the following Hamiltonian: \begin{equation} \label{Hamiltonian0} H = H_{A} + H_{B} + H_{E} + H_{Int} \end{equation} where
\begin{equation} \label{Hamiltonian1} H_{A|B} = \sum_{k_{A|B} = 1}^{N_{A|B}} E_{A|B} \left| A|B_{k_{A|B}} \right> \left< A|B_{k_{A|B}} \right| \end{equation} describes the state of subgroups $A$ and $B$ with energies $E_{A}$ ($E_{A}$) and $N_{A}$ ($N_{A}$) discrete states $\left| A_{k_{A}} \right>$ ($\left| B_{k_{B}} \right>$). \begin{equation} \label{Hamiltonian2} H_{E} = \sum_{n = 1}^{M} E_{n} \left| E_{n} \right> \left< E_{n} \right| \end{equation} is the basic energy of the environment with $M$ discrete states $\left| E_{n} \right>$ labeled by index n. The interaction between the groups and the environment has a form: \begin{equation} \label{Hamiltonian3} H_{Int} = \sum_{n=1}^{M} \left[ \left( \sqrt{ \gamma_{A} } \sum_{ k_{A} = 1}^{ N_{A} } VA_{k_{A}}^{n} \left| A_{k_{A}} \right> \left< E_{n} \right| + h.c. \right) + \left( \sqrt{ \gamma_{B} } \sum_{ k_{B} = 1}^{ N_{B} } VB_{k_{B}}^{n} \left| B_{k_{B}} \right> \left< E_{n} \right| + h.c. \right) \right ] \end{equation} where $VA$ and $VB$ are matrix elements describing couplings between group states $\left| A \right>$, $\left| B \right>$ and environmental levels $\left< E_{n} \right|$.
Because the environment cannot be modified by any group we can use the Markovian approximation and eliminate the environment states from the Hamiltonian above. The result of operation can be written in the following form: \begin{equation} \label{Hamiltonian4} H_{eff} = H_{A} + H_{B} - i V \cdot V^{+} \end{equation} where \begin{equation} \label{Hamiltonian5} V = \left( \begin{array}{} \sqrt{ \gamma_{A} } \cdot VA \\ \sqrt{ \gamma_{B} } \cdot VB \end{array} \right) \end{equation} creates a dissipative part of the effective Hamiltonian $H_{eff}$ which is an $N \times N$ dimensional matrix and matrix $V$ has a dimension $N \times M$ . Our analysis of the dynamics of the system can be reduced to the determination of the eigenvalues of the effective Hamiltonian $H_{eff}$: $\Lambda = x - iy$ which are complex. An imaginary part describes how fast (time $\tau$) a given eigenvalue dissipates in the system: $ \tau \approx \frac{1}{y}$.
In other words, the $\tau$ describes the lifetime of a given opinion values by the real value of the eigenvalue $x$ . Let us stress that we allow to create a continuous number of opinions, not only those which are defined in the assumptions of the issue: $E_{A|B}$ . Why can the eigenvalues of the Hamiltonian be used as a determinant of the social opinion distribution ?
A standard model of opinion evolution is modeled using (in a simplest case) a set of linear equations: \begin{equation} \label{Hamiltonian6} \vec{x} \left( t + 1 \right) = W \vec{x} \left( t \right) \end{equation} where $W$ is some $N\times N$ matrix describing an information exchange with weights between participants of the social network and the vector $\vec{x}$ is an opinion profile in the network calculated for the time $t$. Thus, the correspondence between the opinion profile $\vec{x}\left( t \right)$ and our approach can be done by calculation of a power spectrum of the $\vec{x}\left( t \right)$. Positions of maximums in the power spectrum can be understood as equivalent to the real values of eigenvalues of the Hamiltonian $H_{eff}$ while the time scale of change of the profile $\vec{x}\left( t \right)$ corresponds to the imaginary part of the eigenvalues (to be more specific to the inverse of the imaginary part).
For any numerical calculations we have to define values $M$, $N$, $E_{A}$, $E_{B}$, $N_{A}$, $N_{B}$, $\gamma_{A}$, $\gamma_{B}$. The matrices $VA$ and $VB$ are calculated as Gaussian unitary ensembles (GUE random matrices).
Below we present a variety of plots of numerically determined eigenvalues of the Hamiltonian.
On all plots the dots are a result of numerical calculations of the Hamiltonian $H_{eff}$. Numerical simulations have been done for 30 GUE matrices $V$ with $N = 100$ and $M = 60$. Remaining values of parameters used for simulations:
  1. energies: $E_{A} = -1/2$, $E_{B} = 1/2$,
  2. number of states: for state $A$: $N_{A} = 50$, for state $B$: $N_{B}=50$,
  3. interaction couplings are described for each plot separately.

At the beginning, for a comparison with existing models we shows on Fig. 2,3 and 4 a situation where $\gamma_{A} = \gamma_{B}$ and different values of $\gamma_{B}$. A similar plots one can find in the work [1].
For a better visibility we used for scale $y$ logarithmic scale. In our case $y \leftarrow -log( \left| y \right| )$, so the value $y=-3$ corresponds to $y = 10^{-3}$.

Fig. 2. Simulation for $\gamma_{A} = \gamma_{B} = 0.0025$ in energy units. The left vertical plot shows the integrated density profile of the width of calculated eigenvalues. The upper picture presents the integrated spectral density showing the width of energy states.

Fig. 3. Simulation for $\gamma_{A} = \gamma_{B} = 0.01$. An creation of two timescales of the eigenvectors is visible.

Fig. 4. Simulation for $\gamma_{A} = \gamma_{B} = 0.5$.

The next Figs (5,6,7 and 8) shows asymmetric situations where $\gamma_{A} = \gamma_{B}/10$ and different values of $\gamma_{B}$. Detailed values of other parameters are noted in the figure description.

Fig. 5. The situation with $\gamma_{B} = 0.01$ and $\gamma_{A} = \gamma_{B}/10$.

Fig. 6. The eigenvalue spectrum calculated for $\gamma_{B} = 0.05$ and $\gamma_{A} = \gamma_{B}/10$.

Fig. 7. The spectrum calculated for $\gamma_{B} = 0.1$ and $\gamma_{A} = \gamma_{B}/10$.

Fig. 8. The spectrum calculated for $\gamma_{B} = 0.5$ and $\gamma_{A} = \gamma_{B}/10$.


The landscape presented on all plots of eigenvalues of the Hamiltonian $H_{eff}$ is quite straightforward for those who are familiar with atomic physics, especially interactions of multi-level atomic transitions with resonant light. We see clear existence of so called coupled and uncoupled level (states) combinations . The increase of the coupling constant $\gamma_{A|B}$ leads to a creation of grouping of eigenvalues.
The eigenfunctions in the Hamiltonian can be formed in such a way that some combinations of levels cancel the interaction part with the environment - they are called uncoupled (or trapped) states. The number of such states is $N-M$.
Other level eigenfunctions are coupled with the environment's states - such an eigenfunction we call coupled (or un-trapped) states ($M$ decaying states). Populations of trapped states can survive longer time (smaller values of y) in comparison to the un-trapped states, where the populations is exchanged between coupled states by the interaction term proportional to $\sqrt{\gamma_{A|B}}$.
Let us try to translate this physical picture into a social language.
  1. The coupled (un-trapped) states: configurations of members (eigenfunction) which change a common opinion (eigenvalue) faster due to the interaction with a noisy environment represented by a different set of environmental opinions. Such an exchange of opinion is proportional to the interaction strength $\gamma_{i}$ ($i=A|B$).
  2. The uncoupled (trapped) states: ideally, such configurations of members (eigenfunction) which create a stable combinations of a group members. They do not interact with the environment.
If we define an opinion power as a parameter proportional to the number of eigenfunctions which share the same range of opinions we see that in steady-state conditions the opinion created by the trapped states (i.e. weakly coupled with the environment) has a greater power of persuasion than an opposite social group characterized by the un-trapped member's configurations (i.e. strongly coupled to the environment).
A higher level of noise (larger value of $\gamma_{A|B}$ and different values of couplings $\gamma_{A} < \gamma_{B}$) doesn't lead to reduction of a significance of unwanted opinions but creates just the opposite action:
it sharpens the polarization of existing opinions and increases the importance of resistance against the environment in a society. Such a situation can be well identified nowadays. The noisy environment is created by mainstream media. What one can find in such an environment: an increasing number of different subjects, news focused on accidents, preferred shorter forms and propagation of a number of opposite opinions and many other similar, as well as different ways of distraction of reader's attention.
Groups of people strongly coupled to such a media stream are not able to define a private opinion and they are easy to manipulate. It also means that the number of people in such a chaotic environment will migrate rather to more stable social groups (trapped states), not so strongly dominated by the mainstream news.
The creation of people without a strong, private opinions is a goal of the present liberal governments. But, as it is presented in this note the chosen way of social control leads just to opposite behaviour. You can see it around yourself, don't you ?
In the model, we used the condition where the number of intermediate states in the environment $M$ is smaller that the number of states in considered social groups $N$: $M < N$.
The opposite case leads to a bit different picture than described in this note, but it is a subject for an another post.

Bogdan Lobodzinski
Data Analytics For All Consulting Service


  1. E.Gudowska-Nowak, G. Papp and J. Brickmann, Two-Level System with Noise: Blue's Function Approach, Chem. Phys., 220 120-135 (1997)

Sunday, April 10, 2016

Small and medium-size companies: any chance to grow in the Big Data dominated environment ?

Today I watched a video presenting European Truck Platooning Challenge 2016.
For those who are not familiar with the subject, it is about a test of self-driving trucks across Europe. The test was organized by the Dutch government. Main truck companies from Europe participated in running the event. You can watch the video here here and here.
It is a really great proof-of-concept project driven by the Big Data technologies.

However, in this note I don't like to write about details of the project. Instead, I would like to develop wider my first reflections after watching the video. I think everyone's conclusion will be that with this event all smaller transportation enterprises have got a Big problem. The problem can be verbalize by a question:
how to survive on the market dominated by big transportation players and their self-driving trucks across Europe ?
In the big transport companies the drivers will be replaced by the lawyers, but what about smaller transport businesses ?

Of course the problem is not only related to the transport industry. Similar situation appears in each branch of commerce. Therefore, the more general question, is:
how smaller enterprises can find themselves in the business environment dominated by the Big Data technologies governed by big parties ?
It is a bit similar to the competition between members inside a social group:
how to survive in a group dominated by set of rules profitable mainly for those who created those regulations.
Each of us can find clear example of such an environment from his own experience.

In my opinion, a possible solution of the problem can be found in a grouping of small companies into a fair-shared societies glued by common goals and business activity. Such a group has a bigger chance of survival than a single member, especially if all members can coordinate all individual actions, in such a way that only these processes which are profitable for the group and for the participant should be realized. It may sound like a socialism utopia - so it may mean - forget it, not possible for realization.

However, I am more and more convinced that such a procedure of business activity is possible if all human emotional behavior can be removed from a decision making processes. Remember, we are talking about methods to lead business in a most effective way. Such a decision creator can be realized by a Big Data driven decision making system using machine learning algorithms . If we have a group of a few enterprises and each company makes available his business data (like details of contacts, details of realized tasks and preferences, etc. ) for the decision-making system, a proper Big Data framework can create suggestions how to move forward using well established probabilities.
A client can look for a doer by communications with a group accessing the representative portal providing specification of task(s), or asking directly a chosen company.
In opposite, each group member can act fully autonomously, looking for customers individually or can use a group representative hints created by the decision-making system. In both cases the group representative framework acts as a Data driven decision creator:
  1. for customer: suggesting proper companies from the group
  2. for companies in the group: informing chosen entities about requested tasks.
The structure of the group reminds a bit the franchise business model. In contrary to a franchiser a group member don't have to follow any business concept and there is no need for any kind of standardizations in terms of provided services. The only payment which have to be provided by the group participants is a cost of the service and the development of a proper "Big Data"/Machine Learning infrastructure.

Summarizing this short note, the goal of described above business association is creation and support of strategies for sustainable growth of small sized enterprises in a Big Data dominated environment. The common features among all participating parties in the group should be community of the type of business activity and goals.

Unfortunately, details are usually much mode complicated, but my intention was to sketch a general picture of a structure with (in my opinion) a highest chance to become profitable in the Big Data business era.

I will appreciate any comment and opinion about the idea sketched above.

Bogdan Lobodzinski
Data Analytics For All Consulting Service