tag:blogger.com,1999:blog-44126892762794525012024-02-21T00:13:03.556+01:00Data Analytics For All presenting my ideas and views . Ready for a new challenge !
Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.comBlogger13125tag:blogger.com,1999:blog-4412689276279452501.post-66987907492321680482023-11-11T23:22:00.032+01:002023-11-14T12:50:01.402+01:00Pobieżna analiza wyborów do Parlamentu Polskiego 2023<!-- MathJax usage example -->
<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});</script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({TeX: { equationNumbers: { autoNumber: "AMS" }}});</script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4QyWsB2k8MubNwdgli664tilf39NYGSxM1Bu8VxxUXDI8uDPvzGsf4i8A5dlExh9jx8Lup16bwtk0P6DnO2DbLUU4ID6dXvZSyT-qHVHSyEJrX17r59OhCEIkHEl_QksE674HjBYW9IffwHyVoAhaytKp3auC8vJ33O2rkCquLbutMUlsLemmrD6-IHmY/s1024/frekwencja-sejm-2023.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="1024" data-original-width="1024" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4QyWsB2k8MubNwdgli664tilf39NYGSxM1Bu8VxxUXDI8uDPvzGsf4i8A5dlExh9jx8Lup16bwtk0P6DnO2DbLUU4ID6dXvZSyT-qHVHSyEJrX17r59OhCEIkHEl_QksE674HjBYW9IffwHyVoAhaytKp3auC8vJ33O2rkCquLbutMUlsLemmrD6-IHmY/s400/frekwencja-sejm-2023.png"/></a></div>
[mapa ze strony https://wbdata.pl/wybory-2023-mapy/]<br><br>
Zadałem sobie pytanie czy ostatnie wybory 15.10.2023 były uczciwe i czy można to sprawdzić statystycznie przy pomocy prostej analizy wspomaganej intuicją.<br>
<i>Zatem, nie będzie to twardy dowód na oszustwa wyborcze</i>.<br>
Dane do analizy można odnaleźć na stronie <a href="https://wybory.gov.pl/sejmsenat2023/pl/dane_w_arkuszach">
<b>https://wybory.gov.pl/sejmsenat2023/pl/dane_w_arkuszach</b></a>), w części 'Wyniki głosowania na listy Sejmowe'. Plik
<i>'po okręgach Sejm CSV XLSX'</i> zawiera dane ze wszystkich obwodow wyborczych.
<br /><br>
<h3><b>Wprowadzenie </b></h3>
Zacznijmy od paru defnicji. W dalszej czesci bede opieral sie na paru nowych zmiennych, jak napisalem analiza jest bardzo uproszczona.
Ze wzgledu na wielość partii, wprowadzam następujące grupy:
<ol>
<li><b>
'OPOZYCJA' = 'KOALICYJNY KOMITET WYBORCZY TRZECIA DROGA POLSKA 2050 SZYMONA
HOŁOWNI - POLSKIE STRONNICTWO LUDOWE'+<br />
'KOALICYJNY KOMITET WYBORCZY KOALICJA
OBYWATELSKA PO .N IPL ZIELONI']+<br />'KOMITET WYBORCZY NOWA LEWICA'</b>
<li><b>
'INNE PATRIE' = 'KOMITET WYBORCZY BEZPARTYJNI SAMORZĄDOWCY'+<br />
'KOMITET WYBORCZY WYBORCÓW MNIEJSZOŚĆ NIEMIECKA'+<br />
'KOMITET WYBORCZY KONFEDERACJA WOLNOŚĆ I NIEPODLEGŁOŚĆ'+<br />
'KOMITET WYBORCZY POLSKA JEST JEDNA'+<br />
'KOMITET WYBORCZY WYBORCÓW RUCHU DOBROBYTU I POKOJU'+<br />
'KOMITET WYBORCZY NORMALNY KRAJ'+<br />
'KOMITET WYBORCZY ANTYPARTIA'+<br />
'KOMITET WYBORCZY RUCH NAPRAWY POLSKI'</b>
<li><b>'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' - jako osobna grupa</b>
</ol>
Ponieważ 'KOMITET WYBORCZY RUCH NAPRAWY POLSKI' i 'OPOZYCJA' są głównymi graczami, dlatego w dalszej części będę analizował 3 grupy danych:
<ol>
<li><b>wszystkie obwody: bez rozróżnienia</b>
<li><b>a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ: gdy suma głosów na 'OPOZYCJA' w obwodzie jest większa od liczby głosów na 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</b>
<li><b>b) OPOZYCJA < KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ: gdy suma głosów na 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' w obwodzie jest większa od liczby głosów na 'OPOZYCJA'</b>
</ol>
<br>
Jako zmienną do porównywania rozkładów grup partyjnych używam stosunku liczby głosów oddanych na grupę partyjną w obwodzie do całkowitej liczby głosów w tym obwodzie.<br>
<br>
<h3><b>Rozkłady oddanych głosów </b></h3>
Rozkłady wyglądają tak:<br>
dla grupy <b>wszystkie obwody</b>:<br>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjy1snR28Ezz6ZFLehtg6WH7WRvp7EWyFcOVXKmzoOFNZe_2GwW-LhyGCqxd_yt8Yh2cCOufPJvNbSK1_nFmMDhqTFDc4cYsmBvnGNYjIFSBWzbpHk41Y2tyYbh_6bgqfALTMFY6M8oYnTpXlWAjObpCAhuFGYJhuTZLvS4P91sjrEoKx6HvbQgr1ISDj0z/s1571/rozklady1.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="663" data-original-width="1571" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjy1snR28Ezz6ZFLehtg6WH7WRvp7EWyFcOVXKmzoOFNZe_2GwW-LhyGCqxd_yt8Yh2cCOufPJvNbSK1_nFmMDhqTFDc4cYsmBvnGNYjIFSBWzbpHk41Y2tyYbh_6bgqfALTMFY6M8oYnTpXlWAjObpCAhuFGYJhuTZLvS4P91sjrEoKx6HvbQgr1ISDj0z/s600/rozklady1.png"/></a></div>
Rys. 1<br><br>
<br>
Jak widać, rozkłady <i>'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i> i <i>'OPOZYCJA'</i> są niemal lustrzanymi odbiciami względem wartości $\approx 0.5$.
Rozkłady dla następnych grup danych:<br>
dla grupy <b>a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ</b><br>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiN8_Abmv93rwIZa7vvqy2ajk90-Ce7hCF1IphgwfR5_ePat71Yp5ja-_-RAyOJw7H-sc6UKgl2Uvtc7eCEbNTkcva0yiAhubiVrm1kDYlyGjXAqNHxaJTHFoYGNbHSKixdn4O6TEnoCz3trXtAmOeRx4NDJomiKw531MDiUm6E3IwhSNl-iIc_TzY7rfKG/s1571/rozklady_a.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="667" data-original-width="1571" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiN8_Abmv93rwIZa7vvqy2ajk90-Ce7hCF1IphgwfR5_ePat71Yp5ja-_-RAyOJw7H-sc6UKgl2Uvtc7eCEbNTkcva0yiAhubiVrm1kDYlyGjXAqNHxaJTHFoYGNbHSKixdn4O6TEnoCz3trXtAmOeRx4NDJomiKw531MDiUm6E3IwhSNl-iIc_TzY7rfKG/s600/rozklady_a.png"/></a></div>
Rys. 2<br><br>
<br>
i dla grupy <b>b) OPOZYCJA < KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ</b><br><br>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDrpNdXs9slDxvSgyDSYnOLdlR71hlxvR_PUAG5clSjbc95lTwMnkCIQlZFSkI1_OTg4sjCYm0U7kO5LiWJ1EbwsuJ3TV6rboLB7MM1eE8Q97rfBOIYnlulhDaK8W-aYRswl-XegFre3Mi2guGG52dsFIIFxzJK-Vu7HXYQXiLdj9a_DbwVabCp6KVG01U/s1571/rozklady_b.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="667" data-original-width="1571" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDrpNdXs9slDxvSgyDSYnOLdlR71hlxvR_PUAG5clSjbc95lTwMnkCIQlZFSkI1_OTg4sjCYm0U7kO5LiWJ1EbwsuJ3TV6rboLB7MM1eE8Q97rfBOIYnlulhDaK8W-aYRswl-XegFre3Mi2guGG52dsFIIFxzJK-Vu7HXYQXiLdj9a_DbwVabCp6KVG01U/s600/rozklady_b.png"/></a></div>
Rys. 3<br><br>
<br>
W analizie danych takich jak ta , zwykle mamy do czynienia z rozkładami w przybliżeniu symetrycznymi, tak jak dla grup: <b>a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ</b> i <b>wszystkie obwody</b>.
W przypadku ostatniej grupy (<b>b) OPOZYCJA < KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ</b>) mamy asymetryczny podział pomiędzy grupami politycznymi <i>'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i> i <i>'OPOZYCJA'</i> w pobliżu <i>'stosunek glosow na grupe partyjna do sumy wszystkich oddanych głosow - na obwod'</i>$\approx 0.45$.
Poparcie dla <i>'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i> bardzo ostro maleje do 0. rozkład poparcia dla <i>'OPOZYCJA'</i>, także wydaje się opada stromiej, ale nie jest to tak dramatyczne jak dla <i>'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i>.<br>
Ponieważ ten aspekt wygląda dziwnie, zatem w dalszym ciągu będę analizował głosy, w tych obwodach wyborczych dla których wartosc <i>'stosunek glosow na grupe partyjna do sumy wszystkich oddanych glosow - na obwod'</i> wynosi $0.3 - 0.6$.<br><br>
<h3><b>W poszukiwaniu manipulacji </b></h3>
W tym celu wybieram przedzial grupę danych <i>'b) OPOZYCJA < KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i> dla których wartosc <i>'stosunek glosow na grupe partyjna do sumy wszystkich oddanych głosów - na obwod'</i> wynosi $0.3 - 0.6$ i wyliczam rozkład 2giej cyfry z wartości glosowań na <i>'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i> i <i>'OPOZYCJA'</i>. Jesli głosy nie zostały zmanipulowane, to <b>otrzymany rozkład powinien być zgodny z prawem Benforda dla 2giej cyfry</b> (<a href="https://en.wikipedia.org/wiki/Benford%27s_law"><b>Benford\'s law - Wikipedia</b></a>)<br>
Z reguły, sprawdzanie rozkładu 2giej cyfry ze zbioru wartości jest mniej zależne na dodatkowe czynniki mogące powodowa przekłamania, np. różnice w wielkości obwodów wyborczych, rozdzielone mocno rozkłady analizowanych zmiennych. Tego typu czynniki powodują, że rozkład 1szej cyfry wartosci zmiennych nie jest wiarygodny, nawet jeśli taki rozkład mocno odbiega od oczekiwanego rozkładu wedle praw Bendforda dla 1szej liczby.<br><br>
Do analizy rozkładu 2giej cyfry wybrałem te zmienne ze skopiowanych danych z obwodow wyborczych, które powinny najbardziej bezpośrednio ukazywać ewentualne manipulacje:
<ol>
<li><b> 'Liczba głosów ważnych oddanych łącznie na wszystkie listy kandydatów'</b>,
<li><b> 'Liczba głosów nieważnych'</b>,
<li><b> 'W tym z powodu postawienia znaku „X” obok nazwiska dwóch lub większej liczby kandydatów z różnych list'</b>
</ol>
a także liczbę oddanych głosów na <i>'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i>, <i>'OPOZYCJA'</i> i <i>'INNE PATRIE'</i>.
Jako bląd dopasowania jest wyliczany Chi2 (oznaczony na wykresach poniżej jako <i>CHI2</i>).
Otrzymane rozkłady:<br>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaq2n_TEfFuUKkOTosaFuQzS6k_Z4NLWrhP_qSJ7j0e3Sg5A6gdvIyVzvgvBadTmCnXJSWzx-OMTO7R3gIB3Oa-9xp42i0nPpqV-sXvWohwjpqdCU1OftibprkDUyOraK8j7zVAGDIk0xIE487PFdWlh3CFk3JgaufbWQLTXPMFAwIM1VosptGgw7xTkKz/s2479/benford_1.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="600" data-original-height="2479" data-original-width="2340" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaq2n_TEfFuUKkOTosaFuQzS6k_Z4NLWrhP_qSJ7j0e3Sg5A6gdvIyVzvgvBadTmCnXJSWzx-OMTO7R3gIB3Oa-9xp42i0nPpqV-sXvWohwjpqdCU1OftibprkDUyOraK8j7zVAGDIk0xIE487PFdWlh3CFk3JgaufbWQLTXPMFAwIM1VosptGgw7xTkKz/s600/benford_1.png"/></a></div>
Rys. 4<br><br>
<br>
Na wykreach powyżej, przez $N$ oznaczam liczbe wartości z których wyliczono rozkład.<br>
Dla przypadku zmiennej <i>'Liczba głosów ważnych oddanych łącznie na wszystkie listy kandydatów'</i> (wykres 1) powyżej): wynik sugeruje wiekszą manipulację danych dla obwodów z kategorii <i>'a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i>, cyfry 0 i 1 można interpretować jako dołożone.<br>
Dla zmiennej <i>'Liczba głosów nieważnych'</i> (wykres 2) powyżej): błędy dopasowania są podobne dla obu grup danych <i>'a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i> i <i>'b) OPOZYCJA < KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i>. <i>'Liczba głosów nieważnych'</i> jest większa dla przypadku <i>'a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i>.<br>
Zmienna <i>'W tym z powodu postawienia znaku „X” obok nazwiska dwóch lub większej liczby kandydatów z różnych list'</i> pokazuje jeszcze bardziej znaczące statystycznie różnice dla obu grup danych (<i>a)</i> i <i>b)</i>).<br>
Wykresy 2) i 3) powyżej sugerują manipulacje związane ze zwiększaniem liczby głosów nieważnych.<br><br>
Na następnym rysunku pokazuję analizę liczby głosow poparcia na grupy polityczne.<br>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfMOAVvgXQ0PVW-40xPzubOxFqO2Ej3jNX3Lu1xwlAUV3GveeqaMVbC1_-IYTPO6nIcHb7PfX8ydeCHqs3P5jZ3hYn6SGUyjSGSxrqlgbOMUEof8AJGhcQ2aYeA_JrTQzlu6mAB59gGFiz1wjKA93XefkFqoSq0puEiES6XWqiaJ0QwnxtNlmmSJSgL_OJ/s2479/benford_2.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="600" data-original-height="2479" data-original-width="2340" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfMOAVvgXQ0PVW-40xPzubOxFqO2Ej3jNX3Lu1xwlAUV3GveeqaMVbC1_-IYTPO6nIcHb7PfX8ydeCHqs3P5jZ3hYn6SGUyjSGSxrqlgbOMUEof8AJGhcQ2aYeA_JrTQzlu6mAB59gGFiz1wjKA93XefkFqoSq0puEiES6XWqiaJ0QwnxtNlmmSJSgL_OJ/s600/benford_2.png"/></a></div>
Rys. 5<br><br>
<br>
Poparcie dla grupy <i>'OPOZYCJA'</i> (wykres 1) powyżej) ma mały błąd dopasowania (CHI2 < 1.) dla obu grup danych (<i>a)</i> i <i>b)</i>). Trudno tu wskazać na manipulacje. <br>
Poparcie dla <i>'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i> (wykres 2) powyżej) też jest obarczone małym błędem dopasowania - CHI2 < 1. dla grupy <i>b)</i> i CHI2 $\approx$ 2. dla grupy <i>a)</i>. W grupie <i>a)</i> cyfry 0, 1 rozkładu poparcia są poniżej oczekiwanego rozkładu.<br>
W przypadku <i>'INNE PATRIE'</i>, widać również większy błąd dopasowania dla grupy danych <i>a)</i> niż dla grupy <i>b)</i>.<br><br>
<h3><b>Podsumowanie </b></h3>
W przeprowadzonej analizie pokazuje potencjalne miejsce, w którym mogło dojść do manipulacji głosow poparcia na grupy polityczne. Są to obwody wyborcze, w których poparcie <i>'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i> i <i>'OPOZYCJA'</i> jest zrównoważone (wartosc <i>'stosunek glosow na grupe partyjna do sumy wszystkich oddanych głosów - na obwod'</i> wynosi $0.3 - 0.6$).
Wykresy na Rys. 5 pokazują, że do pewnych manipulacji dochodziło częściej w obwodach należących do grupy danych <i>'a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i> na niekorzyść <i>'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'</i>. <br>
Czy ta analiza udowadnia, że wybory zostały sfałszowane ? Nie, tylko sugeruje statystycznie znalezione sygnatury manipulacji, które są małe. Jak napisałem we wstępie, niniejsza analiza nie jest dowodem na oszustwa wyborcze. <br>
<br><br>
Dziękuje za przeczytanie !
Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0tag:blogger.com,1999:blog-4412689276279452501.post-52663804804668548672022-06-13T17:11:00.002+02:002022-06-13T18:45:28.680+02:00A word on the development and implementation of machine learning techniques for IoT data processing<!-- MathJax usage example -->
<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});</script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({TeX: { equationNumbers: { autoNumber: "AMS" }}});</script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script>
IoT data analytics typically involves creating processes to predict failures, situations or find anomalies online. By process I mean here a model created based on Machine learning or/and statistics and its full implementation in the production cycle. <br />
It has been reported that up to <b>80% of IoT data projects fail</b> (the project was not completed or the company gained nothing from its implementation). If we look at the highlighted reasons for failures, we find mostly data-related topics, hazy descriptions of problems with Machine Learning methods without going into details, vaguely defined goals, and many other reasons. <br />
<br />
<b>In this note, I would like to focus on choosing an algorithm for IoT data analysis purposes.
Before working with IoT data, it's a good idea to determine earlier which analytical solution should be implemented: based on supervised or unsupervised methods.
</b>
<br />
Usually creating a good unsupervised algorithm is a difficult task, more difficult than a supervised algorithm. However, building the model itself is only part of the project. The other part is its application in the production process.
So I propose to look at the whole thing: the creation of an analytical method and its productization. <br />
<br />
<h4><b><i>Solution 1 - Supervised Model (SM):</i></b></h4>
<ol>
<li><b>analytical model (R&D)</b>: this is a Data Scientist (DS) standard job:
<ul>
<li> selection of features (data and feature engineering),
<li> model creation,
<li> precise validation procedure with KPIs.
</ul>
<li><b>Productization</b>: we have to include a Data Engineer here also. Tasks to do:
<ul>
<li> data & feature engineering, preparation of computing environment,
<li> framework for SM automatization including:
<ul>
<li> monitoring of the data shifts (features + target),
<li> data selection for creating new models,
<li> data labeling,
<li> parameter hypertuning,
<li> model retraining,
<li> selection of the best model(s).
</ul>
</ul>
</ol>
<br />
<h4><b><i>Solution 2 - Unsupervised Model (UM): </i></b></h4>
<ol>
<li><b>R&D</b>: DS job:
<ul>
<li> selection of features (data and feature engineering),
<li> model creation,
<li> precise validation procedure with KPIs.
</ul>
Tasks are mosty the same as for the SM case, but obviously the goal is more complex.
<li><b>Productization</b>: it obeys DE + DS tasks. Main tasks to do:
<ul>
<li> data & feature engineering, preparation of computing environment.
</ul>
</ol>
<br />
<b><i>Comparison of both solutions:</i></b><br />
As you can see, creating a well-functioning production cycle for the SM is a very difficult task. In my opinion more difficult than creating a proper supervised model and much more time consuming. <br />
<br />
On the other hand, for solution 2 based on the UM, putting the solution into production is very simple.
The opposite picture we have for building analytical models.
<br />
Given the above considerations, the summary statement might look like this image:
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtWyQ_HspVOmvjCQtOygv70YwgtTmeQkU7UjWrKrXr_aG1VHRBPsXIIUwsQWOFSrDObwODXNkGb5z0GNiJ47Iw0npESVjIxz0hl0fSauVQMiAUEOYthuB6ShFlZFO-W_3JsKWT0negVa5LOldSXnr5_W8FBhkHitD_uMGiLk1taqRWItsJaA22WfsYxg/s822/overview.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="216" data-original-width="822" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtWyQ_HspVOmvjCQtOygv70YwgtTmeQkU7UjWrKrXr_aG1VHRBPsXIIUwsQWOFSrDObwODXNkGb5z0GNiJ47Iw0npESVjIxz0hl0fSauVQMiAUEOYthuB6ShFlZFO-W_3JsKWT0negVa5LOldSXnr5_W8FBhkHitD_uMGiLk1taqRWItsJaA22WfsYxg/s600/overview.png"/></a></div>
<i>The difficulties associated with the productization of the SM model are the main place where failures are to be expected and there are a lot of them.
Maybe sometimes it is worth simplifying this part of the project and focus on building the UM model.</i><br />
<br />
I hope you find my findings helpful. Thanks for reading !
Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0tag:blogger.com,1999:blog-4412689276279452501.post-88662842819118241772022-01-03T11:15:00.001+01:002022-01-03T11:36:13.397+01:00Semantic text clustering: testing homogeneity of text clusters using Shannon entropy.<!-- MathJax usage example -->
<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});</script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({TeX: { equationNumbers: { autoNumber: "AMS" }}});</script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script>
Many natural language processing (NLP) projects are based on the task of semantic text clustering. In short, we have a set of statements (texts, phrases, or sentences), and our goal is to group them semantically. Nowadays, this is quite a classical problem with a rich literature on clustering methods. <br />
<br />
In spite of the simple formulation of the task, there are many problems during its realization. To make things more difficult, let's assume <i>we are dealing with unlabeled data, which means we must rely on unsupervised techniques</i>. <br/>
<h4><b>The first problem:</h4></b>
as we know, we can find many categories according to which we can try to group the given set of texts.
In order to do this task well, we need to know the business case better. In other words, we need to know the question we want to answer using clustering.
This problem is not the purpose of this note, so we will skip it. <br />
<br />
<h3><b>The second problem:</b></h3>
suppose we have successfully performed text clustering. And the question arises: <i><b>how do we know that all clusters contain the correct texts ?</i></b>
<i>Answering this question is the purpose of my note </i>.<br /><br />
Let us assume that a given method has generated $N$ groups/clusters , each of which contains a set of similar texts .
It is known that every cluster of similar texts is more or less homogeneous. There are clusters that are completely wrong in the sense of similarity of texts.
So to answer our original question <br />
<i><b>How do we know that all clusters contain correct texts ?</i></b><br />
is to find those clusters that contain erroneous texts.<br /><br />
<h3><b>How to find the worst clusters ? </b></h3>
In this note, I would like to propose the following method for determining the worst clusters:
<ol>
<li> checking intra-cluster similarity by computing the Shanon entropy of the set of texts belonging to a given cluster.
<li> using dynamically determined threshold entropy to select the worst clusters (based on the method presented in my blog
<a href="https://dataanalyticsforall.blogspot.com/2021/05/semantic-search-too-many-or-too-few.html">
<b>https://dataanalyticsforall.blogspot.com/2021/05/semantic-search-too-many-or-too-few.html</b></a>).
</ol>
As data to illustrate our task I use data known as reuters-21578 (https://paperswithcode.com/dataset/reuters-21578).<br />
Since the clustering stage is not our goal, this part of the work was done using the fast clustering method based on Agglomerative Clustering
(<a href="https://www.sbert.net/examples/applications/clustering/README.html#fast-clustering">
<b>https://www.sbert.net/examples/applications/clustering/README.html#fast-clustering</b></a>,
code: <a href="https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py"
<b>https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py</b></a>).<br />
<br />
Each set of texts is characterized by a different degree of mutual homogeneity.
The general idea of the method is to calculate the Shanon entropy for each identified text cluster, and then using the elbow rule to determine the homogeneity threshold (maximum entropy value) below which the clusters satisfy the homogeneity condition (are accepted). All clusters with entropy above the threshold should be discarded, used to find a better clustering algorithm or reanalyzed.
<br />
In any clustering method, we will get multiple single element clusters.
Since in our method we check for homogeneity between sentences in a cluster, therefore I calculate entropy only for multi-element clusters.
<br />
<br />
The only parameters that needs to be introduced in the proposed method are:
<ol>
<li>the 'approximation_order' which defines a number of of consecutive characters (known as ngrams) from which we create the probability distribution used later to calculate the Shanon entropy.
<li>The sensitivity parameter S (to adjust the aggressiveness in kneed detection [<a href="https://kneed.readthedocs.io/en/stable/parameters.html">
<b>kneed</b></a>]) used in the Python kneed library.
</ol>
The complete code is available at<br />
<a href="https://github.com/Lobodzinski/Semantic_text_clustering__testing_homogeneity_of_text_clusters_using_Shannon-entropy">
<b>https://github.com/Lobodzinski/Semantic_text_clustering__testing_homogeneity_of_text_clusters_using_Shannon-entropy</b></a>.
<br />
<br />
To illustrate the method I show two pictures with the final results for two values of the parameter used to determine the Shannon entropy (approximation_order: 2 and 3)<br />
<br />
All the values of the Shannon entropy approximations are ordered from smallest to largest (y-axis). The kneed library, given a parameter S (in this case S = 1), determines the best inflection point and this entropy value defines our maximum entropy value for the accepted sentence clusters. For a given approximation_order parameter (=2), we obtain 14 clusters which are inappropriately homogeneous.
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEivOKJqL-l8Q4V2tr3PgJ_C5FIsCAxgtc7m5pRX0k51HfAIJz2Kq0JHYEiuardp5kTR5hvZgcrBbg1dS39SaymiDuRWgdC4uSo4Xf-Rv53QL3_6TAdVunV_1t9mK5aiZ1D5pQ4vZSXCxSEXc4AplEGfC7_5h5dlUoPcTlFzp_QdXOixL9iBTi9zzHuWqA=s968" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="354" data-original-width="968" src="https://blogger.googleusercontent.com/img/a/AVvXsEivOKJqL-l8Q4V2tr3PgJ_C5FIsCAxgtc7m5pRX0k51HfAIJz2Kq0JHYEiuardp5kTR5hvZgcrBbg1dS39SaymiDuRWgdC4uSo4Xf-Rv53QL3_6TAdVunV_1t9mK5aiZ1D5pQ4vZSXCxSEXc4AplEGfC7_5h5dlUoPcTlFzp_QdXOixL9iBTi9zzHuWqA=s600"/></a></div>
Questionable clusters are:
<pre class="prettyprint"><code>Cluster 13, #10 Elements
Cluster 23, #8 Elements
Cluster 26, #8 Elements
Cluster 28, #8 Elements
Cluster 49, #6 Elements
Cluster 56, #6 Elements
Cluster 59, #5 Elements
Cluster 62, #5 Elements
Cluster 127, #4 Elements
Cluster 137, #3 Elements
Cluster 145, #3 Elements
Cluster 152, #3 Elements
Cluster 247, #3 Elements
Cluster 275, #3 Elements
</code></pre>
For comparison, the next picture is calculated for approximation_order=3.
In this case, we get 15 incorrect clusters:
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEihMzchiJyFIrhD5c6WNPzccEiywV-CkyS9ndXyPOQyxW1uDlhVbspd_iLs67Q7f9cS9cYuNvVE2TNX9yLQD2aXcCleKJXStn8flcyYuT_qWR3eBORGGKFL-wZb0F-PeQX470DOOmwX7rcS4MyoGMOFet4YyaqukMJiUl6N3niBz0FScSBI_ohMzOPp5Q=s968" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="354" data-original-width="968" src="https://blogger.googleusercontent.com/img/a/AVvXsEihMzchiJyFIrhD5c6WNPzccEiywV-CkyS9ndXyPOQyxW1uDlhVbspd_iLs67Q7f9cS9cYuNvVE2TNX9yLQD2aXcCleKJXStn8flcyYuT_qWR3eBORGGKFL-wZb0F-PeQX470DOOmwX7rcS4MyoGMOFet4YyaqukMJiUl6N3niBz0FScSBI_ohMzOPp5Q=s600"/></a></div>
Rejected clusters:
<pre class="prettyprint"><code>Cluster 0, #396 Elements
Cluster 13, #10 Elements
Cluster 23, #8 Elements
Cluster 26, #8 Elements
Cluster 28, #8 Elements
Cluster 56, #6 Elements
Cluster 59, #5 Elements
Cluster 62, #5 Elements
Cluster 109, #4 Elements
Cluster 127, #4 Elements
Cluster 137, #3 Elements
Cluster 145, #3 Elements
Cluster 152, #3 Elements
Cluster 179, #3 Elements
Cluster 275, #3 Elements
</code></pre>
Comparing the list of unacceptable clusters may give more information about the occurring heterogeneities in our text set.
<br />
<br />
I encourage you to test the method for yourself. In my case it turned out to be very useful.
<br />
<br />
The code:<br />
<a href="https://github.com/Lobodzinski/Semantic_text_clustering__another_way_to_study_cluster_homogeneity">
<b>https://github.com/Lobodzinski/Semantic_text_clustering__another_way_to_study_cluster_homogeneity</b></a>.
The outstanding advantage of the method is that it works fully autonomously without the need for human intervention.<br />
<br />
<br />
Thanks for reading, please feel free to comment and ask questions if anything is unclear.
Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0tag:blogger.com,1999:blog-4412689276279452501.post-37348053311497256522021-08-12T21:51:00.012+02:002021-08-25T10:58:04.584+02:00Are solar power plants really green energy ? Continuation<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-4793NQQM5T"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-4793NQQM5T');
</script>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjePRTHtDH3sgK-f-mLjrfdorGM1OTKoJPUO-t-2i2YxgBdMW4DqGxHScEDEZgn4PMnZdSHD_rcforTP40Lg2NzPtdVizTg8nm4WJ_xGhUiFp7Htcd46J_Ek_d6qngHwRyicySAnhzvSb7U/s455/f30.jpg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="320" data-original-height="455" data-original-width="431" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjePRTHtDH3sgK-f-mLjrfdorGM1OTKoJPUO-t-2i2YxgBdMW4DqGxHScEDEZgn4PMnZdSHD_rcforTP40Lg2NzPtdVizTg8nm4WJ_xGhUiFp7Htcd46J_Ek_d6qngHwRyicySAnhzvSb7U/s320/f30.jpg"/><div class="bottom-right">image copied from https://dziennikzarazy.pl</div></a></div>
This text is a continuation of my reflections on the impact of photovoltaic power plants on climate.
In the part <a href="https://dataanalyticsforall.blogspot.com/2021/08/are-solar-power-plants-really-green.html"><i>Are solar power plants really green energy ?</i></a> I presented some figures which show that the heat generated by the currently working solar power plants (they produce more than 4 times more thermal energy than they produce in the form of electricity) is such a big part of the heat emitted by humans, that after converting this heat into CO2 amounts, it is more than a half of the effect generated in 2020 on the whole Earth!</b><br />
This amount of heat energy is unlikely to have no effect on global warming.</b><br /></b><br />
<h3>More concrete signals that the development of a solar power plant infrastructure could lead to climate disruption and rising temperatures. </h3>
</b><br />
<ol>
<li>
<h4>Urban Heat Island effect (UHI)</h4>
The UHI is a real phenomenon.<br />
The paper
<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4555247/"><i>"The Effect of Urban Heat Island on Climate Warming in the Yangtze River Delta Urban Agglomeration in China"</i></a>
presents the effect of UHI on climate warming based on an analysis of the effects of urbanization rate, urban population and land use change on the warming rate of mean, minimum (night) and maximum (day) air temperature in the Yangtze River Delta (YRD) using observational data from 41 meteorological stations.
In conclusion, the authors found that observations of daily mean, minimum, and maximum air temperature atmeasurement stations in the YRDUA from 1957 to 2010 showed significant long-term warming due to background warming and UHI. The warming rate of 0.108 to 0.483°C/decade for mean air temperature is generally consistent with the warming trend in other urban regions in China and other urban areas in the world. <br />
<b>Thus, the authors showed that urbanization significantly enhanced local climate warming. </b><br />
The solar power plants based on photovoltaic panels are even hotter islands of heat than highly urbanized agglomerations. During the period of most intense sunlight, the temperature near a solar power plant can be up to 3 degrees Celsius higher than the temperature in a similar environment without solar panels and similar solar conditions. </b><br />
<h4>Suggestion:</h4>
The similarity in heat generation between the two cases - densely populated metropolitan areas and solar power plants - suggests the same effect - <b>warming the air over a larger area</b>. </b><br />
<li>
<h4>Correlation between Urban Heat Island (UHI) effect and number of heat waves (HW)</h4>
Due to lack of access to data, I have to rely on visual comparisons (if anyone knows data to analyze or can make it common, please contact me).</b><br />
<b>Below I illustrate 2 cases: USA and Europe (Germany in particular).</b></b><br />
<ul>
<li>
<h4>USA case</h4>
Let's visually compare the 2 images to each other.
The first, showing the density of solar power plants in the USA [<a href="https://openinframap.org"><i>https://openinframap.org</i></a>] - Figure 1:
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjX4H4EgVhN2TN4A3AGSLlfdSYI0ERxJIYkzymf0H3UDyI-oumKBwN_OXwXmRpzwJ2nk-u810bRrHW3J-ldQ2xAOnMfIm-C2T69k9wOLQb0-t_aU1aG-LFdVrGLBNcwiMi6emNwgA_cCJnL/s1574/usa_PV_plants.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="700" data-original-height="611" data-original-width="1574" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjX4H4EgVhN2TN4A3AGSLlfdSYI0ERxJIYkzymf0H3UDyI-oumKBwN_OXwXmRpzwJ2nk-u810bRrHW3J-ldQ2xAOnMfIm-C2T69k9wOLQb0-t_aU1aG-LFdVrGLBNcwiMi6emNwgA_cCJnL/s400/usa_PV_plants.png"/>
<div class="bottom-right">Figure 1: density of Solar Power plants in USA</div></a></div>
The second image, from the paper [<a href="https://www.globalchange.gov/browse/indicators/us-heat-waves"><i>U.S. heat wave frequency and length are increasing</i></a>] showing the increase in the number of HW in the US - Figure 2:
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCpSlocjDdxEnYG0lPX-PcGzfTwYtSVx70IYWm5B9lIq-I8zxxyCJuaFbihcKhDEa10wI2EXVYz-aD3iboYpjiYYgX3zlII6p_FvZKpsOdFcjekZ99iIvk_Vup7DSGfBiGLPvvJXkMeo5T/s2048/heat_waves_us.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="700" data-original-height="1540" data-original-width="2048" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCpSlocjDdxEnYG0lPX-PcGzfTwYtSVx70IYWm5B9lIq-I8zxxyCJuaFbihcKhDEa10wI2EXVYz-aD3iboYpjiYYgX3zlII6p_FvZKpsOdFcjekZ99iIvk_Vup7DSGfBiGLPvvJXkMeo5T/s600/heat_waves_us.png"/>
<div class="bottom-right">Figure 2: Changes in the number of heat waves per year (frequency) and the number of days between the first and last heat wave of the year (season length)</div></a></div>
Finally, Figure 3: shows the growth of US electricity generation as a function of years (data from
<a href="https://en.wikipedia.org/wiki/Solar_power_in_the_United_States"><i>https://en.wikipedia.org/wiki/Solar_power_in_the_United_States</i></a>).
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgac0WsjSF69WiFvd1eqcWqHLLBSJqOFP0faSeebIv7Thop7_TBhy4JphoNsCvmbhStK8hYgc4QbW62P1rRfM8ZQ_Y-0pW_bDMJONgdc4tEgH9aCnGW3FIc0SCR-k54bevyWwblsNtESM37/s2048/USA_PV_.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="500" data-original-height="1397" data-original-width="2048" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgac0WsjSF69WiFvd1eqcWqHLLBSJqOFP0faSeebIv7Thop7_TBhy4JphoNsCvmbhStK8hYgc4QbW62P1rRfM8ZQ_Y-0pW_bDMJONgdc4tEgH9aCnGW3FIc0SCR-k54bevyWwblsNtESM37/s400/USA_PV_.png"/>
<div class="bottom-right">Figure 3: Trends in commercial solar electrical power generation in the top five states, 1990–2012 (U.S. EIA data)</div></a></div>
<h3>
As can be seen from Figure 2 (and Figure 3 and Figure 1), the histogram showing the frequency of HW occurrence, the increase in the number of cases from 2000 to 2010 is greater compared to earlier periods. </b><br />
There is even a geographical correlation (Figure 1 and 2) -
which, of course, can also be presented as a fact that power plants are built close to large agglomerations. </b><br />
In general, the UHI effect should be considered as the sum of the UHI + Solar Heat Island . </b><br />
I therefore postulate the following hypothesis: </b><br />
- Until 2000, urban heat islands were mainly responsible for heat waves. </b><br />
- After 2000, the increase in heat waves can be further attributed to solar power plants.</b><br />
</h3>
</b><br />
</b><br />
<li>
<h4>Germany case</h4>
How does it look like in Europe ?</b><br />
The density of solar power plants in Europe [<a href="https://openinframap.org"><i>https://openinframap.org</i></a>] - Figure 4 (below) shows that the largest solar farm infrastructure is in Germany (Let's forget about the UK for a while).
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgA-lLzQOXPs9WybuEjw5pv7lPC4GK2oVrNj3dlXVzcY6InCIIjoH-m2jY_PWZjn9gQJdyPbVr2nEWHy0Fb7H15lHMfzZR4DOpmuTJQvTSggP2fUExvfNrznMZAA39D9SVM-xnbZLJaSSIB/s1070/europe_PV_plants.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="700" data-original-height="599" data-original-width="1070" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgA-lLzQOXPs9WybuEjw5pv7lPC4GK2oVrNj3dlXVzcY6InCIIjoH-m2jY_PWZjn9gQJdyPbVr2nEWHy0Fb7H15lHMfzZR4DOpmuTJQvTSggP2fUExvfNrznMZAA39D9SVM-xnbZLJaSSIB/s400/europe_PV_plants.png"/>
<div class="bottom-right">Figure 4: density of Solar Power plants in Europe</div></a></div>
Now, numbers of Heat Waves (HW) in Germany:</b><br />
I was not able to find detailed numbers for heat waves in entire Germany. Instead, from the report
<a href="https://www.dwd.de/DE/leistungen/nationalerklimareport/download_report_auflage-4.pdf;jsessionid=9E85A870FF67B45095ED7A811C40B5E4.live21074?__blob=publicationFile&v=11"><i>Nationaler Klimareport Klima ‒ Gestern, heute und in der Zukunft</i></a>
you can decipher the following number of heat waves counted for 5 major cities in Germany (Hamburg, Dresden, Frankfurt am Main, Mannheim and Muenchen).
It shows that (<b>Table 1</b>):</b><br />
</b><br />
Year period:<d style="padding-left:4em;"></d>Nr of Heat weaves:</b><br />
1950-1960<d style="padding-left:4em;"></d>5</b><br />
1960-1970<d style="padding-left:4em;"></d>4</b><br />
1970-1980<d style="padding-left:4em;"></d>5</b><br />
1980-1990<d style="padding-left:4em;"></d>3</b><br />
1990-2000<d style="padding-left:4em;"></d>17</b><br />
2000-2010<d style="padding-left:4em;"></d>15</b><br />
2010-2019<d style="padding-left:4em;"></d>19</b><br />
</b><br />
It shows a huge increase in Heat Waves since the 1990s.
Followed by a small stabilization for the periods 1990-2010 and another increase for the last years after 2010.
The amount of electricity generated by photovoltaics has been increasing since 2000. And since 2010, there is almost a jump in the growth of solar electricity in Germany (see Figure 5).<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh93DpcuYNp7_z-DOQ5dmmlvqR0qWmg6U9CnOmq_S1hqh_nitfJhck_9iZyzA6My39IzrBJRGkzrsIvrG1wJDamr2bk3hmkLjMCQzDi8zXm3opPLmcI-DQ5IZwiqmYIoDckAYPeBgtCKWfv/s1079/Germany_PV_.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="700" data-original-height="700" data-original-width="1079" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh93DpcuYNp7_z-DOQ5dmmlvqR0qWmg6U9CnOmq_S1hqh_nitfJhck_9iZyzA6My39IzrBJRGkzrsIvrG1wJDamr2bk3hmkLjMCQzDi8zXm3opPLmcI-DQ5IZwiqmYIoDckAYPeBgtCKWfv/s400/Germany_PV_.png"/>
<div class="bottom-right">Figure 5: Trends in commercial solar electrical power generation in Germany</div></a></div>
<h3>
As can be seen from Table 1 and Figures: 4 and 5, the increase in the number of Heat Waves in Germany after 2000 (after 2010 especially), is correlated with the generated Solar Energy.
</h3>
</b><br />
</b><br />
</ul>
</ol>
<h2>Conclusions:</h2>
<h3>
Without access to detailed data, it is difficult to conduct a more detailed analysis of the correlation between the number of the HW and the increase in electricity produced by the growing number of solar plants. </b><br />
However, the suggestion given by the available data presented above is at least worth a closer analysis.</b><br />
</b><br />
The ideas in the European document <a href="https://www.solarpowereurope.org/fit-for-a-solar-future-commission-climate-package-is-landmark-achievement-but-more-ambition-is-possible/">
<i><b>
"Fit for in the a Solar Future: Commission climate package is landmark achievement but more ambition is possible"
</b></i></a> could prove devastating .
</h3>
</b><br />
</b><br />
Take careBogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com2tag:blogger.com,1999:blog-4412689276279452501.post-47935145012083546162021-08-08T18:40:00.021+02:002021-08-25T11:01:59.005+02:00Are solar power plants really green energy ?<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-4793NQQM5T"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-4793NQQM5T');
</script>
<!-- MathJax usage example -->
<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});</script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({TeX: { equationNumbers: { autoNumber: "AMS" }}});</script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script>
<h3>
Are solar power plants really a good solution for energy production ? <br /><br />
Everywhere you hear it's a clean way to get energy. So let's see if it really is.<br /><br />
</h3>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTETU0T-fkKf_OC_dimFInUdVKoJN41DzMFOUc6eK-_KW_g5brR7rJfsB960wl8-xCfNhXeALGujGm3KOTbFb-0v5aElgSFb0QEXxw2XuU_7ivSF5vH8rYXDTE5Yjb_d5yjkhisuv68w4F/s1024/f16.jpg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="400" data-original-height="719" data-original-width="1024" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTETU0T-fkKf_OC_dimFInUdVKoJN41DzMFOUc6eK-_KW_g5brR7rJfsB960wl8-xCfNhXeALGujGm3KOTbFb-0v5aElgSFb0QEXxw2XuU_7ivSF5vH8rYXDTE5Yjb_d5yjkhisuv68w4F/s400/f16.jpg"/><div class="bottom-right">image copied from https://dziennikzarazy.pl</div></a></div>
<h3>Introduction:</h3></b><br />
As an introduction, a few words about how the sun heats the earth and how a solar panel works. </b><br />
The infrared part of the solar spectrum (wavelength > 700 nm, about 50% of the energy) is directly responsible for heating the Earth's surface and air.
This kind of solar radiation exposure on the Earth is considered normal.</b><br />
Photovoltaic (PV) panels operate in the visible part of the solar spectrum: from 350 nm to 750 nm (approximately). It is this part of the solar spectrum that does not normally heat the environment (the part between 700 and 750 nm does). Energy from this range of radiation is partially (14-22%) converted into electrical energy and the rest (78-86%) is converted into thermal energy.</b><br />
PVs thus act as a converter of the visible part of the sunlight spectrum (not infrared) to heat (infrared). In other words, they increase the amount of heat compared to the transmitted infrared portion of sunlight.
To simplify the estimation, let's assume that 16% of the energy consumed by PV is converted into electrical energy.
The rest is dissipated in a form of heat.
Data about the operational power produced by solar panels are given in units of electric power generated by Photovoltaic (PV) systems.
I.e. 84% of the energy dissipated into heat is not included in these values.
<br /><br />
<h3>Story nr 1 - local:</h3></b><br />
The first bad effect, fully local one, is a local increase of temperature near solar power plants
<a href="https://www.nature.com/articles/srep35070"><i>The Photovoltaic Heat Island Effect: Larger solar power plants increase local temperatures</i></a>.
Another interesting article about a super solar power plant in the Sahara, taking into account the local effects of large heat dissipation around solar panels was written by Jack Marley
<a href="https://theconversation.com/solar-panels-in-sahara-could-boost-renewable-energy-but-damage-the-global-climate-heres-why-153992"><i>Solar panels in Sahara could boost renewable energy but damage the global climate – here’s why</i></a>.
<br /><br />
<h3>Story nr 2 - global:</h3></b><br />
Now let's try to look at things globally. <br />
the number of solar panels on earth is growing almost exponentially every year.
According to the
<a href="https://www.irena.org/publications/2021/March/Renewable-Capacity-Statistics-2021"><i>Renewable Capacity Statistics 2021</i></a> website,
at 12.2.2021 the world had 714 GW of operational Photovoltaic (PV) systems.
Let's try to translate this value into a carbon footprint by treating all operating PV systems as one.<br />
<b>Some assumptions</b> at the beginning:
<ol>
<li>80% of initial radiation is dissipated by the solar panel into heat.<br />
<li>1 kW of Solar Panel System covers an area about 8 m2 .<br />
<li>Solar irradiance: the averaged over the year and the day, the Earth's atmosphere receives radiation 340 W/m2 from the sun
<a href="https://en.wikipedia.org/wiki/Solar_irradiance"><i>https://en.wikipedia.org/wiki/Solar_irradiance</i></a>.
The PV systems are distributed across the earth, so I assume that the average solar radiation used in the calculations is: 150 W/m2<br />
<li>Average equivalent of the carbon footprint of the 1 kWh as 0.5 . Obviously, we have different CO2 emission intensity for different countries per 1 kwh. The value 0.5 corresponds to the average values over sunny countries.<br /> More detailed data by country and region is available on the website
<a href="https://www.carbonfootprint.com/"><i>https://www.carbonfootprint.com/</i></a>.<br />
<li>Conversion from W to kWh: 1 W == 0.001 kWh<br />
</ol>
<br /><br />
The 714 GW of operational PV systems creates a total surface (St) equal to:
<b><i>St = 5712000000000 m2 = 5 712 000 km2 </i></b>.<br />
It corresponds to the area size between India (3 287 263 km2) and Australia (7 741 220 km2). Going furthermore, considering the surface of the earth, the surface of the solar panels is 1.1% of its surface.
Since we are talking about energy produced by the operational PV systems, we can assume that this is 16% of the energy converted into electricity.
Therefore, the dissipated energy into the heat energy (Ht) produced at the same time is: <br /><br />
<b><i>Ht = 714 [GW] (84 [%]/16 [%]) = 3748.5 [GW]</i></b>.<br />
Now let's calculate the carbon footprint of this amount of heat. Using the conversion from W to kWh (1 W == 0.001 kWh), our amount of heat energy (Ht) is equivalent to
<b><i>3 748 500 000 kWh ~ 3.75 GWh </i></b>.<br /><br />
This value corresponds to the carbon footprint (assumption that the carbon footprint of the 1 kWh is 0.5):</b>.<br />
<b><i>1874250000 kg CO2 /hour</i></b>.<br />
or<br />
<b><i>16 418 430 000 000 kg CO2 / year ~ 16.4 GT /year</i></b>.<br />
This is a huge value and corresponds to the <b><i>52%</i></b> (!) of total CO2 emission in 2020 (<b><i>31.5 GT / year</i></b>) !
Thus, we have an unexpected situation because it looks like solar panels are far worse at producing energy than fossil fuels.
Let's see a comparison of the increase in operational energy of PV systems to the change in global temperature anomaly as a function of years.
Temperature data from
<a href="https://climate.nasa.gov/vital-signs/global-temperature/"><i>https://climate.nasa.gov/vital-signs/global-temperature/</i></a>.<br />
Please note the different scales of the data presented in the figure. The operational power generated by solar panels is shown in red, the temperature anomalies in blue.
Almost perfect correlation !
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5UoE9zVSvTFPfAVK7cxo5ymbWMhHK0ZmeDpIQCJS2DPeTdWjXqngIodXMiGJMQMY8gp-h0NVD_Prs30REPmKAmXE2OKMx2Trtm6STI6HaOmnt8q_4R3hT4J_lLbBR_0DrV6bM7eOt3rnC/s1108/f1.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="444" data-original-width="1108" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5UoE9zVSvTFPfAVK7cxo5ymbWMhHK0ZmeDpIQCJS2DPeTdWjXqngIodXMiGJMQMY8gp-h0NVD_Prs30REPmKAmXE2OKMx2Trtm6STI6HaOmnt8q_4R3hT4J_lLbBR_0DrV6bM7eOt3rnC/s600/f1.png"/></a></div>
<h3>Summary:</h3></b><br />
<ol>
<li>Is CO2 really responsible for warming the earth ?
<li>The correlation between temperature anomalies and the amount of energy produced by PV systems is surprising to say the least !
<li>By building PV systems, we create smaller or larger heat islands around them, disturbing the natural energy balance in such an area.
The PV systems produce more than 4 times more thermal energy than they produce in the form of electricity.
<li>By producing solar panels we pollute our environment (+ the need to recycle).
</ol>
<br /><br />
The final conclusions rather indicate that solar power plants do more damage than conventional ones.<br /><br />
Now, the natural question is whether we are already seeing a correlation of climate change with the increase in heat islands around solar power plants.
<ol>
<li> Is there a correlation between the heat energy produced by the increasing number of solar plants and the increase in air temperature via the Heat Islands effect ?
<li> Is there a correlation between the frequency of Heat Waves and the thermal energy produced by solar power plants ?
</ol>
About this there is the next text <a href="https://dataanalyticsforall.blogspot.com/2021/08/are-solar-power-plants-really-green_12.html"><i>Are solar power plants really green energy ? Continuation </i></a> <br /><br />
I would be grateful if someone could point out to me the error I am making in the above approximations.<br /><br />
Take care Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0tag:blogger.com,1999:blog-4412689276279452501.post-57552531037351882002021-07-15T12:58:00.004+02:002021-07-15T13:40:10.999+02:00Global Real Estate market: a non-expert view<!-- MathJax usage example -->
<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});</script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({TeX: { equationNumbers: { autoNumber: "AMS" }}});</script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihw0g3K_U6nrkxnsbBd1drxNO24YtNUMsTXc9pdeJ7FKtJEr_uPKLtEvxNS-Ex2Kgx00UOrszC0yAFqiJwG6FRWr3kkIpgpGzczfxJlsKgBZkevHnE6kaUj_bU9lO3nRyU1L5dNj_5EdSO/s790/orchestra.webp" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="310" data-original-width="790" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihw0g3K_U6nrkxnsbBd1drxNO24YtNUMsTXc9pdeJ7FKtJEr_uPKLtEvxNS-Ex2Kgx00UOrszC0yAFqiJwG6FRWr3kkIpgpGzczfxJlsKgBZkevHnE6kaUj_bU9lO3nRyU1L5dNj_5EdSO/s320/orchestra.webp"/></a></div>
<b>About:</b><br />
Most analyses of real estate prices compare their behavior over time with other economic indicators, but this is done for a specific country or independently for a group of countries.
This text proposes a comparison of real estate prices between different countries by calculating the correlation between them.
<br /><br />
<b>Data:</b><br />
I came across some data on real estate prices (https://data.world/finance/international-house-price-database).
This data contains values of 4 quantities (with short descriptions found in wikipedia and other sources):
<ol>
<li> the house price index (HPI): <br />
measures the price changes of residential housing as a percentage change from some specific start date (starting in 1975).</li>
<li> the house price index expressed in real terms (RHPI):<br />
the deflated house price index (or real house price index) is the ratio between the house price index (HPI)</li>
<li> the personal disposable income index (PDI):<br />
measures the after-tax income of persons and nonprofit corporations.</li>
It is calculated by subtracting personal tax and nontax payments from personal income.
<li> the personal disposable income expressed in real terms index (RPDI) :<br />
the deflated PDI. </li>
</ol>
<br /><br />
<b>Analysis:</b><br />
As input we have time series with specified quantity $Q$ (HPI, RHPI, PDI or RPDI) for N (N=24) countries (
'Australia', 'Belgium', 'Canada', 'Switzerland', 'Germany', 'Denmark', 'Spain', 'Finland', 'France',
'UK', 'Ireland', 'Italy', 'Japan', 'S. Korea', 'Luxembourg', 'Netherlands', 'Norway', 'New Zealand',
'Sweden', 'US', 'S. Africa', 'Croatia', 'Israel', 'Slovenia').
In order to calculate correlations between countries I do the following calculations:
<ol>
<li> for a given quantity $Q$ I normalize all data independently to the range [0,1],</li>
<li> I determine the two-site correlation function for each timestamp $t$
\begin{equation}
\label{1}
Corr_{country, another\_country} \left(t \right) = Q_{country}\left(t \right) Q_{another\_country}\left(t \right)
\end{equation}
which is used finaly, for calculation of the global correlation for each country:
\begin{equation}
\label{2}
C_{country}\left(t \right) = \frac{\sum_{another\_country=1}^{N} Corr_{country, another\_country} \left(t \right) }{N}
\end{equation}
</li>
</ol>
<br />
The dynamics of the thus calculated correlation function $C_{country}\left(t \right)$ for the quantity $Q=$RHPI and for all countries is shown in Figure 1.
For the quantity RHPI, the correlations are most apparent.
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLvEqw3zKRmcC8XjYnjStLV82ZsQ0IlUfhxQwMdfGdadQRlfkw_goqrWcr7we0V2CK5HSlIlV6iITOQgB_gqFbm5P_ODyuB-ZACkPSKGio3hYu9mWRz5aLfKHFgQXFSFl-9rDwXoVaUxxN/s1436/rhpi.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="1399" data-original-width="1436" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLvEqw3zKRmcC8XjYnjStLV82ZsQ0IlUfhxQwMdfGdadQRlfkw_goqrWcr7we0V2CK5HSlIlV6iITOQgB_gqFbm5P_ODyuB-ZACkPSKGio3hYu9mWRz5aLfKHFgQXFSFl-9rDwXoVaUxxN/s600/rhpi.png"/></a></div>
The financial crisis of 2008 is very well visible in this figure (the yellow cylinder between 2007 and 2008).
<br /><br />
<b>A couple of observations for the time period 2007-2008:</b><br />
The longest increase of the value of RHPI is seen for the US, lasting since about 1995. Other countries behave in a weakly correlated way during this period. A strong correlation between countries starts to be visible from about 2005 and quickly increases until the crash around 2007-2008.
It looks as if most countries joined the global real estate market at the same time (around 2005) and at a given signal decided to crash -
<i>"ready, steady, crash!"</i>. The market seems to be too well orchestrated.
I know that some people will say that this is a normal behavior because it is a global market, all markets are interconnected, etc.
However, please note that it is hard to see any trace of the dot-com crisis of 2000-2003 (dot-com bubble) in this picture.
Another comment on the picture concerns the number of countries participating in the crash.
There are some countries excluded (Japan, S. Korea, Israel) or weakly participating in the process (Australia, Canada, Switzerland, Germany, New Zealand, Sweden, Croatia). Altogether, 13 countries out of 24 are affected by the crash.
<br /><br />
<b>Observations for the time period 2008-now:</b><br />
The first observation is that more countries are now correlated (and not because of COVID). Still outside the correlated market are Japan, S. Korea and Croatia.
Spain, Italy are not correlated (accident at work?).
<br /><br />
<b>Summary:</b><br />
the management of the real estate market is becoming more and more concise (only one player? ): currently 19 countries out of 24 (crisis 2008: 13 countries).<br />
<b><i>Question: when will this player decide to make another crash?</i></b>
<br /><br />
If anyone has knowledge of more data I would be grateful for providing it. Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0tag:blogger.com,1999:blog-4412689276279452501.post-10006004884018636122021-05-14T15:17:00.002+02:002021-05-14T21:50:35.532+02:00Semantic Search: Too many or too few matching pairs ? Dynamically determined selection threshold for matched query pairs<!-- MathJax usage example -->
<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});</script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({TeX: { equationNumbers: { autoNumber: "AMS" }}});</script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script>
In my recent projects on applying Natural language processing (NLP) methods, a large part is based or contains parts based on <i><b>semantic search</i></b>. In a nutshell, we have certain queries (phrases or sentences) on one side and a set of other texts on the other side and our goal is to find the best matching texts to our query. Simply writing, we need to perform semantic search on our data set. <br />
<br />
For those who are less familiar with semantic search, let me define the term as: <br />
<i><b>a kind of lexical comparison of two texts with dominant part of understanding the content of words and phrases, and relations between words or phrases in queries being compared.
</i></b><br />
<br />
While working with semantic search, I encountered a problem with defining the acceptance threshold for my findings. This problem becomes significant when the texts being compared are of significantly different lengths and/or contain significantly different degrees of content. In other words, the problem becomes serious when we deal with the so-called asymmetric semantic search
<a href="https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search"><b>https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search</b></a>.<br />
<br />
In the following, I would like to share a method which allows to dynamically determine the acceptance threshold of found pairs of matched entries. This method may determine the final solution or be a prelude to a more modified version.
The project code is available on my Github account
<a href="https://github.com/Lobodzinski/Semantic_Search__dynamical_estimation_of_the_cut_of_for_selected_matches"><b>"https://github.com/Lobodzinski/Semantic_Search__dynamical_estimation_of_the_cut_of_for_selected_matches"</b></a>.<br />
<br />
Let's start describing the method.
<ul>
<li> <b>The data:</b><br />
As an experiment I will use reuters data, known as reuters-21578 (https://paperswithcode.com/dataset/reuters-21578).
While searching for an answer to our query, we should try to be as precise as possible in formulating the questions.
However, sometimes it is not possible. For the purpose of this mini-project, let's formulate <b>our queries</b> in a general form.<br />
<i>'Behavior of the precious metals market'</i>,<br />
<i>'What is the situation in metal mines'</i>,<br />
<i>'Should fuel prices expect to rise ?'</i>,<br />
<i>'Will food prices rise in the near future ?'</i>,<br />
<i>'I am looking for information about food crops.'</i>,<br />
<i>'Information on the shipbuilding industry'</i><br />
<li> <b>Generation of matched pairs between the queries and the Reuter's texts:</b><br />
Our goal is to perform a semantic search. First, we need to generate matching text pairs. In the following I will use the code that is part of the sentence-transformers package
<a href="https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/semantic-search"><b>"https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/semantic-search"</b></a>.<br />
<li> <b>Similarities and the threshold calculation:</b><br />
Having calculated the similarity values, we can move to the main point - choosing the similarity threshold.
First, let's look at the similarity plot in the test function for a fixed query (<i><b>'What is the situation in metal mines ?'</b></i>)
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6XTLGVGv70oRovEmJFmrTguHkmrtO7cyM6uiDgaBXN8uOga1CnHZZ3fJUg_dirIXLAVzEnze-YnMnPAZ0gRN5E2q49RHlMCc3KQPLrWJsTaZ1xQMoTUCOOJWlCTjFv4nnKkAVK2cuH6cr/s0/fig1.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="898" data-original-width="747" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6XTLGVGv70oRovEmJFmrTguHkmrtO7cyM6uiDgaBXN8uOga1CnHZZ3fJUg_dirIXLAVzEnze-YnMnPAZ0gRN5E2q49RHlMCc3KQPLrWJsTaZ1xQMoTUCOOJWlCTjFv4nnKkAVK2cuH6cr/s0/fig1.png"/></a></div>
It is obvious that not all matches shown in the Figure are good (acceptable). So how to choose the threshold value of similarity ? <br />
The proposed method is fully heuristic and is based on the calculation of the <i><b>elbow point</b></i> of the curve of the similarity as a function of the matched text.
If we take a look at the examined functional relationship, we can see that this curve (almost always) has an elbow point beyond which the similarity between the found texts and our query changes very slowly. To calculate the "cut off" point (elbow point) I used the KneeLocator package
(<a href="https://pypi.org/project/kneed/"><b>"https://pypi.org/project/kneed/"</b></a>).
The function KneeLocator (<a href="https://kneed.readthedocs.io/en/stable/parameters.html#s"><b>"https://kneed.readthedocs.io/en/stable/parameters.html#s"</b></a>) contains a sensitivity parameter <b>S</b> which can be used to better select our elbow point.<br />
The following code and its output shows the details of the calculation and its results. For details, please check <a href="https://github.com/Lobodzinski/Semantic_Search__dynamical_estimation_of_the_cut_of_for_selected_matches"><b>"https://github.com/Lobodzinski/Semantic_Search__dynamical_estimation_of_the_cut_of_for_selected_matches"</b></a>.<br />.<br />
<br />
This part, for each query reads all matched sentences gathered from the Reuters data together with calculated similarities.
The <b>threshold</b> is calculated by the function <b>KneeLocator</b>, this part is denoteb by bold text in the code below.<br />
<pre class="prettyprint"><code>
# loop over our list of queries:
for query in result_df['query'].unique():
sentences_ = result_df[(result_df['query'] == query)]['sentence'].values
x = []
for y_ in sentences_:
x.append(y_[:60]+'...')
<b>
# similarities between the query and the matched Reuter's texts:
y1 = result_df[(result_df['query'] == query)]['score'].values
# determne elbow value:
x0 = list(range(len(y1)))
kn = KneeLocator(x0, y1, S=1., curve='convex', direction='decreasing')
elbow_1 = kn.knee
print ('Elbow point values:\n tekst_id=', elbow_1, \
'; threshold value=',y1[elbow_1])
</b>
</code></pre>
Resulting value (the threshold point) is presented on the next Figure :
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhuKSDzYT7ToLratKqRLK_8igXOjJ8Bmifk9-W2db2ncDM5B1V8ihHtmWfo567fnFN0MXQbpF7Jjz8tyoSxAFPBQT71IGROCXzX9IXVG2VygW3t8DkEX4X7RYGptFcOef7Gqi0W8nBc70_N/s0/fig2.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" data-original-height="898" data-original-width="747" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhuKSDzYT7ToLratKqRLK_8igXOjJ8Bmifk9-W2db2ncDM5B1V8ihHtmWfo567fnFN0MXQbpF7Jjz8tyoSxAFPBQT71IGROCXzX9IXVG2VygW3t8DkEX4X7RYGptFcOef7Gqi0W8nBc70_N/s0/fig2.png"/></a></div>
<br />
So, for our query (<i><b>'What is the situation in metal mines ?'</b></i>), we found 14 texts in the Reuter's set.
Below I have copied the first 3 and last 2 texts from the set of accepted texts (the whole set is too long to present here). The reader can judge for themselves the similarity between the query and the text.<br />
For comparison, I have also added the text which is not accepted (15), which is not accepted by this method.<br />
Accepted texts:<br />
<i>
1 <br /> SIX KILLED IN SOUTH AFRICAN GOLD MINE ACCIDENT Six black miners have been killed and
two injured in a rock fall three km underground at a South African gold mine, the
owners said on Sunday. lt Rand Mines Properties Ltd>, one of South Africa s big six
mining companies, said in a statement that the accident occurred on Saturday morning
at the lt East Rand Proprietary Mines Ltd> mine at Boksburg, 25 km east of Johannesburg.
A company spokesman could not elaborate on the short statement.<br />
2 <br /> NORANDA BEGINS SALVAGE OPERATIONS AT MURDOCHVILLE lt Noranda Inc> said it began salvage
operations at its Murdochville, Quebec, mine, where a fire last week killed one miner
and caused 10 mln dlrs in damage. Another 56 miners were trapped underground for as long
as 24 hours before they were brought to safety. Noranda said the cause and full extent
of the damage is still unknown but said it does know that the fire destroyed 6,000 feet
of conveyor belt. Noranda said work crews have begun securing the ramp leading into the
zone where the fire was located. The company said extreme heat from the fire caused
severe rock degradation along several ramps and drifts in the mine. Noranda estimated that
the securing operation for the zone will not be completed before the end of April. Noranda
said the Quebec Health and Safety Commission, the Quebec Provincial Police and Noranda
itself are each conducting an investigation into the fire. Production at the mine has been
suspended until the investigations are complete. The copper mine and smelter produced
72,000 tons of copper anodes in 1986 and employs 680 people. The smelter continues to
operate with available concentrate from stockpiled supplies, Noranda said. Reuter <br />
3 <br /> NORTHGATE QUEBEC GOLD WORKERS END STRIKE Northgate Exploration Ltd said hourly paid workers at its
two Chibougamau, Quebec mines voted on the weekend to accept a new three year contract offer and
returned to work today after a one month strike. It said the workers, represented by United Steelworkers
of America, would receive a 1.21 dlr an hour pay raise over the life of the new contract and improved
benefits. Northgate, which produced 23,400 ounces of gold in first quarter, said that while the strike
slowed production, We are still looking forward to a very satisfactory performance. The Chibougamau
mines produced 81,500 ounces of gold last year. <br />
....<br />
13 <br /> NORANDA BRUNSWICK MINERS VOTE MONDAY ON CONTRACT Noranda Inc said 1,100 unionized workers at its
63 pct owned Brunswick Mining and Smelter Corp lead zinc mine in New Brunswick would start voting
Monday on a tentative contract pact. Company official Andre Fortier said We are hopeful that we can
settle without any kind of work interruption. Fortier added that Brunswick s estimated 500 unionized
smelter workers were currently meeting about a Noranda contract proposal and would probably vote next
week. The mine s contract expires July 1 and the smelter s on July 21. The Brunswick mine produced
413,800 tonnes of zinc and 206,000 tonnes of lead last year at a recovery rate of 70.5 pct zinc and
55.6 pct lead. Concentrates produced were 238,000 tonnes of zinc and 81,000 tonnes of lead. <br />
14 <br /> COMINCO lt CLT> SETS TENTATIVE TALKS ON STRIKE Cominco Ltd said it set tentative talks with three
striking union locals that rejected on Saturday a three year contract offer at Cominco s Trail and
Kimberley, British Columbia lead zinc operations. The locals, part of United Steelworkers of America,
represent 2,600 production and maintenance workers. No date has been set for the talks, the spokesman
replied to a query. The spokesman said talks were still ongoing with the two other striking locals,
representing 600 office and technical workers. Production at Trail and Kimberley has been shut down
since the strike started May 9. Each of the five locals has a separate contract that expired April
30, but the main issues are similar. The Trail smelter produced 240,000 long tons of zinc and 110,000
long tons of lead last year, while the Sullivan mine at Kimberley produced 2.2 mln long tons of ore
last year, most for processing at Trail. Revenues from Trail s smelter totaled 356 mln Canadian dlrs in 1986.
</i>
<br />
<br />
Not Accepted texts:<br />
<i>
15<br />
VESSEL LOST IN PACIFIC WAS CARRYING LEAD The 37,635 deadweight tonnes bulk carrier Cumberlande,
which sank in the South Pacific last Friday, was carrying a cargo which included lead as well as
magnesium ore, a Lloyds Shipping Intelligence spokesman said. He was unable to confirm the tonnages
involved. Trade reports circulating the London Metal Exchange said the vessel, en route to New Orleans
from Newcastle, New South Wales, had been carrying 10,000 tonnes of lead concentrates. Traders said
this pushed lead prices higher in early morning trading as the market is currently sensitive to any
fundamental news due to its finely balanced supply demand position and low stocks. Trade sources
said that 10,000 tonnes of lead concentrates could convert to around 5,000 tonnes of metal, although
this depended on the quality of the concentrates. A loss of this size could cause a gap in the supply
pipeline, particularly in North America, they noted. Supplies there have been very tight this year
and there is a strike at one major producer, Cominco, and labour talks currently being held at
another, Noranda subsidiary Brunswick Mining and Smelting Ltd. <br />
16 <br /> LTV lt QLTV> TO NEGOTIATE WITH STEELWORKERS LTV Corp s LTV Steel Corp said it agreed to resume
negotiations with the United Steelworkers of America at the local plant levels, to discuss those
provisions of its proposal that require local implementation. The local steelworker union narrowly
rejected a tentative agreement with the company on May 14, it said. LTV also said it agreed to
reopen its offer contained in the tentative agreement reached with the union s negotiating committee
as part of a plan to resolve problems through local discussions.<br />
</i>
<br />
<br />
As you can see, the unaccepted texts are not directly related to mining, which is what we are asking about in our query.
</ul>
<br />
<br />
Thanks for reading, please feel free to comment and ask questions if anything is unclear.
Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0tag:blogger.com,1999:blog-4412689276279452501.post-16822155957666762262021-04-07T12:22:00.008+02:002021-05-14T21:50:13.052+02:00Weird aspects of ARIMA - how to increase the accuracy of predictions by exogenous data<!-- MathJax usage example -->
<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});</script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({TeX: { equationNumbers: { autoNumber: "AMS" }}});</script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>
ARIMA models are commonly used to predict time series. Most importantly, if the ARIMA model is properly chosen, the prediction error is often so small that it is a very difficult task to find better predictions using more sophisticated methods.
Since this is not a note about an introduction to ARIMA models I replace the typical introduction to the models with this link which describes the method much better:
<a href="https://otexts.com/fpp2/"><b>Forecasting: Principles and Practice"</b> (2nd ed)</a> by Rob J Hyndman and George Athanasopoulos (<i>https://otexts.com/fpp2/</i>).<br />
One of the variants of ARIMA models is a version using exogeneous data, to which this note is dedicated.<br />
<i><b>
It is not widely known that this version of ARIMA models is strongly dependent on the factor by which we multiply the exogeneous data.
Generalizing, we can say that the larger the factor, the smaller the prediction error of the model.
</b>
</i>
<br />
<br />
To begin with, let us start with a heuristic proof that the factor determining the ratio between the exogeneous data and the target, can influence the prediction error of the algorithm.<br />
<br />
For simplicity, let us omit the part of the time series description which in ARIMA models is responsible for the differentiable part, i.e. we assume that the parameter $d=0$. In other words, we will use the formulations of the ARMAX model. <br />
Then, a given time series X, in ARMAX, can be expressed generally as:
\begin{equation}
\label{eq1}
X_{t}= c + \epsilon_{t} + AR_{t} + MA_{t} + exog_{t}
\end{equation}
where $c$: constant, $\epsilon_{t}$: white noise, $AR_{t}$: Autoregression part, $MA_{t}$: Moving average part, $exog_{t}$: exogeneous variable(s).<br/>
Now, let's define the exogeneous part $exog_{t}$ as:
\begin{equation}
\label{eq2}
exog_{t} = X_{t} \alpha + \gamma_{t}
\end{equation}
where $\alpha$ is some proportionality factor, $\gamma_{t}$ - white nose. <br/>
Therefore, introducing eq. \ref{eq2} to eq. \ref{eq1} we will get:
\begin{equation}
\label{eq3}
X_{t} = c + \epsilon_{t} + AR_{t} + MA_{t} + X_{t} \alpha + \gamma_{t}
\end{equation}
And after rearranging some terms
\begin{equation}
\label{eq4}
X_{t}\left(1-\alpha\right) = c + \left(\epsilon_{t} + \gamma_{t}\right) + AR_{t} + MA_{t}
\end{equation}
But the part $AR_{t}$ can be written as $AR_{t} = \sum_{i}^{p} \phi_{i} X_{t-i}$
and similarly the $MA_{t}$ component
$MA_{t} = \langle X \rangle + \beta_{t} + \sum_{i}^{q} \theta_{i} \beta_{t-i}$
with:
$\beta_{n}$ as a white noise, $ \langle X \rangle $ - the expectation value of the $X$, $\theta_{i}$ - parameters of the model. <br/>
With the above in mind, the eq \ref{eq4} becomes :
\begin{equation}
\label{eq5}
X_{t} = \frac{c + \epsilon_{t} + \gamma_{t} + \alpha \langle X \rangle}{1-\alpha} +
\sum_{i} \hat{\phi}_{i} X_{t-i} +
\langle X \rangle + \hat{\beta}_{t} + \sum_{i} \theta_{i} \hat{\beta}_{t-i}
\end{equation}
where I introduced notation: $\hat{\beta}_{t-i} = \frac{\beta_{t-i}}{1-\alpha}$, $\hat{\phi}_{i} = \frac{\phi_{i}}{1-\alpha}$ and
$\hat{\theta}_{i} = \frac{\theta_{i}}{1-\alpha}$.<br/>
Using definitiona of the $AR_{t}$ and $MA_{t}$ we can rewrite eq \ref{eq5} into the final form:
\begin{equation}
\label{eq6}
X_{t} = \frac{c + \epsilon_{t} + \gamma_{t} + \alpha \langle X \rangle}{1-\alpha} + \hat{AR}_{t} + \hat{MA}_{t}
\end{equation}
where components $\hat{AR}_{t}$ and $\hat{MA}_{t}$ correspond to the definitions $AR_{t}$ and $MA_{t}$ but with $\hat{\phi}$,
$\hat{\theta}$ and $\hat{\beta}$ coefficients. <br/>
The first component of Equation \ref{eq6} is the most interesting !. <br/>
In the case of ARIMA without exogeneous data, the forecasting error is determined by the $\epsilon$ .
Now, this error is replaced by the expression
\begin{equation}
\label{eq7}
\epsilon_{t} \longrightarrow \frac{\epsilon_{t} + \gamma_{t}}{1-\alpha}
\end{equation}
<br/>
<b>
This is how we reached our final conclusions:
<ol>
<li>use exogenous variables that are highly correlated ($\alpha \approx 1.$) or anti-correlated ($\alpha \approx -1.$) with the target
is equivalent to a model without exogenous variables (but with changed model parameters).</li>
<li>the use of exog data, scaled by the ratio $\frac{exogenous data}{target}$
allows for a significant modification of the final prediction error of the model.
</b>
The error is now scaled by the factor $\frac{1}{1-\alpha}$. So
<ol>
<li> we have the error explosions for $\left|\alpha\right| \approx 1$,</li>
<li> for $\left|\alpha\right| < 1 $: error with exog data $>$ error without exogenous variable,</li>
<li> for $\left|\alpha\right| > 1 $: error with exog data $<$ error without exogenous variable.</li>
</li>
</ol>
</ol>
The above derivation was made for the simplified case without the differential term (i.e., when $d,D=0$).
I leave it to the readers to generalize the above formalism into a model with a differential parts. <br/><br/><br/>
The next step will be a verification of these hypotheses in practice.
A practical example of the implementation of the discussed hypothesis is available in the form of a jupyter-notebook script:
<a href="https://github.com/Lobodzinski/ARIMA_with_exogenous_data"><b>https://github.com/Lobodzinski/ARIMA_with_exogenous_data</b></a>.
<br/>
Here, we just present the dependence of the MAPE error as a function of the exog data factor (in the log10 scale).
As you can see, by using an appropriate value of the factor we are able to reduce the prediction error by almost half ! The error reaches its minimum value at a factor value of 4000000. Then the error increases. The increasing part after the minimum is reached is not directly visible in the theoretical proof. I will try to explain it in the next part of the article. <br/>
All other details are available in the code:
<a href="https://github.com/Lobodzinski/ARIMA_with_exogenous_data"><b>https://github.com/Lobodzinski/ARIMA_with_exogenous_data</b></a>.
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRqR6mqLq23AYq9v6ZZNcOMssxqlGvzGV2mcIs7fBz3_FM_ScTSb0ACEi1J7RDCbj6BHJDy0lioh0OzQIt9st_r31wXHExCAS-ckRnLg3DDtSlkgfNmhWD7EbMCCmpwhx8fLklv5P8hade/s717/mape.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="600" data-original-height="387" data-original-width="717" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRqR6mqLq23AYq9v6ZZNcOMssxqlGvzGV2mcIs7fBz3_FM_ScTSb0ACEi1J7RDCbj6BHJDy0lioh0OzQIt9st_r31wXHExCAS-ckRnLg3DDtSlkgfNmhWD7EbMCCmpwhx8fLklv5P8hade/s600/mape.png"/></a></div>
<br/><br/>
Thank you for the reading !
Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0tag:blogger.com,1999:blog-4412689276279452501.post-15218648749262247382016-06-22T14:50:00.000+02:002021-05-14T21:50:59.637+02:00Sell - buy signals for financial instruments - a multivariate survival analysis approach (in R).<!-- MathJax usage example -->
<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});
</script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script>
Survival analysis could be understand as a kind of a forecasting method created for studying
how long an analyzed subject will exist. In a survival framework we calculate
survival probability of a thing at particular points of time. <br />
And so, it is a natural choice to use the survival method in forecasting of financial market.
Especially, for prediction of future price behaviour of a given financial instance and as a consequence for
generating buy-sell signals. Searching for some examples of similar analysis I have found only a single work addressing
the subject [1]. From this publication I adopted a definition of a daily revenue - given below. <br />
<br />
In this note I would like to show how to use the survival analysis method for generation of <b>buy/sell signals</b>
for a chosen financial product.
As data I use <b>Brent Oil Future</b> options, however the method could be adopted
to other instruments as well. <br />
<br />
The analysis is presented in R language.<br />
As an input I use historical data of Brent Oil Future options available on the portal
<a href="http://stooq.com/q/d/?s=sc.f&d1=20150101&d2=20160617&c=0">stooq.pl</a>. The data contain several features like
<b>Open</b> and <b>Close</b> prices, <b>Volume</b>, highest (<b>High</b>) and lowest (<b>Low</b>) prices and Open Interest (<b>OpenInt</b>).
Operating on a <b>daily basis</b>, my goal is to predict whether for the next day we expect increase
or decrease of the <b>Close</b> value.<br />
Let me begin with preparation of libraries and data engineering. <br />
<pre class="prettyprint"><code>set.seed(1)
library(ggplot2)
library(lubridate)
library(readr)
library(MASS)
library(survival)
library(xts)
library(rms)
# reading the data:
myDir <- "/analysis/forecasting/data_stooq.pl/"
myFile <- "sc_f_d_2015_2016.csv"
input <- NULL
input <- read.csv(paste(myDir,myFile,sep=""),header=TRUE)
names(input)
[1] "Date" "Open" "High" "Low"
[5]"Close" "Volume" "OpenInt"
</code></pre>
In a next step I will introduce new features:<br />
<ol>
<li> $corp = Close - Open$ : it corresponds to the definition of the corp of the candlesticks,
<pre class="prettyprint"><code>input$corp <- 0
input$corp <- input$Close - input$Open
</code></pre>
</li>
<li> Exponential Moving Average - $EMAV$. For this purpose I use functions for calculation of the
Volume Weighted Exponential Moving Average (see <a href="http://www.r-bloggers.com/volume-weighted-exponential-moving-average/">volume-weighted-exponential-moving-average</a>)
<pre class="prettyprint"><code>VEMA.num <- function(x, volumes, ratio) {
ret <- c()
s <- 0
for(i in 1:length(x)) {
s <- ratio * s + (1-ratio) * x[i] * volumes[i];
ret <- c(ret, s);
}
ret
}
VEMA.den <- function(volumes, ratio) {
ret <- c()
s <- 0
for(i in 1:length(x)) {
s <- ratio * s + (1-ratio) * volumes[i];
ret <- c(ret, s);
}
ret
}
VEMA <- function(x, volumes, n = 10, wilder = F, ratio = NULL, ...)
{
x <- try.xts(x, error = as.matrix)
if (n < 1 || n > NROW(x))
stop("Invalid 'n'")
if (any(nNonNA <- n > colSums(!is.na(x))))
stop("n > number of non-NA values in column(s) ",
paste(which(nNonNA),collapse = ", "))
x.na <- xts:::naCheck(x, n)
if (missing(n) & !missing(ratio)) {
n <- trunc(2/ratio - 1)
}
if (is.null(ratio)) {
if (wilder)
ratio <- 1/n
else ratio <- 2/(n + 1)
}
foo <- cbind(x[,1], volumes, VEMA.num(as.numeric(x[,1]), volumes, ratio),
VEMA.den(volumes, ratio))
(foo[,3] / foo[,4]) -> ma
ma <- reclass(ma, x)
if (!is.null(dim(ma))) {
colnames(ma) <- paste(colnames(x), "VEMA", n, sep = ".")
}
return(ma)
}
x <- ts(input$Close)
input$EMAV <- 0
Vol <- rep(1, length(input[,1]))
input$EMAV <- VEMA(input$Close, Vol,ratio=0.95)
plot(input$Close,type="l");lines(input$EMAV,col="red")
</code></pre>
</li>
<li> Daily Revenue - $DailyRevenue$ . The feature is calculated as a percentage of $Close$ values between present and previous days [1]<br />
$DailyRevenue = \frac{( Close[k] - Close[k-1] )}{Close[k-1]} \cdot 100$
<pre class="prettyprint"><code>input$DailyRevenue <- 0
input$DailyRevenue[1] <- 0
for (k in (2:(length(input$Close)))) {
input$DailyRevenue[k] <- (input$Close[k]-input$Close[k-1])*100/(input$Close[k-1])
}
</code></pre>
</li>
<li> Censors: <br />
In this case I consider two series of censored data corresponding to the increasing and decreasing values of $DailyRevenue$.
The increasing or decreasing event is detected by checking whether the Daily Revenue is larger or smaller than a given threshold value.
The threshold value could be used as a parameter in our model. Therefore, the value of the $threshold$ could be different from $0$ ($0$ is here a good initial choice).
The censoring of data used in the model is following:
<ol>
<li>censors specific for increasing daily revenue: $censorUp$:
if the daily revenue is higher than $threshold$: $censorUp = 0$ (censored point of time).
In case when the daily revenue being smaller or equal to $threshold$: $censorUp = 1$ .
<pre class="prettyprint"><code>Threshold <- 0
input$censorUp <- 0
input[input$DailyRevenue <= Threshold,12] <- 1
</code></pre>
</li>
<li>censors specific for decreasing daily revenue: $censorDown$. In this case the
events are opposite to the $censorUp$ feature.
The daily revenue is smaller or equal to $threshold$: $censorDown = 1$ .
the daily revenue is larger than $threshold$: $censorDown = 0$ .
<pre class="prettyprint"><code>input$censorDown <- 0
input[input$DailyRevenue > Threshold,11] <- 1
</code></pre>
</li>
</ol>
</li>
<li> Time to events:
are measured as a time distance between last local extremum of the $DailyRevenue$ and an event defined by censoring series (censored or NOT censored).
<ol>
<li> $TimeToEventUp$: <br />
it is a time measured from a last local minimum of the $DailyRevenue$ to the $censorUp = 1$ events or
the time distance between the last local maximum of the $DailyRevenue$ to the $censorUp = 0$.
<pre class="prettyprint"><code>locmylastmin <- 1
locmylastmax <- 1
for (k in (2:(dim(input)[1]))) {
tmp <- input$DailyRevenue[1:k]
if (input$censorUp[k] == 0) {
# find all local minima:
locmymin <- which(diff(sign(diff(tmp)))==+2)+1
if (length(locmymin) == 0) {locmymin <- which.min(tmp)}
locmylastmin <- locmymin[length(locmymin)]
dateMin <- input[locmylastmin,1]
input$TimeToEventUp[k] <- as.integer(as.Date(input[k,1])-
as.Date(dateMin))
}
if (input$censorUp[k] == 1) {
# find all local maxima:
locmymax <- which(diff(sign(diff(tmp)))==-2)+1
if (length(locmymax)==0) {locmymax <- which.max(tmp)}
locmylastmax <- locmymax[length(locmymax)]
dateMax <- input[locmylastmax,1]
input$TimeToEventUp[k] <- as.integer(as.Date(input[k,1])-
as.Date(dateMax))
}
}
</code></pre>
</li>
<li>$TimeToEventDown$: <br />is calculated in an opposite way to the feature $TimeToEventUp$.
Details of determination of the $TimeToEventDown$ feature is shown on Fig. 1 .
<pre class="prettyprint"><code>locmylastmin <- 1
locmylastmax <- 1
for (k in (2:(dim(input)[1]))) {
tmp <- input$DailyRevenue[1:k]
if (input$censorDown[k] == 1) {
# find all local minima:
locmymin <- which(diff(sign(diff(tmp)))==+2)+1
if (length(locmymin) == 0) {locmymin <- which.min(tmp)}
locmylastmin <- locmymin[length(locmymin)]
dateMin <- input[locmylastmin,1]
input$TimeToEventDown[k] <- as.integer(as.Date(input[k,1])-
as.Date(dateMin))
}
if (input$censorDown[k] == 0) {
# find all local maxima:
locmymax <- which(diff(sign(diff(tmp)))==-2)+1
if (length(locmymax)==0) {locmymax <- which.max(tmp)}
locmylastmax <- locmymax[length(locmymax)]
dateMax <- input[locmylastmax,1]
input$TimeToEventDown[k] <- as.integer(as.Date(input[k,1])-
as.Date(dateMax))
}
}
</code></pre>
<!--
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7oMdXuHntRXsOWHPxWcBG1LCqhaaNqd4Tersr-sHpDgNl4RQrK8wnXF4tF3BroDhZJC8ULmup-VWPUMQJ9uemUPZwx7sI0AlelmD-mFvtA3mkCLNAYLctkJYCpEQYe5qkHRXuwFroVt0H/s1600/censoring.bmp" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7oMdXuHntRXsOWHPxWcBG1LCqhaaNqd4Tersr-sHpDgNl4RQrK8wnXF4tF3BroDhZJC8ULmup-VWPUMQJ9uemUPZwx7sI0AlelmD-mFvtA3mkCLNAYLctkJYCpEQYe5qkHRXuwFroVt0H/s1600/censoring.bmp" /></a>
<b>Fig. 1</b>
-->
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7oMdXuHntRXsOWHPxWcBG1LCqhaaNqd4Tersr-sHpDgNl4RQrK8wnXF4tF3BroDhZJC8ULmup-VWPUMQJ9uemUPZwx7sI0AlelmD-mFvtA3mkCLNAYLctkJYCpEQYe5qkHRXuwFroVt0H/s1600/censoring.bmp" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7oMdXuHntRXsOWHPxWcBG1LCqhaaNqd4Tersr-sHpDgNl4RQrK8wnXF4tF3BroDhZJC8ULmup-VWPUMQJ9uemUPZwx7sI0AlelmD-mFvtA3mkCLNAYLctkJYCpEQYe5qkHRXuwFroVt0H/s640/censoring.bmp" /></a>
Definition of the variable $TimeToEventDown$ demonstrated on a part of the $DailyRevenue$ time series.
It is a sequence of time distances $T_{Down}^{censored}$ and $T_{Down}^{event}$, where
the time $T_{Down}^{censored}$ is a time measured from a last local maximum of the $DailyRevenue$ to the $censorDown = 0$ and
$T_{Down}^{event}$ - is the time distance between the last local $DailyRevenue$ minimum to the $censorDown = 1$.
<br />
</li>
</ol>
</li>
</ol>
<br />
In the next step I build the signal creator. <br />
The basic part is realized in the standard way of calculating of the survival probabilities corresponding
to the both series $censorUp$ and $censorDown$.
The prediction is calculated by the function survfit without newdata object.
In such a case the predicted survival probabilities are calculated on a base of mean values of all covariates included in the
formula used for creation of the Cox hazards regression model fit. In our case I use all available covariates in implemented formula, except the $Volume$ feature which is not already a solid number.
<pre class="prettyprint"><code>formula = Surv(TimeToEventDown, censorDown) ~
High + Low + Close + corp + EMAV + DailyRevenue + OpenInt
</code></pre>
The code of the function $SurvivalProb$:
<pre class="prettyprint"><code>SurvivalProb <- function(myinput,What="Down") {
train <- myinput
if (What == "Down") {
coxphCorp <- coxph(Surv(TimeToEventDown, censorDown) ~
High + Low + Close + corp + EMAV + DailyRevenue + OpenInt,
data=train, method="breslow")
}
if (What == "Up") {
coxphCorp <- coxph(Surv(TimeToEventUp, censorUp) ~
High + Low + Close + corp + EMAV + DailyRevenue + OpenInt,
data=train, method="breslow")
}
mysurffit<-survfit(coxphCorp, se.fit = F, conf.int = F)
tmp<-summary(mysurffit)
myprob <- tmp$surv[1]
myprob
}
</code></pre>
The function predicting the signal sell/buy ($SignalPredictor$) calculates it as a ratio between probabilities calculated for the next days related to the increased and decreased daily revenue accordingly. The ratio is defined as:
$ratio = \frac{Survival-Probability-that-daily-revenue-will-be-higher-next-day}{Survival-Probability-that-daily-revenue-will-be-smaller-next-day} - \left(1+\epsilon\right)$
where $Survival-Probability-that-daily-revenue-will-be-higher-next-day$ (denoted as $probUp$ in the code) and
$Survival-Probability-that-daily-revenue-will-be-smaller-next-day$ ($probDown$ in the code) are calculated by the function $SurvivalProb$.
The constant $1+\epsilon$ corresponds to the $ratioValue$ inside the code. <br />
The code of the function $SignalPredictor$ :
<pre class="prettyprint"><code>SignalPredictor <- function(train,ratioValue) {
# Survival-Probability-that-daily-revenue-will-be-smaller-next-day
ProbDown <- SurvivalProb(train,"Down")
# Survival-Probability-that-daily-revenue-will-be-higher-next-day
ProbUp <- SurvivalProb(train,"Up")
# signal: + <- buy; - <- sell
signal <- ((ProbUp/ProbDown)-ratioValue)
signal
}
</code></pre>
As a practical test how well the model creates buy-sell signals I run the signal predictor in a loop over
available dates in the set of historical data of Brent Oil Future.
In the code below, the $Dist$ variable denotes the size of the training sample.
For a given row $k$ the model use $Dist$ rows in the data as a train sample and makes a prediction for the day: $k + Dist + 1$ ($SignalPredictor$).
In the next iteration the model predicts signal for the next day $k + Dist + 2$, using data of the previous day $k + Dist + 1$ as a part of the train
sample. <br />
<pre class="prettyprint"><code># daily step:
Dist <- 50
ratioValue <- 1.02
Bias <- 0
startInd <- Dist
results <- NULL
results$Signal <- 0
for (k in seq((startInd+Bias+1),length(input[,1]),by=1)) {
print(as.Date(input[k,1]))
train <- input[(k-Dist):k,]
train <- Optimize(train,ratioValue)
ratio <- SignalPredictor(train,ratioValue)
results$Signal <- c(results$Signal,ratio)
}
</code></pre>
Inside the loop I use another function $Optimize$. The $Optimize$ function takes the train sample and selects its last
row as test data. In the next steps the function selects such a size of the train sample that properly predicts the
sign of the $DailyRevenue$ feature from the test sample (e.g. the signal sell/buy corresponds to the sign of the $DailyRevenue$ variable).
The resulting size of the train sample can be smaller or equal to the initial train size. The $Optimize$ function:<br />
<pre class="prettyprint"><code>Optimize <- function(inputData,ratioValue) {
L <- dim(inputData)[1]
mydist <- 1
mycheck <- -1
repeat {
mytest <- inputData[L,]
mytrain <- inputData[mydist:(L-1),]
probDown <- SurvivalProb(mytrain,"Down")
probUp <- SurvivalProb(mytrain,"Up")
ratio <- (probUp/probDown)-ratioValue
if ( (sign(ratio)*mytest$DailyRevenue >= 0) ) {mycheck <- 1}
if (mycheck == 1) {break}
if (mydist >= (L-10)) {
mycheck <- 1;mydist <- 2;break
}
mydist <- mydist + 1
}
mydist <- mydist - 1
inputData[mydist:L,]
}
</code></pre>
The results (sell=buy signals) are shown on the next plot (Fig.2) for an initial values of parameters:
<pre class="prettyprint"><code>Dist <- 50
ratioValue <- 1.02
Threshold <- 0
</code></pre>
and for the date period: 2015-03-13 - 2016-06-16
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzrSA43PNSAdgyeZ_vdoK5ihIR4nS3_EPfQnJb1I1DcZJYQnEUEXwnez5hb4J6fPUj7uN_pVkl6uQYwJr-gh9iYEL5dQqnh424zy6jh_1d_iM9gp6wdrt2PCxoHjtNq_tIJQWkSX-zBCxH/s1600/Predictions.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzrSA43PNSAdgyeZ_vdoK5ihIR4nS3_EPfQnJb1I1DcZJYQnEUEXwnez5hb4J6fPUj7uN_pVkl6uQYwJr-gh9iYEL5dQqnh424zy6jh_1d_iM9gp6wdrt2PCxoHjtNq_tIJQWkSX-zBCxH/s1600/Predictions.jpeg" /></a></div>
<b>Fig. 2</b><br />
<!--
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhU1HDLSWlYw_1rYP3acDYvp0DOjx8LpIVEoYWNRXXauN0mNg6OPNwd3ECDTs5eSf62ukoD7u-xEOWa0soCqiwoTVALTj7QilxEeF-a6r8lAf1Klq4BnrI1xFLyY2Fgb0Ly7c-DGA6-ImqC/s1600/Predictions.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhU1HDLSWlYw_1rYP3acDYvp0DOjx8LpIVEoYWNRXXauN0mNg6OPNwd3ECDTs5eSf62ukoD7u-xEOWa0soCqiwoTVALTj7QilxEeF-a6r8lAf1Klq4BnrI1xFLyY2Fgb0Ly7c-DGA6-ImqC/s400/Predictions.jpeg" /></a></div>
-->
In the Fig. 2 the green parts correspond to the positive values of the ratio where the calculated probability for increasing $Close$ price
is larger than for decreasing one. The red part shows the opposite behaviour. <br />
The best signal buy/sell is generated when the ratio changes the sign:<br />
<b>from red to green : buy, from green to red: sell</b> .<br />
The black curve presents the $Close$ values, the blue line represents the values of $EMAV$. <br />
By manipulating with the above parameters ($Dist$,$ratioValue$, $Threshold$ and $EMAV$) one can create different strategies.
Using an initial capital equal to the $Close$ value at the initial date of the game, the total income generated by buy and sell signals for
the above period and parameters is $16.47$ USD what corresponds to
the $41.1$ % of profit from the considered period of time (2013-03-13 - 2016-06-17) .<br />
<br />
<h4>Conclusions:</h4>
<br />
If you place capital into financial market it is always better to use several independent models for prediction of
behaviour of financial instruments. If most of the methods are coming to the same conclusions, one can decide to take action: to sell or to buy
the instrument(s). <br />
The presented model could be treated as an additional source of such a knowledge.
From technical point of view the script is easy to adopt to any financial data and has a lot of room for improvement.
The most important part of the algorithm is the definition of censors ($censorUp$,$censorDown$) and "time to event" ($TimeToEventUp$,$TimeToEventDown$).
I would appreciate any hints concerning improvement of the algorithm. <br />
<br />
Hope, the note could be used as an introduction for survival analysis in a financial market.
Comments (and links) are very welcome. <br /><br /><br />
Thanks for reading ! <br />
Bogdan Lobodzinski<br />
<a href="http://dataanalytics4all.site/">Data Analytics For All Consulting Service</a><br />
<br />
[1] Guangliang Gao, Zhan Bu, Lingbo Liu, Jie Cao, Zhiang Wu: A survival analysis method for stock market prediction. BESC 2015: 116-122<br />
<br />
Code for generation Fig. 2:<br />
<pre class="prettyprint"><code>shift <- 47
factorSig <- 21
Sig <- sign(results$Signal)
L<-length(Sig)
Sigdf <- data.frame(x=input[(startInd+Bias):length(input[,1]),1],Sig=Sig*factorSig)
JustLine <- data.frame(xTime=input[(startInd+Bias):length(input[,1]),1],Line=0*c(1:L)+shift)
titleStr <- paste("Brent Oil Future: period from ",as.Date(input[(startInd+Bias+1),1]),
" to", as.Date(input[(length(input[,1])),1]))
ggplot(Sigdf, aes(x=as.Date(x), y=Sig+shift)) +
geom_line() +
geom_ribbon(aes(ymin=shift, ymax=ifelse(Sig+shift>shift,Sig+shift,shift)),
fill = "green", alpha = 0.5,colour = "green") +
geom_ribbon(aes(ymin=ifelse(Sig+shift < shift,Sig+shift,shift),ymax=shift),
fill = "red", alpha = 0.5,colour="red") +
geom_line(data = JustLine, aes(x=as.Date(xTime), y=Line)) +
geom_line(data = input[(startInd+Bias+1):length(input[,1]),],
aes(x=as.Date(input[(startInd+Bias+1):length(input[,1]),1]), y=Close)) +
geom_line(data = input[(startInd+Bias+1):length(input[,1]),],
aes(x=as.Date(input[(startInd+Bias+1):length(input[,1]),1]), y=EMAV, colour=4)) +
ggtitle(titleStr) +
labs(x="Date", y="Price in USD") +guides(colour=FALSE)
</code></pre>Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0tag:blogger.com,1999:blog-4412689276279452501.post-91842621551610703362016-05-06T15:33:00.002+02:002021-05-14T21:51:20.981+02:00Mainstream media and shaping social opinions<!-- MathJax usage example -->
<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});
</script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>
Inspired by a question <b>"What can we learn about social behaviour from recommender database ?"</b> which arises from my previous note
<a href="http://dataanalyticsforall.blogspot.de/2016/03/recommender-systems-and-q-value-potts.html">Recommender Systems and q-value Potts model</a>
I spent some time considering a following question: <br />
<b>The question:</b><br />
how to control opinions in a social group by creation of an environment with whom the group can interact ?<br />
<br />
Let us start to define the toy model. <br />
In a given social group we have a set of opinions $K$ distributed among all group members. The group collaborates with external world through
an interaction with some set of opinions stated in an external environment. We assume that the environment is too big to be
manipulated by the social group, but the environment can modify somehow the distribution of opinions inside the group.<br />
For a better transparency we consider 2 different opinions described as $A$ and $B$. Opinions $A$ and $B$ are defined as states
<b>"+1"</b> and <b>"-1"</b> correspondingly. Members are distributed among two subgroups according to a shared opinion and interact with the same
environment but with different strength: $\gamma_{A}$ and $\gamma_{B}$ respectively.
The $\gamma_{A|B}$ coefficients correspond to the level of group members acceptance of noise created by the environment.
Or, in other words, the coupling constants $\gamma_{A|B}$ are weights which characterize how much a given group relies on
opinions supported by the environment. The level of believe in a group is proportional to the value of $\gamma_{i}$ couplings.
The scheme is depicted in the picture 1.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlB2NRufFbHYgA5avVvi4BRv3cUk6g4gFQp6M_WlgrffhB364qZI4_mTRp8NNzpICsca2FLOEs7uVrGBFuUbyXTIrkvKFbbxhlifNCFp_pTHwsMlYqi2IkIB-znc_UFrqMH4TcXZE267Dq/s1600/fig1.bmp" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlB2NRufFbHYgA5avVvi4BRv3cUk6g4gFQp6M_WlgrffhB364qZI4_mTRp8NNzpICsca2FLOEs7uVrGBFuUbyXTIrkvKFbbxhlifNCFp_pTHwsMlYqi2IkIB-znc_UFrqMH4TcXZE267Dq/s320/fig1.bmp" /></a></div>
<br />
fig. 1. The environmental $M$ levels are coupled to 2 separate systems $A$ and $B$ with a given coupling constants $\gamma_{A}$ and $\gamma_{B}$.
For simplicity each group $A$ and $B$ has the same number of levels $N$. <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
We assume that the social model will fulfill the principle of least action so the entire set will tend to minimize the total
energy. Therefore the Hamiltonian approach should make a frame for our analysis.
The model can be described by the following Hamiltonian:
\begin{equation}
\label{Hamiltonian0}
H = H_{A} + H_{B} + H_{E} + H_{Int}
\end{equation}
where <br />
\begin{equation}
\label{Hamiltonian1}
H_{A|B} = \sum_{k_{A|B} = 1}^{N_{A|B}} E_{A|B} \left| A|B_{k_{A|B}} \right> \left< A|B_{k_{A|B}} \right|
\end{equation}
describes the state of subgroups $A$ and $B$ with energies $E_{A}$ ($E_{A}$) and $N_{A}$ ($N_{A}$) discrete states $\left| A_{k_{A}} \right>$
($\left| B_{k_{B}} \right>$).
\begin{equation}
\label{Hamiltonian2}
H_{E} = \sum_{n = 1}^{M} E_{n} \left| E_{n} \right> \left< E_{n} \right|
\end{equation}
is the basic energy of the environment with $M$ discrete states $\left| E_{n} \right>$ labeled by index n.
The interaction between the groups and the environment has a form:
\begin{equation}
\label{Hamiltonian3}
H_{Int} = \sum_{n=1}^{M} \left[ \left(
\sqrt{ \gamma_{A} } \sum_{ k_{A} = 1}^{ N_{A} } VA_{k_{A}}^{n} \left| A_{k_{A}} \right> \left< E_{n} \right| + h.c.
\right) +
\left(
\sqrt{ \gamma_{B} } \sum_{ k_{B} = 1}^{ N_{B} } VB_{k_{B}}^{n} \left| B_{k_{B}} \right> \left< E_{n} \right| + h.c.
\right)
\right ]
\end{equation}
where $VA$ and $VB$ are matrix elements describing couplings between group states $\left| A \right>$, $\left| B \right>$ and
environmental levels $\left< E_{n} \right|$. <br />
Because the environment cannot be modified by any group we can use the Markovian approximation and eliminate the environment states
from the Hamiltonian above. The result of operation can be written in the following form:
\begin{equation}
\label{Hamiltonian4}
H_{eff} = H_{A} + H_{B} - i V \cdot V^{+}
\end{equation}
where
\begin{equation}
\label{Hamiltonian5}
V =
\left(
\begin{array}{}
\sqrt{ \gamma_{A} } \cdot VA \\
\sqrt{ \gamma_{B} } \cdot VB
\end{array}
\right)
\end{equation}
creates a dissipative part of the effective Hamiltonian $H_{eff}$ which is
an $N \times N$ dimensional matrix and matrix $V$ has a dimension $N \times M$ .
Our analysis of the dynamics of the system can be reduced to the determination of the eigenvalues of the
effective Hamiltonian $H_{eff}$: $\Lambda = x - iy$ which are complex. An imaginary part describes how fast (time $\tau$) a given eigenvalue
dissipates in the system: $ \tau \approx \frac{1}{y}$. <br />
In other words, the $\tau$ describes the lifetime of a given opinion values by the real value of the eigenvalue $x$ .
Let us stress that we allow to create a continuous number of opinions, not only those which are defined in the assumptions of the
issue: $E_{A|B}$ .
Why can the eigenvalues of the Hamiltonian be used as a determinant of the social opinion distribution ?<br />
A standard model of opinion evolution is modeled using (in a simplest case) a set of linear equations:
\begin{equation}
\label{Hamiltonian6}
\vec{x} \left( t + 1 \right) = W \vec{x} \left( t \right)
\end{equation}
where $W$ is some $N\times N$ matrix describing an information exchange with weights between participants of the social network
and the vector $\vec{x}$ is an opinion profile in the network calculated for the time $t$.
Thus, the correspondence between the opinion profile $\vec{x}\left( t \right)$ and our approach can be
done by calculation of a power spectrum of the $\vec{x}\left( t \right)$.
Positions of maximums in the power spectrum
can be understood as equivalent to the real values of eigenvalues of the Hamiltonian $H_{eff}$ while the time scale of change of the
profile $\vec{x}\left( t \right)$ corresponds to the imaginary part of the eigenvalues (to be more specific to the inverse of the imaginary part).<br />
For any numerical calculations we have to define values $M$, $N$, $E_{A}$, $E_{B}$, $N_{A}$, $N_{B}$, $\gamma_{A}$, $\gamma_{B}$.
The matrices $VA$ and $VB$ are calculated as Gaussian unitary ensembles (GUE random matrices). <br />
Below we present a variety of plots of numerically determined eigenvalues of the Hamiltonian. <br />
On all plots the dots are a result of numerical calculations of the Hamiltonian $H_{eff}$. Numerical simulations have been done
for 30 GUE matrices $V$ with $N = 100$ and $M = 60$. Remaining values of parameters used for simulations: <br />
<ol>
<li> energies: $E_{A} = -1/2$, $E_{B} = 1/2$,
<li> number of states: for state $A$: $N_{A} = 50$, for state $B$: $N_{B}=50$,
<li> interaction couplings are described for each plot separately.
</ol>
<br />
At the beginning, for a comparison with existing models we shows on Fig. 2,3 and 4 a situation where
$\gamma_{A} = \gamma_{B}$ and different values of $\gamma_{B}$. A similar plots one can find in the work [1]. <br />
For a better visibility we used for scale $y$ logarithmic scale. In our case $y \leftarrow -log( \left| y \right| )$, so the value $y=-3$ corresponds to
$y = 10^{-3}$.
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi32t-XYE17DTmHGV3X_K6UjEcZOINmVyPA4FAeGK0OiF7nIIi-F6oTaCEda9KCtjIocp465P-RlWV06pbVZQFv97dd68m5Jnz-8CTwkYn0iCyT74sgrotoy8Ml0C7NBCdj4NTMsQv3irjz/s1600/fig_Re8_Im8.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi32t-XYE17DTmHGV3X_K6UjEcZOINmVyPA4FAeGK0OiF7nIIi-F6oTaCEda9KCtjIocp465P-RlWV06pbVZQFv97dd68m5Jnz-8CTwkYn0iCyT74sgrotoy8Ml0C7NBCdj4NTMsQv3irjz/s320/fig_Re8_Im8.jpeg" /></a></div>
<br />
Fig. 2. Simulation for $\gamma_{A} = \gamma_{B} = 0.0025$ in energy units. The left vertical plot shows the integrated density profile of the width of calculated eigenvalues. The upper picture presents the integrated spectral density showing the width of energy states. <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_7lSS0lrIzm1cJ1iX2mDtYgASQZt9FdIPtViJMzk3yIVFy-YtImAUJxPZOw7JdHk8LkpFf0ZmCIs5tmPXX6gMleBdESgOBCDO7UCA_8XyVwhzNW0cOI4KbxiGn1fdBRs9YBLAqRJABmn0/s1600/fig_Re7_Im7.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_7lSS0lrIzm1cJ1iX2mDtYgASQZt9FdIPtViJMzk3yIVFy-YtImAUJxPZOw7JdHk8LkpFf0ZmCIs5tmPXX6gMleBdESgOBCDO7UCA_8XyVwhzNW0cOI4KbxiGn1fdBRs9YBLAqRJABmn0/s320/fig_Re7_Im7.jpeg" /></a></div>
<br />
Fig. 3. Simulation for $\gamma_{A} = \gamma_{B} = 0.01$. An creation of two timescales of the eigenvectors is visible.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfLHEMb_zNOrUHy6iHRLIQeY2VEKBar_o11Actw5yIHtEGBwlqkLUFQNL6G2LIFRC9TdnJeYwCuffpVBoPSiOnL6ZlyvYiDYWm5i1Ph7zZcmjK4UHqMPFDSBi5qcbDLVj0MazrfJV6dB4v/s1600/fig_Re6_Im6.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfLHEMb_zNOrUHy6iHRLIQeY2VEKBar_o11Actw5yIHtEGBwlqkLUFQNL6G2LIFRC9TdnJeYwCuffpVBoPSiOnL6ZlyvYiDYWm5i1Ph7zZcmjK4UHqMPFDSBi5qcbDLVj0MazrfJV6dB4v/s320/fig_Re6_Im6.jpeg" /></a></div>
<br />
Fig. 4. Simulation for $\gamma_{A} = \gamma_{B} = 0.5$. <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
The next Figs (5,6,7 and 8) shows asymmetric situations where $\gamma_{A} = \gamma_{B}/10$ and different values of $\gamma_{B}$.
Detailed values of other parameters are noted in the figure description. <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiIsgS-PemVXA7cuL0ggIzRxHY0-cmOf_JbwbuPKvnIJHz-mVtMVT8twswi_UegaMyKx4_dxab9z2khzc5_or-aXJltMVgz1VFTjs4Y9H4t7u35nlYeh1Cxg0aJlk-oQar51YLFvWneCx3/s1600/fig_Re2_Im2.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiIsgS-PemVXA7cuL0ggIzRxHY0-cmOf_JbwbuPKvnIJHz-mVtMVT8twswi_UegaMyKx4_dxab9z2khzc5_or-aXJltMVgz1VFTjs4Y9H4t7u35nlYeh1Cxg0aJlk-oQar51YLFvWneCx3/s320/fig_Re2_Im2.jpeg" /></a></div>
<br />
Fig. 5. The situation with $\gamma_{B} = 0.01$ and $\gamma_{A} = \gamma_{B}/10$. <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjS9QShwmy34MImBcP7z14OYqgt8U9STlrvb_Tl_t6p_IpTFQwJ5emqEnLdGf8rZkHVavxANhf6y_1ZQYCXzUOAHPN8caA9Uz3UsPw94UfnYrBp_qA1jXIO_FQom33HRl6WWEdXs2Lw0Ync/s1600/fig_Re3_Im3.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjS9QShwmy34MImBcP7z14OYqgt8U9STlrvb_Tl_t6p_IpTFQwJ5emqEnLdGf8rZkHVavxANhf6y_1ZQYCXzUOAHPN8caA9Uz3UsPw94UfnYrBp_qA1jXIO_FQom33HRl6WWEdXs2Lw0Ync/s320/fig_Re3_Im3.jpeg" /></a></div>
<br />
Fig. 6. The eigenvalue spectrum calculated for $\gamma_{B} = 0.05$ and $\gamma_{A} = \gamma_{B}/10$. <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgI0bwaULrY-W-lF5a_lB-Rlkpqs3IyN_Yrl4DsnhTZPSojONXJMr-pn3cifMZq5qypPJwSoCVnTqfGNUguiuE6-SrUy5buXkv9u6AVfMelmrPrj-UW80gEZroyvtcg8ihtpbps8CUrc_Tg/s1600/fig_Re_Im_gamma_0.1.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgI0bwaULrY-W-lF5a_lB-Rlkpqs3IyN_Yrl4DsnhTZPSojONXJMr-pn3cifMZq5qypPJwSoCVnTqfGNUguiuE6-SrUy5buXkv9u6AVfMelmrPrj-UW80gEZroyvtcg8ihtpbps8CUrc_Tg/s320/fig_Re_Im_gamma_0.1.jpeg" /></a></div>
<br />
Fig. 7. The spectrum calculated for $\gamma_{B} = 0.1$ and $\gamma_{A} = \gamma_{B}/10$. <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiO3sefb0eRiUzJqLQF6bkRo1Oq-APDWbP0iIXz4dUWc6J6l9n0ldNFOPkkZU_78krM9ms2foRkC7VupmMsC7wgmXDi1feL-QZn2x85UXss_o0KMLgapagPRZ-LWq_wTSrTf6hiVgTo6AWj/s1600/fig_Re4_Im4.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiO3sefb0eRiUzJqLQF6bkRo1Oq-APDWbP0iIXz4dUWc6J6l9n0ldNFOPkkZU_78krM9ms2foRkC7VupmMsC7wgmXDi1feL-QZn2x85UXss_o0KMLgapagPRZ-LWq_wTSrTf6hiVgTo6AWj/s320/fig_Re4_Im4.jpeg" /></a></div>
<br />
Fig. 8. The spectrum calculated for $\gamma_{B} = 0.5$ and $\gamma_{A} = \gamma_{B}/10$. <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<h4>
Conclusions
</h4>
<br />
The landscape presented on all plots of eigenvalues of the Hamiltonian $H_{eff}$ is quite straightforward for those who are familiar with
atomic physics, especially interactions of multi-level atomic transitions with resonant light. We see clear existence of
so called coupled and uncoupled level (states) combinations . The increase of the coupling constant $\gamma_{A|B}$ leads to a creation of grouping
of eigenvalues. <br />
The eigenfunctions in the Hamiltonian can be formed in such a way that
some combinations of levels cancel the interaction part with the environment - they are called uncoupled (or trapped) states. The number of
such states is $N-M$. <br />
Other level eigenfunctions are coupled with the environment's states - such an eigenfunction we call coupled (or un-trapped) states ($M$ decaying states).
Populations of trapped states can survive longer time (smaller values of y) in comparison to the un-trapped states, where
the populations is exchanged between coupled states by the interaction term proportional to $\sqrt{\gamma_{A|B}}$. <br />
Let us try to translate this physical picture into a social language. <br />
<ol>
<li> The coupled (un-trapped) states: configurations of members (eigenfunction) which change a common opinion (eigenvalue) faster due to the interaction with a noisy environment represented by a different set of environmental opinions. Such an exchange of opinion is
proportional to the interaction strength $\gamma_{i}$ ($i=A|B$).
<li> The uncoupled (trapped) states: ideally, such configurations of members (eigenfunction) which create a stable combinations of a group members. They do not interact with the environment.
</ol>
If we define an opinion power as a parameter proportional to the number of eigenfunctions which share the same range of opinions we see that
in steady-state conditions the opinion created by the trapped states (i.e. weakly coupled with the environment) has a greater power of persuasion than an opposite social group characterized by the un-trapped member's configurations (i.e. strongly coupled to the environment). <br />
A higher level of noise (larger value of $\gamma_{A|B}$ and different values of couplings $\gamma_{A} < \gamma_{B}$) doesn't lead to reduction of
a significance of unwanted opinions but creates just the opposite action: <br />
it sharpens the polarization of existing opinions and increases the importance of resistance against the environment in a society.
Such a situation can be well identified nowadays. The noisy environment is created by mainstream media.
What one can find in such an environment: an increasing number of different subjects, news focused on accidents, preferred shorter forms and
propagation of a number of opposite opinions and many other similar, as well as different ways of distraction of reader's attention.<br />
Groups of people strongly coupled to such a media stream are not able to
define a private opinion and they are easy to manipulate. It also means that the number of people in such a chaotic environment will migrate rather to more stable social groups (trapped states), not so strongly dominated by the mainstream news. <br />
The creation of people without a strong, private opinions is a goal of the present liberal governments. But, as it is presented in this note the chosen way
of social control leads just to opposite behaviour. You can see it around yourself, don't you ? <br />
In the model, we used the condition where the number of intermediate states in the environment $M$ is smaller that the
number of states in considered social groups $N$: $M < N$. <br />
The opposite case leads to a bit different picture than described in this note, but it is a subject for
an another post.<br />
<br />
<br />
Bogdan Lobodzinski<br />
<a href="http://dataanalytics4all.site">Data Analytics For All Consulting Service</a><br />
<br />
<h4>
References:
</h4>
<ol>
<li> E.Gudowska-Nowak, G. Papp and J. Brickmann, Two-Level System with Noise: Blue's Function Approach, Chem. Phys., 220 120-135 (1997)
</ol>
Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0tag:blogger.com,1999:blog-4412689276279452501.post-82908904585894345182016-04-10T22:09:00.001+02:002020-01-21T21:04:49.791+01:00Small and medium-size companies: any chance to grow in the Big Data dominated environment ?Today I watched a video presenting European Truck Platooning Challenge 2016.<br />
For those who are not familiar with the subject, it is about a test of self-driving trucks across Europe.
The test was organized by the Dutch government. Main truck companies from Europe participated in running the event.
You can watch the video here
<a href="http://www.engadget.com/2016/04/07/european-truck-platooning-challenge-video/">here</a>
and <a href="https://www.eutruckplatooning.com/default.aspx">here</a>.
<br />
It is a really great proof-of-concept project driven by the Big Data technologies.
<br />
<br />
However, in this note I don't like to write about details of the project.
Instead, I would like to develop wider my first reflections after watching the video.
I think everyone's conclusion will be that with this event all smaller transportation enterprises have got a Big problem.
The problem can be verbalize by a question:<br />
<i>how to survive on the market dominated by big transportation players and their self-driving trucks across Europe ?</i><br />
In the big transport companies the drivers will be replaced by the lawyers, but what about smaller transport businesses ?
<br />
<br />
Of course the problem is not only related to the transport industry. Similar situation appears in each branch of commerce.
Therefore, the more general question, is:<br />
<b><i>how smaller enterprises can find themselves in the business environment dominated by the Big Data technologies governed by big parties ?</i></b><br />
It is a bit similar to the competition between members inside a social group:<br />
<i>how to survive in a group dominated by set of rules profitable mainly for those who created those regulations</i>. <br />
Each of us can find clear example of such an environment from his own experience.
<br />
<br />
In my opinion, a possible solution of the problem can be found in a grouping of small companies into a fair-shared societies
glued by common goals and business activity.
Such a group has a bigger chance of survival than a single member, especially if all members can coordinate all individual actions,
in such a way that only these processes which are profitable for the group and for the participant should be realized.
It may sound like a socialism utopia - so it may mean - forget it, not possible for realization.
<br />
<br />
However, I am more and more convinced that such a procedure of business activity is possible if all human emotional behavior
can be removed from a decision making processes. Remember, we are talking about methods to lead business in a most effective way.
Such a decision creator can be realized by a Big Data driven decision making system using machine learning algorithms .
If we have a group of a few enterprises and each company makes available his business data (like details of contacts, details of realized tasks
and preferences, etc. ) for the decision-making system, a proper Big Data framework can create suggestions how to move forward using
well established probabilities.<br />
A client can look for a doer by communications with a group accessing the representative portal providing specification of task(s), or asking directly
a chosen company.<br />
In opposite, each group member can act fully autonomously, looking for customers individually or can use a group representative hints
created by the decision-making system.
In both cases the group representative framework acts as a Data driven decision creator:
<br />
<ol>
<li> for customer: suggesting proper companies from the group
<li> for companies in the group: informing chosen entities about requested tasks.
</ol>
The structure of the group reminds a bit the franchise business model. In contrary to a franchiser a group member don't have to follow any
business concept and there is no need for any kind of standardizations in terms of provided services. The only payment which have to be
provided by the group participants is a cost of the service and the development of a proper "Big Data"/Machine Learning infrastructure.
<br />
<br />
Summarizing this short note, the goal of described above business association is creation and support of strategies for
sustainable growth of small sized enterprises in a Big Data dominated environment.
The common features among all participating parties in the group should be community of the type of business activity and goals.
<br />
<br />
Unfortunately, details are usually much mode complicated, but my intention was to sketch a general picture of a structure
with (in my opinion) a highest chance to become profitable in the Big Data business era.
<br />
<br />
I will appreciate any comment and opinion about the idea sketched above.
<br />
<br />
<br />
Bogdan Lobodzinski <br />
<a href="http://dataanalytics4all.site/">Data Analytics For All Consulting Service</a>
Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0tag:blogger.com,1999:blog-4412689276279452501.post-34288504385618041952016-03-13T19:46:00.005+01:002021-05-14T21:51:35.498+02:00Recommender Systems and q-value Potts model<!-- MathJax usage example -->
<script type="text/x-mathjax-config"> MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}}); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>
<br />
<h4>
Introduction</h4>
<br />
<br />
Recommendation system (RS) can be described in a general way as a class of applications used to predict a set of items with highest probability of acceptance by a given user and is based on user's responses.
The RS connects users with items and it is usually described from the point of view of the performance of algorithms used in the system.
<br />
In this note I would like to raise the following question:
<b>what can we learn by applying a q-valued Potts model to datasets created by recommender system ?</b>
The Potts model describes interaction of a set of spins distributed on a lattice. In contrary to Ising model, spins in the $q$-valued Potts formulation can take $q$ possible states (values). <br />
<br />
<h4>
Analysis</h4>
<br />
Let us consider a database created by a given RS where each item is characterized by a set $\{rating,tag\}$. Both, the $rating$ and $tag$ values are determined by independent users. From the $tag$ value one can extract a set of users who tagged the item. Those users make an $y$ coordinate.
The $x$ coordinate is created by the users who made $rating$ instances. Both coordinates can be used for creation of $x-y$ lattice with $z$ values corresponding to the $rating$ variable.
<br />
It is obvious that the available items and users evolve with time in some way.
Therefore it is not possible to determine close neighbours of a given spin placed in a lattice point $\{x,y\}$. In order to avoid this problem I assume that all spins on the lattice can interact with each other with the same coupling constant equal 1.
After such assumptions the data can be described as the long-ranged $q$-value Potts model with parameter $q$ corresponding to the number of voting options.
Using mathematical notations, for a given time $t$ we have $x-y$ lattice occupied by spins $\sigma_{i} = 1,..,q$ . The lattice (graph) state is described by the energy $E$ due to the interaction between spins $\sigma_{i}$:
\[
E = - \sum_{i,j} \delta_{\sigma_{i},\sigma_{j}}
\]
and by the order parameter (also known as a magnetisation) $M$ defined as
\[
M = \frac{\left( q \max\{n_{i}\} - N \right)}{\left(q-1\right)} .
\]
In the formulas above: $\delta$ is the Kronecker symbol, $N$ is the total number of spins in the lattice, $n_{i} <= N$ denotes the number of of spins with orientations $\sigma_{i}$.
<br />
Each graph is described by the per-graph quantities: $e = E/N$ and $m = M/N$ .
The maximum order in a graph corresponds to $m = 1$ and is equivalent to all spins with the same orientation. The perfect disorder is specific for a case when all orientations are equally numerous: $\max\{n_{i}\} = N/q$, this state corresponds to $m = 0$.
As a data for the model I used the movie dataset [1]. The database has 10 voting options ($q = 10$) with possible voting values $V = \{0.5,1,1.5,2,2.5,3,3.5,4,4.5,5\}$.
<br />
In the analysis, $y$ users (determined from $tag$ values) were selected globally for entire time range available in the data (9/17/1997 - 1/5/2009 ). The $x$ users were determined for each time step which was chosen as 1 day. For each time step I calculated the energy $e$ and the order parameter $m$. Details of calculations are available in R language listed in [2].
The calculations were performed for different numbers of voting options $q$ and available values $V$.
Selected values of $q$ and $V$ are following <br />
<br />
<ol>
<li> $q = 3$, $V = \{2, 3, 4 \}$
</li>
<li> $q = 4$, $V = \{2, 3, 4, 5 \}$
</li>
<li> $q = 5$, $V = \{1, 2, 3, 4, 5 \}$
</li>
<li> $q = 6$, $V = \{1, 1.5, 2, 3, 3.5, 4\}$
</li>
<li> $q = 7$, $V = \{1, 1.5, 2, 2.5, 3, 3.5, 4\}$
</li>
<li> $q = 10$, $V = \{0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5\}$
</li>
</ol>
For each subset defined by $q$ I made an averaging fit of calculated time evolution of variables $e$ and $m$ using local polynomial regression fitting smoothing (loess) with parameters [3]:
<br />
<pre class="brush: csharp">
loess(y ~ x, family="gaussian", span=.75, degree=1)
</pre>
<br />
All results are presented on plots below (the R source code for plots is listed in [3]). The energy $e$ (normalized to the same initial value) shows a behaviour typical for the Potts model:
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgk8L-GA_FPsv2JFd7br0dmekkGmqpuwDH1GWu5FJoi4d8DB-y0hv6CyNQm4R-oEh7plugluFhpr7sDAZnEguLYP3sD1xjG-uZTGm3CLe_oY2p2WcpXEWDTevwVhyphenhypheneGWl4gNyEcQ8LC7hnu/s1600/EnergyPar.jpeg" imageanchor="1"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgk8L-GA_FPsv2JFd7br0dmekkGmqpuwDH1GWu5FJoi4d8DB-y0hv6CyNQm4R-oEh7plugluFhpr7sDAZnEguLYP3sD1xjG-uZTGm3CLe_oY2p2WcpXEWDTevwVhyphenhypheneGWl4gNyEcQ8LC7hnu/s320/EnergyPar.jpeg" /></a>
<br />
But the most unexpected time evolution is seen on plot of the order parameter $m$:
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsm23ZOa2WMwK47tfrKcTdV6lr3420BjPIuqr4dK-v5WUjJczpJyEaLEe520hg35JijFOX3LW2-Ycple_ye2XHkIPl7fgoZE4g7u_-1BRz2JD9ekn6b5e18MvlCs87gRuEIrFfRf_mwhBn/s1600/OrderPar.jpeg" imageanchor="1"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsm23ZOa2WMwK47tfrKcTdV6lr3420BjPIuqr4dK-v5WUjJczpJyEaLEe520hg35JijFOX3LW2-Ycple_ye2XHkIPl7fgoZE4g7u_-1BRz2JD9ekn6b5e18MvlCs87gRuEIrFfRf_mwhBn/s320/OrderPar.jpeg" /></a>
<br />
The relaxation of the order parameter $m$ shows a two-staged process, well distinguished for $q < 6$ . The same behaviour is also visible for $q >=6$ however the steady-state value is not reached by the system yet. The final value of the order parameter also depends on the number of voting options. The stable value of $m$ parameter is higher for models with $q < 6$ ($\approx 0.6$) than for $q >=6$ ($\approx 0.4$). But this result should be checked more carefully on other datasets and it is too early to conclude this observation.<br />
<br />
<h4>
Conclusions</h4>
<br />
After some analysis work with the dataset from the RS, let me gather 2 main points:
<br />
<ol>
<li> two-staged relaxation process of the order parameter $m$ could be related to the different dynamical mechanisms. One of them leads to the exponential decay and the second one to the self-organizing phenomena. In terms of users behaviour, it suggests existence of two classes of users: randomly using the RS which could be responsible for the fast decay of the parameter $m$ and visitors of the RS with more stable opinions about voting subject. The stabilizing effects due to the non-random visitors will be specific rather for longer working rating webpages. Due to this feature I expect different performance of the same RS on a newly created rating website and on the rating site working for longer time .
</li>
<li> The time scale of the order parameter $m$ decay after which the stable stage is reached is faster for smaller $q$ parameter. For the dataset used for the analysis the decays can be grouped for models with $q < 6$ and $q >=6$. This observation implies that a better performance of RS can be achieved for a system with smaller number of voting options.
</li>
</ol>
<br />
Any comments and opinions greatly appreciated, thanks. <br />
<br />
<br />
Bogdan Lobodzinski <br />
<br />
<a href="http://dataanalytics4all.site/">Data Analytics For All consulting Service</a><br />
<br />
<br />
<h4>
References</h4>
<br />
[1] As a data for the model I used the movie dataset created by:
<br />
<ul>
<li> [url=http://www.grouplens.org]Movielens from GroupLens research group[/url],
</li>
<li> [url=http://www.imdb.com]IMDb website[/url],
</li>
<li> [url=http://www.rottentomatoes.com] Rotten Tomatoes website[/url],
</li>
</ul>
build by Ivan Cantador with the collaboration of Alejandro Bellogin and Ignacio Fernandez-Tobias, members of the Information Retrieval group at
<a href="http://ir.ii.uam.es/">Universidad Autonoma de Madrid</a>. The data can be downloaded from
<a href="http://ir.ii.uam.es/hetrec2011//datasets.html">http://ir.ii.uam.es/hetrec2011//datasets.html site</a>.
[2] the source code in R for calculation of energy $e$ and order parameter $m$ for 4 voting options ($q = 4$) with available values $V = \{2,3,4,5\}$.<br />
<pre class="brush: csharp">// Code
library(readr)
library(rgl)
library(MASS)
library(reshape2)
# Set a random seed for reproducibility
set.seed(1)
# set the q value of the model
q0 <- 4
# set the value options:
options <- c(2,3,4,5)
# set the file name with output:
Output <- "Params_q4.csv"
cat("reading the data\n")
prefix <- "../moviedata/"
user_rated_input <- paste0(prefix,"user_ratedmovies.dat")
userrate<- read.table(user_rated_input, header=TRUE)
user_tagged_input <- paste0(prefix,"user_taggedmovies.dat")
usertag <- read.table(user_tagged_input, header=TRUE)
cat("Data manipulation\n")
# 1. replace time by secs
userrate$timeinDays <- as.numeric(as.POSIXlt(userrate$time, format="%m/%d/%Y"))
usertag$timeinDays <- as.numeric(as.POSIXlt(usertag$time, format="%m/%d/%Y"))
# 2. rearrange data first in descending time order:
sort.userrate <- userrate[order(userrate$timeinHours) , ]
sort.usertag <- usertag[order(usertag$timeinDays) , ]
GetMatrix <- function(inputX,inputY, inputZ) {
x <- NULL
y <- NULL
z <- NULL
x <- inputX
y <- inputY
z <- inputZ
# check duplications:
d1<-NULL
d1<-data.frame(x,y)
indexes<-NULL
indexes<-which(duplicated(d1))
if (length(indexes)) {
d1<-d1[!duplicated(d1),]
z<-z[-indexes]
}
mydf <- NULL
mydf <- data.frame(d1,z)
mat <- NULL
mat<-as.matrix(acast(mydf, x~y, value.var="z"))
mat[is.na(mat)] <- 0
mat
}
lowerLimit <- q0
EnergyAv <- NULL
MagAv <- NULL
items <- NULL
for (item in unique(sort.userrate$time)) { # <- Day's timescale
tx <- NULL
tx <- subset(data.frame(sort.userrate),time == item)
if ( min(dim(tx)) > 0 ) {
merged <- NULL
merged<-merge(tx,sort.usertag,by=c("movieID"))
if (min(dim(merged)) > 0 ) {
myMat <- NULL
myMat <- GetMatrix(merged$userID.x,merged$userIDy,merged$rating)
myMat1<-NULL
myMat1<-myMat[!(myMat == 0)]
# selection of states:2,3,4,5 for q = 4 valued model:
myMat1 <- myMat1[myMat1 %in% options]
nrOfSpins <- length(myMat1)
if ( (nrOfSpins > lowerLimit) ) {
# Energy:
Energy <- 0
# counting nr of pairs:
for (elem in table(myMat1)) {
Energy <- Energy -(choose(elem, 2))
}
EnergyAv <- c(EnergyAv,Energy/nrOfSpins)
# Order parameter Mag:
Mag <- 0
Mag <- (q0*max(table(myMat1))-nrOfSpins)/(q0-1)
MagAv <- c(MagAv, Mag/nrOfSpins)
# time:
items <- c(items, item)
}
}
}
}
cat("Saving results ...\n")
dfq<-NULL
dfq<-data.frame(Time=items,Mag=MagAv,En=EnergyAv)
write_csv(dfq, Output )
</pre>
[3] R source code used for creation of the plots:
<pre class="brush: csharp">// Code
library(readr)
require(ggplot2)
library(rgl)
library(MASS)
library(lattice)
# read data
prefix <- "./"
q3_input <- paste0(prefix,"Params_q3.csv")
q4_input <- paste0(prefix,"Params_q4.csv")
q5_input <- paste0(prefix,"Params_q5.csv")
q6_input <- paste0(prefix,"Params_q6.csv")
q7_input <- paste0(prefix,"Params_q7.csv")
q10_input <- paste0(prefix,"Params_q10.csv")
q3 <- read.table(q3_input, sep=",", header=TRUE)
q4 <- read.table(q4_input, sep=",", header=TRUE)
q5 <- read.table(q5_input, sep=",", header=TRUE)
q6 <- read.table(q6_input, sep=",", header=TRUE)
q7 <- read.table(q7_input, sep=",", header=TRUE)
q10 <- read.table(q10_input, sep=",", header=TRUE)
# fits & plots:
elems <- list(q3,q4,q5,q6,q7,q10)
# fit of the order parameter:
myx1 <- NULL
mypred1 <- NULL
for (ind in 1:length(elems)) {
tmp <- NULL
tmp <- data.frame(elems[ind])
x <- as.numeric(as.POSIXlt(tmp$Time, format="%m/%d/%Y"))
myx1 <- c(myx1,list(x))
y <- with(tmp, Mag)
eval.length <- dim(tmp)[1]
fit.loess2= loess(y ~ x, family="gaussian", span=.75, degree=1)
pred <- predict(fit.loess2, data.frame(x=x))
fac<-1/pred[1]
pred <- pred*fac
mypred1<- c(mypred1,list(pred))
}
# energy:
# fit of the energy parameter:
myx2 <- NULL
mypred2 <- NULL
for (ind in 1:length(elems)) {
tmp <- NULL
tmp <- data.frame(elems[ind])
x <- as.numeric(as.POSIXlt(tmp$Time, format="%m/%d/%Y"))
myx2 <- c(myx,list(x))
y <- with(tmp, En)
eval.length <- dim(tmp)[1]
fit.loess2= loess(y ~ x, family="gaussian", span=.75, degree=1)
pred <- predict(fit.loess2, data.frame(x=x))
fac<-abs(1/pred[1])
pred <- pred*fac
mypred2<- c(mypred2,list(pred))
}
# plots:
xpos <- c(8.94,8.99,9.04,9.09) # x axis: log10(time) position
x01 <- log10(myx1[[1]])
origin_date <- "1970-01-01"
t1 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[1]))]), origin = origin_date),1,10)
t2 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[2]))]), origin = origin_date),1,10)
t3 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[3]))]), origin = origin_date),1,10)
t4 <- substr(as.POSIXct(10^(x01[which.min(abs(x01-xpos[4]))]), origin = origin_date),1,10)
TimeLabels <- c(t1,t2,t3,t4)
colors <- terrain.colors(6)
plot(log10(myx1[[ind]]), mypred1[[ind]],type="l",lwd=6, col = colors[ind], xlab = "Log10(Time)",
ylab = "Normalized Order parameter",
ylim=c(0.2,1.0), xaxt = "n" )
for (i in seq(ind-1,1,-1)) {
lines(log10(myx1[[i]]), mypred1[[i]],lwd=6, col = colors[i], xlab = "Time (s)",
ylab = "Counts")
}
legend("topright", inset=.05, title="q-values:",c("3","4","5","6","7","10"), fill=colors[1:6],
horiz=TRUE)
axis(side = 1, at = xpos, labels = TimeLabels)#, tck=-.05)
# energy:
plot(log10(myx2[[1]]), mypred2[[1]],type="l",lwd=6, col = colors[1], xlab = "Log10(Time)",
ylab = "Normalized Energy", xaxt = "n")#, ylim=c(-,1.0) )
for (i in seq(2,ind,1)) {
lines(log10(myx2[[i]]), mypred2[[i]],lwd=6, col = colors[i], xlab = "Time (s)",
ylab = "Counts")
}
legend("bottomleft", inset=.05, title="q-values:",c("3","4","5","6","7","10"), fill=colors[1:6],
horiz=TRUE)
axis(side = 1, at = xpos, labels = TimeLabels)#, tck=-.05)
</pre>
Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com2tag:blogger.com,1999:blog-4412689276279452501.post-24010242575088621282016-03-13T19:39:00.001+01:002021-05-14T21:51:54.977+02:00How Data Analytics can help small companies ?Trying to operate on the market of micro, small and medium-sized enterprises as a data analyst serving solutions based on the machine learning techniques, I have to say - it is a difficult business. <br />
<br />
In my opinion, the difficulties arise due to two main, entangled reasons:
<br />
<ol>
<li> completely missing knowledge about a potential of data analytics.
Most company owners don't even accept a though that someone from outside of the firm can suggest how to improve his business productivity,
</li>
<li> if someone sees the need of a data analytics, the second problem appears: it is the cost of such a project. It can be a serious problem if it is seen as a single payment without noticing a new business landscape which gets opened by the results of the data analysis.
</li>
</ol>
Let us start with the second point: data analytics project costs.<br />
<br />
First of all, a usual thinking is that a data analytics requires enormously complex hardware and software infrastructure, therefore it will cost much too much. The truth is just the opposite.
All tools which are needed for data analytics tasks of small companies can be completed during a few working days without spending a penny.
Actually, it is not fully true, you have to have a good, powerful notebook with access to the network.
Already having a computer, what else do you need at the beginning ?<br />
<br />
<ol>
<li> Linux as a free operational system,
and software:
</li>
<li> Python or/and R (in this case also Rstudio).
</li>
</ol>
All of the above is available for free and downloadable from the Internet as an open source software.
The most important part of the whole collection is your business database. Again, even if you collect data using commercial software specific for your activity, it is possible to convert most of the available formats into more accessible ones for the R or Python form. Sometimes, that work might require some effort but in general it is doable.<br />
More problematic is the situation when the company data are on the paper only. What to do in such a case ?<br />
Maybe your accountant can help somehow - in communication with the tax office some forms of digital data have to be used anyway.<br />
Another solution is to initialize the data gathering from the scratch. It can be realized using open source databases installed on a cloud. The cloud is not a free choice but the management costs are really negligible if one compares the cost of cloud computing and storage with other business expenses.
In case of a cloud usage, we have to add a note about security of your data, especially in terms of unauthorized external access to your data.
Solution is very simple: the data analysis doesn't need real values in your database entries. <br />
It is enough to replace, for example, real addresses of the company clients by short unique strings.
Similar encryption can be done with numbers, using a common factor for all values in a given row. The real meaning of the unique strings and numerical factors remains in a company's hands.<br />
If needed, one can use already existing datasets available for
<a href="http://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free">free</a>.<br />
<br />
Now, we get to the point where a data expert becomes a crucial person.
The data scientist can merge the entrepreneur business knowledge with mathematics and practical machine learning solutions. This can lead to a better understanding of your business experience. But the final move belongs to the company managers, data analytics would lead to nothing if the results were not introduced into practical realizations. <br />
<br />
Work of the data expert(s), usually includes
<br />
<ol>
<li> consultations: when points like these are discussed:
<ul>
<li> what the company would like to develop, explain, understand or predict,
</li>
<li> availability of business data and how they are compatible with the company's goal,
</li>
<li> preparation of a concluding question e.g. the final goal of the data analysis,
</li>
</ul>
</li>
<li> pilot solution: a creation of a test model which can be validated on the existing data and tested on a new set of data,
</li>
<li> final realization: preparation of software tools which can be used on demand in daily business operations or on a regular basis using old and newly created data.
</li>
</ol>
Between the lines written above a number of interactive actions is included where all obstacles or unpredicted behaviours are discussed. All that is performed until the desired result is achieved.
The usual working time of the expert can be estimated between 50 hours (a week) for a micro projects up to several hundreds hours of work. <br />
<br />
Now, one can ask how much data analytics products per working hour might cost ?
The hourly rate can be estimated as 30 - 100 Euros. So, the final cost can vary between 1500 - 50000 Euro and more. <br />
<br />
How to measure the efficiency of a data analytics project ?<br />
A research performed by D. Barton and D. Court which was based on 165 large publicly traded firms shows that Data-driven decision-making can increase the output and productivity by 3-5% beyond the traditional improvement steps (Dominic Barton and David Court, <i>Making advanced analytics work for you</i>, Harvard Business Review, October 2012, Volume 90, Number 10, pp. 78–83).
Using this estimation and assuming the smallest increase of 3% of a company output, a minimum enterprise outcome would be about 50 000 Euro for full recompensation of a smallest Data analytics project cost. <br />
<br />
We treat the subject very generally, therefore some company owners will not be convinced why they should start to use data-driven decisions instead of relaying on the well grounded on the experience intuition . The numbers: 3-5% of outcome increase (set by the classical actions estimated by authors in cited above research paper) seem to be too small to be used as a solid argument in a discussion with small-firm owners.<br />
<br />
Yet, let me add a few arguments which, I hope, can change a bit the dominated view of data analytics in small business.
<br />
<ol>
<li> The most important step: collection of data, especially in small companies is difficult due to the lack of knowledge and free manpower. However, that issue can be easily automated and (what is not negligible) the tool can be fully owned by the company. Collecting sufficient amounts of data takes time, it could be a quite long process in case of small-firm segments. Therefore, the sooner the data collection process will start, the sooner analytical process will become beneficial.
</li>
<li> The data-driven management system requires more formalized and structured approach. It could be a difficult point in the transformation to the data-driven manner. But by answering simple questions (sometimes completely neglected and treated as a not important at all) one can achieve the biggest goals. Just a few examples of questions: <br />
Who is buying a given flavour of a bread ?<br />
How is the buying of bread correlated with time ?<br />
Which cookies prefer young, mature and older customers ? etc. <br />
All these answers can be provided by data analytics. And it is straightforward to notice that knowing the answers and following them in real world means clearly higher revenue. <br />
</li>
<li> What if the results of data analytics are consistent with the firm-owner intuition ?
In our opinion it is a wrong question: even if data-driven development suggestions are compatible with the well established experience of the company leader, we are talking about specific numbers: how many cookies should be delivered to the shop, how many breads should be produced, etc.
Do you believe that any, even the best intuition would provide the same numbers ?.
Obviously not. Therefore, the initial question should be replaced by the statement:
The data-driven results are capable to increase assurance of the manager's intuition. <br />
</li>
<li> and finally the non BigData conclusion:
new technologies in a company business opens the firm to new ideas and tedious work becomes more interesting.
</li>
</ol>
So, what do you think Dear Reader ?
Are you ready for innovations in your view of the company's management style ? <br />
<br />
Bogdan Lobodzinski <br />
<a href="http://dataanalytics4all.site/">Data Analytics For All Consulting Service</a>
Bogdan Lobodzinskihttp://www.blogger.com/profile/09289422592105592639noreply@blogger.com0