Data Analytics For All

Saturday, November 11, 2023

Pobieżna analiza wyborów do Parlamentu Polskiego 2023

[mapa ze strony https://wbdata.pl/wybory-2023-mapy/]

Zadałem sobie pytanie czy ostatnie wybory 15.10.2023 były uczciwe i czy można to sprawdzić statystycznie przy pomocy prostej analizy wspomaganej intuicją.
Zatem, nie będzie to twardy dowód na oszustwa wyborcze.
Dane do analizy można odnaleźć na stronie https://wybory.gov.pl/sejmsenat2023/pl/dane_w_arkuszach), w części 'Wyniki głosowania na listy Sejmowe'. Plik 'po okręgach Sejm CSV XLSX' zawiera dane ze wszystkich obwodow wyborczych.

Wprowadzenie

Zacznijmy od paru defnicji. W dalszej czesci bede opieral sie na paru nowych zmiennych, jak napisalem analiza jest bardzo uproszczona. Ze wzgledu na wielość partii, wprowadzam następujące grupy:

'OPOZYCJA' = 'KOALICYJNY KOMITET WYBORCZY TRZECIA DROGA POLSKA 2050 SZYMONA HOŁOWNI - POLSKIE STRONNICTWO LUDOWE'+
'KOALICYJNY KOMITET WYBORCZY KOALICJA OBYWATELSKA PO .N IPL ZIELONI']+
'KOMITET WYBORCZY NOWA LEWICA'
'INNE PATRIE' = 'KOMITET WYBORCZY BEZPARTYJNI SAMORZĄDOWCY'+
'KOMITET WYBORCZY WYBORCÓW MNIEJSZOŚĆ NIEMIECKA'+
'KOMITET WYBORCZY KONFEDERACJA WOLNOŚĆ I NIEPODLEGŁOŚĆ'+
'KOMITET WYBORCZY POLSKA JEST JEDNA'+
'KOMITET WYBORCZY WYBORCÓW RUCHU DOBROBYTU I POKOJU'+
'KOMITET WYBORCZY NORMALNY KRAJ'+
'KOMITET WYBORCZY ANTYPARTIA'+
'KOMITET WYBORCZY RUCH NAPRAWY POLSKI'
'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' - jako osobna grupa

Ponieważ 'KOMITET WYBORCZY RUCH NAPRAWY POLSKI' i 'OPOZYCJA' są głównymi graczami, dlatego w dalszej części będę analizował 3 grupy danych:

wszystkie obwody: bez rozróżnienia
a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ: gdy suma głosów na 'OPOZYCJA' w obwodzie jest większa od liczby głosów na 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'
b) OPOZYCJA < KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ: gdy suma głosów na 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' w obwodzie jest większa od liczby głosów na 'OPOZYCJA'

Jako zmienną do porównywania rozkładów grup partyjnych używam stosunku liczby głosów oddanych na grupę partyjną w obwodzie do całkowitej liczby głosów w tym obwodzie.

Rozkłady oddanych głosów

Rozkłady wyglądają tak:
dla grupy wszystkie obwody:

Rys. 1

Jak widać, rozkłady 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' i 'OPOZYCJA' są niemal lustrzanymi odbiciami względem wartości $\approx 0.5$. Rozkłady dla następnych grup danych:
dla grupy a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ

Rys. 2

i dla grupy b) OPOZYCJA < KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ

Rys. 3

W analizie danych takich jak ta , zwykle mamy do czynienia z rozkładami w przybliżeniu symetrycznymi, tak jak dla grup: a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ i wszystkie obwody. W przypadku ostatniej grupy (b) OPOZYCJA < KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ) mamy asymetryczny podział pomiędzy grupami politycznymi 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' i 'OPOZYCJA' w pobliżu 'stosunek glosow na grupe partyjna do sumy wszystkich oddanych głosow - na obwod'$\approx 0.45$. Poparcie dla 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' bardzo ostro maleje do 0. rozkład poparcia dla 'OPOZYCJA', także wydaje się opada stromiej, ale nie jest to tak dramatyczne jak dla 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'.
Ponieważ ten aspekt wygląda dziwnie, zatem w dalszym ciągu będę analizował głosy, w tych obwodach wyborczych dla których wartosc 'stosunek glosow na grupe partyjna do sumy wszystkich oddanych glosow - na obwod' wynosi $0.3 - 0.6$.

W poszukiwaniu manipulacji

W tym celu wybieram przedzial grupę danych 'b) OPOZYCJA < KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' dla których wartosc 'stosunek glosow na grupe partyjna do sumy wszystkich oddanych głosów - na obwod' wynosi $0.3 - 0.6$ i wyliczam rozkład 2giej cyfry z wartości glosowań na 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' i 'OPOZYCJA'. Jesli głosy nie zostały zmanipulowane, to otrzymany rozkład powinien być zgodny z prawem Benforda dla 2giej cyfry (Benford\'s law - Wikipedia)
Z reguły, sprawdzanie rozkładu 2giej cyfry ze zbioru wartości jest mniej zależne na dodatkowe czynniki mogące powodowa przekłamania, np. różnice w wielkości obwodów wyborczych, rozdzielone mocno rozkłady analizowanych zmiennych. Tego typu czynniki powodują, że rozkład 1szej cyfry wartosci zmiennych nie jest wiarygodny, nawet jeśli taki rozkład mocno odbiega od oczekiwanego rozkładu wedle praw Bendforda dla 1szej liczby.

Do analizy rozkładu 2giej cyfry wybrałem te zmienne ze skopiowanych danych z obwodow wyborczych, które powinny najbardziej bezpośrednio ukazywać ewentualne manipulacje:

'Liczba głosów ważnych oddanych łącznie na wszystkie listy kandydatów',
'Liczba głosów nieważnych',
'W tym z powodu postawienia znaku „X” obok nazwiska dwóch lub większej liczby kandydatów z różnych list'

a także liczbę oddanych głosów na 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ', 'OPOZYCJA' i 'INNE PATRIE'. Jako bląd dopasowania jest wyliczany Chi2 (oznaczony na wykresach poniżej jako CHI2). Otrzymane rozkłady:

Rys. 4

Na wykreach powyżej, przez $N$ oznaczam liczbe wartości z których wyliczono rozkład.
Dla przypadku zmiennej 'Liczba głosów ważnych oddanych łącznie na wszystkie listy kandydatów' (wykres 1) powyżej): wynik sugeruje wiekszą manipulację danych dla obwodów z kategorii 'a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ', cyfry 0 i 1 można interpretować jako dołożone.
Dla zmiennej 'Liczba głosów nieważnych' (wykres 2) powyżej): błędy dopasowania są podobne dla obu grup danych 'a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' i 'b) OPOZYCJA < KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'. 'Liczba głosów nieważnych' jest większa dla przypadku 'a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'.
Zmienna 'W tym z powodu postawienia znaku „X” obok nazwiska dwóch lub większej liczby kandydatów z różnych list' pokazuje jeszcze bardziej znaczące statystycznie różnice dla obu grup danych (a) i b)).
Wykresy 2) i 3) powyżej sugerują manipulacje związane ze zwiększaniem liczby głosów nieważnych.

Na następnym rysunku pokazuję analizę liczby głosow poparcia na grupy polityczne.

Rys. 5

Poparcie dla grupy 'OPOZYCJA' (wykres 1) powyżej) ma mały błąd dopasowania (CHI2 < 1.) dla obu grup danych (a) i b)). Trudno tu wskazać na manipulacje.
Poparcie dla 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' (wykres 2) powyżej) też jest obarczone małym błędem dopasowania - CHI2 < 1. dla grupy b) i CHI2 $\approx$ 2. dla grupy a). W grupie a) cyfry 0, 1 rozkładu poparcia są poniżej oczekiwanego rozkładu.
W przypadku 'INNE PATRIE', widać również większy błąd dopasowania dla grupy danych a) niż dla grupy b).

Podsumowanie

W przeprowadzonej analizie pokazuje potencjalne miejsce, w którym mogło dojść do manipulacji głosow poparcia na grupy polityczne. Są to obwody wyborcze, w których poparcie 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' i 'OPOZYCJA' jest zrównoważone (wartosc 'stosunek glosow na grupe partyjna do sumy wszystkich oddanych głosów - na obwod' wynosi $0.3 - 0.6$). Wykresy na Rys. 5 pokazują, że do pewnych manipulacji dochodziło częściej w obwodach należących do grupy danych 'a) OPOZYCJA > KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ' na niekorzyść 'KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ'.
Czy ta analiza udowadnia, że wybory zostały sfałszowane ? Nie, tylko sugeruje statystycznie znalezione sygnatury manipulacji, które są małe. Jak napisałem we wstępie, niniejsza analiza nie jest dowodem na oszustwa wyborcze.

Dziękuje za przeczytanie !

Monday, June 13, 2022

A word on the development and implementation of machine learning techniques for IoT data processing

IoT data analytics typically involves creating processes to predict failures, situations or find anomalies online. By process I mean here a model created based on Machine learning or/and statistics and its full implementation in the production cycle.
It has been reported that up to 80% of IoT data projects fail (the project was not completed or the company gained nothing from its implementation). If we look at the highlighted reasons for failures, we find mostly data-related topics, hazy descriptions of problems with Machine Learning methods without going into details, vaguely defined goals, and many other reasons.

In this note, I would like to focus on choosing an algorithm for IoT data analysis purposes. Before working with IoT data, it's a good idea to determine earlier which analytical solution should be implemented: based on supervised or unsupervised methods.
Usually creating a good unsupervised algorithm is a difficult task, more difficult than a supervised algorithm. However, building the model itself is only part of the project. The other part is its application in the production process. So I propose to look at the whole thing: the creation of an analytical method and its productization.

Solution 1 - Supervised Model (SM):

analytical model (R&D): this is a Data Scientist (DS) standard job:
- selection of features (data and feature engineering),
- model creation,
- precise validation procedure with KPIs.
Productization: we have to include a Data Engineer here also. Tasks to do:
- data & feature engineering, preparation of computing environment,
- framework for SM automatization including:
  - monitoring of the data shifts (features + target),
  - data selection for creating new models,
  - data labeling,
  - parameter hypertuning,
  - model retraining,
  - selection of the best model(s).

Solution 2 - Unsupervised Model (UM):

R&D: DS job:
- selection of features (data and feature engineering),
- model creation,
- precise validation procedure with KPIs.
Tasks are mosty the same as for the SM case, but obviously the goal is more complex.
Productization: it obeys DE + DS tasks. Main tasks to do:
- data & feature engineering, preparation of computing environment.

Comparison of both solutions:
As you can see, creating a well-functioning production cycle for the SM is a very difficult task. In my opinion more difficult than creating a proper supervised model and much more time consuming.

On the other hand, for solution 2 based on the UM, putting the solution into production is very simple. The opposite picture we have for building analytical models.
Given the above considerations, the summary statement might look like this image:

The difficulties associated with the productization of the SM model are the main place where failures are to be expected and there are a lot of them. Maybe sometimes it is worth simplifying this part of the project and focus on building the UM model.

I hope you find my findings helpful. Thanks for reading !

Monday, January 3, 2022

Semantic text clustering: testing homogeneity of text clusters using Shannon entropy.

Many natural language processing (NLP) projects are based on the task of semantic text clustering. In short, we have a set of statements (texts, phrases, or sentences), and our goal is to group them semantically. Nowadays, this is quite a classical problem with a rich literature on clustering methods.

In spite of the simple formulation of the task, there are many problems during its realization. To make things more difficult, let's assume we are dealing with unlabeled data, which means we must rely on unsupervised techniques.

The first problem:

as we know, we can find many categories according to which we can try to group the given set of texts. In order to do this task well, we need to know the business case better. In other words, we need to know the question we want to answer using clustering. This problem is not the purpose of this note, so we will skip it.

The second problem:

suppose we have successfully performed text clustering. And the question arises: how do we know that all clusters contain the correct texts ? Answering this question is the purpose of my note .

Let us assume that a given method has generated $N$ groups/clusters , each of which contains a set of similar texts . It is known that every cluster of similar texts is more or less homogeneous. There are clusters that are completely wrong in the sense of similarity of texts. So to answer our original question
How do we know that all clusters contain correct texts ?
is to find those clusters that contain erroneous texts.

How to find the worst clusters ?

In this note, I would like to propose the following method for determining the worst clusters:

checking intra-cluster similarity by computing the Shanon entropy of the set of texts belonging to a given cluster.
using dynamically determined threshold entropy to select the worst clusters (based on the method presented in my blog https://dataanalyticsforall.blogspot.com/2021/05/semantic-search-too-many-or-too-few.html).

As data to illustrate our task I use data known as reuters-21578 (https://paperswithcode.com/dataset/reuters-21578).
Since the clustering stage is not our goal, this part of the work was done using the fast clustering method based on Agglomerative Clustering ( https://www.sbert.net/examples/applications/clustering/README.html#fast-clustering, code: https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py).

Each set of texts is characterized by a different degree of mutual homogeneity. The general idea of the method is to calculate the Shanon entropy for each identified text cluster, and then using the elbow rule to determine the homogeneity threshold (maximum entropy value) below which the clusters satisfy the homogeneity condition (are accepted). All clusters with entropy above the threshold should be discarded, used to find a better clustering algorithm or reanalyzed.
In any clustering method, we will get multiple single element clusters. Since in our method we check for homogeneity between sentences in a cluster, therefore I calculate entropy only for multi-element clusters.

The only parameters that needs to be introduced in the proposed method are:

the 'approximation_order' which defines a number of of consecutive characters (known as ngrams) from which we create the probability distribution used later to calculate the Shanon entropy.
The sensitivity parameter S (to adjust the aggressiveness in kneed detection [ kneed]) used in the Python kneed library.

The complete code is available at
https://github.com/Lobodzinski/Semantic_text_clustering__testing_homogeneity_of_text_clusters_using_Shannon-entropy.

To illustrate the method I show two pictures with the final results for two values of the parameter used to determine the Shannon entropy (approximation_order: 2 and 3)

All the values of the Shannon entropy approximations are ordered from smallest to largest (y-axis). The kneed library, given a parameter S (in this case S = 1), determines the best inflection point and this entropy value defines our maximum entropy value for the accepted sentence clusters. For a given approximation_order parameter (=2), we obtain 14 clusters which are inappropriately homogeneous.

Questionable clusters are:

Cluster 13, #10 Elements 
Cluster 23, #8 Elements 
Cluster 26, #8 Elements 
Cluster 28, #8 Elements 
Cluster 49, #6 Elements 
Cluster 56, #6 Elements 
Cluster 59, #5 Elements 
Cluster 62, #5 Elements 
Cluster 127, #4 Elements 
Cluster 137, #3 Elements 
Cluster 145, #3 Elements 
Cluster 152, #3 Elements 
Cluster 247, #3 Elements 
Cluster 275, #3 Elements

For comparison, the next picture is calculated for approximation_order=3. In this case, we get 15 incorrect clusters:

Rejected clusters:

Cluster 0, #396 Elements 
Cluster 13, #10 Elements 
Cluster 23, #8 Elements 
Cluster 26, #8 Elements 
Cluster 28, #8 Elements 
Cluster 56, #6 Elements 
Cluster 59, #5 Elements 
Cluster 62, #5 Elements 
Cluster 109, #4 Elements 
Cluster 127, #4 Elements 
Cluster 137, #3 Elements 
Cluster 145, #3 Elements 
Cluster 152, #3 Elements 
Cluster 179, #3 Elements 
Cluster 275, #3 Elements

Comparing the list of unacceptable clusters may give more information about the occurring heterogeneities in our text set.

I encourage you to test the method for yourself. In my case it turned out to be very useful.

The code:
https://github.com/Lobodzinski/Semantic_text_clustering__another_way_to_study_cluster_homogeneity. The outstanding advantage of the method is that it works fully autonomously without the need for human intervention.

Thanks for reading, please feel free to comment and ask questions if anything is unclear.

Thursday, August 12, 2021

Are solar power plants really green energy ? Continuation

image copied from https://dziennikzarazy.pl

This text is a continuation of my reflections on the impact of photovoltaic power plants on climate. In the part Are solar power plants really green energy ? I presented some figures which show that the heat generated by the currently working solar power plants (they produce more than 4 times more thermal energy than they produce in the form of electricity) is such a big part of the heat emitted by humans, that after converting this heat into CO2 amounts, it is more than a half of the effect generated in 2020 on the whole Earth!
This amount of heat energy is unlikely to have no effect on global warming.

More concrete signals that the development of a solar power plant infrastructure could lead to climate disruption and rising temperatures.

Urban Heat Island effect (UHI)
The UHI is a real phenomenon.
The paper "The Effect of Urban Heat Island on Climate Warming in the Yangtze River Delta Urban Agglomeration in China" presents the effect of UHI on climate warming based on an analysis of the effects of urbanization rate, urban population and land use change on the warming rate of mean, minimum (night) and maximum (day) air temperature in the Yangtze River Delta (YRD) using observational data from 41 meteorological stations. In conclusion, the authors found that observations of daily mean, minimum, and maximum air temperature atmeasurement stations in the YRDUA from 1957 to 2010 showed significant long-term warming due to background warming and UHI. The warming rate of 0.108 to 0.483°C/decade for mean air temperature is generally consistent with the warming trend in other urban regions in China and other urban areas in the world.
Thus, the authors showed that urbanization significantly enhanced local climate warming.
The solar power plants based on photovoltaic panels are even hotter islands of heat than highly urbanized agglomerations. During the period of most intense sunlight, the temperature near a solar power plant can be up to 3 degrees Celsius higher than the temperature in a similar environment without solar panels and similar solar conditions.

Suggestion:
The similarity in heat generation between the two cases - densely populated metropolitan areas and solar power plants - suggests the same effect - warming the air over a larger area.
Correlation between Urban Heat Island (UHI) effect and number of heat waves (HW)
Due to lack of access to data, I have to rely on visual comparisons (if anyone knows data to analyze or can make it common, please contact me).
Below I illustrate 2 cases: USA and Europe (Germany in particular).
- USA case
  Let's visually compare the 2 images to each other. The first, showing the density of solar power plants in the USA [https://openinframap.org] - Figure 1:
  
  Figure 1: density of Solar Power plants in USA
  The second image, from the paper [U.S. heat wave frequency and length are increasing] showing the increase in the number of HW in the US - Figure 2:
  
  Figure 2: Changes in the number of heat waves per year (frequency) and the number of days between the first and last heat wave of the year (season length)
  Finally, Figure 3: shows the growth of US electricity generation as a function of years (data from https://en.wikipedia.org/wiki/Solar_power_in_the_United_States).
  
  Figure 3: Trends in commercial solar electrical power generation in the top five states, 1990–2012 (U.S. EIA data)
  
  As can be seen from Figure 2 (and Figure 3 and Figure 1), the histogram showing the frequency of HW occurrence, the increase in the number of cases from 2000 to 2010 is greater compared to earlier periods.
  There is even a geographical correlation (Figure 1 and 2) - which, of course, can also be presented as a fact that power plants are built close to large agglomerations.
  In general, the UHI effect should be considered as the sum of the UHI + Solar Heat Island .
  I therefore postulate the following hypothesis:
  - Until 2000, urban heat islands were mainly responsible for heat waves.
  - After 2000, the increase in heat waves can be further attributed to solar power plants.
- Germany case
  How does it look like in Europe ?
  The density of solar power plants in Europe [https://openinframap.org] - Figure 4 (below) shows that the largest solar farm infrastructure is in Germany (Let's forget about the UK for a while).
  
  Figure 4: density of Solar Power plants in Europe
  Now, numbers of Heat Waves (HW) in Germany:
  I was not able to find detailed numbers for heat waves in entire Germany. Instead, from the report Nationaler Klimareport Klima ‒ Gestern, heute und in der Zukunft you can decipher the following number of heat waves counted for 5 major cities in Germany (Hamburg, Dresden, Frankfurt am Main, Mannheim and Muenchen). It shows that (Table 1):
  
  Year period:Nr of Heat weaves:
  1950-19605
  1960-19704
  1970-19805
  1980-19903
  1990-200017
  2000-201015
  2010-201919
  
  It shows a huge increase in Heat Waves since the 1990s. Followed by a small stabilization for the periods 1990-2010 and another increase for the last years after 2010. The amount of electricity generated by photovoltaics has been increasing since 2000. And since 2010, there is almost a jump in the growth of solar electricity in Germany (see Figure 5).
  
  Figure 5: Trends in commercial solar electrical power generation in Germany
  
  As can be seen from Table 1 and Figures: 4 and 5, the increase in the number of Heat Waves in Germany after 2000 (after 2010 especially), is correlated with the generated Solar Energy.

Conclusions:

Without access to detailed data, it is difficult to conduct a more detailed analysis of the correlation between the number of the HW and the increase in electricity produced by the growing number of solar plants.
However, the suggestion given by the available data presented above is at least worth a closer analysis.

The ideas in the European document "Fit for in the a Solar Future: Commission climate package is landmark achievement but more ambition is possible" could prove devastating .

Take care

Sunday, August 8, 2021

Are solar power plants really green energy ?

Are solar power plants really a good solution for energy production ?

Everywhere you hear it's a clean way to get energy. So let's see if it really is.

image copied from https://dziennikzarazy.pl

Introduction:

As an introduction, a few words about how the sun heats the earth and how a solar panel works.
The infrared part of the solar spectrum (wavelength > 700 nm, about 50% of the energy) is directly responsible for heating the Earth's surface and air. This kind of solar radiation exposure on the Earth is considered normal.
Photovoltaic (PV) panels operate in the visible part of the solar spectrum: from 350 nm to 750 nm (approximately). It is this part of the solar spectrum that does not normally heat the environment (the part between 700 and 750 nm does). Energy from this range of radiation is partially (14-22%) converted into electrical energy and the rest (78-86%) is converted into thermal energy.
PVs thus act as a converter of the visible part of the sunlight spectrum (not infrared) to heat (infrared). In other words, they increase the amount of heat compared to the transmitted infrared portion of sunlight. To simplify the estimation, let's assume that 16% of the energy consumed by PV is converted into electrical energy. The rest is dissipated in a form of heat. Data about the operational power produced by solar panels are given in units of electric power generated by Photovoltaic (PV) systems. I.e. 84% of the energy dissipated into heat is not included in these values.

Story nr 1 - local:

The first bad effect, fully local one, is a local increase of temperature near solar power plants The Photovoltaic Heat Island Effect: Larger solar power plants increase local temperatures. Another interesting article about a super solar power plant in the Sahara, taking into account the local effects of large heat dissipation around solar panels was written by Jack Marley Solar panels in Sahara could boost renewable energy but damage the global climate – here’s why.

Story nr 2 - global:

Now let's try to look at things globally.
the number of solar panels on earth is growing almost exponentially every year. According to the Renewable Capacity Statistics 2021 website, at 12.2.2021 the world had 714 GW of operational Photovoltaic (PV) systems. Let's try to translate this value into a carbon footprint by treating all operating PV systems as one.
Some assumptions at the beginning:

80% of initial radiation is dissipated by the solar panel into heat.
1 kW of Solar Panel System covers an area about 8 m2 .
Solar irradiance: the averaged over the year and the day, the Earth's atmosphere receives radiation 340 W/m2 from the sun https://en.wikipedia.org/wiki/Solar_irradiance. The PV systems are distributed across the earth, so I assume that the average solar radiation used in the calculations is: 150 W/m2
Average equivalent of the carbon footprint of the 1 kWh as 0.5 . Obviously, we have different CO2 emission intensity for different countries per 1 kwh. The value 0.5 corresponds to the average values over sunny countries.
More detailed data by country and region is available on the website https://www.carbonfootprint.com/.
Conversion from W to kWh: 1 W == 0.001 kWh

The 714 GW of operational PV systems creates a total surface (St) equal to: St = 5712000000000 m2 = 5 712 000 km2 .
It corresponds to the area size between India (3 287 263 km2) and Australia (7 741 220 km2). Going furthermore, considering the surface of the earth, the surface of the solar panels is 1.1% of its surface. Since we are talking about energy produced by the operational PV systems, we can assume that this is 16% of the energy converted into electricity. Therefore, the dissipated energy into the heat energy (Ht) produced at the same time is:

Ht = 714 [GW] (84 [%]/16 [%]) = 3748.5 [GW].
Now let's calculate the carbon footprint of this amount of heat. Using the conversion from W to kWh (1 W == 0.001 kWh), our amount of heat energy (Ht) is equivalent to 3 748 500 000 kWh ~ 3.75 GWh .

This value corresponds to the carbon footprint (assumption that the carbon footprint of the 1 kWh is 0.5):.
1874250000 kg CO2 /hour.
or
16 418 430 000 000 kg CO2 / year ~ 16.4 GT /year.
This is a huge value and corresponds to the 52% (!) of total CO2 emission in 2020 (31.5 GT / year) ! Thus, we have an unexpected situation because it looks like solar panels are far worse at producing energy than fossil fuels. Let's see a comparison of the increase in operational energy of PV systems to the change in global temperature anomaly as a function of years. Temperature data from https://climate.nasa.gov/vital-signs/global-temperature/.
Please note the different scales of the data presented in the figure. The operational power generated by solar panels is shown in red, the temperature anomalies in blue. Almost perfect correlation !

Summary:

Is CO2 really responsible for warming the earth ?
The correlation between temperature anomalies and the amount of energy produced by PV systems is surprising to say the least !
By building PV systems, we create smaller or larger heat islands around them, disturbing the natural energy balance in such an area. The PV systems produce more than 4 times more thermal energy than they produce in the form of electricity.
By producing solar panels we pollute our environment (+ the need to recycle).

The final conclusions rather indicate that solar power plants do more damage than conventional ones.

Now, the natural question is whether we are already seeing a correlation of climate change with the increase in heat islands around solar power plants.

Is there a correlation between the heat energy produced by the increasing number of solar plants and the increase in air temperature via the Heat Islands effect ?
Is there a correlation between the frequency of Heat Waves and the thermal energy produced by solar power plants ?

About this there is the next text Are solar power plants really green energy ? Continuation

I would be grateful if someone could point out to me the error I am making in the above approximations.

Take care

Thursday, July 15, 2021

Global Real Estate market: a non-expert view

About:
Most analyses of real estate prices compare their behavior over time with other economic indicators, but this is done for a specific country or independently for a group of countries. This text proposes a comparison of real estate prices between different countries by calculating the correlation between them.

Data:
I came across some data on real estate prices (https://data.world/finance/international-house-price-database). This data contains values of 4 quantities (with short descriptions found in wikipedia and other sources):

the house price index (HPI):
measures the price changes of residential housing as a percentage change from some specific start date (starting in 1975).
the house price index expressed in real terms (RHPI):
the deflated house price index (or real house price index) is the ratio between the house price index (HPI)
the personal disposable income index (PDI):
measures the after-tax income of persons and nonprofit corporations.
the personal disposable income expressed in real terms index (RPDI) :
the deflated PDI.

Analysis:
As input we have time series with specified quantity $Q$ (HPI, RHPI, PDI or RPDI) for N (N=24) countries ( 'Australia', 'Belgium', 'Canada', 'Switzerland', 'Germany', 'Denmark', 'Spain', 'Finland', 'France', 'UK', 'Ireland', 'Italy', 'Japan', 'S. Korea', 'Luxembourg', 'Netherlands', 'Norway', 'New Zealand', 'Sweden', 'US', 'S. Africa', 'Croatia', 'Israel', 'Slovenia'). In order to calculate correlations between countries I do the following calculations:

for a given quantity $Q$ I normalize all data independently to the range [0,1],
I determine the two-site correlation function for each timestamp $t$ \begin{equation} \label{1} Corr_{country, another\_country} \left(t \right) = Q_{country}\left(t \right) Q_{another\_country}\left(t \right) \end{equation} which is used finaly, for calculation of the global correlation for each country: \begin{equation} \label{2} C_{country}\left(t \right) = \frac{\sum_{another\_country=1}^{N} Corr_{country, another\_country} \left(t \right) }{N} \end{equation}

The dynamics of the thus calculated correlation function $C_{country}\left(t \right)$ for the quantity $Q=$RHPI and for all countries is shown in Figure 1. For the quantity RHPI, the correlations are most apparent.

The financial crisis of 2008 is very well visible in this figure (the yellow cylinder between 2007 and 2008).

A couple of observations for the time period 2007-2008:
The longest increase of the value of RHPI is seen for the US, lasting since about 1995. Other countries behave in a weakly correlated way during this period. A strong correlation between countries starts to be visible from about 2005 and quickly increases until the crash around 2007-2008. It looks as if most countries joined the global real estate market at the same time (around 2005) and at a given signal decided to crash - "ready, steady, crash!". The market seems to be too well orchestrated. I know that some people will say that this is a normal behavior because it is a global market, all markets are interconnected, etc. However, please note that it is hard to see any trace of the dot-com crisis of 2000-2003 (dot-com bubble) in this picture. Another comment on the picture concerns the number of countries participating in the crash. There are some countries excluded (Japan, S. Korea, Israel) or weakly participating in the process (Australia, Canada, Switzerland, Germany, New Zealand, Sweden, Croatia). Altogether, 13 countries out of 24 are affected by the crash.

Observations for the time period 2008-now:
The first observation is that more countries are now correlated (and not because of COVID). Still outside the correlated market are Japan, S. Korea and Croatia. Spain, Italy are not correlated (accident at work?).

Summary:
the management of the real estate market is becoming more and more concise (only one player? ): currently 19 countries out of 24 (crisis 2008: 13 countries).
Question: when will this player decide to make another crash?

If anyone has knowledge of more data I would be grateful for providing it.