Regression effect and regression fallacy

If two variables are normally distributed and are correlated in such a way that their scatter plot is football shaped. X values which are above average tend to have on average lower y values. This is seen whenever there is spread.

This is known as Regression effect. This cannot be attributed to any external cause other than ordinary variability.

Regression fallacy is attributing an external cause to the regression effect. For example lets say a new bie in cricket does extremely well and becomes a star overnight. He gets a lot of endorsements and good media coverage. However next year he does not perform equally better. His performance may even be below average. This could be due to pure variability. However, attributing it some external cause for example saying that all that success has gone into his head etc is wrong. This is regression fallacy.

School Ranking Analysis

This is in continuation to my previous blog (Are school Rankings biased). In this blog I have analyzed various categories of schools viz. International vs. Non-International, Residential vs. Day cum Boarding vs. Day schools. The source of data is the schools ranking published by Education World in the year 2014.

I processed the raw data on schools scores from Education World. Only the following parameters were retained and overall score was calculated as the average of scores of these parameters.

  • Value for Money
  • Sports
  • ParentsInvolvement
  • IndividualAttention
  • FacultyCompetence
  • AcadsReputation

Analysis reveals the following key points.

  1. Boys schools, Day cum boarding and Residential schools and International schools have scored better overall.
  2. Most schools have good focus on academic front and score well on it. In contrast very few schools have focus on Special Needs education and Life Skills and most schools score below average on them.
  3. Day Coed schools far outnumber schools in any other category. All the international and Day cum Boarding schools are Coed.
  4. Fees of Residential International schools is way higher than any other category. Girls Residential have higher fees while boys residential have lower fees. International schools have higher fees on average.
  5. Non-International Residential schools and Day cum Boarding score provide better value for money. International Day cum Boarding score worst in terms of Value for Money.
  6. Boys schools typically do better in sports and Girls school score good on academic fronts. However Residential Girls schools and Boys Day school score good on both sports and Academic front.
  7. Maximum number of non international Day schools are in Delhi/ NCR. Most number of Residential schools are in Uttarakhand. Maharashtra has maximum number of International schools.
  8. Schools in Himachal Pradesh score very well across all the parameters on evaluation. Interestingly schools in Jharkhand score well on Academic front but score poorly on all other fronts.

Check out the link below and find out how does your state score. The following link shows the distribution of schools across various locations in India.

https://public.tableau.com/views/SchoolRankingAnalysis/Top5SchoolsinIndia

The graphs and charts below depict the visual analysis summarized above.

schools-stack-bar
Number of schools in various categories
schools-temporal
Number of schools Established per year

The chart shows that total number of schools have been increasing ever since India got independence. Most of these schools were day schools. However since 1967 number of Day coed schools have increased rapidly while number of Boys and Girls schools remained steady. International schools are a recent phenomenon starting 2003 as were Day cum Boarding.

school-box
Average Fees/ Year across various categories of Schools

Average Fees across various categories shows that  fees of Residential International schools is way higher than any other category. Girls Residential have higher fees while boys residential have lower fees. International schools have higher fees on average.

2x2
Scores of various Schools on Sports and Academic front

This chart shows how the various schools are placed with respect to Academics and Sports.

We can see that Boys Day schools and Girls Residential schools lie in the top right quadrant which implies that they score high on both Academic and sports front. We can infer that Girls are more sincere in academics as such. Focus on sports increases if they are in residential schools and they excel in both.

Boys on the other hand do better in sports as such. If however they are in Day schools then due to pressure from parents the focus on academics increases and they do better in academics as well.

slope
How various types of schools score on various parameters

This graph shows how the various types of schools do on certain other parameters. Clearly day cum boarding International schools offer least value for money. Residential schools are better for Special Needs Education.

Are the school rankings biased?

Every year parents wait eagerly for the schools ranks to be published in leading dailies to get an idea about the top schools for their child’s admission. These ranking are based on survey of the reputed schools which wish to participate and let themselves evaluated on various parameters.

But do the rankings and scores of the schools that come out of the rigorous analysis really give the correct picture? Is the analysis truly statistically correct?

In my blog I am trying to find out if the school ranking is truly unbiased and right. The data has been taken from the 2014 Schools ranking from Education World (educationworld.in)

The bar chart below shows the coeff of determination between Total Score with the various parameters.

schools_coeff

It is evident Teachers Welfare and Development, Leaderhip Quality Management and Lifeskills account for more than 70% of variation in Final Score while an important parameter like AcadReputation has the least contribution towards the Final Score.

Further exploration of the correlation coefficient between various parameters is shown below:

The table below shows the correlation coefficients between various parameters. Green color means high correlation while red means low correlation

schools_cor

Looking at the R between various parameters we see that the following are highly are correlated:

  • TeachersWD vs FacultyCompetence: r = 0.703
  • TeachersWD vs LifeSkills r = 0.74
  • TeachersWD vs Internationalism: r = 0.729
  • LeadMgmt vs LifeSkills r = 0.706
  • Cocurrics vs Sport : r = 0.726

Since Teachers Welfare and Development is correlated to at least three other parameters, this is resulting in scores biased favorably towards the schools which score high on Teachers Welfare and Development. An unbiased scores can be calculated only when the various parameters do not have high correlation among themselves.

In order to remove this bias in ranking we retained only the parameters which have low correlation coeff. among them. The table below shows various parameters which have have low correlation coefficient among themselves. These are:

  • Value for Money
  • Sports
  • ParentsInvolvement
  • IndividualAttention
  • FacultyCompetence
  • AcadsReputation

schools-cor2

We recommend that the overall score based on the such parameters as have relatively low correlation among themselves so that the overall score of any school is unbiased.

My next blog is on the analysis and various categories of schools based on their scores.

Geospatial and temporal visualization of AIDS poster data

Project Description

AIDS has radically transformed the world and become the focus of interdisciplinary study and research from a medical, cultural, and media-historical perspective. Over the past 30 years, the German Hygiene Museum in Dresden has collected numerous items –predominantly posters– which have been used in the media campaign to combat the epidemic. It is the world’s largest collection of AIDS posters with over 9,000 specimens from 147 countries.
The goal of the project is to visualize the distribution of symbols, gestures, and topics addressed in the posters through space and time so that other researchers and members of the public can understand the development of the cultural response to the AIDS epidemic.

Data Description:
In the course of the fellowship project “AIDS as a global media event” 2715 posters from the German Hygiene Museum’s collection were classified and codified using the ICONCLASS system. Each record captures the date, language, ICONCLASS classification number, keywords, and geographic data for a given media object.
ICONCLASS is an iconographic classification system which assigns codes (combinations of numbers and letters) to common subjects in Western art. ICONCLASS is a hierarchical classification that uses alphanumeric codes divided among 10 major classes.

Insights
After evaluation of various visualization following inferences have been drawn:
Poster Campaign against AIDS started off in 1983 after detection of first case in US. USA was first the country which started off with the campaign which was swiftly picked up by various countries. US and Europe lead the world in terms of maximum numbers of posters created.
Most posters were created in the first decade (1983-1993) after which their numbers have gradually gone down. Although US was first to start off the campaign, Europe has consistently outshined other regions in terms of number of posters created. Posters created in Africa in general use more themes or symbols per posters making them somewhat more complex in contrast to other regions.
Human Being has been the most popular Symbols or themes in posters followed by Society, Civilization and Culture and Nature in that order. These three symbols put together account for more than 80% of posters.
While most of the Religion and Abstract types of posters were made in Europe, most mythological and Historical theme related posters were made in the US. Most of Red and Green Ribbon related posters were made in Africa.
Most of the posters created in early years called for people to come forward and seek help or information on AIDS by calling on the prescribed numbers. Early posters also made use of explicit sexual themes to educate people on modes of transmission of the disease through unprotected sex. A good numbers of posters have used condoms to focus on ways of prevention of the disease. Towards the later years the posters acquired somewhat subtle or implicit tone with theme becoming more symbolic in nature.
Each poster uses one or more than one theme or symbol to drive home a message regarding AIDS. An analysis on the combination of symbols used in posters shows that Human Beings and Society& Civilization have been most widely used together in posters. This is followed by Human Being and condoms being used together in AIDS posters. AIDS posters have frequently made use of nude Figure of Man and relationship between individuals. Some of the things which stand out in posters include Arms, Hand, Head, postures, gestures etc.
These insights are based on various graphs and analysis provided below.

Analysis

image1-pie

image2-barAfrica, Europe and North Americas are well represented in the data with each accounting for around 30% of the data in contrast to Asia and South America etc which has less than 5% of the data available.

Posters have been classified into 10 basic categories with each having multiple subcategories. ‘Human Being, Society, Civililization and Nature put together have been used in more than 80% of posters as seen in graph below.
Average life span of posters = 10 years

Geospatial Distribution of Posters

image3-geo

The graph below shows the geographic distribution of poster counts along with the start year to end year. The circle size depicts the count of Poster while the interior and exterior color show the date of posters as shown in legend. We can see that US has a largest share of posters which were published early and poster campaign still continues there.

image4-stacked

This graph shows the what percent of posters in a given Symbol category were generated in a particular region. It shows that most of the Abstract and Religion and magic related posters were made in Europe.
America forms the largest fraction of posters in Mythological and Historical category.

image5-geo2

This graph shows the average number of themes used per poster. Clearly some nations in Africa have more complex AIDS poster trying to communicate too many things.

Temporal Distribution of Posters

image6-barThe poster time chart shows that poster campaign against AIDS was at peak during 1985 to 1995 with 1990 being the median year. During this period most of the posters were made. This is in line with the fact that the first case of AIDS was detected in 1980s and around that time massive campaign was undertaken. With time campaign gradually peaked off as the awareness increased.

image7-area

Use of symbols over time shows that ‘Human Beings’ were first to be used in posters and they have been most popular in posters all through

image8-line

The distribution of the posters across continents and time shows that the campaign started off with America followed by Europe and later by Africa.

Burst Analysis
Burst Analysis on the text part of the ICONclass classification is below:

image9-burst

Burst on ‘telephone’: In early years due to lack of awareness on AIDS and due to stigma associated with it people may not have been very forthcoming and hence usage of the telephones in the campaigns to prompt them to come forward and call on the given numbers to get the required help.

Red Ribbon: Red Ribbon is universally used as a symbol of AIDS awareness. It was created in 1991 which probably explains why it figures in the Burst Analysis in 1991 and continuously thereafter from 1995 onwards.

The pattern of the Burst Analysis shows that during early years any symbol or gesture was used very widely but only for a small period. In contrast during the later years the burst period has increased. Probably it means that campaign has matured and stabilized and the symbols or gestures perceived to be most effective have been continued for larger time. So the focus has shifted from creative and variety of posters to fewer more effective themes.

Also the symbols or gesture in early days were more sexually explicit e.g image of a phallus, buttocks etc while during the later years the themes were more symbolic and figurative.

Network Analysis
Word Co-occurrence Network from themes

image10-net1
Word Co-occurrence network has been created for the themes or the Text Portion of the IconClass Classification. The network shows which words co-occur in posters. The words in the theme form the nodes while their frequency of co-occurrence is shown by the weight of their edges.
The network has various communities shown in different colors. We see that the biggest community is blue in color. It comprises of variety of words which figure in themes. Arm, Head, Hand, Condom, Human etc are most widely used terms. The community in yellow is smaller and has terms which are more sexually explicit.

Co-occurrence Network of symbols in themes

image11-net2

This network has been generated on the hierarchy of symbols in Iconclass classification. Various ICONCLASS categories and subcategories form nodes and edges represent the linkages or number of times they have occurred in themes. The network shows that Human Being depicted either as Man in Biological sense or Human being in nude is most commonly used theme in posters as shown below:
Condom though most commonly used term in theme stands alone as it is not a pert of ICONClass Classification

Co-occurrence Network of Symbols in Posters

image12-net3
Each poster has one or more than one theme associated with it. The objective is to merge the themes based on poster and visualize co-occurrence of higher level symbols or Icon class used in posters. The network generated shows which Symbols (nodes) co-occur in posters and their frequency of co-occurrence is shown by the weight of their edges.
The network below shows which Symbols have been used together in posters against AIDS campaign. Human beings and Society, civilization and culture has been frequently used together in posters as shown below. Abstract Ideas have only been used with Human Beings and Society, Civilization and culture

Co-occurrence network of Language

image13-net4
Many posters have been published in more than one language simultaneously. The below diagram shows how the languages are correlated as far the AIDS posters are concerned. The languages form the nodes while the edges represent the frequency with which two languages have been used in posters.
The node size represents the poster counts while the node color shows the Start year of the posters- Dark Color-> Recent Years; Light Color-> Earlier Years.
Maximum numbers of posters were co-published in English and Spanish. English and Spanish are very highly co-related followed by ( English , Swahili), ( English , French), (French, Arabic)

Bipartite Network of Region and Language

image14-net5
This network is meant to show which language(s) are most commonly used across the continents in AIDS posters. The network shows nodes with the degree less than 3 filtered out as we are interested only in languages which are common across multiple( 3 or more) continents. The nodes were sized as per the poster counts and colored depending on the node type (Region or language).
The network diagram shows that English has been used in posters across all the regions in the world. Portuguese is the next most common language- only region is has not been used is Middle East. Arabic and Spanish are next most common languages across regions.

Relevant Links
http://www.dhmd.de/
http://www.kulturstiftung-des-bundes.de/cms/en/programme/fellowship_internationales_museum/aids_als_globales_medienereignis.html
http://www.iconclass.nl/contents-of-iconclass

Visualization of Global Human Development indicators of various countries over the years

This is an Exploratory Analysis of the Global Human Development datasets downloaded from http://hdr.undp.org/en.

Following datasets were used in the analysis:

– HDI: Human Development Index and its components: The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living.

– GII: Gender Inequality Index: expose differences in the distribution of achievements between women and men. It measures the human development costs of gender inequality, thus the higher the GII value the more disparities between females and males. The GII values vary tremendously across countries, they range from 2.1 percent to 73.3 percent.

– MPI: Multidimensional Poverty Index: This index is a measure of overlapping deprivations suffered by people at the same time. It identifies deprivations across the same three dimensions as the HDI.

Following components of HDI were also analyzed:

– GNI: Gross National Income: The standard of living dimension is measured by gross national income per capita.

– MSY: Mean Schooling Year: The education component of the HDI is measured by mean of years of schooling for adults

– LE: Life expectance in years: The health dimension is assessed by life expectancy at birth component of the HDI

Indices for various countries and years are not available. Missing data is beging ignored in this analysis.

We see that Ausralia had the highetst HDI from 1980 to 1990. Norway overtook Australia in terms of HDI in 2000 and has maintatined its top spot.

The latest data is that of year 2013 and the earliest is of 1980. The lowest ever HDI was recorded as 0.146 for Liberia in 2003. The highest one was for Norway in 2013. The median HDI over the course of years is 0.64

Minimum GII to be have been recorded is 0.021 for Slovenia for 2013. The highest was 0.885 for Yemen in 1995. Median GII is 0.446. We notice that countries with the hightest GNI are not the ones with the max HDI.

Geospatial variation of HDI show that the America , western countries in European Union, Australia and New Zealand have higher HDI while Africa has nations with the lowest HDIs.

Variation of HDI and its components with Time shows that the average HDI of nations has constantly been increasing since 1980. Av. GNI decreased from 1983 to 1994 before gradually increasing till 2008 after which it again started declining. Mean Year of Schooling and average life expectancy has increased since 1980

The Heat Map below shows that standing of nations as per their HDI and GII. Norway has the highest HDI while Niger is lowest. Yemen stands out for its high GII i.e gender inequality. Qatar seems to fare better in terms of HDI but has poor GII in contrast.

The scatter plot between HDI, GII and GNI shows that there is clear relation between HDI and GII. nations have poor human development records have higher gender based inequality. These are also the nations that have poor GNI (Gross national Income). HDI is not dependent on GNI. Some nations have higher GNI still pose poorly on HDI such as Qatar.

The scatterplot showing the variation of HDI with its components. The nations with high HDI might not have the hightest GNI but they score good on life expectancy and education.

Scatter plot of various of Dimensions of Multi Dimensional Poverty Index shows that very nations showing high Multi Dimensional Poverty is due to poor living conditions of people there. Health and Education are not the primary contributors. Whereas for nations faring well on MPI the primary contributors are educaton and Health.

The time series hows that Cambodia has had the highest rate of (130%) improvement in HDI followed by Afghanistan (108%)

Dynamic co-occurrence network of authors and their co-authors

Co-Occurrence Networks are undirected networks that make connection between the entities of the same type.

Given below is a dynamic co- occurrence network of Katy and her co-authors. Records with Katy Borner as a author were downloaded from the NSF database at  http://sdb.cns.iu.edu.

The .csv file was uploaded into Sci2 tool and Co-Occurrence network was extracted. Starting year was defined as one of the nodes attributes. The starting year is the year when a first publication co-authored by Katy and another author was published.  The network thus extracted was visualized using Gephi.

From the Data laboratory tab using the option of ‘Merge Columns’, the column with the Start Year was converted to time interval. The network was thus rendered as a dynamic network with nodes gradually appearing and network growing in time.

dynamic-network

Nodes are colored based on the number of grants awarded  and are sized based on the degrees.

Blue: 13 awards

Red: 2 awards

Green: 1 award

Bipartite network linking Institute names and research programs

Bipartite network connect entities of different types. Given below is a Bipartite network showing the  linkages between the Institute names and research program names with which Katy Bormer is  ssociated.

Records of various research programs with which Katy Bormer is associated with have been downloaded from the NSF database at http://sdb.cns.iu.edu.

The .csv file was uploaded into Sci2 tool and Bipartite network was extracted. The network was  visualized as Bipartite Network Graph in Sci2.

bipartite