Wednesday, May 27, 2015

How good is our education system?

--edit-- Since I read about the central limit theorem in a deeper fashion after publishing I have realized that the bell curve is not demanded by statistical laws. This is because exam results are strongly influenced by factors like the amount of effort students put in. The bell curve is still desirable since it allows us to evenly distribute people. -- edit -- 

Exactly how good is our education system? Touted to be a very good one and supported by a few shining gems, what exactly does the data say about the education system?

During the course of this article you will find some pleasant but unexpected things and some expected ones. About the absolutely shocking I cannot make a promise.

First up is my home state Rajasthan. The Rajasthan Board of Secondary Education has been put under the microscope. The choice was simple, this examination board did not ask for anything other than a roll number to get the results of students. Fair enough. I obliged and wrote a Python script to automatically get as many results as possible. After about 30,000 records my Internet connection was ready to die, and so I stopped mining for more results for the day.

While I was processing the results of my mining, I realized that the results of the CBSE class 12 board examinations were declared. With wild glee I prayed that they too ask for nothing more than a roll number. My prayers were granted. After tinkering with the python script a little it was good to go for the CBSE website. Again I put my computer to use and started mining. After about 50,000 results my Internet connection became really slow and computer became really hot. That got my attention and I stopped mining.

Now for the data analysis part. What should I do to this data? A simple answer was to check if it was following the normal curve. What is normal curve you say? Well normal data in simple words is as follows:-
In any population, if you count the frequency of any trait, you will observe a bell like curve
Now this is a very powerful prediction with roots in solid mathematics. Any deviation from this rule implies a bias in the real world where the data was generated. With this in mind I obtained a histogram.

This is the plot for the Rajasthan Board results. The sample size was 30,000 students. Marks obtained vs frequency.

For simplicity all students who failed were simply not plotted to let us concentrate on those who have passed.

 As expected, a bell curve is observed. There are spikes at regular intervals but they can be explained by assuming that examiners will tend to give 30.5 instead of 30.49 and so on. This is excusable. 

There are some interesting things about this plot though.
  1. The peak of the bell is not at 50 as expected but at approx 54. This is indicative of a system which inherently passes students who appear for the paper.
  2. The slope for the left is steeper for the one on the right. Fewer students seem to be scoring less than the average.
Now for the elephant in the room. Yes, the mammoth spike at 60 surprised me as well. As of now I have been unable to find an explanation for that spike. It is however indicative of something really biased. Some things that caught my eye about it:
  • It is HUGE! It is higher than the peak of the curve.
  • It is ahead of the curve peak.
  • There is no other peak like it.

A spike like that indicates that the system is inherently rigged to provide most students with marks around 60. This is extremely fishy. What is the significance of this number? It is not the fail barrier. Examiners pushing people over the failing marks might have explained such a thing.

This is a skyscraper right in the middle of our marking scheme. One explanation may be that this is some grade demarcation. (this is not confirmed)
Another explanation may be that the sample size was too small. This is less likely as 30k samples for an entire state are enough in my opinion.

 We will talk of this later as I have a conspiracy theory to cook. As soon as the CBSE sample size is noteworthy (arround 1 lakh students) I will plot it's histogram and update this article. Stay tuned.

-- Update -- 1
It seems someone already did this on Quora  for ICSE. Will continue to do for CBSE and pray I have enough data before he does.

-- Update -- 2
After a grueling evening consisting of snail paced mining of results, I turn to Twisted and asyncio. With that I rewrite the code again and behold, the program is faster than The Flash. After a while I get bored and cannot wait to see the plots so I stop. End result, 42294 usable student records obtained.

Maybe someone with more time and better resources can conduct a better analysis.

There is plenty of data now and I have plotted the results. They are at best troubling. Have a look. The data is available here.

Some subjects have a low sample size and you should keep that in mind while looking at the graphs. Some however have a large sample size and still show some very "creative trends".

Things to note:
  • CBSE seems to have the policy, "When in doubt give 95"
  • The overall trend is to pass students.
I believe that examinations are a means of distributing students across marks in order to make it easier to judge them.
100 is considered good only because a small number of students should be able to get it.
 In such a model a bell curve is the best things that could happen. This curve makes us realize that a lot of people are just average. The best and worst are rare cases. To evaluate a person you ask their marks and can judge how many such people are about. 
All of this goes out of the window if the normal distribution is removed from the picture. You no longer have any idea of how good the student is. All you know is that they have obtained such and such marks. The marks themselves have no meaning.All that is left is an empty number.

The current battle of high cutoffs in colleges is entirely due to the non normal distribution of marks if this data is to be believed. Since everyone is in the 95 bracket, the cutoff needs to be higher. A normal distribution ensures that everyone is judged properly and cutoff are reasonable.

Then again I might be rambling, and my sample size is definitely small. (Maybe someone with more experience in stats can help?) The Internet connection is  simply too slow to support such high usage of services.

Until there is someone willing to support this study, this is the best I can manage.
So go ahead and have a look at the graphs and enjoy. There were some subjects I did not know existed( no really, I swear).

Thursday, May 21, 2015

Trai and their blunders

Once TRAI had published the list of emails(read about it here)  it had received along with the addresses I was very skeptic of what measures would be taken to correct this blunder and how effective they would be.

After a little time TRAI decided that in order to discourage Spam bots from using the list as a source, they must do something. A decision with it's heart in the right place. Then came the blow to my intelligence.

The measures TRAI undertook was to replace @ with ( at ) and "." with (dot) in every email.
This was unexpected. If you have ever typed into GMail any address and performed the same replacements you would notice that it does not matter if you use @ or (at).

Another thing of note. I expected it to be relatively easy to extract the emails from the website and compile a list of them. Thus I sat down with my friend's 2G Internet connection on Aircel and began to download the web pages containing the emails. There were 18 we pages of note which contained the emails.

With this in mind, I fired up Vim (text editor) and began to type out a python script which would do the extraction for me. An easy enough job and after letting it run for 192.55 seconds (I timed it) I had a list of 8,90,537 emails. Not quiet the 1 million as claimed but substantially close.

All in all the efforts TRAI made to keep our data private was commendable even though it only took a student with a slow Internet connection and a little knowledge of Python to extract the emails.

As expected my email was also within the ones found.

Thursday, May 14, 2015

Programming in multiple languages.

English is a language that has been the language of progress for a long time. A lot of the best things in the world have come forth from English speakers.

For people who do not natively speak English, programming is a thing which is subject to first learning the language of the world. For a lot of people that is a big step and not always such an easy one if their native language has a different structure along with a different script.

PyTongue to the rescue!! This little piece of code makes writing programs in your native language a lot easier by providing transcription services for python. Sadly the entire ecosystem of python is not supported but on the bright side people can understand the code being written.

For example:-
# HI
छाप('नमस्ते दुनिया!')
के_लिए i में range(10):
जब सच:
    छाप('नमस्ते दुनिया!')
This is code written in Hindi and produces completely understandable results. The second last line evaluates to While True: and so produces an infinite loop (there is no running from those things even in a perfect world.)

For proof of concept I tried out Arabic and Hindi programs and they ran flawlessly on my computer. Although I have pretty much no idea what the Arabic program says, I am sure it makes sense.

Since I know Hindi, I know that the program makes sense, although is a broken manner. This should not matter as constructs in programming languages are made up of only a handful of expressions. The loops are in a specific format and so are the conditional statements.

The downside is that since the software provides only Transliteration the generated programs still have to make sense only in the English way of sentence construction. The program in Hindi makes absolutely no sense in Hindi grammar but is still more readable than a program in English.

Also the programs need to be run; which indicates an interaction with the terminal/ OS in some manner. This itself indicates a point of failure as the OS is English supporting most of the time ( I know of no systems which have commands in other languages. So `ls` still remains `ls`). Sigh, some day I am going to make a language which is language independent. May that day dawn quickly. (I know about Lisp so do not even start. :)

Wednesday, May 13, 2015

Corruption: India's best friend

Why I can say this.

After a sequence of years unearthing scandalous corruption scams one after another, I wanted to know if India would ever be corruption free. Wishful thinking says it will and pessimism says it will not. Those two imposers are not what I put my faith in. I only believe what is backed by data. Since data is only available from the past I had nothing to work with.

With that being the start of this explorers trail, I began to look for methods which might help me satisfy my curiosity. Finally it hit me. I simply needed to simulate populations. With that idea and Python in hand I began to write code. This article is about the code, the results and the idea.

After running about 160 more simulations I discovered that in cases where the police have a pay at least twice as great as the bribe they receive, corruption does not spread to the entire society and is limited in all cases to less than 60%.

--End of Update--

The method of evaluation

First we create people who will populate the society we want to study.
We give people some characteristics
  • They are all born with "initial_money" number of coins.
  • They all have a value "stoicity" coming from the word stoic. This is a measure of how honest they are.
In case that got you thinking, we select the "stoicity" values such that if you would plot the histogram it would result in a Gaussian distribution.

In order to study behavior we must have behavior to study. Hence we add some more attributes to a person.
  • Any person may be "police".
  • Any person may be "criminal".
How does this society operate? Everyone goes about saying hi to everyone else( in a round robin tournament fashion). Whenever two people meet:-
  • One of them is policeman
    • We ask the other person if they want to bribe the policeman?
    • If he says yes we ask the policeman if he accepts?
  • Both of them are policemen
    • We randomly ask one of them if they want to bribe the other?
    • We then proceed to ask the other if they accept the bribe?
  • None of them is a policeman
    • We randomly transact a random amount of coins from one of them to the other.
    • This is to provide a statistically even distribution of people earning money.

    With these things in mind I conducted some simulations with parameters of interest. Parameters being how much to bribe, how much is the punishment of the criminal etc.

    Source code

    All the source code for this was written in Python 3.4.0 and is available on my github repository.

    Results of interest

    All graphs are fractions of population. Hence if policemen is shown at 0.8 it means 80% of the population is policing by behaviour.

    During the first simulation run I found that despite some scenarios where criminals themselves died out, corruption itself never died out. The people who were accepting and giving bribes became policemen. Overall the policemen dominated the society.

    Other simulations also provided similar results. (Most of the plots are available on my github page.)

    No matter how we reward policemen and how we punish crime, once more than 50% of the population is indulging in bribes, police or not, bribing quickly saturates in the population.

    For those interested, here are some more graphs. (I will be posting more as soon as the simulations keep on completing.)

    Friday, May 8, 2015

    Django Gunicorn and Nginx

    One of the few times I have had access to an actual machine which had to server django I have never been failed by the DUNG stack. Django-Unix-Nginx-Gunicorn.

    Of course I have had to replace Unix with Linux. There is nothing like a bit of Mint fresh air. This tutorial is carried out on Linux Mint but should work fine for Ubuntu and the likes.

    This tutorial assumes that you have pip and virtualenv set up and working. At the time of writing Django1.8 was the latest and Gunicorn 19.0

    First things first. We have to set things up. Open up your beloved terminal and start typing.

    1. We have to create a directory for our tutorial. Let us call it test_django
      1. mkdir test_django
      2. cd test_django 
    2. Now we must create a virtualenv to work in and activate it. If you do not know what that is head over here.
      1. virtualenv -p python3 env
      2. source env/bin/activate
    3. Now we must install django and gunicorn.
      1. pip install django
      2. pip install gunicorn
    4. After waiting for the above commands to finish we create an empty django project. Do not forget the dot at the end of the command.
      1. django-admin startproject dung_test .
      2. python makemigrations
      3. python migrate
    5. Our directory now looks like this now.
      1. .
        |-- dung_test
        |   |--
        |   |--
        |   |--
        |   `--
    6. That done we now want to make sure that our dummy website is working. 
      1. python runserver
      2. Open up a browser and check localhost:8000/admin
      3. It should be the standard Django Administration login screen.
    7. To begin deploying the website with nginx and gunicorn we first have to install nginx.Issue the second command in case Nginx fails to start.
      1. sudo apt-get install nginx
      2. sudo service nginx start
    8. Once nginx is installed it will start automatically. You can check by navigating to "" on your browser and it should say welcome to nginx.
    9. We now have to configure nginx to serve our website.
      1.  We navigate to /etc/nginx/sites-available and create a new file there. Let us call it dung_website.
      2. cd /etc/nginx/sites-available
      3. Open in sudo mode. sudo gedit dung_website
      4. Now insert the following text into the file.
      5. upstream app_server_djangoapp {
            server unix:/home/ghost/dev/temp/gunicorn_socket fail_timeout=0;
        server {
            keepalive_timeout 5;
            # path for static files
            root /path/to/your/app/media;#make sure this exists
            location / {
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header Host $http_host;
                proxy_redirect off;
                if (!-f $request_filename) {
                    proxy_pass http://app_server_djangoapp;
      6. Remember to replace /home/ghost/dev/temp/gunicorn_socket with your own path. The temp folder is that one which contains the test_django folder.
      7. Save and exit. now navigate one step up and to sites-enabled and create a link to the file just created.
        1. cd ..
        2. cd sites-enabled 
        3. ln ../sites-available/dung_website
      8. Now reload nginx with sudo service nginx reload
    10. To configure Gunicorn nothing much is needed. Navigate to where your project was. In my case cd ~/dev/temp
      1. Now add "gunicorn" to the list of installed apps in the dung_test/ folder. 
        1. INSTALLED_APPS = (
      2. We were earlier testing the website by running python runserver
      3. This will be replaced by gunicorn dung_test.wsgi:application --bind=unix:/home/ghost/dev/temp/gunicorn_socket --log-file=-
      4. The address given to bind is the same as written in the nginx server. The unix sockets are faster than using loopback. We could replace the unix:/path/to/socket with localhost:8000 in the nginx file and gunicorn command
    11.  With that done we can open our browser and navigate to localhost/admin to see the admin login page without the CSS. That is solved py running python collectstatic from the folder where is kept after setting up static files in the
    12. Your website can be accessed from other computers by typing in your IP address. To find your ip address run ifconfig in a terminal while connected to the network
    With those steps one can conclude the setup of a Django website using Gunicorn and Nginx. As opposed to the Apache model of concurrency Nginx is a reverse proxy server. Gunicorn handles things one at a time and nginx keeps the traffic from overwhelming gunicorn.

    The gunicorn documentation explains it much better. Hopefully this will help out some people.

    Wednesday, May 6, 2015

    Computer Vision with Python and openCV

    Of late I have been obsessed with computer vision. This is in part due to my ambition of creating my own butler and the 3d scanner project. What this led to was a long and extensive study of the mathematics involved behind computer vision.

    After some days of searching I discovered the git repository of OpenCV. A wonderful library full of interesting mathematical features and so on. Since there was no simple pip install as is the case with most non-trivial installations, I spent quiet some time building and installing this piece of code.

    Once installed I was at a complete loss of knowledge because every possible documentation was for C/C++. I could not find any(partly because I was not using google. I use duckduckgo.) After a while I did find some documentation and it was quiet fun.

    About half an hour of understanding the math and finally moving on to the code I began with getting the webcam feed to show up.

    import cv,cv2

    def get_live_feed():

        #calibrate the camera
        #required to adjust for lighting
        for i in range(10):
        #capture and show the feed
        while True:
            if img!=0:
            if c==27:break

    if __name__=='__main__':
    With this I had a live feed working.Now came the part where I had to detect my face in the frames obtained. Hence with a few documentation snippets and code from here and there I had the following.

    import cv2
    import sys

    casc = sys.argv[1]
    faceCascade = cv2.CascadeClassifier(casc)

    video_capture = cv2.VideoCapture(0)

    while True:
        ret, frame =
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        faces = faceCascade.detectMultiScale(
            minSize=(30, 30),

        # Draw a rectangle around the faces
        for (x, y, w, h) in faces:
            cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 2)
        # Display the resulting frame
        cv2.imshow('Video', frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):

    # When everything is done, release the capture

    That led to the following video being created. The vision is still far off from what the Butler must see , I will probably teach it to recognize other objects like keys etc. Also after face detection comes the task of face recognition. Expect a post soon on such a topic.The thing is a little off but works fine generally speaking.

    Tuesday, May 5, 2015

    O lovely thing

    O lovely thing of god's creation,
    how long I have fed my eyes,
    yet with every passing sun I,
    can not quench this thirst of lies.

    Your mirth gives birth,
    to a smile which so,
    makes my mind forget all dearth,
    of love, I have no more.

    Leave my side I beg of you,
    for you were never here,
    yet O godly thing on earth,
    the yearning is mine to bear.

    I have loved you, or have i not?
    from a universe away,
    so why do I still feel so near,
    while I keep you at bay.

    O godly thing, are you not?
    a beauty so angelic it shines,
    I am none so special
    not angel nor demon but
    measly human in your eyes.

    Courage I have none,
    not even the strength,
    all i have is the pine.
    Is that why you have been aloof,
    for I never made you mine?

    Saturday, May 2, 2015

    3D scanning with blender and python

    For my physics project this year I arrived on the conclusion that I needed to make a 3d scanner. It was a gutsy move since I knew that if I became committed to this I could not buy this anywhere in the market and would have to build it myself. This burnt all the white flags I may have had and made sure I made my own project. Let me tell you it was horrible. What I had in mind was something awesome, what I obtained was something organic. Graciously Anurag helped me with this herculean task.

    First was the problem of the laser. It was too damn expensive. I bought a laser diode from a friend and quickly burnt. What I did not realize was that intensity of the laser depended on the current provided and not on the voltage.

    Image of laser and camera setupThen came the problem of making a line laser out of a point laser. The first method I stumbled upon was to use a rotating mirror placed in front of the point which would cause a circle of laser light. This failed miserably as the required RPM was not met. Then I came upon the idea of using a cylindrical lens. A glass stirrer cut to size( not cut in prototype 1) was nimbly attached to the front of the laser pointer and lo behold we had a laser line.

    Now to tackle the problem of holding my mobile phone upright. This was indeed a messy one. After some time I gave up and replaced it with a friend's DSLR. Problem solved. The camera now sits perfectly on it's own body.

    The rotating mechanism was simplicity itself and was simply too easy to build. A little DIY(or jugaad for that matter) and we had a rotating pedestal.

    What came next was the mathematics. After digging around a lot I still could not understand exactly how this thing was supposed to work. Then came a moment of truth and everything was a walk in the park. Using blender I managed to extract frames from the video and ended up with about a thousand frames to work with.

    Next came the cleaning of the frames. A simple blur, increased contrast and grey scale conversion gave me a very good image of the laser. Then we selected the brightest point in every row of the image and marked it as the laser line.

    Next came the reconstruction. With a simple python script I managed to get the cylindrical coordinates of every point in the picture. With the knowledge that the object was rotated 360 degrees and with the assumption that the rate of rotation did not change much I  recreated the scene.

    A collection of points was created and saved as a scene. This was then put into blender to create a 3d representation of the object which was very very wrong. What had happened was that the glow of the laser had reflected off the laser and created data points where there should have been none. This created a scan which had a lot of errors.

    I later realized that I was calculating angles in degrees and python implicitly(any good program) uses radians. Recalculating the slices led to a new plot which fared a lot better than the previous ones. A lot of the reconstruction was noise but I could make out the nose, and ears of the Buddha statue. It was a magical moment.
    Finally Anurag scanned another object, an emergency flashlight. The results were good and funny at the same time. The flashlight had a small volume and so the point cloud was very dense. During the scan the strap attached to the flashlight was also scanned. It was pleasing to note that the scan reconstructed the strap too. Due to the dense point cloud it was not easy to make out the rest of the geometry of the flashlight.
    To see the  geometry we moved the point of view to inside the flashlight and could see the objects clearly.

    The next problem to be tackled was the problem of mesh regeneration from the point cloud. The problem was that our cloud had non uniform density. This led to some algorithms being discarded. Ball Pivoting Algorithm and Poisson Surface Reconstruction are what got my eye. Will be writing about them soon. All the source code is available on my Github Page.