I had another itch to create a fun Data Visualization. This time, I explore baby names. The Social Security Administration has a great webpage to look at popular baby names by year. I knew this data would be perfect for a Tableau visualization. The problem was, the CGI interface only allowed you to view one year of data at a time and didn't have a data export link. So, I was faced with two options: Manually go to each of the 135 pages and copy/paste the data into Excel, or write a script.
I knew the social security web page was a CGI program (meaning all processing and HTML generation happens on the server). It allowed the user to POST to it and pass in a "year" parameter, and in return would get an HTML page with a table of baby names for that year. I knew I could write a script to loop through the years 1880-2015 and call the page and somehow parse the HTML results into a table I could easily turn into a tableau data set.
I've been getting re-acquainted with coding and have been learning (and loving) Python. It has turned into a swiss-army-knife of sorts. There is an http request extension which allows you to generate a POST request and receive the resulting HTML as a string. So, looping through the years and generating POST requests and receiving HTML output was the easy part. The hard part was turning that into a data set (either excel or csv) that could be imported.
I could have parsed the HTML to create a csv file by looking for the <TD> and <TR> tags, but I knew for this purpose it was over-engineering. However, I knew that in a web browser, I could copy and paste the contents of an HTML table and paste them into Excel directly. Since this was a one-time-use of this data, I thought it was a good approach.
I created a python script to generate an HTML table with 2,700 rows: one row for each rank (1-20) for each year (1880-2015). To find the "relevant" part of the HTML, I just ham-fisted a find() function to find the start of where their HTML table began. Of course, this method is not flexible, and would break if they changed their CGI code. But, again, for a one-time-use, I was happy with it. I also, for good measure, included a short sleep() at the bottom of each loop just so I didn't accidentally simulate a Denial-of-Service attack.
The code is below, but I AM NOT CERTIFYING THAT IT WORKS. If you use it it is AT YOUR OWN RISK. I claim NO RESPONSIBILITIES for what you use this code to do. That said, it worked for me as a quick one-time-use hack. I think Python is a great tool to have in the toolkit, and it has a wide array of extensions for things just like this.
So, sorry for geeking out for a bit. Below you'll see the final result. My goal was a clean, intuitive, interactive and fun visualization.
Click on the thumbnail below to see it on Tableau Public!
Some fun facts I discovered while playing with it:
Did you know Bertha was once a popular baby name?
Did you know John is like the Michael Jordan of Baby Names (persistently #1). Ironically, Michael is the Kobe Bryant of Baby Names, the more recent #1.
I'm looking forward to all of the college-aged Jacob's entering the workforce in about 10 years.
Emma has got to be the "comeback name of the year." It only took 100 years for it to come back.
Again, please let me know what you think!