Thursday, September 29, 2011

Simple Way to Analyze and Present Historical Data from Different Sources

I had historical data for 8 different indexes designed to measure economic, social, and market development across countries. I wanted to conduct a case study on Argentina to determine how the country performed across all 8 indexes in the last decade. Here is how I approached the case study:


First of all, information was not always available for the 2001-2011 time period so for some indexes I have 2006-2011, 2004-2011, etc. What I did was "normalize" the rankings by dividing Argentina's ranking over the total number of countries ranked for that particular index. I then came up with a score between 0 and 1. Note that in the "*Methodological Note" I write that a higher score indicates inferior performance, which seems counter-intuitive.

This is a rather simple case study but it is powerful in that it identifies improvement in only 2 out of 8 indexes and gives us a broad overview of Argentina's recent macroeconomic performance.

Monday, September 12, 2011

The "encode" command and the beauty of loops in STATA

I recently had to do some analysis on a data set I imported from Excel. When I opened the data editor in STATA my screen looked like this:


Now, if you want to do some data analysis in STATA chances are you can't do much if your data is all red or full of "string variables" (like they are called in STATA jargon). Here is what I did to first destring my string or red variables and then to automate the destringing of 9 variables.

If you manually wanted to destring your 9 variables you would have to write 9 lines of code that would look something like this:

encode specialty, generate (_specialty)
encode ref_reason, generate (_ref_reason)
encode icd9_catg1, generate (_icd9_catg1)
encode dt_req_sc , generate (_dt_req_sc)
encode dt_response, generate (_dt_response)
encode dt_appt, generate (_dt_appt)
encode rejection_reas_sc1, generate (_rejection_reas_sc1)
encode scheduled_status_sc1, generate (_scheduled_status_sc1)
encode insurance_type, generate (_insurance_type)

The encode statement is telling STATA to destring the variable while the generate ("") statement tells STATA to give the new destringed variable the name of the expression inside the parentheses.

This is great because instead of having a red screen you'll now have a blue screen, which means that you can do some data crunching with your newly destringed or encoded data. Your data editor in STATA will now look like this:



Now, this is already fascinating. However, there is another process that will make your life easier when it comes to writing those 9 lines of code to destring your 9 string variables. This process is called a loop and it is designed to save you time while it also reduces the chances of human error (aka "typo") since you are typing 157 characters instead of the 378 if you had typed the 9 lines of code. It looks like this:

foreach var of varlist specialty ref_reason  icd9_catg1  dt_req_sc dt_response dt_appt rejection_reas_sc1 scheduled_status_sc1 insurance_type {
encode `var', generate (_`var')
}  


What you are doing is instructing STATA to create a variable list made up of the 9 variables you want to destring. Once you type or paste your variables to the variable list you type a left "curly bracket" and then type the procedure you want the loop to conduct. Caveat # 1: after you enter the left curly bracket you need to hit ENTER so your cursor is placed on the next line, which is where you'll type the command you want to loop.

the command  {
encode `var', generate (_`var')

looks very similar to the code in one of the 9 lines of code you wrote to destring your variables. However, here the `var' expression is acting as placeholder and it's telling STATA to generate destringed the variables and give them the same name preceded by an "underscore" _  

Caveat #2: after you enter the command that you want STATA to conduct in your loop, you hit ENTER again so your cursor is placed on the next line and then you close the expression with a "right curly bracket" which will sit by itself at the bottom of your loop looking like this:
}

Wednesday, September 7, 2011

VLOOKUP to automate the creation of a table in Excel

How do you go from this data set containing about 2,000 observations over a period of 10 years (aka panel data)


To this neat looking table?


The answer is: VLOOKUP. And it looks like this:

Let's look at cell C22 for Argentina. What this VLOOKUP command is telling Excel to do is to find the value in B22 (Argentina) in the 'INDEX OF ECON FREEDOM DATA 2000' tab (which is the tab where the data is stored) and then look at the range B1860-O2014 and give you the value on the 14th column. The FALSE statement at the end indicates that you want the EXACT value. If you look at the 1st snapshot on this blog entry you will notice that the O column or the 14th column contains the value "Rank". Then the neat looking table  or the 2nd snapshot in this entry presents 11 yearly rankings for 11 countries. Once you have done cell C22 for Argentina just copy and paste the VLOOKUP command, making sure that you press the "F4" key to anchor your data range.

This is it. To do this for subsequent years just go back to the tab where you have the data, apply a filter for the year you want to do and repeat the process. This data is publicly available at:

http://www.heritage.org/index/Explore.aspx?view=by-region-country-year