Wednesday, November 23, 2011

SIMPLE QUERIES IN SQL AND.....

SIMPLE BUT POWERFUL QUERIES IN SQL AND INTERESTING GLOBAL GEOGRAPHY QUIZ



Let’s assume we have a database with the following basic characteristics:


Where region is the name of the table and name, region, area, population, and gdp are the variables or columns where data is stored within the table. This is a table with a relatively low number of observations because there are only less than 200 countries in the world. However, it serves as a great case study to practice SQL queries. 

I am going to be showing 3 different queries to retrieve different groups or points of data. While this is nothing more than a mental exercise (and a fun geography quiz), the same logic for retrieving data can be applied to solve important business needs. 
 

1.       List each country name where the population is larger than 'Russia'

SELECT name FROM bbc WHERE population>
(SELECT population FROM bbc WHERE name='Russia')
What the query in red is telling SQL is that you want to select the values of the variable “name” from the table “bbc” where the population is greater than. The second query (with a yellow background) is telling SQL that to identify the comparison group (the country of Russia in this case) and that it needs to look in the “population” variable in the table “bbc” and identify the country with the “name” =’Russia’. Notice that we use single quotation marks  ‘…’ for the word ‘Russia’ because it is the specific word within that column of data (“name”) that we want SQL to identify.
Curious to know what the final answer is?

2.       Show the countries in Europe with a per capita GDP greater than 'United Kingdom'.
SELECT name FROM bbc WHERE  gdp/population>
(SELECT gdp/population FROM bbc  WHERE name ='United Kingdom')
AND region='Europe'

We observe a similar pattern in the first query (the on in red). It wants the names of countries from the table bbc where gdp/capita is greater than. In this case SQL calculates the gpd/capita since it is not one of the original variables in this table. It simply divides the variable gdp over the population variable.  Then the second query (yellow background) selects the same calculated variable (gdp/capita) from the table bbc for countries with the name ‘United Kingdom’. We use the quotation marks again to tell SQL to look for that particular set of characters. There is an additional query here, which is telling to select the conditions in the yellow query PLUS (that is why we include the command AND ) countries that are located in the “region” =’Europe’.
Curious to know what the final answer is?

3.       Which country has a population that is more than Canada but less than Algeria?

SELECT name FROM  bbc WHERE population >
(SELECT population FROM bbc WHERE name='Canada')
AND population < (SELECT population FROM bbc WHERE name='Algeria')
My exposure to the world of command writing whether it is an Stata, Access, Eviews, Excel, or SQL has shown me that is all about patterns. The 3 queries in this blog entry have been increasing in complexity (the 3 of them are still considered to be at the ‘novice’ level)  but we observe that the basic query in red letters follows the same pattern. So the query in red is telling SQL to look for the name of a country in the bbc table where the population is greater than. The second query in yellow is telling SQL to population value in the bbc table for an entry with the name =’Canada’. The interesting addition here is the AND command and the less than (<) sign. The green query is very similar to the yellow query but is the necessary item we need to solve our geography challenge! 
Curious to know what the final answer is?


Thursday, September 29, 2011

Simple Way to Analyze and Present Historical Data from Different Sources

I had historical data for 8 different indexes designed to measure economic, social, and market development across countries. I wanted to conduct a case study on Argentina to determine how the country performed across all 8 indexes in the last decade. Here is how I approached the case study:


First of all, information was not always available for the 2001-2011 time period so for some indexes I have 2006-2011, 2004-2011, etc. What I did was "normalize" the rankings by dividing Argentina's ranking over the total number of countries ranked for that particular index. I then came up with a score between 0 and 1. Note that in the "*Methodological Note" I write that a higher score indicates inferior performance, which seems counter-intuitive.

This is a rather simple case study but it is powerful in that it identifies improvement in only 2 out of 8 indexes and gives us a broad overview of Argentina's recent macroeconomic performance.

Monday, September 12, 2011

The "encode" command and the beauty of loops in STATA

I recently had to do some analysis on a data set I imported from Excel. When I opened the data editor in STATA my screen looked like this:


Now, if you want to do some data analysis in STATA chances are you can't do much if your data is all red or full of "string variables" (like they are called in STATA jargon). Here is what I did to first destring my string or red variables and then to automate the destringing of 9 variables.

If you manually wanted to destring your 9 variables you would have to write 9 lines of code that would look something like this:

encode specialty, generate (_specialty)
encode ref_reason, generate (_ref_reason)
encode icd9_catg1, generate (_icd9_catg1)
encode dt_req_sc , generate (_dt_req_sc)
encode dt_response, generate (_dt_response)
encode dt_appt, generate (_dt_appt)
encode rejection_reas_sc1, generate (_rejection_reas_sc1)
encode scheduled_status_sc1, generate (_scheduled_status_sc1)
encode insurance_type, generate (_insurance_type)

The encode statement is telling STATA to destring the variable while the generate ("") statement tells STATA to give the new destringed variable the name of the expression inside the parentheses.

This is great because instead of having a red screen you'll now have a blue screen, which means that you can do some data crunching with your newly destringed or encoded data. Your data editor in STATA will now look like this:



Now, this is already fascinating. However, there is another process that will make your life easier when it comes to writing those 9 lines of code to destring your 9 string variables. This process is called a loop and it is designed to save you time while it also reduces the chances of human error (aka "typo") since you are typing 157 characters instead of the 378 if you had typed the 9 lines of code. It looks like this:

foreach var of varlist specialty ref_reason  icd9_catg1  dt_req_sc dt_response dt_appt rejection_reas_sc1 scheduled_status_sc1 insurance_type {
encode `var', generate (_`var')
}  


What you are doing is instructing STATA to create a variable list made up of the 9 variables you want to destring. Once you type or paste your variables to the variable list you type a left "curly bracket" and then type the procedure you want the loop to conduct. Caveat # 1: after you enter the left curly bracket you need to hit ENTER so your cursor is placed on the next line, which is where you'll type the command you want to loop.

the command  {
encode `var', generate (_`var')

looks very similar to the code in one of the 9 lines of code you wrote to destring your variables. However, here the `var' expression is acting as placeholder and it's telling STATA to generate destringed the variables and give them the same name preceded by an "underscore" _  

Caveat #2: after you enter the command that you want STATA to conduct in your loop, you hit ENTER again so your cursor is placed on the next line and then you close the expression with a "right curly bracket" which will sit by itself at the bottom of your loop looking like this:
}

Wednesday, September 7, 2011

VLOOKUP to automate the creation of a table in Excel

How do you go from this data set containing about 2,000 observations over a period of 10 years (aka panel data)


To this neat looking table?


The answer is: VLOOKUP. And it looks like this:

Let's look at cell C22 for Argentina. What this VLOOKUP command is telling Excel to do is to find the value in B22 (Argentina) in the 'INDEX OF ECON FREEDOM DATA 2000' tab (which is the tab where the data is stored) and then look at the range B1860-O2014 and give you the value on the 14th column. The FALSE statement at the end indicates that you want the EXACT value. If you look at the 1st snapshot on this blog entry you will notice that the O column or the 14th column contains the value "Rank". Then the neat looking table  or the 2nd snapshot in this entry presents 11 yearly rankings for 11 countries. Once you have done cell C22 for Argentina just copy and paste the VLOOKUP command, making sure that you press the "F4" key to anchor your data range.

This is it. To do this for subsequent years just go back to the tab where you have the data, apply a filter for the year you want to do and repeat the process. This data is publicly available at:

http://www.heritage.org/index/Explore.aspx?view=by-region-country-year

Tuesday, August 30, 2011

Professional Writing Sample

Abstract: the following writing sample is part of an assignment for an applied econometrics course. In this assignment I make use of advanced graphing techniques and pivot tables in Excel. I also use STATA statistical software to conduct a “synthetic control” graphical calculation. 



The objective of this assignment is to determine whether or not primary seat belt laws have an effect on the level of fatal car accidents in state TU (an imaginary state that is calculated by averaging out the values of Connecticut, Iowa, New Mexico and Texas). For additional clarification, primary seat belt laws stipulate that a law enforcement officer can ticket a driver for not wearing a seat belt, regardless of whether or not the driver has broken any other laws. The primary seat belt law was implemented in 1986 in state TU.
This dataset covers the period 1981-2003. The next table compares the pre period (1981-1985) log of fatalities per capita for state TU and the 44 control states. We can observe that the treatment state (TU) has a larger number of ln(fatalities per capita) during this period.  In fact, the ln(fatalities per capita) for the control states is approximately 1.17 times lower than TU. 

 From the graph below we can observe the same pattern as the table above where TU has a higher level of ln(fatalities per capita) during the 1981-1985 time period. What this implies is that we need to assign a higher weight to the states that show a pattern that more closely resembles the pattern exhibited by TU in this time period.  
 
When we sort the values of ln(fatalities per capita) in 1985 we find that the state that most closely resembles TU is Alabama:
 
A graphical presentation of this can be seen below:
To attempt to predict the level of fatalities per capita in a regression framework I also use the variables per capita beer consumption, rain, and snow in addition to primary seat belt laws.
   
We can observe the values of all variable below differ significantly. This implies that if we use Alabama as a counterfactual state we would not be making an “apples to apples” comparison. Therefore using Alabama as a counterfactual state would be problematic. We observe similar values in the pre-intervention (1981-1985) dependent variable ln(fatalities per capita) for both states. However, the difference in explanatory variables implies that there would be unobserved determinants that can change over time and affect the level of fatalities.  
 
Synthetic Control Method:
The synthetic control method puts heavier weight on the states that more closely resemble TU. This is done to obtain a counterfactual or a comparable control state. This synthetic control state is obtained by calculating the weighted average of all states in the control group. The predictor variables per capita beer consumption, rain, and snow are averaged over the entire period (1981-1985) before the primary seat belt laws were implemented in 1986 in state TU. 
 The graph above was obtained using STATA statistical software. From it we can observe that traffic fatalities per capita declined in the imaginary state TU after 1986 when the primary seat belt law was implemented. Again, the synthetic TU is obtained by averaging out the values of other states in the pre-primary seat belt law period (1981-1985). Based on this information, we can observe what happened in the states that did not implement a primary seat belt law and we can conclude that the decrease in the level of fatalities in state TU after 1988 can be attributed to the implementation of primary seat belt laws.

Friday, August 26, 2011

Nice Graphics in Excel 2007

I designed these graphs for a research proposal to quantify the economic impacts of a massive wind farm project in Oaxaca, Mexico. As part of the exploratory research phase, I wanted to get an idea of the size of the wind energy generation market in Mexico. The data is courtesy of Mexico's Energy Department. When I finalized the graphs I realized Excel 2007 has some very neat graphing and data presentation capabilities. Conclusion: despite representing only a small share of overall gross electricity production in Mexico,  wind energy has experienced a phenomenal growth rate in the last decade.


Master's Thesis Defense

I defended my master's thesis in front of a panel of 6 economics professors on April 28, 2011. The outcome: unanimous pass with honors. On the first picture I am introducing the topic by talking about innovative approaches to measuring subjective well-being. The examples on the slide include: the United Nations' Human Development Report, Gallup-Healthways' Well-Being Index, the New Economic Foundation's Happy Planet Index, and the Legatum Institute's Legatum Prosperity Index Report.

On the second picture I am presenting the empirical methodology I employed for my research: household fixed effects, instrumental variables, and ordered probit. If you are able to connect the dots. The hh fixed effects and instrumental variables equations represent the regression output I posted for my Welcome! entry.






 


Tuesday, August 23, 2011

Example of STATA's Analytical and Graphing Power

I really wanted to combine many graphs into one using STATA statistical software when I saw the graph below. However, even if you look at a really large version of this 6x5 graph, you won't be able to see the details unless you have 20+/20+ eyesight, which most of us don't.

After looking at some STATA manuals I came up with this graph below. I think it is really neat how you can write a simple piece of code in STATA to merge as many graphs as you want. Here is the simple piece of code I used to make the graphs happen:

twoway scatter   npl2000 hrs_82
graph save g1.gph, replace
twoway (scatter meanhs8 hrs_82) (lfit meanhs8 hrs_82), ytitle(mean housing prices in 1980)
graph save g2.gph, replace
gr combine g1.gph g2.gph

And here is the actual explanation of what the graphs mean:

3b)  From the graph on the left below we can observe that an HRS score higher than 28.5 accounts for some but not all the variation for determining whether or not a census tract contains a hazardous waste site listed on the NPL by year 2000.
3c) The graph on the right gives a linear regression estimate, showing a mildly positive relationship between the HRS score in 1982 and mean housing prices in 1980. Additionally, from this graph we can conclude that HRS score in 1982 serves as a good predictor for housing values in 1980.
And here is the graph: 


Sunday, August 21, 2011

The Power of Web Analytics

Web analytics is a field that has the potential to experience exponential growth in the near future. Why? Because an increasingly large number of people use the internet to carry out more and more transactions (aka purchases). Just think about Amazon.com and Borders. Amazon is doing great because people choose to purchase books online instead of going to a traditional establishment like Borders, which results in traditional establishments like Borders going out of business. The bottom line is that businesses need to adapt to this new environment and web analytics is an excellent tool as it dissects  key metrics about what is going on in the internet. Here is the snapshot of one of the slides of a presentation I put together recently.


Saturday, August 20, 2011

Welcome!

Hi Everyone,

Welcome and thank you for taking the time to check out my blog. This is a space to share my ideas, thoughts, and interesting content with the world.

The table below shows you regression results the way they are presented in an academic economics journal. Columns 1-4 present 4 specifications of an Ordinary Least Squares with Fixed Effects Model and columns 5-8 also present an Ordinary Least Squares Model but using Sponsorship Eligibility as an Instrumental Variable.

This is the way researchers attempt to identify a cause and effect relationship. The asterisks to the right of some of the coefficients indicate these coefficients help explain variation in your "Y" variable of interest....... 

Oh, and by the way, I am proud to announce that I did put together this table myself.