Monday, September 12, 2011

The "encode" command and the beauty of loops in STATA

I recently had to do some analysis on a data set I imported from Excel. When I opened the data editor in STATA my screen looked like this:


Now, if you want to do some data analysis in STATA chances are you can't do much if your data is all red or full of "string variables" (like they are called in STATA jargon). Here is what I did to first destring my string or red variables and then to automate the destringing of 9 variables.

If you manually wanted to destring your 9 variables you would have to write 9 lines of code that would look something like this:

encode specialty, generate (_specialty)
encode ref_reason, generate (_ref_reason)
encode icd9_catg1, generate (_icd9_catg1)
encode dt_req_sc , generate (_dt_req_sc)
encode dt_response, generate (_dt_response)
encode dt_appt, generate (_dt_appt)
encode rejection_reas_sc1, generate (_rejection_reas_sc1)
encode scheduled_status_sc1, generate (_scheduled_status_sc1)
encode insurance_type, generate (_insurance_type)

The encode statement is telling STATA to destring the variable while the generate ("") statement tells STATA to give the new destringed variable the name of the expression inside the parentheses.

This is great because instead of having a red screen you'll now have a blue screen, which means that you can do some data crunching with your newly destringed or encoded data. Your data editor in STATA will now look like this:



Now, this is already fascinating. However, there is another process that will make your life easier when it comes to writing those 9 lines of code to destring your 9 string variables. This process is called a loop and it is designed to save you time while it also reduces the chances of human error (aka "typo") since you are typing 157 characters instead of the 378 if you had typed the 9 lines of code. It looks like this:

foreach var of varlist specialty ref_reason  icd9_catg1  dt_req_sc dt_response dt_appt rejection_reas_sc1 scheduled_status_sc1 insurance_type {
encode `var', generate (_`var')
}  


What you are doing is instructing STATA to create a variable list made up of the 9 variables you want to destring. Once you type or paste your variables to the variable list you type a left "curly bracket" and then type the procedure you want the loop to conduct. Caveat # 1: after you enter the left curly bracket you need to hit ENTER so your cursor is placed on the next line, which is where you'll type the command you want to loop.

the command  {
encode `var', generate (_`var')

looks very similar to the code in one of the 9 lines of code you wrote to destring your variables. However, here the `var' expression is acting as placeholder and it's telling STATA to generate destringed the variables and give them the same name preceded by an "underscore" _  

Caveat #2: after you enter the command that you want STATA to conduct in your loop, you hit ENTER again so your cursor is placed on the next line and then you close the expression with a "right curly bracket" which will sit by itself at the bottom of your loop looking like this:
}

No comments:

Post a Comment