699lec030930

Uploaded from authorPOINTLite
Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Sociology 699c (Martin) Lecture 5: September 30, 2003: 

Sociology 699c (Martin) Lecture 5: September 30, 2003 Working with your data data labels creating variables using SAS functions IF-THEN statements subsetting data

This week: Modifying SAS data files: 

This week: Modifying SAS data files Now that you have data in your SAS data set, you almost certainly will want to change it to suit your needs. labeling variables creating new variables dropping variables changing the values of variables changing values of one variable conditional on the value of another changing variables from character to numeric (and back)

Labeling variables: example: 

Labeling variables: example libname home ‘c:\socy699\data\’; data home.file1; input name $ test1 test2 test3; label name = ‘first name of student’ test1 = ‘test score for Oct 23’ test2 = ‘test score for Nov 20’ test3 = ‘student’s final exam score’; datalines; Bill 187 97 103 John 156 76 74 Joe 99 102 129 ; run; proc contents data = file1; run;

Labeling variables: another example: 

Labeling variables: another example Assume you have a premade SAS data file: c:\socy699\data\file1.sd2 This set of statements will show you the contents of the data file: libname home ‘c:\socy699\data\’; proc contents data = home.file1; run;

Creating a new variable with an assignment statement: 

Creating a new variable with an assignment statement The standard statement for creating a new variable is newvar = expression; libname home ‘c:\socy699\data\’; data home.file1; input name $ test1 test2 test3; improve = test3 – test1; datalines; Bill 187 97 103 John 156 76 74 Joe 99 102 129 ; run;

Types of assignment statements: 

Types of assignment statements with variables improve=test3 – test1; with constants improve=15; as characters improve=‘satisfactory’; with numerical operators improve=test3 + test1; improve=test3 * test1; improve=test3 / test1; improve=test3 ** test1; Note that the variable receives the value of the expression after the ‘=‘ sign (not the other way).

Assignment rules for missing values:: 

Assignment rules for missing values: missing numeric values are represented by a period with no single quotes. improve=.; missing numeric values will propagate if the missing value is part of an expression. ‘improve’ will be missing if either ‘test3’ or ‘test1’ is missing. improve=test3 – test1; missing character values are represented by a space in single quotes. religion=‘ ‘;

Some useful SAS functions:: 

Some useful SAS functions: SAS has over 300 functions – here are examples:

General note on SAS functions:: 

General note on SAS functions: SAS functions operate on expressions for the values in the current buffer only. If we want to create a variable x1_ave that contains the mean value for x1 across all observations, we need to use a different procedure. (Next time) Additional SAS functions are listed in your manual and in the little SAS book.

Working with logical operators: 

Working with logical operators example of a SAS data step with IF-THEN statements: data scores; input name $ test1 test2 test3; if test3 > test1 then impr_yn = 1; if test3 <= test1 then impr_yn = 0; datalines; common logical operators: try one of these: ‘equal to’ = EQ ‘not equal to’ ~= NE ‘greater than’ > GT ‘less than’ < LT ‘greater than or equal to’ >= GE ‘less than or equal to’ <= LE

Complicated logical grammar: IF-THEN-ELSE statements: 

Complicated logical grammar: IF-THEN-ELSE statements The following reads each observation more efficiently than a series of if-then statements. data scores; input name $ test1 test2 test3; if test3 = . then impr_yn = .; else if test1 = . then impr_yn = .; else if test3 > test1 then impr_yn = 1; else impr_yn = 0; datalines; Beware: in all IF-THEN statements the order of the statements can affect your results!

Nested ‘if-then’ statements: 

Nested ‘if-then’ statements SAS can handle many levels of if-then statements, but a programmer’s mind can usually only handle two levels. if educ > 96 then do; if occup = ‘lawyer’ then nohsgrad = 0; else nohsgrad = .; end; else if educ <= 11 then nohsgrad = 1; else nohsgrad = 0; The ‘do’ and ‘end’ statements are not necessary for SAS, but they are a big help for you.

Variable self-reference:: 

Variable self-reference: Q: Can a variable operation refer to itself? example: if educ > 96 then educ = .; A: Yes, that will work, but I really don’t recommend doing that. If you make mistakes, you will erase the evidence.

Additional complexity: ‘and’ and ‘or’ statements: 

Additional complexity: ‘and’ and ‘or’ statements Each field must be specified in order if educ > 96 then hsgrad = .; else if (educ < 11) or (educ = 12 and yearcomp = 0) then hsgrad = 0; else if (educ = 12 and yearcomp = 1) or (educ > 12) then hsgrad = 1; else hsgrad = 99; The above code contains a deliberate mistake! Can you find it? Why use cpu-wasting code like ‘else hsgrad = 99’? If you wish, you can use ‘&’ instead of ‘and’, and ‘|’ instead of ‘or’.

Special features of character variables:: 

Special features of character variables: Unexpected truncation of character values: * first attempt; data aircode; set air.depart; if usgate = ‘San Francisco’ then usairpt=‘SFO’; else if usgate = ‘Honolulu’ then usairpt=‘HNL’; else if usgate = ‘New York’ then usairpt=‘JFK or EWR’; run; proc print data=aircode; var country usgate usairpt; title ‘Unexpected Truncation of Character Values’; run;

Special features of character variables:: 

Special features of character variables: Output for truncation of character values: OBS COUNTRY USGATE USAIRPT 1 Japan San Francisco SFO 2 Italy New York JFK 3 Australia Honolulu HNL Problem: the first instance of ‘USAIRPT’ had a character length of 3 bytes, so SAS truncates all future instances to 3 bytes.

Slide17: 

More on special features of character variables: Solution for unexpected truncation of character values: * second attempt; data aircode2; length usairpt $ 10; set air.depart; if usgate = ‘San Francisco’ then usairpt=‘SFO’; else if usgate = ‘Honolulu’ then usairpt=‘HNL’; else if usgate = ‘New York’ then usairpt=‘JFK or EWR’; run; proc print data=aircode; var country usgate usairpt; title ‘No more truncation of character values’; run;

Recoding variables 1:: 

Recoding variables 1: 1.) If –Then – Else Statements DATA group1; INPUT age; IF age>=0 & age< 10 then agegroup = 0; ELSE IF age>=10 & age< 20 then agegroup = 10; ELSE IF age>=20 & age< 30 then agegroup = 20; ELSE IF age>=30 & age< 40 then agegroup = 30; datalines; proc print; title ‘data set group1’; run;

Recoding variables 2:: 

Recoding variables 2: 1.) Output using If –Then – Else Statements DATA SET GROUP1 OBS AGE AGEGROUP 1 . . 2 1 0 3 5 0 4 16 10 5 32 30 6 43 .

Recoding variables 3:: 

Recoding variables 3: 2.) SELECT and WHEN statements DATA group1; INPUT age; SELECT; WHEN (age>=0 & age< 10) agegroup = 0; WHEN (age>=10 & age< 20) agegroup = 10; WHEN (age>=20 & age< 30) agegroup = 20; WHEN (age>=30 & age< 40) agegroup = 30; OTHERWISE; datalines; proc print; title ‘data set group1’; run;

Slide21: 

Recoding variables 4: 3.) Bounded expressions with If –Then – Else Statements DATA group1; INPUT age; IF 0<=age< 10 then agegroup = 0; ELSE IF 10<=age< 20 then agegroup = 10; ELSE IF 20<=age< 30 then agegroup = 20; ELSE IF 30<=age< 40 then agegroup = 30; datalines; proc print; title ‘data set group1’; run;

Slide22: 

Recoding variables 5: 4.) Descending checks DATA group1; INPUT age; IF age< 40 then agegroup = 30; IF age< 30 then agegroup = 10; IF age< 20 then agegroup = 20; IF age< 10 then agegroup = 0; datalines; Note: this recode will deal with missing values differently than recode form #1. Note also: this recode wastes cpu time.

Slide23: 

Recoding variables 6: 5.) Inclusive logical expression DATA group1; INPUT age; AGEGROUP = 10*(10<=AGE<20)+ 20*(20<=AGE<30)+ 10*(30<=AGE<40); datalines; Note: this recode will deal with missing values differently than recode form #1. Note also: while the most efficient form, this recode can confuse you, so you should assist yourself with comments and check your results.

Subsetting observations 1: removing observations with DELETE: 

Subsetting observations 1: removing observations with DELETE To remove from the data set all vacations of 10 days or longer we would write: libname save 'c:\MyDocuments\classes\699f01\week04\'; data save.travel01; infile 'h:\Smartin\699c01\week04\travel.txt'; input country $ 1-15 tourtype $ 16-30 nights 31-35 landcost 36-40 vendor $ 41-50; if nights >=10 then delete; run;

Subsetting observations 2: retaining observations with OUTPUT: 

Subsetting observations 2: retaining observations with OUTPUT Conversely, to retain only scenic vacations of 10 days or longer we would write: libname save 'c:\MyDocuments\classes\699f01\week04\'; data save.travel01; infile 'h:\Smartin\699c01\week04\travel.txt'; input country $ 1-15 tourtype $ 16-30 nights 31-35 landcost 36-40 vendor $ 41-50; if nights >=10 & tourtype=‘scenery’ then output; run;

Subsetting observations 3: retaining observations with WHERE: 

Subsetting observations 3: retaining observations with WHERE To restrict the data set only for the current proc step, we can use the WHERE statement: proc print data=save.travel01; where nights>=10; title 'data set travel01: long vacations only'; run;

Subsetting variables 1: removing variables with DROP: 

Subsetting variables 1: removing variables with DROP To remove the information about cost of vacation we would write libname save 'c:\MyDocuments\classes\699f01\week04\'; data save.travel01; infile 'h:\Smartin\699c01\week04\travel.txt'; input country $ 1-15 tourtype $ 16-30 nights 31-35 landcost 36-40 vendor $ 41-50; drop landcost; run;

Subsetting variables 2: keeping variables with KEEP: 

Subsetting variables 2: keeping variables with KEEP To keep only the information about cost of vacation and country we would write: libname save 'c:\MyDocuments\classes\699f01\week04\'; data save.travel01; infile 'h:\Smartin\699c01\week04\travel.txt'; input country $ 1-15 tourtype $ 16-30 nights 31-35 landcost 36-40 vendor $ 41-50; keep landcost country; run;

Subsetting variables 3: miscellaneous other statements that work: 

Subsetting variables 3: miscellaneous other statements that work Figure out what these statements will do (this is NOT a single working SAS program: data save.travel01(drop=vendor); data save.travel01(keep=country tourtype nights landcost); data save.travel01(drop=vendor); set fitness(drop=name); proc print data=save.travel01 (drop=vendor);

Summary:: 

Summary: We have focused on data manipulation for numeric variables. There are many many more numeric functions for you to explore We have only minimally discussed data manipulation for character variables. You must always balance efficiency against clarity. I prefer to be clear at a cost to cpu efficiency.