Entering Data into a Computer.

Dan Tandberg, MD

Statistical analysis of your research data is generally best carried out with special purpose statistical software.  These programs are generally expensive and most have steep learning curves.  Thus, most researchers will use a statistical consultant who has purchased such programs and uses them regularly.  Your consultant should make sure that your data is entered in a way that it can easily be converted to the correct format for analysis by their particular analysis programs.  They should also help you decide which variables you should be measuring, how they should be coded, and which statistical methods are likely to be used.  Finally, as you enter your data into one of your own computer programs you may find the following hints useful:


Which Computer Program to Use.

You and your consultant should sit together and choose an appropriate data entry program.  If you are already familiar with spreadsheet software, this is often adequate for small to medium projects.  Larger projects may benefit from using database software.  Students can even use a word processor to enter data if this is how they are most comfortable.  This can be set up using the Table menus or with Tabs.


The Layout.

Most analyses will require that the data be entered into a single two-dimensional table.  A common example would be to have each subject’s data entered into a row.  Columns would then be labeled with all the variable names (e.g. Subject, Age, Gender, Sodium, Treatment, . . )


Column Titles Are Variable Names.

These should be informative, but not too long.  Some statistical packages will truncate titles at 15 or even 10 characters.  Titles should be in “Title Case” and should contain only letters, numbers, and underscores.  Spaces may cause trouble with some statistical packages.  Always avoid having non-alphanumeric symbols such as ~!@#$%^&*()+={}[]\| in your titles!


Keeping Column Headings Visible.

As you enter rows of data, your column headings will eventually scroll off the top of the page.  If you are using a spreadsheet, the top row of headings can be kept on the screen by “splitting” the sheet into two parts.  In Excel, this is done by putting the cursor into cell A2 and invoking Window>Split from the menus.  If you are using a word processor, copying the titles every 10 rows or so works well.


Navigation Aids.

After you have entered several screenfulls of data, it can become difficult to scroll long distances with the mouse.  In many programs, you can use the keyboard to move the cursor around more efficiently.  In Windows programs these are called something like “keyboard shortcuts.”  Combinations of the Ctrl, Shift, Arrow, End, and Home keys can be used to do things like move the cursor long distances or to select large blocks of data.


Avoid Multiple Entries.

Statistical programs cannot make much sense out of multiple entries in a single column.  So avoid having a column titled ‘Drugs’ filled with several entries such as ‘penicillin, gentamicin, digoxin’.  Instead, you need a separate column for each likely item. 


Missing Values.

Management of missing values is a bread and butter issue for statistical consultants.  How these are coded may change the way your results are calculated.  I think it is best to leave missing values blank, not ‘space’, ‘n/a’ or the like.  You should at least discuss this with your consultant before you start entering data.


Thoughtful Coding Can Reduce Errors.

Always make a list of all of your variables and clearly explain how each will be coded.  Each variable should be coded in a way that minimizes data entry work, but at the same time minimizes data entry mistakes.  For example, the variable ‘Gender’ might be coded as ‘male’ or ‘female’ but typing these words is more work than  using ‘f’ or ‘m’.  You might also code the variable ‘Gender’ as a numeric 0 or 1, but this would likely lead to increased entry errors since you have to memorize which is which.  Most statistical programs will think ‘Female’ and ‘female’ are different genders, so pay attention to case.  I avoid using any capitals in text data entries as it saves a keystroke.


Note to Apple users: Microsoft Excel supports two different date systems: the 1900 date system and the 1904 date system. This article describes the two date systems and the problems that you may encounter when you use workbooks that use different date systems.  This is important if you are switching between Apple and PC versions of Excel.  (Search for "1900 1904 problem" for more links.)       http://support.microsoft.com/kb/180162

Data Validation.

Most spreadsheet programs allow you to set limits on the data you can enter into a column of cells. For example you might only want to allow whole number ages in the range of 18 to 130 years. Or you might force entry from a list of the letters 'm' and 'f' for gender. Excel calls this process "data validation"; Open Office calls it data validity. It is worth doing in many cases, especially for larger data entry tasks. 

Saving You Files.

Lost data can be a tragedy!  Never save your data file “over” an older version.  When working with data sets that are bigger than the computer screen, it is possible (and fairly easy) to make a mistake and erase a large part of the data without even knowing that you have done so.  Once you save, it may be gone forever.  A power surge or outage during a save can have the same disastrous effect.

Instead, name the data file something informative like ‘Diabetes_Exercise_Smith_01.’  Then make sure to save the file using “Save As” and increment the digits at the end of the name each time you save, (_02, _03, _04.)  I recommend saving the file under a new name every 15 minutes or so.  Then you are unlikely to loose more that this amount of work, even if something dreadful happens.


Automated Backups.

Many programs used for entering data allow for automated backups.  In Microsoft programs, these may be found under the Tools>Options menus.  You will have to turn on the automated "Save" or “Autorecover” function and specify the "File Location" (the folder where you want backups to be stored.)  This setup procedure is slightly different in different versions of spreadsheet and other programs, so you may need to look at Help screens.  I have my files automatically backed up every 10 minutes.


Do Not Ever Sort Your Data File.

In spreadsheets and other data entry programs, it may be possible to sort your data.  Do not be tempted to do this.  It is quite easy to sort part of the data and scramble the relationships that you are in fact trying to explore.  There is no way to fix a partially sorted file once it has been saved.  Just say ‘no’ to sorting.


Use Multiple Redundancy

When you are done with each session of entering data, make sure that it is backed up at to least two other places besides your hard drive.  At least one of these should be off site.  Good choices are floppy disk, memory stick, or offsite server, and then email to yourself and one other person as attachments.  A useful test is “What would happen if my two computers both broke and there was also a fire?”  The answer should always be “I’ll still be OK.”


Viruses and Other Malware Threats.

Since you will likely be opening your data file on several different computers (yours, the library’s, your mentor’s, your statistical consultant’s), be extra careful to avoid virus infection.  Make sure that your own computer’s virus checking software is current.  The same holds for any other computer that your use to open your file.  Most virus checking software will allow you to scan an individual file and you should do this periodically.


Additional Error Reduction Tips.

Keeping data entry errors to a minimum is important.  In medium to large sized data sets, there will almost always be a few typos, but you should make every effort to reduce these.

1. Only enter data for an hour or so per session.  Most people make more errors with longer entry sessions.

2. Be careful not to use ‘space’ for ‘missing value’, capital ‘O’ for a zero, or lower case “L” for a 1.

3.  When you think all data is entered, have a friend proofread each value with you.  Switch off between looking at the computer screen and the original data source (often paper).

4.  Modern statistical software is very helpful for finding outliers and typos.  Plan an early session with your statistical consultant that will focus on finding these.  Be sure to have your original (hard copy) data available at this meeting.