Documentation and Methodology for Frequently Occurring Names in the U.S.–1990

From Wikisource
Jump to: navigation, search

The Census Bureau has a primary obligation to protect the confidentiality of individual responses to the Census. As part of this confidentiality commitment, the Census Bureau does not currently release individual census questionnaires (or any other information that could identify an individual) until 72 years after a Decennial Census was taken. In 1992, the Census Bureau released the 1920 Census schedules to National Archives. In fact, the Census Bureau is so concerned about confidentiality that name has not been entered into the basic internal electronic data used to tabulate census results.

However, there have been numerous demands for sunmary data on the frequency of surnames for genealogical reasons. Similar interest has arisen for the frequency of first names by sex. This data set attempts to satisfy these demands while still providing utmost confidentiality of individual results.


In the summer of 1990, immediately following the 1990 Decennial Census, the United States Census Bureau conducted a large scale survey to measure undercount in the 1990 Census. This independent post Census operation (the 1990 Post-Enumeration Survey--PES) collected items of demographic data (race, sex, age and NAME) from 377,000 persons living in 165,000 housing units in 5,300 predefined blocks (or block clusters).

The information acquired from this independent (PES) operation was matched against actual 1990 Census records for persons living in those same 5300 blocks plus additional surrounding ring blocks. The PES blocks plus the surrounding ring--"the Search Area"--contained 7.2 million census records replete with name. It is this Search Area data set that provides the impetus for the three name files at this internet site.

File formats[edit]

In July 1995, the Census Bureau placed abridged summary information from the Search Area on its internet site. Selected data from these files have appeared in the print media with the citation "source--Census Bureau". Since the documentation accompanying the original release of this information was sketchy, we are supplying additional explanatory material about the limitations of these data.

Each of the three files, (dist.all.last), (dist. male.first), and (dist female.first) contain four items of data. The four items are:

  1. A "Name"
  2. Frequency in percent
  3. Cumulative Frequency in percent
  4. Rank

In the file (dist.all.last) one entry appears as:

        MOORE       0.312       5.312       9

In our Search Area sample, MOORE ranks 9th in terms of frequency. 5.312 percent of the sample population is covered by MOORE and the 8 names occurring more frequently than MOORE. The surname, MOORE, is possessed by 0.312 percent of our population sample.


Producing that summary line for the name MOORE required a great deal of program editing. For example, we immediately realized that it was necessary to convert the entries MOORE JR, MOORE SR, and MOORE III in the last name field to MOORE. For purposes of consistency we also converted entries such as MOORE JONES or MOORE-JONES to MOORE.

In addition to those rather simplistic edits, we also examined each name entry for the possibility of an inversion. (eg: a first name appearing in the last name field and a last name placed in the first name area). Consider a 2 person household with the entries MOORE ROBERT, and MOORE CAROLYN in the name fields. From our sample name universe, we can empirically determine the probability that the inversion (ROBERT MOORE and CAROLYN MOORE) as a far greater probability of being "right" than the keyed entry. When the probability that the odds of an inversion attained odds of 10,000 to 1, the inversion was done.

Many names can be inverted and sound absolutely right. For example, there is absolutely no reason to suspect that HENRY THOMAS is wrong and THOMAS HENRY is preferable. However, if HENRY THOMAS had a spouse listed as HENRY MARTHA and a female child named HENRY SUSAN, that additional information suffices to invert the name field for the entire family.

For first names, we considered concatenating entries but finally decided against it. Among males the combinations JOHN PAUL and JOSE LUIS in the first name field were far more frequent than any other set of spaced names. We could possibly have formed them as JOHNPAUL and JOSELUIS. As a result the male first names JOHN and JOSE may be marginally overstated.

The one name that is most affected by our decision not to concatenate is the grand old name of MARY. The entries (MARY ANN, MARY BETH, MARY CATHERINE, MARY ELLEN, MARY FRANCES, MARY GRACE etc) wind up as MARY. MARY may or may be the most common first name among American women, but our decision to avoid concatenation did add a significant number of MARY's.

Finally, we came to the conclusion that the existence of a single letter (an initial perhaps?) appearing in either the first or last name field would not qualify as a name; but an initial in one field would not disqualify the other name field. For example the 19th century financier (J P MORGAN) has a valid last name but the letter J does not meet these standards for a first name. MUHAMMED X is an example of a acceptable first name with an unusable surname.

Missing data[edit]

Although the search area contains 7.2 million persons, almost 15 percent of those persons do not provide enough information to form a name. In the previous paragraph we provided the situation where we decided that a single a single letter would not constitute a name. Other situation are listed below.

  1. The respondents did not enter a "name" at the top of page 2 of the 1990 Census form, even though names might have appeared on the roster of page 1. A name must appear at the top of page 2 for the name to be keyed.
  2. The respondent may have inadvertently left sex (gender) off his or her Census record. In that instance we accept the last name, but we have no "certain" way of placing the first name in the male or the female file. We do not assume that JENNIFER without a sex designator would be female even though common sense suggests that this is indeed the case.
  3. A family may have put down a last name for the householder but not for any other household member. We may have the following family JOHN SMITH, MARY (blank), JOHN JR (blank), ROBERT (blank), JENNIFER (blank), SUSAN (blank). In that family we have a first and last name for householder John, but first names only for the remaining 5 family members.
  4. The keyed name may not follow acceptable form. Some examples of invalid entries in either the first or last name field are: BABY GIRL, MR JONES, DR BROWN, FILIPINO FEMALE.

Each of these situations are responsible for limiting our original sample of 7.2 million person records down to its present size of 6.3 million. The actual number of person records making up the unabridged files are:

     File Name             Valid Records      Unique Names  
  1. dist.all.last          6.290,251            88,799 
  2. dist.female.first      3,184,399             4,275
  3. dist.male.first.       3,003,954             1.219 

For purposes of both confidentiality and elimination of data noise we restricted the number of unique names available at this internet site to the minimum number of entries that contain 90 percent of the population in that data file. There is an extremely small chance that an individual with a truly "unique" name could have been captured in sample, and is far more likely for surnames than for first names. A second basis for limiting entries is that a smattering of entries exist because of the combination of bad handwriting coupled with poor typing. Consider the entry JOSEHP in dist.male.first. Although JOSEHP may be a name, it is much more likely that all of the JOSEHP entries are really miskeys of JOSEPH.


For the names at the top of the distribution, (SMITH, JOHNSON, WILLIAMS, JONES etc), or (MARY, PATRICIA, LINDA, BARBARA etc) or (JAMES. JOHN, ROBERT, MICHAEL etc) the data speaks for themselves. However as the sample thins, one might draw conclusions about frequency that are not warranted.

The PES sample intentionally over sampled both Blacks and Hispanics, and it is likely that the Search Area also contains an excess of these two groups. Thus the frequently occurring surnames: GARCIA, RODRIGUEZ, GOMEZ and WASHINGTON as well as first names: JUAN, JOSE, GUADALUPE and WILLIE might attain higher rankings than their actual population numbers within the United States would warrant.

But the limitations due to sampling are much more noticeable when looking at rarely occurring names--especially surnames. Consider a surname appearing 63 (out of 6.3 million entries) times in the file dist.all.last. Here the frequency would appear as 0.001 percent, but it is possible that that sample frequency may not be close to "truth".

Ignoring clustering, (persons in the same household usually have the same surname) the coefficient of variation on a number of that magnitude would be approximately 12 percent. But most people who do not live alone share a last name with other people in that household. Thus the 63 persons with that rare name may be the result of 16 households, which would raise the coefficient of variation to approximately 25 percent.

But we are not done. Even in the last years of the 20th century, families tend to live close to each other, and it is not impossible to conceive a situation where all Americans with a certain surname appear in sample. Were that situation to occur it would be possible to overstate the frequency of that surname name by a factor of 40. The number 40 arises because the number of Census records in the sample (6,290,251) is approximately one fortieth of the United States population.

The fact that a name doesn't appear in these three files does not mean that it is non existent, only that it is reasonably rare.

In conclusion we do realize that misleading frequencies are much less likely in the files (dist.female first) and (dist.male.first). Although fathers and one son may share a first name, brothers almost never share the same first name.

Additional information[edit]

Persons wanting or needing more information about the contents of these three files can contact David L. Word ( 301-457-2103 or Randy M. Klear ( 301-457-1727.

This work is in the public domain in the United States because it is a work of the United States federal government (see 17 U.S.C. 105).