Data Visualization and its Role in the Practice of Statistics

An Undergraduate Summer Program
in Statistics

June 19-24, 2005

Photos from the event ]


Mark Hansen (UCLA)
Vijay Nair (University of Michigan)
Deborah Nolan (UC Berkeley)
Duncan Temple Lang (UC Davis)
Bin Yu (UC Berkeley)


Today, almost every aspect of our lives is "rendered" in data. New data collection technologies have made it easy to record continuous, high-resolution measurements of our physical environment (weather patterns, seismic events, the human genome). We're also constantly monitoring our movements through and interactions with our physical surroundings (automobile and air traffic, large-scale land use, advanced manufacturing facilities). In computer-mediated settings, our activities either depend crucially on or consist entirely of complex digital data (networked games, peer-to-peer technologies, Web site and Internet usage).

As a reflection of the diversity and variety of the "systems" under study, these data-based descriptions of our world tend to be massive in size, dynamic in character, and replete with rich structures. The advent of these enormous repositories of information presents us with an interesting challenge: how can we represent and interpret such complex, abstract and often socially important data?

This workshop is designed to introduce undergraduates to the exciting work being done in statistics, the science of data.

Akhtar, Syed     Williams
Bircan, Cagatay Williams
Chan, Chun Hung UCLA
Christenson, Erica Berkeley
Fortin, Dan UW
Goldman, Megan Pitt
Gracien, Katina NCSU
Hodgson, Laura UW
Horvath, Zsuzsanna U Utah
Kelly, Megan U Chicago
Kenaga, Margaret Berkeley
Lee, Meng-Ju Purdue
Lee, Tammy Berkeley
Liggonah, Sayi LSU
Nathan, Sandy Berkeley
Neff, Christopher UC Davis
Nguyen, Vinh UC Irivine
Omidiran, Chris Rice
Palm, Yvonne Grinnell
Reyes, Cherene UW
Rosario, Ryan UCLA
Stefanski, Doug NCSU
Tsang, Terri UCLA
Weiler, Khela Berkeley
Wong, Ka Lok UCLA
Speakers and Guests
Dacumos, Dean    UCLA
Estrin, Deborah    UCLA
James, David    Bell Labs
Kaiser, Bill    UCLA
Lambert, Diane    Bell Labs
Nychka, Doug    NCAR
Rice, John    Berkeley
Speed, Terry    Berkeley
Swayne, Deborah    AT&T Labs
Wu, Yingnian    UCLA
TA's and Graduate Students
Ahn, Soyeon    Berkeley
Barr, Chris UCLA
Brodsky, Jae UCLA
Farzinnia, Neda UCLA
Lenderman, Jason UCLA
Tong, Frances Berkeley
Tranbarger, Katie UCLA
Computer Wizardry
Hales-Garcia, Jose    UCLA
Photo Credits
Hansen, Mark    UCLA
Nolan, Deborah    Berkeley
Rosario, Ryan    UCLA

This program was made possible by grants from the Institute for Pure and Applied Mathematics at UCLA and from the American Statistical Association.

Sunday, June 19
From the beginning, participants were given a mix of context, concept and computation; applications provided the context, statistical concepts guided our investigations which were ultimately shaped by computing.

Our first meeting began with a gentle introduction to the R computing environment and the ggobi data visualization system. Deborah Swayne, one of the principal authors of ggobi was on hand to give us a tour of its capabilities.

We were also treated to an excellent talk by Deborah Estrin, head of the Center for Embedded Network Sensing at UCLA. Deborah highlighted the statistical challenges in sensor network design, deployment and analysis.

Monday, June 20
The statistics of freeway traffic
John Rice, University of California, Berkeley

John guided us through data collected by the Freeway Performance Measurement System (PeMS), a real-time monitoring system tracking flows of traffic in California. We explored the relationship between flow (a measure of "throughput" of the freeway system) and occupancy (a measure of congestion) and how these vary over time, across lanes, and in response to "events." We also looked at forecasting travel times and evaluated predictions for trips with different start times.

Using data from the PeMS web site we finished the day with some questions:

· Taxi drivers claim that when traffic breaks down, the fast lane breaks down first so they move immediately to the right lane. Can you see any such phenomena in the data?
· How close is flow or occupancy in one station related to that in a nearby station? Can you find time lags at which this relationship is strongest? Can you explain what you see?
· Choose one day, and look at the detail of the CHP incident data. Choose an interesting incident. Can you see this incident has affected traffic flow in both directions? Is there any evidence of rubber necking?
· Construct an animation showing the relation between flow and occupancy in time.

We worked in groups of three and presented our analyses; some groups worked in R, others preferred ggobi displays.

We finished the day with a dinner sponsored by the UCLA Statistics Department.

Tuesday, June 21
Terry Speed, University of California, Berkeley

The day started with a short lecture from Terry to define some terms; a Single Nucleotide Polymorphism (SNP) is a variant form of DNA at a well-defined position on a chromosome. Researchers are genotyping thousands of individuals at thousands of SNPs, in the hope of finding associations between SNP genotypes and disease with genetic components, and other genetic traits. While there are many ways to determine a person's genotype at an SNP, our data came from an Affymetrix SNP chip. We looked at data for one SNP. Terry provided us with a data frame consisting of Affymetrix probe intensities for 90 individuals; for each person in the study, we were given intensities for 40 probes.

Two graduate students from UC Berkeley, Frances Tong and Soyeon Ahn, were on hand to help us analyze the data. As with the previous day, we worked in groups of 3. We started with simple exploratory analyses of the probe intensities, forming boxplots and heat diagrams, summarizing the 40 probes across individuals.

We next reduced the data to a pair of relative allele signals (RAS's) and fit a simple cluster analysis to these derived measurements. We explored how well cluster membership agreed with genotype at the SNP. At the end of the day, our group presentations focused on repeating these analyses on different SNP's and commenting on what we found.

We ended the day with an open discussion by five statistics graduate students: Neda Farzinnia, Katie Tranbarger, Jae Brodsky, Soyeon Ahn and Frances Tong. We had lots of questions about the difference between life as an undergraduate versus that of a graduate, the process of selecting an advisor, and student "community."

Wednesday, June 22
Diane Lambert, Bell Labs

The signal from wireless devices like PDA's and mobile phones can be picked up by various fixed receivers or access points. If you are near an access point, the (received) signal strength is large; if you are distant, it is small. Diane posed a simple question: Can measurements of signal strength at various wireless access points be used to identify a person's location? She brought a data set of signal strengths recorded at a handful of access points as a researcher roamed the corridors of one floor of an office building.

By this point in the program, we had enough experience with R to explore the data in a very open-ended way. Some groups created plots of signal strength, overlaying these measurements onto a map of the building, while others looked at the relationship between signal strength and distance to the access points. Time flew by on these projects; we were so absorbed that it was tough to break for lunch! Today, Jan de Leeuw, Chair of the UCLA Statistics Department, joined us for lunch at the faculty center. After lunch, we returned to work, ultimately presenting our findings through a series of group presentations.

We ended the day with a discussion led by Diane Lambert and Doug Nychka (our speaker for Thursday) about statistical research being conducted outside a university setting.

Thursday, June 23
Case study for precipitation on Colorado's Front Range

Doug Nychka, National Center for Atmospheric Research

In addition to Doug, we were joined by David James, a researcher from Bell Laboratories. Our project dealt with extreme precipitation events. The frequency of large amounts of rainfall is important for planning for floods and effects how land and roadways should be developed. Doug provided us with climatic data from 56 weather stations located along Colorado's Front Range (FR).

The FR consists of relative flat plains with a transition to high mountains. Because of this diversity it is useful study area to test methods. The variable of interest is the total amount of precipitation in a 24 hour period during the "summer." Doug guided us through the following questions:

· What is the distribution of big precipitation events and how does this distribution vary over space?
· How can irregular station observations be extrapolated to locations where measures are not made?
· What is the degree to which large precipitation events cluster in time?
· Given the answers to the previous questions, how well does a climate model simulation reproduce the features in the observed meteorology?

Next, we took a quick field trip to Bill Kaiser's lab; Bill is part of the CENS group and works with robotic data collection devices.

We ended the day with another group discussion, this time on the process of applying to graduate school. Deb Nolan was joined by Dean Dacumos, the Student Affairs Officer for the Statistics Department at UCLA, for an open discussion about what graduate programs are looking for.

Friday, June 24

Computer Vision
Yingnian Wu, UCLA

This was the last day of our program, and it was divided into two sections. In the first, Yingnian gave us a basic introduction to some of the problems in computer vision. We learned about R's capabilities for working with image data, and applied some simple filters to digital photographs taken from the first five days of our workshop. Yingnian then described some basic statistical properties of natural images, including so-called "scaling" behavior (what happens as you zoom in and out of a scene).

After lunch, we had a series of presentations. Some students spoke in groups, others flew solo. In each case, they were asked to dig a little deeper into some of the data they had seen during the week and talk about "analysis" that they found particularly satisfying. For some, this was a plot that they were proud of, while for others it was an investigation, a thought process, that led them to some interesting conclusions.

We ended the program with one final group dinner at the Westwood Brewing Company. Far too much food was ordered, but no one went away hungry!