Data Visualization and its Role in the Practice of Statistics
An Undergraduate
Summer Program
in Statistics
June 19-24, 2005
[ Photos from the event ]
Organizers
Mark Hansen (UCLA)
Vijay Nair (University of Michigan)
Deborah Nolan (UC Berkeley)
Duncan Temple Lang (UC Davis)
Bin Yu (UC Berkeley)
Overview
Today, almost every aspect of our lives is "rendered"
in data. New data collection technologies have made it easy to record
continuous, high-resolution measurements of our physical
environment (weather patterns, seismic events,
the human genome).
We're also constantly monitoring
our movements through and interactions with
our physical surroundings (automobile and air traffic, large-scale
land use, advanced manufacturing facilities).
In computer-mediated settings, our activities either depend
crucially on or consist entirely of complex digital data (networked games,
peer-to-peer technologies, Web site and Internet usage).
As a reflection of the
diversity and variety of the "systems" under study, these data-based
descriptions of our world tend to be massive in size, dynamic in
character, and replete with rich structures. The advent of these
enormous repositories of information presents us with an
interesting challenge: how can we represent and interpret such
complex, abstract and often socially important data?
This workshop is designed to introduce undergraduates to the
exciting work being done in statistics, the science of data.
| Participants | |
| | |
| Akhtar, Syed | | Williams |
| Bircan, Cagatay | | Williams |
| Chan, Chun Hung | | UCLA |
| Christenson, Erica | | Berkeley |
| Fortin, Dan | | UW |
| Goldman, Megan | | Pitt |
| Gracien, Katina | | NCSU |
| Hodgson, Laura | | UW |
| Horvath, Zsuzsanna | | U Utah |
| Kelly, Megan | | U Chicago |
| Kenaga, Margaret | | Berkeley |
| Lee, Meng-Ju | | Purdue |
| Lee, Tammy | | Berkeley |
| Liggonah, Sayi | | LSU |
| Nathan, Sandy | | Berkeley |
| Neff, Christopher | | UC Davis |
| Nguyen, Vinh | | UC Irivine |
| Omidiran, Chris | | Rice |
| Palm, Yvonne | | Grinnell |
| Reyes, Cherene | | UW |
| Rosario, Ryan | | UCLA |
| Stefanski, Doug | | NCSU |
| Tsang, Terri | | UCLA |
| Weiler, Khela | | Berkeley |
| Wong, Ka Lok | | UCLA |
| | |
| Speakers and Guests | |
| | |
| Dacumos, Dean | | UCLA |
| Estrin, Deborah | | UCLA |
| James, David | | Bell Labs |
| Kaiser, Bill | | UCLA |
| Lambert, Diane | | Bell Labs |
| Nychka, Doug | | NCAR |
| Rice, John | | Berkeley |
| Speed, Terry | | Berkeley |
| Swayne, Deborah | | AT&T Labs |
| Wu, Yingnian | | UCLA |
| | |
| TA's and Graduate Students | |
| | |
| Ahn, Soyeon | | Berkeley |
| Barr, Chris | | UCLA |
| Brodsky, Jae | | UCLA |
| Farzinnia, Neda | | UCLA |
| Lenderman, Jason | | UCLA |
| Tong, Frances | | Berkeley |
| Tranbarger, Katie | | UCLA |
| | |
| Computer Wizardry | |
| | |
| Hales-Garcia, Jose | | UCLA |
| | |
| Photo Credits | |
| | |
| Hansen, Mark | | UCLA |
| Nolan, Deborah | | Berkeley |
| Rosario, Ryan | | UCLA |
|
This program was made possible by grants from the Institute for
Pure and Applied Mathematics at UCLA and from the American Statistical
Association.
|
|
|
|
|
Sunday, June 19
|
From the beginning, participants were given a mix
of context, concept and computation; applications provided
the context, statistical concepts guided
our investigations which were ultimately shaped
by computing.
Our first meeting began with a gentle
introduction to the R
computing environment and the ggobi
data visualization system. Deborah Swayne, one
of the principal authors of ggobi was on hand to give us a tour of
its capabilities.
We were also treated to an excellent talk by
Deborah Estrin,
head of the Center for Embedded Network
Sensing at UCLA. Deborah highlighted the statistical challenges
in sensor network design, deployment and analysis.
|
|
Monday, June 20
|
The statistics of freeway traffic
John Rice, University of California, Berkeley
John guided us through data collected by the
Freeway Performance Measurement
System (PeMS), a real-time monitoring system tracking flows of traffic
in California. We explored the relationship between flow (a measure
of "throughput" of
the freeway system) and occupancy (a measure of congestion) and how
these vary over time, across lanes, and in response to "events."
We also looked at forecasting travel times and evaluated predictions
for trips with different start times.
Using data from the PeMS web site we finished the day with some questions:
| · |
Taxi drivers claim that when traffic breaks down, the fast lane breaks down first so they move immediately to the right lane. Can you see any such phenomena in the data?
|
| · |
How close is flow or occupancy in one station related to that in a nearby station? Can you find time lags at which this relationship is strongest? Can you explain what you see?
|
| · |
Choose one day, and look at the detail of the CHP incident data. Choose an interesting incident. Can you see this incident has affected traffic flow in both directions? Is there any evidence of rubber necking?
|
| · |
Construct an animation showing the relation between flow and occupancy in time.
|
We worked in groups of three and presented our analyses; some groups
worked in R, others preferred ggobi displays.
We finished the day with a dinner sponsored by
the UCLA Statistics Department.
|
|
Tuesday, June 21
|
Genotyping
Terry Speed, University of California, Berkeley
The day started with a short lecture from Terry to define some terms;
a Single Nucleotide Polymorphism (SNP) is a variant form of DNA at
a well-defined position on a chromosome.
Researchers are genotyping thousands of individuals at thousands of SNPs,
in the hope of finding associations between SNP genotypes and disease with
genetic components, and other genetic traits.
While there are many ways
to determine a person's genotype at an SNP, our data came from
an Affymetrix SNP chip.
We looked at data for one SNP. Terry provided us with a data frame
consisting of Affymetrix probe intensities for 90 individuals;
for each person in the study, we were given intensities for 40
probes.
Two graduate students from UC Berkeley, Frances Tong and Soyeon Ahn, were
on hand to help us analyze the data. As with the previous day, we
worked in groups of 3. We started with simple exploratory
analyses of the probe intensities, forming boxplots and heat diagrams,
summarizing the 40 probes across individuals.
We next reduced the data
to a pair of relative allele signals (RAS's) and fit a simple
cluster analysis to these derived measurements. We explored how well
cluster membership agreed with genotype at the SNP. At the end of the
day, our group presentations focused on repeating these analyses on
different SNP's and commenting on what we found.
We ended the day with an open discussion by five statistics graduate
students: Neda Farzinnia, Katie Tranbarger, Jae Brodsky,
Soyeon Ahn and Frances Tong. We had lots of questions about
the difference between life as an undergraduate versus that of a
graduate, the process of selecting an advisor, and student "community."
|
|
|
|
Wednesday, June 22
|
Geolocation
Diane Lambert, Bell Labs
The signal from wireless devices like PDA's and mobile phones can
be picked up by various fixed receivers or access points. If you
are near an access point, the (received) signal strength is large; if
you are distant, it is small. Diane
posed a simple question: Can measurements of signal strength
at various wireless access points be used to identify a person's
location? She brought a data set of signal strengths recorded at
a handful of access points as a researcher roamed the corridors
of one floor of an office building.
By this point in the program, we had enough experience with R to
explore the data in a very open-ended way. Some groups created plots
of signal strength, overlaying these measurements onto a map of the
building, while others looked at the relationship between signal strength
and distance to the access points. Time flew by on these projects; we
were so absorbed that it was tough to break for lunch! Today,
Jan de Leeuw, Chair
of the UCLA Statistics Department, joined us for lunch at the faculty
center. After lunch, we returned to work, ultimately presenting
our findings through a series of group presentations.
We ended the day with a discussion led by Diane Lambert and Doug
Nychka (our speaker for Thursday) about statistical research
being conducted outside a university setting.
|
|
Thursday, June 23
|
|
Case study for precipitation on Colorado's Front Range
Doug Nychka, National Center for Atmospheric Research
In addition to Doug, we were joined by David James, a researcher
from Bell Laboratories. Our project dealt with extreme precipitation
events.
The frequency of large amounts of rainfall is important for planning
for floods and effects how land and roadways should be developed.
Doug provided us with climatic data from 56 weather stations located
along Colorado's Front Range (FR).
The FR consists of relative flat
plains with a transition to high mountains. Because of this diversity
it is useful study area to test methods.
The variable of interest is the total amount
of precipitation in a 24 hour period during the "summer."
Doug guided us through the following questions:
| · |
What is the distribution of big precipitation events and how does this distribution vary over space?
|
| · |
How can irregular station observations be extrapolated to locations where measures are not made?
|
| · |
What is the degree to which large precipitation events cluster in time?
|
| · |
Given the answers to the previous questions, how well does a climate model simulation reproduce the features in the observed meteorology?
|
Next, we took a quick field trip to Bill Kaiser's
lab; Bill is part of the CENS group and works with robotic data
collection devices.
We ended the day with another group discussion, this time
on the process of applying to graduate school. Deb Nolan was joined by
Dean Dacumos, the Student Affairs Officer for the Statistics Department at UCLA,
for an open discussion about what graduate programs are looking for.
|
|
Friday, June 24
|
|
Computer Vision
Yingnian Wu, UCLA
This was the last day of our program, and it was divided
into two sections. In the first, Yingnian gave us a basic
introduction to some of the problems in computer vision.
We learned about R's capabilities for working with image
data, and applied some simple filters to digital photographs
taken from the first five days of our workshop. Yingnian then
described some basic statistical properties of natural images,
including so-called "scaling" behavior (what happens as you
zoom in and out of a scene).
After lunch, we had a series of presentations. Some students
spoke in groups, others flew solo. In each case, they were
asked to dig a little deeper into some of the data they
had seen during the week and talk about "analysis" that they
found particularly satisfying. For some, this was a plot
that they were proud of, while for others it was an investigation,
a thought process, that led them to some interesting conclusions.
We ended the program with one final group dinner
at the Westwood Brewing Company. Far too much
food was ordered, but no one went away hungry!
|
|