Archive for May, 2005

May 31 2005

Reporting exploratory data

Published by matt under Uncategorized

A nice summary re: the writing process for reporting exploratory data.

Comments Off

May 29 2005

Mix tapes

Published by matt under Uncategorized

I’m just not focused this morning. Don’t know what it is.

Yesterday, I dug my minidisc out. This tool was essential during the early stages of my research—for recording interviews, lectures (mine), and those of classes I was taking at the time. A Sony MZ-R700, lime green. Of course, it was a frustrating tool, as transferring the audio to the PC could only be done in analog real-time, and as a result very few things were transferred from the MD to the Powerbook.

The minidisc has come out a number of times for train trips across England, but carrying it around on a daily basis wasn’t something I did. My iPod Shuffle tends to get more milage now, simply because it has almost zero mass, and doubles as a USB flash drive. However, there’s still a few minidiscs that are important to me that I’ll have to digitize at some point.

I have countless lectures and research discs; these will probably never play a role in my academic work at this point, despite the hours invested in them. I have a few I’ve recorded at holiday time around my home, my grandparents’, and a few other random places. The recordings aren’t of anything in particular—they’re just ambient, picking up conversations and laughter of family and friends. And then, I have two mix discs.

Before leaving the States, I transferred two mix tapes to minidisc. One was made in 1994 for me by Carrie before I headed off on my first Chasers tour. As it turns out, I had a lot of time to listen to that tape—I made a bad navigation choice, and ended up driving through the mountains of Virginia during a really awful blizzard; it was, by far, the worst weather I’ve ever driven in, and am lucky that the two times I lost control of my vehicle I brought it to a stop on the road, not over a cliff. What should have been a 7-hour drive became 15 hours of slow, blind hell as I wound my way up, over, and down the switchbacks that are characteristic of state roads in that part of the country.

The second tape, also made by Carrie, did not get put to the same dramatic use, but kept me company during her year in Wales. Now, I think that it has kept me company from time-to-time while—again—she was in Wales. It was just for three years, this time.

11 years is a long life for a mix tape.

Comments Off

May 29 2005

PayPal scam emails

Published by matt under Uncategorized

They really do a good job.

If you recently noticed one or more attempts your account while traveling, the unusual log in attempts may have been initiated by you. However, if your are rightful holder of the account, click on the link below to log into your account and fallow the intrusctions.

Of course, fallow and intrusctions usually have slightly less creative spellings.

Thanks for your patiance as we work together to protect your account.

 
Sincerly,
PayPal Account Review Department
PayPal, an ebay Company

Sincerly is a nice touch as well. Still, I looked twice.

Comments Off

May 27 2005

Stats, continued

Published by matt under Uncategorized

I think there’s a bit of context required here.

The statistical part of my dissertation is not the most important; while collecting this data, the gross statistical investigations that I carried out previously, while inexpert, were descriptive in nature, and sufficiently clear to guide further investigation.

As I begin assembling this data, I’ve become… concerned with how this data is presented. This concern stems from a number of things; first and foremost, I want to do a thorough job of presenting the data; second, I think the data needs to be seen and understood by more teachers of programming, as I think it presents part of a story of how students learn to program that hasn’t been told in this way before. Third, but unrelated, is the fact that I don’t like not understanding things: in this case, I am not a statistician, and it is frustrating knowing there are tools out there that I don’t know how to use.

I’ve come to appreciate that statistics involves as much process—design and implementation—as architecting software. Exploratory data analysis might be likened to refactoring a system that already exists: something is there for you to touch and explore, but making sense of it and restructuring it so that others you and others can understand it is the goal.

Or perhaps not. Regardless, I’m going to focus on doing the kinds of explorations I understand, and I’ll leave the inferential statistics and modeling until later. For now, I want to get a sense for the breadth and depth of my full data set, and I’ll circle back around to fill in holes later.

Comments Off

May 27 2005

Comments on my (bad) stats

Published by matt under Uncategorized

I have no problem admitting that I lack expertise… in things that I lack expertise in. :)

As a result of these comments, I’m going to try and get someone who really understands exploratory data analysis and statistics to adopt me. The truth is that I’m sitting in the middle of a complex data set, with powerful tools for analyzing it, and no idea what I’m doing. It’s like giving a 3-year-old a chain-saw, and telling them to be careful…

From Noel:

To compare fit between different distributions, you can
compare the probability of each distribution generating the
observed data.

The normal distribution is an approximation to the binomial
distribution when the number of samples is large and the
probability of success is not too small. Your data is not
normal because it is too close to the y-axis
(i.e. probability of success is too small) but a binomial
distribution might still fit it well.

I haven’t read your posts in detail so you may answer this:

You need to consider the model of the process that generates
the data. For example, the binomial assumes a process with
a fixed probability of success which generates the samples.
There is no notion of time. Poisson does consider time and
so may be more appropriate.

You have to consider at least: time between compilation,
number of sessions recorded for each student, and that each
student may be generating data by a different process. So
fitting a distribution to just number of compilations per
student may not be too meaningful.

All good points. All things I didn’t cover in my qualitative research methods courses. There are times when you wish you could go back and do things differently—right now, I wish I had made a point of taking some stats classes along the way…

From Robin:

I was reading your blog of May 25. A couple of your green-boxed
points made me wonder… so a quick note seemed apropos.

Not sure what you mean by your first point. Transformations transform
the data — that’s most anybody’s definition of “changing the data”.
For instance, I could multiply all the values by zero. That would
change it from “interesting data” into “boring data”.

Taking the log of both variables (i.e., x and y) is a transformation
designed to identify power laws. If you have data where y ~ A*x^p,
then a log-log transformation (x’=log(x), y’=log(y) transforms it into

y’ ~ p*x’ + log(A)

So, if your initial data has a good linear fit, then it means y and x
are linearly related. If your log-log data has a good linear fit,
then it means they’re related via a power law. In your case, it looks
like the exponent is p~1.25. Which is pretty close to 1, which is why
your initial data looks pretty linear too.

Sorry if you knew all that. It just wasn’t obvious from Mr. Weblog.

2nd. A good fit doesn’t always imply causality. Ask Meg for a rant
about how often people assume that it does when it really doesn’t.
Now, you say “A good fit in this case implies causality”, which might
be true. I don’t see it, though. Maybe both are caused by the same
thing — the student with lots of compiles and sessions has epilepsy,
and twitches a lot, and therefore accidentally pushes keys. I dunno.

Anyway, I’m just suggesting that before you assert causality to (e.g.)
your thesis defense board, you run that argument by a really hardass
statistician and make sure he/she agrees. ‘Cause as I understand it,
unfounded assertions of causality are a faux pas.

To which Meg adds:

Re: causality. The ONLY way that you can make causal inferences is if you can observe a relationship between _something_ that was randomly assigned and your outcome of interest. Otherwise, you can only say they’re correlated, and while that may give you good predictive power, you can’t claim causality. There are some tricks you can do, but I won’t bore you with details unless you need them.

I definitely need to be adopted by a statistician. In a hurry.

Comments Off

May 26 2005

Tranferring funds

Published by matt under Uncategorized

The most tedious thing about living abroad are finances: you have two bank accounts, two sets of credit cards, etc. Now, it is kinda nice, as I have purchasing power (and a verified mailing address) in two countries, and that’s especially nice for purchasing cheap goods in the US (the GBP goes a long way against the US dollar right now). However, paying for the US credit card means getting money from the UK to the US.

Enter PayPal. The setup is kinda slick:

  1. I send cash from my UK account to Carrie, who has a US account linked to our US bank.
  2. We pay a 3% fee for the currency exchange.
  3. Carrie receives an email indicating she has received money from me
  4. She logs onto PayPal to accept the funds
  5. She initiates the transfer from PayPal to the US bank account (free)
  6. Done.

The 3% fee is a very good rate; I think, for an electronic wire transfer from a UK bank to a US bank I’m going to pay something like $50 in fees. So, for small, infrequent transfers ($500 or less, infrequently), this is an excellent way to move money from one place to another.

Of course, if I don’t pay my Visa bill, then I pay interest…

Comments Off

May 25 2005

Compilation and sessions

Published by matt under Uncategorized

Update 20050528
From Meg:

Robin sent me the address for your blog, so I’ve got a slightly better idea of what you’re doing. Your basic recollection about R^2 is correct, but let me elaborate: the R^2 value tells you what percentage of the variation in the observations is accounted for by your model. Your statistical package attempts to fit a line through the data, and the difference between its prediction and the actual observation is the “residual” for that observation.

The t-test results in your output tell you whether the coefficient on each of your variables is significantly different from zero. The F-test statistic at the bottom tells you whether the model as a whole is significant. You’re right to be cautious about these statistics, because there’s a fundamental assumption of OLS regression that, if violated, will make your confidence intervals too small (and therefore suggest you have significant results even if you don’t really): the residuals must be random (not correlated with any of the variables) and normally distributed. We can test the latter.

Duh! I forgot some very simple things:

  • A transformation applied, uniformly, across the data does not change the data. (eg. the log-log transform I perform)
  • The R2 statistic tells me how good a fit I have; if it is 1, it’s a perfect fit. Close to 1 is a good fit. Therefore, the R2 value of .82 or so from my log-log plot is a good fit.
  • A good fit in this case implies causality, and therefore we have a causal relation between the number of sessions and number of compiles. This, I think, is reasonable. That is, more sessions should yield more compilation events.

So, it’s true: taking breaks and eating are good things to do when you’re working hard on things that require careful thought. Also, asking people who know more than you is a good strategy as well.

I’ve been exploring issues of timing in my analysis of my data, but something that I had to move past is worth noting here, mostly so I can point people at it when I ask them questions.

In looking at the number of compilations students engaged in over the course of a semester, I also looked at how the number of sessions involved in generating that data. I’m defining a session as:

  1. A student opening BlueJ
  2. Writing some code (or editing, or whatever), generating one or more compilation events
  3. Closing BlueJ

I can’t detect sessions where they open BlueJ and don’t compile anything; this, however, doesn’t particularly matter.

Now, the number of compilation events generated and the number of sessions are reasonably similar distributions; they both have a spike near zero, and a fairly long tail. So, just looking at the distributions, you might think that we have a fairly clear correspondence between the number of times a student opens BlueJ to do some programming, and the number of compilation events they are responsible for.

So, I decided to plot the two against each-other. I took the number of sessions a student engaged in, and the number of events they generated over the course of those sessions. Looking at the Fall 2003 data, the Spring 2004 data, and the 2004-2005 data, they look like slightly different distributions—at a glance, anyway (Figure 1).

Compiles-Per-Session3

Figure 1: Compilation events vs. the number of
sessions students engaged in for three populations.

There’s a clear clustering of points where students have taken part in a small number of sessions and, therefore, generated a small number of compilation events. However, the spread outwards and upwards is interesting, as it becomes a source of high variability in the population.

Figure 2 overlays these three plots on top of each-other.

Compiles-Per-Session-Overlay

The session vs. compilation data for three populations, overlaid.

The spread seen in the 04-05 data is reinforced, if you will, by the partial data from the Fall of 2003 and Spring of 2004; of course, the Fall data was only collected when students were in-class, which limits the amount of data we have available. To be honest, I don’t know how I should analyze this particular spread. In the 1200-compilation-event range, there are three students; one generated their 1200 events in 20 sessions, another in roughly 38, and a third student in nearly 65 sessions (I’m eyeballing this from Figure 2, and am not quoting source data). Likewise, in the 60-65 session range, there is one student from Spring 2004 who generated 800 events (a half-year, mind), while over the course of the entire year, I have three students who generated 1200, 1500, and 2500 compilation events.

Put another way, some students are very prolific in a single session, while others only compile a few times. Why? What does it mean? I don’t know yet; something for the interviews, no doubt.

Log-Log-Comp-Vs-Sess

Figure 3: Log-log plot of the same data

If I plot the log of the data (Figure 3), it pulls it out to something that is almost, but not quite, linear in shape. Granted, it’s noisy as all get-out, but I don’t necessarily expect my students to have a narrow range of behaviors. Unfortunately, I don’t know where to go next at the moment; what does it mean that the log-log plot of the data is roughly linear? Does that tell me anything about the relationship between the number of compilation events a student generated and the number of sessions they engaged in? Can I use this to determine who is more or less “prolific” than other students?

Off to the stats desk. Perhaps I’ll be given more reading material…

Update, a few minutes later

Another problem is that I don’t know how to interpret basic statistical information. For example, I can ask R to perform a linear regression analysis over my data; here, I’m working with the 04-05 dataset.

> summary(lm(compilesvssessionsdeltas0405$compiles ~
             compilesvssessionsdeltas0405$session))
Call:
lm(formula = compilesvssessionsdeltas0405$compiles ~
             compilesvssessionsdeltas0405$session)

Residuals:
    Min      1Q  Median      3Q     Max
-442.87 -175.21  -50.84   87.01 1171.39 

Coefficients:
                                     Estimate Std. Error
(Intercept)                           -59.600     54.970
compilesvssessionsdeltas0405$session   27.257      2.266
                                     t value Pr(>|t|)
(Intercept)                           -1.084    0.282
compilesvssessionsdeltas0405$session  12.027   <2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 278.6 on 66 degrees of freedom
Multiple R-Squared: 0.6867,    Adjusted R-squared: 0.6819
F-statistic: 144.6 on 1 and 66 DF,  p-value: < 2.2e-16

Something that I do know is that significance in statistics is a technical term; I cannot casually accept the fact that R seems to think that there is some kind of significance in the F test, which tests the hypothesis that the regression coefficient is zero. I know that, because I looked it up. What I don’t know is whether the test is relevant, which further brings into question whether the significance measure is… significant?

If I try the same thing on my log-log data:

> summary(lm(log(compilesvssessionsdeltas0405$compiles) ~
                 log(compilesvssessionsdeltas0405$session)))
Call:
lm(formula = log(compilesvssessionsdeltas0405$compiles) ~
                 log(compilesvssessionsdeltas0405$session))

Residuals:
    Min      1Q  Median      3Q     Max
-2.1409 -0.2973 -0.1420  0.4197  1.4623 

Coefficients:
                                          Estimate
(Intercept)                                2.40890
log(compilesvssessionsdeltas0405$session)  1.19836
                                          Std. Error t value
(Intercept)                                  0.26781   8.995
log(compilesvssessionsdeltas0405$session)    0.09624  12.452
                                          Pr(>|t|)
(Intercept)                               4.44e-13 ***
log(compilesvssessionsdeltas0405$session)  < 2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 0.6608 on 66 degrees of freedom
Multiple R-Squared: 0.7014,    Adjusted R-squared: 0.6969
F-statistic: 155.1 on 1 and 66 DF,  p-value: < 2.2e-16

Again, the same question; when I use the log-log data,
both the intercept and fit to the data seems to be better, but… so? I can fit a line to the data (Figure 5), but again… so?

0405-With-Abline

Figure 5: log-log 04-05 with fitted line.

I think I need lunch.

Comments Off

May 24 2005

Chihuly in the Royal Botanic Gardens, Kew

Published by matt under Uncategorized

Chihuly1

Via my father, my mother made mention of the fact that Dale Chihuly’s work is being exhibited in the Kew gardes up in London. I think that I need to head up to London some Saturday or Sunday where the weather is supposed to be good.

While I’m at it, I should probably go up in the Eye, and revisit he V&A. I also like the look of the Open Systems exhibition in the Tate Modern come June to Sept.

Comments Off

May 23 2005

Microsoft, the Innovation Company

Published by matt under Uncategorized

Michael Kölling wrote up a nice comparison between a new, innovative Visual Studio feature and BlueJ. The screenshots are compelling, and I think the most important point he makes is in his concluding remarks:

So have they copied us? I don’t know. It could all be a great coincidence. And what if they have? Is it illegal? No, it isn’t. We don’t have software patents. (We don’t want to have software patents – we don’t believe in them.) Is it unethical – I don’t know. It’s business, and business sometimes is.

Do I care? I don’t care that they copied BlueJ – good on them, and good luck to them. But I care about attribution.

I work at a university, and I strongly believe in honest attribution of sources. Microsoft does not have a good track record on this. So I decided to post these screenshots here so that people can at least see and make up there own minds.

It would be nice to see this spread in the blogosphere, as attribution for years of hard work is only right. Please spread the words and links in this post as far and wide as you like.

Comments Off

May 23 2005

A moment, living

Published by matt under Uncategorized

Img 7674-1

Today was a rare sort of day. I woke, I cleaned my apartment, and brought some measure of closure to a project started two days before. I then crossed the street to cheer on my friends as they played hockey. I could not have predicted the rest of the day.

My intention was to watch; instead, a few minutes later, their—now, our—team was called to play. It was an end-of-year, open round-robin tournament; we were and are the Carnival Legends. As the entire day’s festivities were intended to be carried out in fancy dress, I felt the viking helmet I had been given was not enough. So with little fanfare, I removed my jeans, donned my jumper over my boxers as one would a kilt, and set out to play a game I had never played before.

Ten minutes later, we were defeated. However, we organized ourselves for the next game, which was then cut unnecessarily short—unfortunate, as we were evenly matched with the Magnificent Seven. My third game (the team’s fourth), we played Team Origins, many of whom were clearly experienced hockey players. Our team of seven—Damian, Cookie, Maggie, myself, Patrick, Phil, and Wiebke—with but two experienced players, held them to zero-zero. Despite their aggressive play, we held our ground, and gave them what may have been their only non-scoring game. In our next match, we went one-one. We played superbly as a team, and I was surprised and pleased to have scored our goal for the day.

Img 7677-1

We barbecued, relaxed, and went our separate ways. As I sit in my room in the quiet hours of the morning, thinking of the time I’ve been here in England, I think of the things I miss—family, friends far away, and my wife who is too often too far away.

But there was a moment, where I looked out from my place on the field, down the long green hill, its grass shaggy and unkempt, into the valley and the city of Canterbury. The cathedral, rising up from the town and into the mottled blue-and-grey skies, gave me pause; here I was with friends, running free under the sun, living and loving life, concerned only with the moment.

Img 7676-1

Comments Off

Next »