Simply Statistics

Non-tidy data

During the discussion that followed the ggplot2 posts from David and I last week we started talking about tidy data and the man himself noted that matrices are often useful instead of "tidy data" and I mentioned there might be other data that are usefully "non tidy". Here I will be using tidy/non-tidy according to Hadley's definition. So

When it comes to science - its the economy stupid.

I read a lot of articles about what is going wrong with science: The reproducibility/replicability crisis Lack of jobs for PhDs The pressure on the families (or potential families) of scientists Hype around specific papers and a more general abundance of BS Consortia and their potential evils Peer review not working well Research parasites Not enough room for applications/public

Not So Standard Deviations Episode 9 - Spreadsheet Drama

For this episode, special guest Jenny Bryan (@jennybryan) joins us from the University of British Columbia! Jenny, Hilary, and I talk about spreadsheets and why some people love them and some people despise them. We also discuss blogging as part of scientific discourse. Subscribe to the podcast on iTunes. Show notes: Jenny's Stat 545 Coding

Why I don't use ggplot2

Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don't know what that is, ggplot2 is an R package/phenomenon

Data handcuffs

A few years ago, if you asked me what the top skills I got asked about for students going into industry, I'd definitely have said things like data cleaning, data transformation, database pulls, and other non-traditional statistical tasks. But as companies have progressed from the point of storing data to actually wanting to do something with

Leek group guide to reading scientific papers

The other day on Twitter Amelia requested a guide for reading papers I love @jtleek’s github guides to reviewing papers, writing R packages, giving talks, etc. Would love one on reading papers, for students. — Amelia McNamara (@AmeliaMN) February 5, 2016   So I came up with a guide which you can find here: Leek

A menagerie of messed up data analyses and how to avoid them

Update: I realize this may seem like I'm picking on people. I really don't mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from "I got a big one here" when I made

Exactly how risky is breathing?

This article by by George Johnson in the NYT describes a study by Kamen P. Simonov​​ and Daniel S. Himmelstein​ that examines the hypothesis that people living at higher altitudes experience lower rates of lung cancer than people living at lower altitudes. All of the usual caveats apply. Studies like this, which compare whole populations, can

Not So Standard Deviations Episode 8 - Snow Day

Hilary and I were snowed in over the weekend, so we recorded Episode 8 of Not So Standard Deviations. In this episode, Hilary and I talk about how to get your foot in the door with data science, the New England Journal's view on data sharing, Google's "Cohort Analysis", and trying to predict a movie's

Parallel BLAS in R

I'm working on a new chapter for my R Programming book and the topic is parallel computation. So, I was happy to see this tweet from David Robinson (@drob) yesterday: How fast is this #rstats code? x <- replicate(5e3, rnorm(5e3)) x %*% t(x) For me, w/Microsoft R Open, 2.5sec. Wow. https://t.co/0SbijNxxVa — David Robinson (@drob)

Profile of Hilary Parker

If you've ever wanted to know more about my Not So Standard Deviations co-host (and Johns Hopkins graduate) Hilary Parker, you can go check out the great profile of her on the American Statistical Association's This Is Statistics web site. What advice would you give to high school students thinking about majoring in statistics? It’s

Not So Standard Deviations Episode 7 - Statistical Royalty

The latest episode of Not So Standard Deviations is out, and boy does Hilary have a story to tell. We also talk about Theranos and the pitfalls of diagnostic testing, Spotify's Discover Weekly playlist generation algorithm (and the need for human product managers), and of course, a little Star Wars. Also, Hilary and I start a

Jeff, Roger and Brian Caffo are doing a Reddit AMA at 3pm EST Today

Jeff Leek, Brian Caffo, and I are doing a Reddit AMA TODAY at 3pm EST. We're happy to answer questions about...anything...including our roles as Co-Directors of the Johns Hopkins Data Science Specialization as well as the Executive Data Science Specialization. This is one of the few pictures of the three of us together.

Not So Standard Deviations: Episode 6 - Google is the New Fisher

Episode 6 of Not So Standard Deviations is now posted. In this episode Hilary and I talk about the analytics of our own podcast, and analyses that seem easy but are actually hard. If you haven't already, you can subscribe to the podcast through iTunes. This will be our last episode for 2015 so see you

Instead of research on reproducibility, just do reproducible research

Right now reproducibility, replicability, false positive rates, biases in methods, and other problems with science are the hot topic. As I mentioned in a previous post pointing out a flaw with a scientific study is way easier to do correctly than generating a new scientific study. Some folks have noticed that right now there is

By opposing tracking well-meaning educators are hurting disadvantaged kids

An unfortunate fact about the US K-12 system is that the education gap between poor and rich is growing. One manifestation of this trend is that we rarely see US kids from disadvantaged backgrounds become tenure track faculty, especially in the STEM fields. In my experience, the ones that do make it, when asked how they overcame the

Thinking like a statistician: the importance of investigator-initiated grants

A substantial amount of scientific research is funded by investigator-initiated grants. A researcher has an idea, writes it up and sends a proposal to a funding agency. The agency then elicits help from a group of peers to evaluate competing proposals. Grants are awarded to the most highly ranked ideas. The percent awarded depends on how

Pages