User Reviews
Rating: really liked it
When I am not writing witty and informative reviews on Goodreads/Amazon my day job is as a Government statistician. Therefore when offered the opportunity to read this book I thought it would be useful for me to do so. And I do believe it is helping me in my work. I am thinking more about how best to present my statistics and what analytical techniques I could use too. So this book works from that perspective.
This book takes real world questions and shows you how they've been answered introducing various statistical techniques as it does so. It does this whilst aiming to avoid "getting embroiled in technical details". The questions picked are quite interesting subjects like "why do old men have big ears?", "how many trees are there in this planet?" (an estimated 3.04 trillion if you must know) or what height will a son/daughter be given their parents' heights and so on with some of the questions being based on work the author has been involved in during his career. Relating the problems to real life helps make the text appeal not only to statisticians (to which this book is dedicated) but also to non-technical readers "who want to be more informed about the statistics they encounter both in their work and in everyday life."
Some of this is not new stuff, e.g. early bits on presentation of data such as 3D pie charts not being useful for comparing proportions. But the book does get more involved as you work through it getting deeper in statistical techniques making it harder to understand and requiring more concentration, and the author is aware of this, for example asking if it is "all clear? If it isn't then please be reassured that you have joined generations of baffled students". Also the conclusion congratulates you for getting to the end.
Useful stuff in here for me was the chapter on regression (which is what I use more commonly than much of the rest), and the last couple of chapters after the hard stuff were good reading too, showing bad examples and good examples of statistics from journals and the like and explaining why (offering learning points).
Technical stuff is relegated to the technical glossary so this book is readable (which is good for a book about statistics), although still hard in places. For my work it has been useful and I'm glad I read it and have it for future reference.
Rating: really liked it
Pretty good, but there are a few chapters where the author basically goes "I'm not explaining this very well, but I know you won't get it so let's just move on". I also wish there were a few more "digital" / web analytics cases, but that's just because it would help me.
Overall, an interesting and useful read.
Rating: really liked it
This amazing piece can somewhat be seen as the equivalent of Angrist&Pischke's "Mastering Metrics" for bread and butter statistical problems instead of intuitive econometrics. It covers everything one has to know when it comes to scientific studies that rely on data. All aspects and elements are touched, but math and formulas are relegated to an appendix. Thus the book is well suited for experts with year-long experience, college students of all fields, but especially science writers or people that want to be well equipped when it comes to discussing or questioning the newest "study x found that y prevents cancer" headline.
Explains concepts with easy-to-grasp real world examples, appealing to the intuition of the reader. Touches upon all topics, from basic proportions, regression, classification/"big data", up to bayesian approaches. Also covering common misconceptions and fallacies on the fly ("how to lie with stats"). Everything in a very coherent and readable way. A truly joyful read!
Can be assign as a companion text for a stats undergrad course across all disciplines in order to show students, sometimes drowning in pure formula memorisaation, the beauty of stats and numbers and data. Also suited for AP stats people. Also for skilled professionals as a revision.
A big plus is the companion code for the open-source software R, which together with Python is going to be the future of (statistical) programming.
The last part of the book explains the so-called "statistical crisis in science" (or "replication crisis") and how it came about and, most importantly, what to do about it. Communication chains are analysed to understand how exaggerated newspaper headlines are created. Mot importantly the author provides check lists for the reader to be able to infer himself whether, or how much, a certain study or headline should be trusted.
Rating: really liked it
I didn't like the first 60% of the book. It was too dumbed down even for me and not enough original storytelling for explaininf concepts to non math students. I even gave this feedback to the author. The last 1/3 of the book was much better,getting into p hacking, data quality, and data ethics.
Rating: really liked it
Very nice overall, not much algebra but focus on the reasoning behind, interesting examples. Good for nonscientists.
Rating: really liked it
Q:
A classic example of how alternative framing can change the emotional impact of a number is an advertisement that appeared on the London Underground in 2011, proclaiming that ‘99% of young Londoners do not commit serious youth violence’. These ads were presumably intended to reassure passengers about their city, but we could reverse its emotional impact with two simple changes. First, the statement means that 1% of young Londoners do commit serious violence. Second, since the population of London is around 9 million, there are around 1 million people aged between 15 and 25, and if we consider these as ‘young’, this means there are 1% of 1 million or a total of 10,000 seriously violent young people in the city. This does not sound at all reassuring, (c)
Q:
But these are generally reported as the ‘average house price’, which is a highly ambiguous term. Is this the average-house price (that is, the median)? Or the average house-price (that is, the mean)? A hyphen can make a big difference. (c)
Rating: really liked it
Long ago I worked for a consulting firm and was assigned to look at the help desk tickets of a large government agency. They wanted to see if computer-related productivity losses could be reduced through better training, different procedures, or automated solutions to common problems.
I analyzed three years’ worth of data and saw a clear trend: this was a one-tail distribution to the right. In other words, some tickets closed within minutes (like password resets or “is the printer even turned on?”), but many took longer, especially if an Admin needed to get involved or a tech had to visit the user, and a few stayed open for weeks or months because there was no solution or the solution was more expensive than the productivity loss it caused.
With this information in hand I got on the client’s calendar. My recommendation was to divide the data into quintiles and focus on the middle three, because the first one didn’t need any help and the last one required custom, one-off solutions. He didn’t know what a quintile was and had no intention of learning. He said my job was simple: just take the average ticket closure time and move it to the left. “The average,” I said, “mean, median, or mode?” He thought I was being a smartass.
I went back to my manager and raised the alarm. No matter how much we improved the middle three quintiles, the overall average closure time was not going to be affected much if we also had to include tickets open for weeks or months. He replied with the Mantra of Mediocre Managers: “just do the best you can.”
We did good work on that contract, lowering the closure times of the second, third, and fourth quintiles by 8-14%, and saving the client thousands of productive manhours per year. The overall average, however, across the entire data set when the fifth quintile was included, was reduced by only three minutes. This was cited this as a factor when our contract was not renewed. As Kurt Vonnegut would say, “And so it goes….”
The Art of Statistics uses real world examples to help the reader understand how to make sense of raw data. The first thing to understand is that, if the devil is in the details, statistics has a lot of devils mucking up the work. Just defining the problem can turn out to be fiendishly difficult. The book includes a case study on mortality rates among children who underwent heart surgery at various hospitals in Britain, and simply deciding which cases to count required difficult, and subjective, decisions. What is the upper age limit to define a child? Which of the many procedures get included when deciding what to count as heart surgery? If the child dies, how do you decide if the surgery was the primary cause, a contributing factor, or a coincidental event, and how could you ever convince the parents that it was not the surgery that killed their child?
In Gina Kolata’s
Flu, about the 1918 influenza pandemic, she cites a discussion that occurred during the Swine Flu scare of 1976
Dr. Hans H. Neumann, who was director of preventive medicine at the New Haven Department of Health, explained the problem in a letter to the New York Times. He wrote that if Americans have flu shots in the numbers predicted, as many as 2,300 will have strokes and 7,000 will have heart attacks within two days of being immunized. “Why? Because that is the number statistically expected, flu shots or no flu shots,” he wrote. “Yet can one expect a person who received a flu shot at noon and who that same night had a stroke not to associate somehow the two in his mind? Post hoc, ergo proter hoc,” he added. (p. 161)
Another difficulty occurs when asking people their opinions, because how the question is asked can steer responses one way or another. We live in an age of fierce partisan politics, so it is not uncommon to see polls that deliberately attempt to skew answers, but it can also happen completely by accident if the questions are not given proper consideration. This book cites an example of this kind of framing: when asked if people would support or oppose giving 16-17 year olds the vote, the majority approved, but when the question was asked in the form of whether the voting age should be reduced, most disapproved.
There is also an excellent discussion on how we can be led astray by assumptions of accuracy. Take for instance a 95% accurate drug test given to 1000 athletes, 20 of whom are doping and the other 980 not. All but one of those actually doping will be detected (95% = 19 of 20), but 49 who are not doping will also be flagged (5% of 980=49). There will be a total of 68 positive tests (19+49), of whom only 19 are actually doping. Therefore, when someone tests positive there is only a 19/68 (28%) chance that they guilty – the rest are false positives.
The author was part of the team which examined the case of Dr. Harold Shipman, who was found guilty of the deaths of fifteen of his patients in 2000, but may have killed between 215 and 260 by injecting them with lethal drugs and then altering the records to make their deaths appear to have been from natural causes. The team’s task was to see if there could have been a way to detect what Shipman was doing before he killed so many people. As it happened, there was indeed statistical evidence that might have convicted him fifteen years earlier, saving perhaps 175 deaths, but this required an exhaustive review of the circumstances around the deaths of Shipman’s patients, including a comparison with the outcomes of thousands of cases from other doctors, and even looking at the time of day most deaths occurred. The end result of this investigation was the creation of a data collection system on patient mortality that makes it easier to identify statistical anomalies, but even these must be examined with care, since doctors who work primarily with elderly patients will have higher death rates, and social factors like patient income and education can affect outcomes.
The book uses case studies like these to examine statistical reasoning and how it can be useful to non-statisticians when examining data. There is a look at the decision trees involved in deciding who the “luckiest” survivor from the Titanic was, and a truly disturbing look at how Bayesian analysis could have prevented many unnecessary cancer surgeries that resulted from physicians not understanding how to differentiate between true and false positives and negatives. There is also a good discussion on regression analysis, a powerful tool which can easily be misused to project false or misleading trends.
I enjoyed this book, and learned some useful things from it. It is written in a clear, non-technical style that anyone can follow, and the case studies were well chosen to be illuminating and informative. This is a good place to start for anyone who sometimes needs to extract meaning from numbers.
Rating: really liked it
I really wanted to like this book. But at times it felt like it’s trying to cover too much ground and a lot of it not deep enough. Often times more technical details would have aided proper understanding of the subject.
It was also quite surprising to see supervised learning being defined as classification, which seems incorrect and also doesn’t explain what supervised learning actually is.
Rating: really liked it
Statisticians study patterns in data to help us answer questions about the world. When reported accurately, statistical research can enrich storytelling and inform the public about important issues. Unfortunately, there are a great many distorting filters that research has to pass through before it reaches the public, including scientific journals and the media. As statistical data creeps into our lives more and more, there is a growing need for us all to improve our data literacy so we can appropriately assess the findings.
Actionable advice:Don’t take statistics at face value.
View statistical information the way you might view your friends: they’re the source of some great stories, but they’re not always the most accurate. Statistical information should be treated with the same skepticism you apply to other kinds of claims, facts and quotes. And, where possible, you should examine the sources of statistics behind the headlines so you can assess how accurately the information has been reported.
----
What’s in it for me? Improve your data literacy and learn to see the agenda behind the numbers.
You might think that with the growing availability of data and user-friendly statistical software that does the mathematical heavy-lifting for you, there’s less need to be trained in statistical methods.
But the ease with which data can now be accessed and analyzed has led to a rise in the use of statistical figures and graphics as a means of furnishing supposedly objective evidence for claims. Today, it’s not just scientists who make use of statistics as evidence, but also political campaigns, advertisements, and the media. As statistics are separated from their scientific basis, their role is changing to persuade rather than to inform.
And the people generating such statistical claims are not necessarily trained in statistical methods. An increasingly diverse number of sources produce and distribute statistics with very little oversight to ensure their reliability. Even when data is produced by scientists undertaking research, errors and distortions of statistical claims can occur at any point in the cycle – from flaws in the research to misrepresentations by the media and the public.
So, in today’s world, data literacy has become invaluable in order to accurately evaluate the credibility of the myriad news stories, social media posts, and arguments that use statistics as evidence. These blinks will give you all the tools you need to better assess the statistics you encounter on a daily basis.
In this book, you’ll learn:
how statistics can be used to catch serial killers;
whether drinking alcohol is good for your health or not; and
which remarkable creature can respond to human emotions even after it has died.----
Statistics can help us answer questions about the world.
Have you ever wondered what statisticians actually do?
To many, statistics is an esoteric branch of mathematics, only slightly more interesting than the others because it makes use of pictures.
But today, the mathematical side of statistics is considered only one component of the discipline. Statistics deals with the entire lifecycle of data, which has five stages which can be summarized by the acronym
PPDAC: Problem, Plan, Data, Analysis, and Conclusion. The job of a statistician is to identify a problem, design a plan to solve it, gather the relevant data, analyze it, and interpret an appropriate conclusion.
Let’s illustrate how this process works by considering a real-life case that the author was once involved in: the case of the serial killer Harold Shipman.
With 215 definite victims and 45 probable ones, Harold Shipman was the United Kingdom’s most prolific serial killer. Before his arrest in 1998, he used his position of authority as a doctor to murder many of his elderly patients. His modus operandi was to inject his patients with a lethal dose of morphine and then alter their medical records to make their deaths look natural.
The author was on the task force set up by a public inquiry to determine whether Shipman’s murders could have been detected earlier. This constitutes the first stage of the investigative cycle – the problem.
The next stage – the plan – was to collect information regarding the deaths of Shipman’s patients and compare this with information regarding other patient deaths in the area to see if there were any suspicious incongruities in the data.
The third stage of the cycle – data – involves the actual process of collecting data. In this case, that meant examining hundreds of physical death certificates from 1977 onwards.
In the fourth stage, the data was analyzed, entered into software, and compared using graphs. The analysis brought to light two things: First, Shipman’s practice recorded a much higher number of deaths than average for his area. Second, whereas patient deaths for other general practices were dispersed throughout the day, Shipman’s victims tended to die between 01:00 p.m. and 05:00 p.m. – precisely when Shipman undertook his home visits.
The final stage is the conclusion. The author’s report concluded that if someone had been monitoring the data, Shipman’s activities could have been discovered as early as 1984 – 15 years earlier – which could have saved up to 175 lives.
So, what do statisticians do? They look at patterns in data to solve real-world problems.
---
What to read next:
How to Lie with Statistics, by Darrell HuffWe’ve seen how statistical claims can be distorted in their passage from research to the public ear. Usually, these distortions of the data are unintentional and arise from a misunderstanding of statistical methods. Sometimes, however, these distortions are quite deliberate.
The blinks to How to Lie with Statistics, by author Darrell Huff, deal with this darker side of statistics. They introduce the techniques that media and advertisements use to alter how data is perceived and interpreted. They also go deeper into some familiar themes, such as the difficulty of truly random sampling, the error of inferring cause from correlation, and the misuse of averages. To avoid getting fooled, head on over to our blinks on How to Lie with Statistics.
Ref: blinkist.com
Rating: really liked it
I never really got statistics when I did Maths when I was younger. The most esoteric parts of pure maths were a breeze, but statistics never clicked, in large part because nobody was able to explain to me what some of the core concepts actually mean. Chief villain in the piece is standard deviation, something I considered to be the height of charlatanism. Fast forward 20 years, and I am working in a role that actually needs to know statistics, and I'm regretting my youthful intransigence.
This book has, to a large part, undone the damage. This book is NOT a practical guide on how to do statistics. It IS a guide, something that shows you what statistics is good for, what it is not, the good and bad ways to practice it, and what each concept means. I can go and read any number of articles about how to do statistics, how to apply a particular technique, but all of them presuppose I know when I should and in what circumstances. That's where this book closes the gap. I suspect I'll need to return to this many times.
But this book goes beyond just helping specialists to do statistics. It also helps people to interpret statistics. It gives you a good groundwork in the various different principles of statistics, without getting bogged down in calculation. It also includes a significant section critiquing how statistics are communicated to the public, and I think this would of interest to anyone.
All in all, this is a very good book. I can't recommend it enough. If you have any interest in statistics, this should be on your shelf.
Rating: really liked it
Do statins reduce heart attacks and strokes?
Do speed cameras reduce accidents?
Is prayer effective?
Why do old men have big ears?
Are more boys born than girls?
Does the Higgs boson exist?
Was Richard III buried in a Leicester parking lot?
The Art of Statistics is a nicely packaged introductory course in statistical reasoning, in which a Cambridge professor and president of the Royal Statistical Society tries to teach some subtle and important theories without making the reader do too much math.
So this is a book about statistics for the layman, and you can hear the author in every chapter pleading for people (politicians, journalists, scientists, and the general public) to be more informed because this shit matters. But as much as the author hand-holds the reader through his examples, you are going to have to look at some numbers, and even do a little math. But if you care enough to read this book, you should know enough math to get through it.
The first few chapters talk about elementary concepts, and why statistics matter. He starts each chapter with some intriguing, sometimes silly examples of questions you can answer with statistical reasoning.
One of his introductory examples is Harold Shipman, Britain's most prolific serial killer. He was a family doctor who between 1975 and 1998 murdered hundreds of elderly patients before he was caught. Afterwards, investigators wanted to find out if he could have been detected earlier had anyone been paying attention to the death rate among his patients.
Answer: yes, and in fact he probably could have been caught in the first few years of his career, if the sort of forensic analysis of patient deaths that's done
now had been performed then. But just looking at a chart that shows that Dr. Shipman's patients died at a higher rate than other GPs is obviously not enough - there are all kinds of confounders and other factors that need to be measured to express a degree of certainty that he's losing patients at a frequency that should really be considered alarming, and Spiegelhalter walks us through the numbers and the data visualizations to show us how it's done.
From there, he goes into many other measurements, from coin flips to number of sexual partners to predicting a child's height based on the heights of their parents. Very obvious ideas like "correlation is not causation" is covered in depth, of course, with some examples that aren't obvious at first glance. Regression models, probability theory, classification trees, bootstrapping, confidence intervals, p-values, Bayes Theorem, the Law of Large Numbers, the Central Limit Theorem — does that sound a little scary? Strap in and read up; if Spiegelhalter had his way this would be basic education at least for anyone who's graduated college, and the world would be a better place and journalists might not write stories with alarming headlines like "Threefold Variation in UK Bowel Cancer Death Rates" or "Going to university makes you more likely to die of a brain tumor." Also politicians might make decisions with some basic numeracy. Well, we can dream, right?
Two of my favorites:
The Prosecutor's FallacyThe probability of innocence given the evidence is not the same as the probability of the evidence given innocence. I.e., "If the accused is innocent, there is only a 1 in a billion chance that their DNA would match the evidence at the crime scene" is wrongly interpreted as "Given the DNA evidence, there is only a 1 in a billion chance that the accused is innocent." Spiegelhalter likens this to "If you're the Pope, you're Catholic" being interpreted as meaning the same thing as "If you're Catholic, you're the Pope."
Simpson's ParadoxThe direction of association between two variables can reverse when adjusted for a confounding factor. For example, rates of admission that show women being admitted at a lower rate than men — obvious sexism! — turn out to mean the opposite when factoring in the actual programs men and women applied for (more women apply to selective programs with a higher overall rate of rejection, but adjusting for the admission rate of each program, are overall
more likely to be accepted than men! This plays out in many other scenarios.)
There's some discussion of communicating data, and data visualization, and of course there's every data science student's favorite problem, predicting which Titanic passengers should survive and which ones shouldn't.
Bayes Theorem (and the dispute between rival schools of statistical inference and Bayesians) gets its own chapter. If you think statistics is just hard math with provable right and wrong answers, well, it's more complicated.
Finally, Spiegelhalter talks about the so-called "replication crisis" (in which a large number of scientific papers have been found to have results that cannot be reproduced, leading many to suspect incompetence, fraud, and/or lazy research across many fields), and from there, a discussion of how bias affects statistics, and some proposed principles for ethical data science.
I have done a fair amount of machine learning and data science, so very few ideas in this book were new to me. But I found it very readable, with just enough math to require you to be comfortable with numbers, but not so much that I was straining my brain to remember how to calculate derivatives and integrals. And really, the world would be a better place if everyone knew this much, especially around election time.
Rating: really liked it
As a data scientist, I enjoyed the non-technical aspects of this book more than the technical (though the review was welcome). Statistical training should include more courses and resources like this that remind us there is more to the practical use of statistics than just the mathematics. Publication, ethics, review, interpretation and communication all play a vital role in how studies benefit society at large. These concepts are more useful and accessible to the general population than, say, the formula for determining the p-value of a test.
Rating: really liked it
Great reference read. More entertaining than you might expect with lots of interesting applications of statistics eg in predicting Harold Shipman's murders, discussion of (lack of) use of stats in courts of law, probability and politics etc. I think having basis from my degree helped so wasnt overwhelmed but it also taught me some new things and was more interesting than revision. Particularly enjoyed the section where he describes "scandinavian coutries are an epidemiologists dream" ... bodes well for masters.
Rating: really liked it
This book was just okay - I can't help but feel that if Spiegelhalter did one of the things he wanted to accomplish in this book it would have been great, but he tried to make this book all things to all people and it ended up being too shallow on both fronts.
I'm beating around the bush a bit but essentially Spiegelhalter wanted to 1) teach the audience about statistics and how they can make life better and 2) present some cool scenarios where statistics can get us an approximate answer to something - like how likely someone would have been to survive the Titanic, if ovarian cancer screening is good, whether busier hospitals have higher survival rates, and so on.
I found that Spiegelhalter had sections that were conversational and easy to read, and then I got whiplash going into other sections that were incredibly dense and requiring intense engagement from the reader. Ultimately it made it difficult to determine the context in which to read this - was it a casual commute read or something that I want to make sure I had a pen and paper ready to take notes for.
If you're looking for a good foundational stats book, I would recommend picking up Charles Wheelan's Naked Statistics rather than this one.
Rating: really liked it
I read a lot of pop-maths books and enjoy them (Hannah Fry, Du Sautoy, Simon Singh, and pervious books by Spiegelhalter). This one is a bit more chewy. Where Sex by Numbers uses statistics to tell you things, this book is much closer to a textbook on how statistics should be done and what can be learned from it.
I have learned a great deal from this and his discussions of Harold Shipman and of 95% accuracy tests giving far more false positives than accurate responses (inter alia) have been really eye-opening. The technicality of p and t tests has got a bit beyond me and one or two graphs could be clearer (though my preview copy is not coloured and so perhaps this is unfair).
Certainly one comes away from the book knowing why statistics and significance testing is becoming ever more central in subjects such as Psychology where a replication crisis is at work (and even at A-Level stats is becoming more prevalent) and his clear desire that the journalists reporting cases (he often cites examples of poor reporting) would understand teh data they use and not confuse themselves and readers.
Huge amounts to learn, but perhaps too technical in places for most of us.