<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Another way to mislead with statistics</title>
	<atom:link href="http://xyfu.bumblebeelabs.com/another-way-to-lie-with-statistics/feed/" rel="self" type="application/rss+xml" />
	<link>http://xyfu.bumblebeelabs.com/another-way-to-lie-with-statistics/</link>
	<description>Michael's Personal Design Blog</description>
	<lastBuildDate>Thu, 09 Feb 2012 13:50:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: Statistical vindication &#171; Bumblebee Labs Blog</title>
		<link>http://xyfu.bumblebeelabs.com/another-way-to-lie-with-statistics/comment-page-1/#comment-8506</link>
		<dc:creator>Statistical vindication &#171; Bumblebee Labs Blog</dc:creator>
		<pubDate>Thu, 18 Jun 2009 03:02:45 +0000</pubDate>
		<guid isPermaLink="false">http://blog.bumblebeelabs.com/?p=866#comment-8506</guid>
		<description>[...] few days ago, I wrote about a case of a seemingly fascinating graph which I felt was used inappropriately. I was rightfully castigated in the comments for being too [...]</description>
		<content:encoded><![CDATA[<p>[...] few days ago, I wrote about a case of a seemingly fascinating graph which I felt was used inappropriately. I was rightfully castigated in the comments for being too [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hang</title>
		<link>http://xyfu.bumblebeelabs.com/another-way-to-lie-with-statistics/comment-page-1/#comment-7991</link>
		<dc:creator>Hang</dc:creator>
		<pubDate>Fri, 12 Jun 2009 22:15:14 +0000</pubDate>
		<guid isPermaLink="false">http://blog.bumblebeelabs.com/?p=866#comment-7991</guid>
		<description>OK, y&#039;all have inspired me to crack open my statistics textbook again. Unfortunately, my statistics textbook is pretty useless so I&#039;m going to wing it. As far as I can remember, the more hypotheses you test, the higher the p factor has to be for any one hypothesis to avoid fishing for significance. Given k elements, if you want to test whether any set of element is biased towards a particular search engine, there are 2^k possible hypotheses so your significance factor has to be something like 1-(1/2^k) which, of course, is a ridiculously high standard  that clearly none of the datapoints match.

As such, what you&#039;re presenting is the null hypothesis graph except in a form which at least I was unused to seeing. Is it right to present a null hypothesis graph? Clearly opinions differ but to me, it&#039;s perhaps about a serious an error as presenting data with too many significant figures. Not a grave sin but something a good statistician should be conscientious about. The only reason I wrote about it was because, I was surprised that even I as a reasonable trained statistics guy was momentarily caught off guard by it. Clearly, you meant nothing malicious by it but it&#039;s a technique that could be used for malicious purposes so I wrote about it.

I&#039;ve amended the title to tone down the rhetoric.</description>
		<content:encoded><![CDATA[<p>OK, y&#8217;all have inspired me to crack open my statistics textbook again. Unfortunately, my statistics textbook is pretty useless so I&#8217;m going to wing it. As far as I can remember, the more hypotheses you test, the higher the p factor has to be for any one hypothesis to avoid fishing for significance. Given k elements, if you want to test whether any set of element is biased towards a particular search engine, there are 2^k possible hypotheses so your significance factor has to be something like 1-(1/2^k) which, of course, is a ridiculously high standard  that clearly none of the datapoints match.</p>
<p>As such, what you&#8217;re presenting is the null hypothesis graph except in a form which at least I was unused to seeing. Is it right to present a null hypothesis graph? Clearly opinions differ but to me, it&#8217;s perhaps about a serious an error as presenting data with too many significant figures. Not a grave sin but something a good statistician should be conscientious about. The only reason I wrote about it was because, I was surprised that even I as a reasonable trained statistics guy was momentarily caught off guard by it. Clearly, you meant nothing malicious by it but it&#8217;s a technique that could be used for malicious purposes so I wrote about it.</p>
<p>I&#8217;ve amended the title to tone down the rhetoric.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bob Carpenter</title>
		<link>http://xyfu.bumblebeelabs.com/another-way-to-lie-with-statistics/comment-page-1/#comment-7979</link>
		<dc:creator>Bob Carpenter</dc:creator>
		<pubDate>Fri, 12 Jun 2009 19:46:10 +0000</pubDate>
		<guid isPermaLink="false">http://blog.bumblebeelabs.com/?p=866#comment-7979</guid>
		<description>I think Lukas is right that this post&#039;s title is out of line.  Let&#039;s all play nicely and constructively.

I like Dolores Labs&#039; results visualization.  I questioned the same issue of whether they were just random.  When I ran the outlying  queries side by side, they sure looked random.  And what you&#039;re seeing for all those queries in  the middle of the graph is a very close vote.

Growing up Bayesian, I&#039;m rather allergic to these kinds of significance tests.  What I&#039;d like to do is run &lt;a href=&quot;http://lingpipe-blog.com/2008/09/05/hierarchical-bayesian-models-of-categorical-data-annotation/&quot; rel=&quot;nofollow&quot;&gt;my models of voted inference&lt;/a&gt; to estimate annotator bias, randomness, and overall prevalence of preferences.   Hmm, maybe Lukas&#039;ll share the raw data.

One reason is that &quot;significance&quot; depends on the test.  Paired t-tests vs. grouped t-tests, one-sided vs. two-sided, replication adjusted or not.  Another reason is that they are just as freighted with assumptions about how the data&#039;s generated as with Bayesian priors.  Yet another is that significant doesn&#039;t mean important; with more queries and  evaluators, a 50.1 vs. 49.1 preference could be significant, even though a typical user would never notice it.

 At least bootstrap variance estimation (on Google &gt; Bing) would be reasonably easy to interpret.

I believe what this post is suggesting is to test vs. the null hypothesis of &quot;was generated by a Binomial(0.5) distribution&quot;.  I&#039;m not very well classically trained, but I&#039;d hope that&#039;d be close to a two-sided t-test given the sample size.</description>
		<content:encoded><![CDATA[<p>I think Lukas is right that this post&#8217;s title is out of line.  Let&#8217;s all play nicely and constructively.</p>
<p>I like Dolores Labs&#8217; results visualization.  I questioned the same issue of whether they were just random.  When I ran the outlying  queries side by side, they sure looked random.  And what you&#8217;re seeing for all those queries in  the middle of the graph is a very close vote.</p>
<p>Growing up Bayesian, I&#8217;m rather allergic to these kinds of significance tests.  What I&#8217;d like to do is run <a href="http://lingpipe-blog.com/2008/09/05/hierarchical-bayesian-models-of-categorical-data-annotation/" rel="nofollow">my models of voted inference</a> to estimate annotator bias, randomness, and overall prevalence of preferences.   Hmm, maybe Lukas&#8217;ll share the raw data.</p>
<p>One reason is that &#8220;significance&#8221; depends on the test.  Paired t-tests vs. grouped t-tests, one-sided vs. two-sided, replication adjusted or not.  Another reason is that they are just as freighted with assumptions about how the data&#8217;s generated as with Bayesian priors.  Yet another is that significant doesn&#8217;t mean important; with more queries and  evaluators, a 50.1 vs. 49.1 preference could be significant, even though a typical user would never notice it.</p>
<p> At least bootstrap variance estimation (on Google &gt; Bing) would be reasonably easy to interpret.</p>
<p>I believe what this post is suggesting is to test vs. the null hypothesis of &#8220;was generated by a Binomial(0.5) distribution&#8221;.  I&#8217;m not very well classically trained, but I&#8217;d hope that&#8217;d be close to a two-sided t-test given the sample size.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://xyfu.bumblebeelabs.com/another-way-to-lie-with-statistics/comment-page-1/#comment-7867</link>
		<dc:creator>John</dc:creator>
		<pubDate>Wed, 10 Jun 2009 19:26:33 +0000</pubDate>
		<guid isPermaLink="false">http://blog.bumblebeelabs.com/?p=866#comment-7867</guid>
		<description>Hang  - I think you&#039;re off base. The presentation of the data Lukas gives tells you something informative, namely the distribution of individual differences.  If he just reported means and standard errors, I would have no idea if the difference was driven by a few outliers or say a small but consistent superiority on every query term.</description>
		<content:encoded><![CDATA[<p>Hang  &#8211; I think you&#8217;re off base. The presentation of the data Lukas gives tells you something informative, namely the distribution of individual differences.  If he just reported means and standard errors, I would have no idea if the difference was driven by a few outliers or say a small but consistent superiority on every query term.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hang</title>
		<link>http://xyfu.bumblebeelabs.com/another-way-to-lie-with-statistics/comment-page-1/#comment-7860</link>
		<dc:creator>Hang</dc:creator>
		<pubDate>Wed, 10 Jun 2009 17:38:35 +0000</pubDate>
		<guid isPermaLink="false">http://blog.bumblebeelabs.com/?p=866#comment-7860</guid>
		<description>Lukas: I apologize if you interpreted my post to mean I ascribe intent to your actions. Perhaps mislead would have been more appropriate. The graphs that show the aggregate differences between search engines are something which I think is an appropriate representation of the data because, indeed, as you point out, there are aggregate differences in the data. However, because there are no individual differences, I don&#039;t agree that it was appropriate to present the individual queries. All they do is mislead people into seeing patterns where they don&#039;t exist. If you want to present the dataset of queries, I would do it in table form so that there&#039;s no suggestion of a pattern.

Again, I&#039;m sorry if this post came across as overly critical. I&#039;ve done the same thing many times myself so I&#039;m very sympathetic to the reasons behind why you made the choices you do. I simply wanted to provide an alternative presentation of the data.</description>
		<content:encoded><![CDATA[<p>Lukas: I apologize if you interpreted my post to mean I ascribe intent to your actions. Perhaps mislead would have been more appropriate. The graphs that show the aggregate differences between search engines are something which I think is an appropriate representation of the data because, indeed, as you point out, there are aggregate differences in the data. However, because there are no individual differences, I don&#8217;t agree that it was appropriate to present the individual queries. All they do is mislead people into seeing patterns where they don&#8217;t exist. If you want to present the dataset of queries, I would do it in table form so that there&#8217;s no suggestion of a pattern.</p>
<p>Again, I&#8217;m sorry if this post came across as overly critical. I&#8217;ve done the same thing many times myself so I&#8217;m very sympathetic to the reasons behind why you made the choices you do. I simply wanted to provide an alternative presentation of the data.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lukas Biewald</title>
		<link>http://xyfu.bumblebeelabs.com/another-way-to-lie-with-statistics/comment-page-1/#comment-7859</link>
		<dc:creator>Lukas Biewald</dc:creator>
		<pubDate>Wed, 10 Jun 2009 17:25:07 +0000</pubDate>
		<guid isPermaLink="false">http://blog.bumblebeelabs.com/?p=866#comment-7859</guid>
		<description>It&#039;s nice to see such thoughtful criticism - but the fact that you generated a similar shape using a random process doesn&#039;t mean that there&#039;s no statistical significance in our data.  If you put your graph and our graph side-by-side, you will notice that your graph is somewhat more symmetric.  A p-value of 0.04 means that just over one in twenty times you will get a mean greater than or equal to ours.

You say, &quot;The blog entry claims that there was a minor but significant  (p &lt; 0.04) difference in overall quality but it’s obvious from the null graph that no individual query is statistically different in quality (I’d unfortunately have to dig out my stats textbook to figure out what test I would need to run to verify this but I’m pretty confidant on my eyeball estimate).&quot; -- I&#039;m not sure why you&#039;re surprised that there can be a statistically significant difference in aggregate but not in individual queries.

BTW - We work hard to present data honestly.  I think it&#039;s somewhat over the top to call your blog post &quot;Another way to lie with statistics&quot;.  I&#039;m sorry that our graph mislead you into thinking there were patterns that may be due to noise.  I think the graph does a nice job of laying out exactly what our data set consists of.</description>
		<content:encoded><![CDATA[<p>It&#8217;s nice to see such thoughtful criticism &#8211; but the fact that you generated a similar shape using a random process doesn&#8217;t mean that there&#8217;s no statistical significance in our data.  If you put your graph and our graph side-by-side, you will notice that your graph is somewhat more symmetric.  A p-value of 0.04 means that just over one in twenty times you will get a mean greater than or equal to ours.</p>
<p>You say, &#8220;The blog entry claims that there was a minor but significant  (p &lt; 0.04) difference in overall quality but it’s obvious from the null graph that no individual query is statistically different in quality (I’d unfortunately have to dig out my stats textbook to figure out what test I would need to run to verify this but I’m pretty confidant on my eyeball estimate).&#8221; &#8212; I&#8217;m not sure why you&#8217;re surprised that there can be a statistically significant difference in aggregate but not in individual queries.</p>
<p>BTW &#8211; We work hard to present data honestly.  I think it&#8217;s somewhat over the top to call your blog post &#8220;Another way to lie with statistics&#8221;.  I&#8217;m sorry that our graph mislead you into thinking there were patterns that may be due to noise.  I think the graph does a nice job of laying out exactly what our data set consists of.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

