Two stage least squares (2SLS) has poor properties if instruments are exogenous but weak. But how strong do instruments need to be for 2SLS estimates and test statistics to exhibit acceptable properties? A common standard is that first-stage F≥10. This is adequate to ensure two-tailed t-tests have modest size distortions. But other problems persist: In particular, we show 2SLS standard errors are artificially small in samples where the estimate is most contaminated by the OLS bias. Hence, if the bias is positive, the t-test has little power to detect true negative effects, and inflated power to find positive effects. This phenomenon, which we call a "power asymmetry," persists even if first-stage F is in the thousands. Robust tests like Anderson–Rubin perform better, and should be used in lieu of the t-test even with strong instruments. We also show how 2SLS test statistics typically suffer from very low power if first-stage F is only 10, leading us to suggest a higher standard of instrument strength in empirical practice.