Research is, to a large extent, a media endeavor, and so there is a significant incentive to reach out for fame with a surprise breakthrough or confirmation of the most extraordinary dreams and biases. However, common problems, such as unfair or irrelevant comparisons or misunderstandings regarding what is measured, riddle the writings. To avoid wasting time on unlikely or irrelevant results, consider using the diligence methods below to analyze a claim’s credibility or believability (correctness, completeness, and importance), because even valid results may take you nowhere in your business territory.
The Main Contribution Relevance, Impact, and Open Questions
The primary questions to ask when looking at a piece of research to ask are:
- Insightful? What is the main idea in one sentence? Usually, there is just one main idea that can be summarized into one sentence, even if you are not an expert in the research area. The lack of a central point is suspicious.
- Actionable? What impact could this have, and why is this important and relevant?
- Trustworthy? What are the unknowns and open questions? What does not fit well?
Expanding on Doomberg’s 5 Question Framework
The most exciting questions I encountered were Doomberg’s 5 questions shared on a podcast. Doomberg is an independently thinking industry research group. This post is an expansion of the 5 questions. There is a paywalled article called Conducting Diligence with all the details, but I don’t have full access to the text as of now.
- Author Credibility: What is the reputation and history of the researchers and the institutions they are affiliated with? Do they have at least 3 successes within the domain?
- Follow the money (Who benefits?): Who funded it? Are researchers biased, e.g., are their livelihoods, views, or identities tied to any results? For example, some research is funded by a for-profit company, which they a goal-oriented bias into the study. (Try investing in a company and then tell me what it does to your views.). Is this a disgruntled employee powered by resentment (Axe to grind)?
- Connections: Are they known, accessible, or anonymous? Who is the author connected to, and who helped them in the past?
- Outsider: Outsiders may shed new light with less authority on the subject. What potential critics stand to lose? (Follow the anti-money.)
- Publisher Credibility: Who published these results in their publication? What is the history and reputation of this publisher?
- Reviewers and Critics: Who peer-reviewed, or is it a self-published pre-print? Was the peer reviewer familiar with the domain to judge the research? Who you can ask about this?
- Republishers: Who are the original primary sources, and are they publicly accessible and non-anonymous? Who is republishing the research, and what are their biases, intentions, and incentives? Do people from multiple sides publish this?
- Audience bias: Too on the nose? Too aesthetic to be true? Too memetic (made to replicate)? Is there an existing bias in the audience that may have been encouraging or preventing wider publication? Why was this published or surfaced now? Is the current environment related to the results?
- Scientific Process and Evidence: At what stage is this research in the scientific process of validation and publishing?
- How often are similar papers at similar stage invalid and retracted? Is it more similar to a valid or invalid past research?
- Burden of Evidence: What is the specific evidence for the claims and does it really imply the conclusion (non sequitur)? Are the metrics and their comparisons valid and relevant? Or are there hidden effects? Is this a randomized double-blind placebo control (RDBPC) trial? Who else applied the results successfully?
- Transparent: Is the process published and reliable? Are the numbers published, and do they show reasonable statistical distribution? If you wanted to hide something, where would you hide it? Is it dogmatic?
- Scientific Context: What do existing research and the history of this field suggest about these findings?
- The idea: Is there a clear main finding or an attempt to create confusion by adding unnecessary instead? Is the main message coherent? Does the title match the content?
- Shoulders of Giants: What previous research does this paper cite and stand on? Are the cited sources saying what is claimed this paper? What are the assumptions, and were they proven?
- Cross-check: Are the findings logically consistent with other verified research results? Can it be derived from first principles (believed undeducatable axioms, postulates)? Is the explanation for the claims tested or correctly deduced? How does it compare to similar related research? Do the existing trends and order of magnitude estimates match the results?
- Future Expectations and Impact: What is the outlook for this finding
- How large is the impact of the finding? Extraordinary claims require extraordinary proof. How large are the rewards for the successful implementation of this method?
- When and how will these results be reproduced (replicated)? What industries are interested in applying this? Who wants to put this into production?
- What are the risks, unknowns, and hidden problems that may invalidate the results or the impact, and how likely they are? What is the downside if you experiment with this method?
Common Problems in Machine Learning Research
In machine learning, these are common problems that may be present in nice-sounding research papers, which, however, will prove to offer no valuable insights on how to improve the production systems in the industry.
- Evaluating architecture, but based on models with different parameter counts. Bigger models will tend to outperform smaller models.
- Outperforming on unknown or invalid benchmarks: Evaluating general architecture ability but using obscure benchmark datasets with only irrelevant competing architectures. Recommendation systems often lack large-scale datasets to compare results on.
- Production-irrelevant metrics: Commonly, in recommendation systems mean squared error was used for comparison, whereas metrics like precision, recall, and NDCG are more useful for production deployments.
- Seed tuning, hyperparameter tuning, or training longer: Comparing with previous results, but spending more on finding the best random neural network initialization. This will inflate the results, creating invalid comparisons. Training time is part of the cost calculation for deployment.
- Testing set leak: Evaluating generalization but having evaluation set samples leaking into the training set. This is a difficult problem with web-scale datasets, which makes it hard to filter out testing sets.
- Different preprocessing or training sets: Preprocessing and filtering of the training set may have a much more significant impact than any model.
- Not reproducible: Not providing code and not providing all details about the path towards the results.
- Not citing research: Potential bias not to cite research from competing companies to avoid advertising them. However, ignoring previous research may also exist because of limited research resources.
- Authors that had large impact on ML in the past (incomplete): Schmidhuber, Hinton, Karpathy, Bengio, Lecun
Other Claim Evaluation Tests
There are various tests:
-
Ray Dalio’s Believability: “People who have repeatedly and successfully accomplished the thing in question and have great explanations when probed are most believable.”
-
The CRAAP Test: This easy-to-remember acronym stands for Currency, Relevance, Authority, Accuracy, and Purpose, and it guides users to assess the quality and relevance of information sources.
-
SIFT Method: The SIFT (Stop, Investigate the source, Find better coverage, Trace claims) method is a simple four-step process for evaluating information online.
-
TRAAP Test: Similar to the CRAAP test, the TRAAP (Timeliness, Relevance, Authority, Audience, and Purpose) test is used to assess the reliability of a source and the information it provides.
-
RADAR Framework: This stands for Rationale, Authority, Date, Accuracy, and Relevance. It essentially helps users to assess the quality, reliability, and usefulness of a resource.
-
PICO Framework: Used in evidence-based practice to formulate a searchable clinical question. It stands for Patient problem or Population, Intervention, Comparison, and Outcome.
-
The Pragmatist’s Guide to Life - Standards for Evidence: Logical Consistency, Personal Experience, Personal Emotional Experience, Cultural Consensus, Expert Consensus, Scientific Method, Doctrine. This book talks about a lot more than just this topic. And there are other very unusual books in the series Pragmatist’s Guide.
Battle Testing the Method
It is great to have a handy set of questions to ask to assess credibility quickly. But you will only know the usefulness once you apply the method correctly and completely yourself in your particular problem terrain. Depending on your business territory, this may or may not be worth your time.
Compressing Knowledge
Reality cannot be losslessly compressed into a text. No book, no conversation, and no one is comprehensively usably true. Simplifications and empirical statistics about here and now have merit and utility in that, with less cognitive effort, you can pursue your goals. Our subconsciousness must operate in the face of a dynamic environment that is not fully known and not entirely understandable.
Instead of theory-building immediately gather examples (stories, memes) first to build up problem statistics. A more general rule can be conjectured or verified afterward.