Aaron Gowins is a data scientist and research fellow in the Laboratory of Biological Modeling at the National Institute of Diabetes and Digestive and Kidney Diseases, which is part of the National Institutes of Health. He and his colleagues develop mathematical models to simulate energy use and mass deposition during childhood growth. Clinicians and researchers use these simulations to investigate the development of childhood obesity. These models provide insights that facilitate the development of strategies for interventions and increase their long-term benefit. Their efforts are also an integral part of an interdisciplinary effort sponsored by the Bill and Melinda Gates Foundation to address childhood growth stunting worldwide.
Though Mr. Gowins’ educational background is in mathematics and the sciences, his data expertise is largely self-taught. He frequently writes tutorials for other rising analytics professionals for DataScience+. Mr. Gowins holds a bachelor’s degree in mathematics with a minor in chemistry. He has also earned more than a dozen professional certifications in areas like R programming, statistical inference, machine learning, and reproducible research.
[OnlineEducation.com] As a data scientist, you use your skills in a research capacity. What is data science and how is it applied to research activities? For those unacquainted with the field, what is the difference between data science and data analytics? For instance, how do data scientists’ goals and methods differ from those of predictive or quantitative analysts? Do you use different skills and technologies?
[Mr. Gowins] Many of the techniques and methods people describe now as data science have been around for a long time in traditional scientific fields. The idea of a predictive mathematical model has been pervasive in physics from its inception, and more recently bioinformatics, and genetics have come to rely on sophisticated statistical methods applied to large datasets to formulate results and make predictions. In these scientific settings, people who routinely analyze large datasets and apply statistical models and machine learning techniques refer to themselves not as data scientists, but as neuroscientists, astronomers, biologists, etc. Developing a hypothesis, and then testing that hypothesis by collecting and analyzing data in a principled way, is the fundamental paradigm that defines science.
Data science is generally the application of these traditionally scientific methodologies and mathematical modeling techniques to understand and draw inferences from data outside hard science. Employing these strategies to problems in retail, for instance, allows a company like Netflix to recommend movies, or Google to choose the best website for your search words.
In the financial sector, quantitative and predictive analysts have been an integral part of developing tools for traders and investors for many years. Today they are fundamentally indistinguishable from some data scientists in terms of the principles and methods they use. For instance, a data scientist may be interested in the seasonal fluctuation in sales at Amazon, and might use the same software and modeling strategy that a financial analyst uses to make predictions about Amazon’s stock price. With the recent explosion of data, there is substantial overlap in all data-driven fields.
Data scientists have an unusually large breadth of knowledge. They are database administrators, mathematicians, and programmers. They are skilled at explaining complicated topics to non-experts. They have keen insight when it comes to recognizing patterns and applying abstract concepts to real-world problems and specific questions. Thanks to Cyrus Lentin for this witty quote: Data Scientist (noun): Person who is better at statistics than any software engineer and better at software engineering than any statistician. I have adopted this definition for now.
[OnlineEducation.com] You presently serve as a research fellow for the National Institutes of Health, in cooperation with the Bill and Melinda Gates Foundation. To the extent that you can, would you describe the nature of your job and what it entails? How does data science inform our understanding of human health and improve its outcomes?
[Mr. Gowins] We are developing a mathematical simulation of the changes in body composition and energy use associated with childhood growth. At the NIH, we are focused on the development of obesity, particularly during early childhood. For the Gates Foundation, we are more interested in applying our model to under-nutrition, especially in infancy. In developing parts of the world, poor nutrition and sanitation put children at risk of experiencing stunted growth. Researchers have associated this early childhood growth faltering with an increased risk of cognitive deficiencies later in life. This puts an additional burden on populations already facing many challenges. Alongside experts from a variety of fields, we hope to provide insights into the optimal timing and nature of interventions, as well as identify specific gaps in data that can be addressed by future studies.
More technically, our group is developing a mechanistic model that uses a variety of relevant physiological data to make predictions about the energy absorption and expenditure that gave rise to a particular trajectory of fat and lean mass deposition. Inversely, we can also make predictions about the changes in mass associated with a certain energy uptake. This mechanistic approach allows us to examine the interplay between individual physiological processes like changes in brain glucose uptake and the role of infection on energy absorption and growth. From a strictly analytical standpoint, this effort is hampered by the lack of useful data. Many studies have implemented caloric and sanitation interventions, and although this data is useful, it’s the children that are experiencing the effects of the poorest conditions that we want to know the most about.
Mathematical models are critical tools in these situations where cost or logistics makes it difficult to collect accurate data. By measuring the changes in fat and lean mass, and having data on the energy content of tissue and typical metabolism, we can bypass expensive and time-consuming data collection, and make fairly accurate predictions of the amount of energy absorbed by the child during that timespan. This is useful information for a clinician prescribing a caloric intervention, who would have a hard time measuring a patient’s energy absorption rate by any other means. Human caloric intake is subject to error and is highly variable. Mathematical models have become the gold standard for measuring intake because it is based on the change in body mass that resulted from the intake of that individual. For the Gates project, this means we can simulate the conditions and responses that lead to cognitive impact, and allow efforts in the field to focus on providing care. It’s really a privilege to work on such an important and impactful project.
[OnlineEducation.com] Data science publications often discuss the various programming languages data scientists use in the field. As a contributor to the popular blog DataScience+, you also tend to write about advanced programming. Which programming languages would you consider essential for prospective data scientists to learn? Can you briefly describe how each is used and its importance to your work?
[Mr. Gowins] A data scientist must be a proficient programmer. R and Python have rightfully emerged as the frontrunners in programming languages, and choosing between them is somewhat a matter of preference and largely a matter of the nature of the analysis.
Python is gaining traction as the go-to language for big data science. I find Python to be easier to learn, more flexible, and faster than R. Currently it provides better packages for Natural Language Processing and text mining tasks. Python for application backend is highly scalable, meaning it can handle many users and large datasets. The recent advances in Python interactive publishing are the state of the art.
R was written with statistics in mind and comes with vast libraries of built in functionality. For smaller datasets and more traditional statistical analyses, R provides the analyst with a breadth of options. R has tremendous documentation, mainly meaning online resources for learning. However, to some extent, that is because R requires more documentation since it can be opaque and counterintuitive. Without bulky workarounds R runs entirely in memory, and for this reason, Python dominates when it comes to out-of-memory analysis of large datasets and seamlessly integrates with modern databases. Using R makes it simple to build a very nice app that employs R’s sophisticated algorithms, but it will perform best with a smaller dataset and lighter traffic.
At the NIH, our model is written in Matlab, which is proprietary software, as is another popular choice, SAS. In contrast, Python and R are open source and free to download and use. Python and R are continually being developed by thousands of experts from around the world and are easily joined with other software and visualization tools. Matlab and SAS exist virtually in a bubble and are most competitive as stand alone tools. I should add that for ours and other tasks Matlab has advantages over Python and R, and that SAS is very intuitive and easy to use. For a student interested in large-scale analyses, Python looks like it will be the future.
[OnlineEducation.com] Professionals’ interests and aptitudes can impact their work significantly no matter what they do for a living. Can you describe some of the strengths or qualities of a successful data scientist? What about habits that serve them well in the field?
[Mr. Gowins] Dealing with large amounts of data can be very challenging. Datasets that are too big to fit in memory require specialized tools and techniques to analyze effectively. These tools are largely under development, have gaps in performance, and are notoriously unstable. The techniques required can be complex and demanding. Runtimes can be very long, and debugging a program that runs for a week on a supercomputer can be daunting. These difficult problems seldom have a method designed, and require the analyst to be continually learning and exploring new approaches, many times designing their own solutions.
Research inherently takes place at the limit of current knowledge, and this is true for the traditional sciences as well as data science. A successful data scientist must combine a love of learning with a high tolerance for enduring frustrating and time-consuming problems. Once a successful solution is found, a brand new challenge awaits. Good scientists specialize in solving problems, and the description above appeals to them.
I have always put personal growth and curiosity first in my career choices. For me, there is no substitute for pursuing something that engages me. I enjoy challenges, and there’s nothing quite like the challenge of learning something new. I’ve learned that bringing an optimistic and adventurous outlook is a strength. People count on me to extract meaning from seemingly lifeless data and develop insights that can answer questions, solve problems, and make useful predictions. To me, that’s about as good as it gets.
[OnlineEducation.com] Recent reports suggest the number of data professionals has more than doubled nationally over the last four years, and universities are adding new programs accordingly. What parting advice would you offer readers who would like to follow in your footsteps? Is there anything they can do to prepare for or succeed in data science? Or get their first job?
[Mr. Gowins] Collecting and storing data has become much easier and cheaper, and this has introduced a new paradigm in most quantitative fields, because many of the stalwart techniques of statisticians no longer apply at scale. However, since analyses typically entail either some sort of manipulation that can be traced back to statistics or a machine learning algorithm, probably also inspired by statistics, at least a fundamental training in stats is essential.
Advanced degrees are often a requirement of employment, but online courses are highly affordable and, in my opinion, are offering the best option for quickly developing relevant and useful skills. There is great demand for data scientists, and there have been relatively few quality advanced degrees being offered. Therefore, the current trend is toward preferring experience and demonstrated ability, which online courses have done a remarkable job providing. In data science, methods and tools are constantly being updated and optimized. Developing effective online learning skills, for those with or without an advanced degree, is both a necessity and marketable trait.
Whether universities will develop programs that become essential for employment as data scientists or online courses and self-teaching will be the future of data science training, similar to actuarial science, remains to be seen. For aspiring students, a strong math background and solid programming skills will look good on a resume for the foreseeable future.
There is no reason that someone who is curious about data science should ever be bored. The internet is full of data, projects, and tutorials on any topic you can imagine. If you’re willing to put in the effort, online learning and data “boot camps” can provide all the skills you need to succeed. Check out these resources and learn as much as you can. Not only is it a fun diversion, but the analyses and products you can create represent the type of enthusiasm and initiative that make a job-seeker stand out.
Getting involved in the data science community should be a priority. Anyone even casually interested in data science absolutely must visit datasciencecentral.com. It’s a terrific site packed with valuable information and resources. While you’re there, check out Vincent Granville’s blog posts and literature. He is a highly respected authority on developing data science talent and is a treasure trove of solid insight and advice for data scientists at every level.