Answer: A data engineer is a high-level computer programmer, computer scientist, and/or computer engineer tasked with designing, building, and managing large databases, processing systems, and other components of big-data infrastructure. The role of the data engineer typically overlaps with that of the data scientist and data engineers often have foundational training and/or experience in data science and analytics processes, such as data mining, data modeling, and data systems programming. There is no formal division or distinction between data engineering, data science, and other types of highly technical work in the field of analytics. However, data engineers are generally regarded as professionals whose primary responsibilities pertain to the construction and maintenance of the channels or pipelines that collect and supply useable data for scientific analysis, predictive and prescriptive modeling, and other analytics functions.
Data science projects often require a team or teams of specialists with specific roles, functions, and areas of expertise. While data scientist and data analyst may be used as broad designations for individuals who perform the various functions related to the complex processes of collecting, sorting, cleaning, organizing, storing, modeling, and interpreting large data sets, it has become increasingly common to differentiate various members of a data science team based on their roles and their areas of expertise.
Data engineer is one of several of the more specialized and somewhat narrower designations for a subset of professionals in the field of data science and analytics. It is typically used to distinguish data science team members whose primary function is to design, construct, and/or maintain the big data systems used in analytics from team members who build algorithms, construct probability models, and provide analyses of the results. While data engineers engage in many of the core elements of data science, they have a greater responsibilities in areas related to the initial collection of raw data, and in the process of sorting, cleansing, storing, securing, and moving that data, than in the analytical procedures that characterize the later stages of a data science project.
It is, however, important to note that there is no clear line separating data engineers from data scientists and others involved in big data systems operations, or the role of the data engineer from that of the data scientist and other data professionals. This is evident in the mission statement of the Institute of Electrical and Electronics Engineers (IEEE) Technical Committee on Data Engineering, which lists the following as primary concerns in data engineering:
At a minimum, data engineers should be proficient in basic programming languages, as well as common computer operating systems, networking protocols, database systems, and analytics tools. That knowledge base may then serve as a foundation for exploring the deeper intricacies of information technology processing systems and of big-data storage and warehousing systems, as well as for learning how to manage large, unstructured datasets. This usually includes developing a knowledge of SQL and NoSQL databases, cloud computing platforms, and parallel processing tools like Apache’s Hadoop and Spark.
Some of the other data pipeline, storage, and management tools used by data engineers include Amazon’s AWS cloud service, SAP ERP software, Oracle and Apache database systems, the open-source MySQL relational database management system, and the object-relational database management system PostgreSQL. Data engineers may also benefit from cultivating professional communication skills in order to work more effectively in cross-functional data science teams and to explain technical information to non-technical members of an organization.
There are many pathways to becoming a data engineer, but the most common routes begin with a bachelor’s and/or a master’s degree in computer science, computer/software/IT engineering, or data science/analytics. Designated degree programs in data engineering are not common. However, there are master’s in data science and master’s in computer science and engineering programs that may offer elective coursework in one or more areas pertaining to data engineering. There are also non-degree options for training in data engineering, including massive open online courses (MOOCs), professional data engineering training programs and boot camps run by private organizations, and certificate programs in data engineering like the University of Washington’s Certificate in Big Data Technologies program, which is offered online and at two of the school’s campuses and features an Introduction to Data Engineering course and a course on Building the Data Pipeline.
Data engineer or database engineer is typically not an entry-level position. Many employers require new hires to have two or more years of experience working in a data technology field, and it is not uncommon for data engineers to advance into a data engineering role after several years of work in analytics, IT management, computer programming, or a related field. There are also several professional credentials that may be helpful for career advancement in the field of data engineering, including the Google Cloud Certified Professional Data Engineer credential, the Cloudera Certified Professional Data Engineer credential, and IBM’s Certified Data Engineer – Big Data certification.
The steps below illustrate a potential pathway to becoming a data engineer:
The International Data Engineering and Data Science Association (IDEAS), formerly the Data Science Association, is a professional organization that can provide additional resources and information about data engineering. IDEAS also offers certificate programs in Natural Language Processing, Python for Data Science, R for Data Analyst, and SQL for Data Analyst.