小言_互联网的博客

如何成为数据工程师

359人阅读  评论(0)


https://towardsdatascience.com/who-is-a-data-engineer-how-to-become-a-data-engineer-1167ddc12811

A simple guide on how to ride the waves of Data Engineering and not let them pull you under.

It seems like these days everybody wants to be a Data Scientist. But what about Data Engineering? In its heart, it is a hybrid of sorts between a data analyst and a data scientist; Data Engineer is typically in charge of managing data workflows, pipelines, and ETL processes. In view of these important functions, it is the next hottest buzzword nowadays that is actively gaining momentum.

High salary and huge demand — this is only a small part of what makes this job to be hot, hot, hot! If you want to be such a hero, it’s never too late to start learning. In this post, I have put together all the needed information to help you in taking the first steps.

So, let’s get started!

What is Data Engineering?

Frankly speaking, there is no better explanation than this:

“A scientist can discover a new star, but he cannot make one. He would have to ask an engineer to do it for him.”
–Gordon Lindsay Glegg

So, the role of the Data Engineer is really valuable.

It follows from the title the data engineering is associated with data, namely, their delivery, storage, and processing. Accordingly, the main task of engineers is to provide a reliable infrastructure for data. If we look at the AI Hierarchy of Needs, data engineering takes the first 2–3 stages in it: Collect, Move & Store, Data Preparation.

Hence, for any data-driven organization, it is vital to employ data engineer to be on the top.

What does a data engineer do?

With the advent of “big data,” the area of responsibility has changed dramatically. If earlier these experts wrote large SQL queries and overtook data using tools such as Informatica ETL, Pentaho ETL, Talend, now the requirements for data engineers have advanced.

Most companies with open positions for the Data Engineer role have the following requirements:

  • Excellent knowledge of SQL and Python
  • Experience with cloud platforms, in particular, Amazon Web Services
  • Preferred knowledge of Java / Scala
  • Good understanding of SQL and NoSQL databases (data modeling, data
    warehousing)

Keep in mind, it’s only essentials. From this list, we can assume the data engineers are specialists from the field of software engineering and backend development.

For example, if a company starts generating a large amount of data from different sources, your task, as a Data Engineer, is to organize the collection of information, it’s processing and storage.

The list of tools used in this case may differ, everything depends on the volume of this data, the speed of their arrival and heterogeneity. Majority of companies have no big data at all, therefore, as a centralized repository, that is so-called Data Warehouse, you can use SQL database (PostgreSQL, MySQL, etc.) with a small number of scripts that drive data into the repository.
IT giants like Google, Amazon, Facebook or Dropbox have higher requirements:

  • Knowledge of Python, Java or Scala
  • Experience with big data: Hadoop, Spark, Kafka
  • Knowledge of algorithms and data structures
  • Understanding the basics of distributed systems
  • Experience with data visualization tools like Tableau or ElasticSearch will be a big plus

That is, there is clearly a bias in the big data, namely their processing under high loads. These companies have increased requirements for system resiliency.

Data Engineers Vs. Data Scientists


Okay, that was a simple and humorous explanation (nothing personal), but the truth is, everything is much more complicated.

You should know first there is really much ambiguity between data science and data engineer roles and skills. So, you can easily get baffled about what skills are essentially required to be a successful data engineer. Of course, there are certain skills that overlap for both the roles. But also, there is a whole slew of diametrically different skills.

Data science is a real thing — but the world is moving to a functional data science world where practitioners can do their own analytics. You need data engineers, more than data scientists, to enable the data pipelines and integrated data structures.

Is a data engineer more in demand than data scientists?

Yes, because before you can make a carrot cake you first need to harvest, clean and store the carrots!

Data Engineer understands programming better than any data scientist, but when it comes to statistics, everything is exactly the opposite.

But here the advantage of the data engineer:

without him/her, the value of this prototype model, most often consisting of a piece of code in a Python file of terrible quality, which came from a data scientist and somehow gives result is tending toward zero.

Without the data engineer, this code will never become a project and no business problem will be solved effectively. A data engineer is trying to turn this into a product.

Essential Things Data Engineer Should Know


So, if this job sparks a light in you and you are full of enthusiasm, you can learn it, you can master all the needed skills and became a real data engineering rock-star. And, yes, you can do it even without programming or other tech backgrounds. It’s hard, but it’s possible!

What are the first steps?

You should have a general understanding of what is what.

First of all, Data Engineering is primarily related to computer science. To be more specific, you should have an understanding of efficient algorithms and data structures. Secondly, since data engineers deal with data, an understanding of the operation of databases and the structures underlying them is a necessity.

For example, the usual B-tree SQL databases are based on the B-Tree structure, and in the modern distributed repositories LSM-Tree and other hash table modifications.

These steps are based on a great article by Adil Khashtamov. So, if you know Russian, please support this writer and read his post too.

1. Algorithms and Data Structures

Using the right data structure can drastically improve the performance of an algorithm. Ideally, we should all learn data structures and algorithms in our schools, but it’s rarely ever covered. Anyway, it’s never too late.

So, here are my favorite free courses to learn data structures and algorithms:

Plus, do not forget about the classic work on the algorithms by Thomas Cormen — Introduction to Algorithms. This is the perfect reference when you need to refresh your memory.

To improve your skills, use Leetcode
.

You can also dive into the world of the database by dint of awesome videos by Carnegie Mellon University on Youtube:

2. Learn SQL

Our whole life is data. And in order to extract this data from the database, you need to “speak” with it in the same language.

SQL (Structured Query Language) is the lingua franca in the data area. No matter what anyone says, SQL lives, it is alive and will live for a very long time.

If you have been in development for a long time, you probably noticed that rumors about the imminent death of SQL appear periodically. The language was developed in the early 70s and is still wildly popular among analysts, developers, and just enthusiasts.

There is nothing to do without SQL knowledge in data engineering since you will inevitably have to construct queries to extract data. All modern big data warehouse support SQL:

  • Amazon Redshift
  • HP Vertica
  • Oracle
  • SQL Server

…and many others.

To analyze a large layer of data stored in distributed systems like HDFS, SQL engines were invented: Apache Hive, Impala, etc. See, no going anywhere.

How to learn SQL? Just do it on practice.

For this purpose, I would recommend getting acquainted with an excellent tutorial, which is free by the way, from Mode Analytics.

A distinctive feature of these courses is the presence of an interactive environment where you can write and execute SQL queries directly in the browser. The Modern SQL resource will not be superfluous. And you can apply this knowledge on Leetcode tasks in the Databases section.

3. Programming in Python and Java / Scala

Why it is worth learning the Python programming language, I already wrote in the article Python vs R. Choosing the Best Tool for AI, ML & Data Science. As for Java and Scala, most of the tools for storing and processing huge amounts of data are written in these languages. For example:

  • Apache Kafka (Scala)
  • Hadoop, HDFS (Java)
  • Apache Spark (Scala)
  • Apache Cassandra (Java)
  • HBase (Java)
  • Apache Hive (Java)

To understand how these tools work you need to know the languages ​​in which they are written. The functional approach of Scala allows you to effectively solve problems of parallel data processing. Python, unfortunately, can not boast of speed and parallel processing. On the whole, knowledge of several languages ​​and programming paradigms has a good effect on the breadth of approaches to solving problems.

For plunging into the Scala language, you can read Programming in Scala by the author of the language. Also, the company Twitter has published a good introductory guide — Scala School.

As for Python, I consider Fluent Python to be the best intermediate level book.

4. Big Data Tools

Here is a list of the most popular tools in the big data world:

  • Apache Spark
  • Apache Kafka
  • Apache Hadoop (HDFS, HBase, Hive)
  • Apache Cassandra

More information on big data building blocks you can find in this awesome interactive environment. The most popular tools are Spark and Kafka. They are definitely worth exploring, preferably understanding how they work from the inside. Jay Kreps (co-author Kafka) in 2013 published a monumental work of The Log: What every software engineer should know about real-time data’s unifying abstraction, core ideas from this boob, by the way, was used for the creation of Apache Kafka.
An introduction to Hadoop can be A Complete Guide to Mastering Hadoop (free).
The most comprehensive guide to Apache Spark for me is Spark: The Definitive Guide.

5. Cloud Platforms


Knowledge of at least one cloud platform is in the nest requirements for the position of Data Engineer. Employers give preference to Amazon Web Services, in the second place is the Google Cloud Platform, and ends with the top three Microsoft Azure leaders.

You should be well-oriented in Amazon EC2, AWS Lambda, Amazon S3, DynamoDB.

6. Distributed Systems

Working with big data implies the presence of clusters of independently working computers, the communication between which takes place over the network. The larger the cluster, the greater the likelihood of failure of its member nodes. To become a cool data expert, you need to understand the problems and existing solutions for distributed systems. This area is old and complex.

Andrew Tanenbaum is considered to be a pioneer in this realm. For those who don’t afraid theory, I recommend his book Distributed Systems, for beginners it may seem difficult, but it will really help you to brush your skills up.

I consider Designing Data-Intensive Applications from Martin Kleppmann to be the best introductory book. By the way, Martin has a wonderful blog. His work will help to systematize knowledge about building a modern infrastructure for storing and processing big data.

For those who like watching videos, there is a course Distributed Computer Systems on Youtube.

7. Data Pipelines

Data pipelines are something you can’t live without as a Data Engineer.

Much of the time data engineer builds a so-called. Pipeline date, that is, builds the process of delivering data from one place to another. These can be custom scripts that go to the external service API or make a SQL query, enrich the data and put it into centralized storage (data warehouse) or storage of unstructured data (data lakes).

Summing it up: The Ultimate Data Engineer Checklist

Resuming, take a good understanding of:

  • Take a good understanding of information systems;
  • Software Engineering (Agile, DevOps, Design Techniques, SOA)
  • Distributed Systems and Parallel Programming;
  • Database Fundamentals — Plan, Design, Operation, and Troubleshooting;
  • Experiments design — A/B test for putting into “Proof of Concept” systems reliability, performance, as well as designing good streamlines for delivering good solutions “on-the-fly”

This is some of few requirements to become a data engineer, so, as much your study and can understand about data systems, information systems, continuous delivery/deployment/integration, programming languages and other topics on computer science (not all domain areas ofc), the better you gather your skillset.

And finally, the last but very important thing I want to say.

The journey of becoming Data Engineering is not so easy as it might seem. It is unforgiving, frustrating and you have to be ready for this. Some moments on this journey will push you to throw everything in the towel. But, this is a true work and learning process.

Just don’t sugar coat it right from the beginning. The whole point of the journey is to learn as much as you can and be prepared for new challenges.

Here’s a great visual I came across that illustrates this point really well:


And yes, don’t forget to avoid burnout and get rest. It is also very essential. Good luck!


转载:https://blog.csdn.net/weixin_42955011/article/details/101034247
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场