Engineering and Technology
Learn how to manipulate data and create machine learning feature sets in Spark using SQL in Python.
Enhance your knowledge of Apache Spark with this comprehensive course on Spark SQL. Designed for individuals familiar with SQL and interested in leveraging the capabilities of Apache Spark, this four-hour course will take you through advanced SQL features, including window functions, to maximize the usefulness of Spark. Throughout four chapters, you will delve into various applications of Spark SQL. You will learn how to analyze time series data, extract common words from text documents, create feature sets from natural language text, and utilize logistic regression to predict the last word in a sentence. The course begins by guiding you through the creation and querying of an SQL table in Spark. You will also gain proficiency in using SQL window functions to perform running sums, running differences, and other operations. Moving forward, you will explore the application of window functions in Spark SQL for natural language processing. This includes utilizing a moving window analysis to identify common word sequences. Chapter 3 focuses on optimizing performance by effectively caching DataFrames and SQL tables using the SQL Spark UI. Additionally, you will learn best practices for logging in Spark. Finally, you will apply all the skills acquired throughout the course to load and tokenize raw text, extracting word sequences. You will then employ logistic regression to classify the text, training a text classifier using raw natural language data. By the end of this course, you will have gained a comprehensive understanding of Spark SQL and its integration of distributed computing with the simplicity of Python and SQL.
by DataCamp
Learn how to manipulate data and create machine learning feature sets in Spark using SQL in Python.
by DataCamp
Learn how to run big data analysis using Spark and the sparklyr package in R, and explore Spark MLIb...
by DataCamp
Learn the fundamentals of data visualization using spreadsheets.
by DataCamp
Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R w...
by DataCamp
Master the basics of data analysis with Python in just four hours. This online course will introduce...
by DataCamp
Learn A/B testing: including hypothesis testing, experimental design, and confounding variables.
by DataCamp
Learn how to implement and schedule data engineering workflows.
by DataCamp
Learn statistical tests for identifying outliers and how to use sophisticated anomaly scoring algori...
by DataCamp
Learn about AWS Boto and harnessing cloud technology to optimize your data workflow.
by DataCamp
Bash scripting allows you to build analytics pipelines in the cloud and work with data stored across...