Introduction to Spark SQL in Python

4 hours duration

Learn how to manipulate data and create machine learning feature sets in Spark using SQL in Python.

Enhance your knowledge of Apache Spark with this comprehensive course on Spark SQL. Designed for individuals familiar with SQL and interested in leveraging the capabilities of Apache Spark, this four-hour course will take you through advanced SQL features, including window functions, to maximize the usefulness of Spark. Throughout four chapters, you will delve into various applications of Spark SQL. You will learn how to analyze time series data, extract common words from text documents, create feature sets from natural language text, and utilize logistic regression to predict the last word in a sentence. The course begins by guiding you through the creation and querying of an SQL table in Spark. You will also gain proficiency in using SQL window functions to perform running sums, running differences, and other operations. Moving forward, you will explore the application of window functions in Spark SQL for natural language processing. This includes utilizing a moving window analysis to identify common word sequences. Chapter 3 focuses on optimizing performance by effectively caching DataFrames and SQL tables using the SQL Spark UI. Additionally, you will learn best practices for logging in Spark. Finally, you will apply all the skills acquired throughout the course to load and tokenize raw text, extracting word sequences. You will then employ logistic regression to classify the text, training a text classifier using raw natural language data. By the end of this course, you will have gained a comprehensive understanding of Spark SQL and its integration of distributed computing with the simplicity of Python and SQL.

