Spark Interview Questions and Answers

Views:
 
Category: Education
     
 

Presentation Description

Top Apache Spark Interview Questions and Answers that you should prepare for in 2017 to nail your next apache spark developer job interview.

Comments

Presentation Transcript

Top 50 Spark Interview Questions and Answers:

Top 50 Spark Interview Questions and Answers

INTRODUCTION:

INTRODUCTION DeZyre has curated a list of top 50 Apache Spark Interview Questions and Answers that will help students/professionals nail a big data developer interview and bridge the talent supply for Spark Developers across various industry segments.

Compare Spark vs Hadoop MapReduce:

Compare Spark vs Hadoop MapReduce   Criteria Hadoop MapReduce Apache Spark Memory  Does not leverage the memory of the hadoop cluster to maximum. Let's save data on memory with the use of RDD's. Disk usage MapReduce is disk oriented. Spark caches data in-memory and ensures low latency. Processing Only batch processing is supported Supports real-time processing through spark streaming. Installation Is bound to hadoop. Is not bound to Hadoop.

List some use cases where Spark outperforms Hadoop in processing. :

List some use cases where Spark outperforms Hadoop in processing. Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources. Spark is preferred over Hadoop for real time querying of data Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution

What is a Sparse Vector? :

What is a Sparse Vector? A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space. For More Spark Interview Questions and Answers - https://www.dezyre.com/article/top-50-spark-interview-questions-and-answers-for-2017/208

Explain about transformations and actions in the context of RDDs.:

Explain about transformations and actions in the context of RDDs. Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey . Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take. To Read More in Detail about Spark RDD’s - https://www.dezyre.com/article/working-with-spark-rdd-for-fast-data-processing/273

Explain about the major libraries that constitute the Spark Ecosystem :

Explain about the major libraries that constitute the Spark Ecosystem Spark MLib - Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc. Spark Streaming  – This library is used to process real time streaming data. Spark GraphX   – Spark API for graph parallel computations with basic operators like joinVertices , subgraph, aggregateMessages , etc. Spark SQL  – Helps execute SQL like queries on Spark data using standard visualization or BI tools. Read More in Detail about Spark Ecosystem and Spark Components - https://www.dezyre.com/article/apache-spark-ecosystem-and-spark-components/219

What are the common mistakes developers make when running Spark applications? :

What are the common mistakes developers make when running Spark applications? Developers often make the mistake of- Hitting the web service several times by using multiple clusters. Run everything on the local node instead of distributing it. Developers need to be careful with this, as Spark makes use of memory for processing.

What is the advantage of a Parquet file? :

What is the advantage of a Parquet file? Parquet file is a columnar format file that helps – Limit I/O operations Consumes less space Fetches only required columns.

Is Apache Spark a good fit for Reinforcement learning? :

Is Apache Spark a good fit for Reinforcement learning? No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.

What is the difference between persist() and cache() :

What is the difference between persist() and cache() persist () allows the user to specify the storage level whereas cache () uses the default storage level. For More Spark Interview Questions and Answers - https://www.dezyre.com/article/top-50-spark-interview-questions-and-answers-for-2017/208

authorStream Live Help