Which of the following code blocks reads JSON file imports.json into a DataFrame?
Correct Answer: D
Explanation Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/25.html , https://bit.ly/sparkpracticeexams_import_instructions)
Question 52
Which of the following statements about Spark's configuration properties is incorrect?
Correct Answer: D
Explanation The default number of partitions to use when shuffling data for joins or aggregations is 300. No, the default value of the applicable property spark.sql.shuffle.partitions is 200. The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property. Correct, see below. The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property. Correct, the maximum number of tasks that an executor can process in parallel depends on both properties spark.task.cpus and spark.executor.cores. This is because the available number of slots is calculated by dividing the number of cores per executor by the number of cores per task. For more info specifically to this point, check out Spark Architecture | Distributed Systems Architecture. More info: Configuration - Spark 3.1.2 Documentation
Question 53
Which of the following describes a valid concern about partitioning?
Correct Answer: A
Explanation A shuffle operation returns 200 partitions if not explicitly set. Correct. 200 is the default value for the Spark property spark.sql.shuffle.partitions. This property determines how many partitions Spark uses when shuffling data for joins or aggregations. The coalesce() method should be used to increase the number of partitions. Incorrect. The coalesce() method can only be used to decrease the number of partitions. Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions. No. For narrow transformations, fewer partitions usually result in a longer overall runtime, if more executors are available than partitions. A narrow transformation does not include a shuffle, so no data need to be exchanged between executors. Shuffles are expensive and can be a bottleneck for executing Spark workloads. Narrow transformations, however, are executed on a per-partition basis, blocking one executor per partition. So, it matters how many executors are available to perform work in parallel relative to the number of partitions. If the number of executors is greater than the number of partitions, then some executors are idle while other process the partitions. On the flip side, if the number of executors is smaller than the number of partitions, the entire operation can only be finished after some executors have processed multiple partitions, one after the other. To minimize the overall runtime, one would want to have the number of partitions equal to the number of executors (but not more). So, for the scenario at hand, increasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions. No data is exchanged between executors when coalesce() is run. No. While coalesce() avoids a full shuffle, it may still cause a partial shuffle, resulting in data exchange between executors. Short partition processing times are indicative of low skew. Incorrect. Data skew means that data is distributed unevenly over the partitions of a dataset. Low skew therefore means that data is distributed evenly. Partition processing time, the time that executors take to process partitions, can be indicative of skew if some executors take a long time to process a partition, but others do not. However, a short processing time is not per se indicative a low skew: It may simply be short because the partition is small. A situation indicative of low skew may be when all executors finish processing their partitions in the same timeframe. High skew may be indicated by some executors taking much longer to finish their partitions than others. But the answer does not make any comparison - so by itself it does not provide enough information to make any assessment about skew. More info: Spark Repartition & Coalesce - Explained and Performance Tuning - Spark 3.1.2 Documentation
Question 54
Which of the following code blocks creates a new one-column, two-row DataFrame dfDates with column date of type timestamp?
Correct Answer: C
Explanation This question is tricky. Two things are important to know here: First, the syntax for createDataFrame: Here you need a list of tuples, like so: [(1,), (2,)]. To define a tuple in Python, if you just have a single item in it, it is important to put a comma after the item so that Python interprets it as a tuple and not just a normal parenthesis. Second, you should understand the to_timestamp syntax. You can find out more about it in the documentation linked below. For good measure, let's examine in detail why the incorrect options are wrong: dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"]) This code snippet does everything the question asks for - except that the data type of the date column is a string and not a timestamp. When no schema is specified, Spark sets the string data type as default. dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date")) In the first row of this command, Spark throws the following error: TypeError: Can not infer schema for type: <class 'str'>. This is because Spark expects to find row information, but instead finds strings. This is why you need to specify the data as tuples. Fortunately, the Spark documentation (linked below) shows a number of examples for creating DataFrames that should help you get on the right track here. dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"]) dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-dd HH:mm:ss")) The issue with this answer is that the operator withColumnRenamed is used. This operator simply renames a column, but it has no power to modify its actual content. This is why withColumn should be used instead. In addition, the date format yyyy-MM-dd HH:mm:ss does not reflect the format of the actual timestamp: "23/01/2022 11:28:12". dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss")) Here, withColumnRenamed is used instead of withColumn (see above). In addition, the rows are not expressed correctly - they should be written as tuples, using parentheses. Finally, even the date format is off here (see above). More info: pyspark.sql.functions.to_timestamp - PySpark 3.1.2 documentation and pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 2
Question 55
The code block shown below should write DataFrame transactionsDf to disk at path csvPath as a single CSV file, using tabs (\t characters) as separators between columns, expressing missing values as string n/a, and omitting a header row with column names. Choose the answer that correctly fills the blanks in the code block to accomplish this. transactionsDf.__1__.write.__2__(__3__, " ").__4__.__5__(csvPath)
Correct Answer: C
Explanation Correct code block: transactionsDf.repartition(1).write.option("sep", "\t").option("nullValue", "n/a").csv(csvPath) It is important here to understand that the question specifically asks for writing the DataFrame as a single CSV file. This should trigger you to think about partitions. By default, every partition is written as a separate file, so you need to include repatition(1) into your call. coalesce(1) works here, too! Secondly, the question is very much an invitation to search through the parameters in the Spark documentation that work with DataFrameWriter.csv (link below). You will also need to know that you need an option() statement to apply these parameters. The final concern is about the general call structure. Once you have called accessed write of your DataFrame, options follow and then you write the DataFrame with csv. Instead of csv(csvPath), you could also use save(csvPath, format='csv') here. More info: pyspark.sql.DataFrameWriter.csv - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1