2022 Associate-Developer-Apache-Spark exam torrent Associate-Developer-Apache-Spark Study Guide [Q38-Q54]

Share

2022 Associate-Developer-Apache-Spark exam torrent Associate-Developer-Apache-Spark Study Guide

Easily pass Associate-Developer-Apache-Spark Exam with our Dumps & PDF Test Engine


Exam Format and Content

  • Exam Length: 60 questions

  • Passing score: 70%

  • Exam Duration: 120 minutes

  • Exam Format: Multiple choice questions

  • Language: This exam is only available in the Python or Scala language.


How to Register for the Databricks Associate-Developer-Apache-Spark Exam

  • Go to create an account.

  • You can register for the exam by clicking the Register button.

  • The on-screen steps will show you how to arrange an exam with our partner.

  • You can see all the available certificate exams by Clicking on the Certifications tab.


Understand why Databricks experts work only for top consulting companies

Many business owners are excited about the idea of using Databricks for their business. However, the fact is that Databricks experts can only be found at the top consulting companies. But why? What makes these companies so special? And what can you do to get the best Databricks experts for your business?

These consulting companies are the best because they have the most experienced and skilled consultants. They know how to use Databricks to help their clients. They also know the best practices for using Databricks. They know how to make the most out of Databricks. Databricks Associate Developer Apache Spark exam dumps are the best way to pass this exam.

Databricks is an amazing tool for data scientists. It is a great tool for data engineers as well. They are using it to build powerful applications for different industries. They are also using it to solve real-world problems. This is a great way to solve real-world problems. Data scientists are the most important people in any company. They are the ones who can bring the most value to any organization.

 

NEW QUESTION 38
Which of the following statements about broadcast variables is correct?

  • A. Broadcast variables are local to the worker node and not shared across the cluster.
  • B. Broadcast variables are commonly used for tables that do not fit into memory.
  • C. Broadcast variables are serialized with every single task.
  • D. Broadcast variables are occasionally dynamically updated on a per-task basis.
  • E. Broadcast variables are immutable.

Answer: E

Explanation:
Explanation
Broadcast variables are local to the worker node and not shared across the cluster.
This is wrong because broadcast variables are meant to be shared across the cluster. As such, they are never just local to the worker node, but available to all worker nodes.
Broadcast variables are commonly used for tables that do not fit into memory.
This is wrong because broadcast variables can only be broadcast because they are small and do fit into memory.
Broadcast variables are serialized with every single task.
This is wrong because they are cached on every machine in the cluster, precisely avoiding to have to be serialized with every single task.
Broadcast variables are occasionally dynamically updated on a per-task basis.
This is wrong because broadcast variables are immutable - they are never updated.
More info: Spark - The Definitive Guide, Chapter 14

 

NEW QUESTION 39
The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))

  • A. 1. withColumn
    2. "transactionDateForm"
    3. "MMM d (EEEE)"
    4. "transactionDate"
  • B. 1. select
    2. "transactionDate"
    3. "transactionDateForm"
    4. "MMM d (EEEE)"
  • C. 1. withColumnRenamed
    2. "transactionDate"
    3. "transactionDateForm"
    4. "MM d (EEE)"
  • D. 1. withColumn
    2. "transactionDateForm"
    3. "transactionDate"
    4. "MM d (EEE)"
  • E. 1. withColumn
    2. "transactionDateForm"
    3. "transactionDate"
    4. "MMM d (EEEE)"

Answer: E

Explanation:
Explanation
Correct code block:
transactionsDf.withColumn("transactionDateForm", from_unixtime("transactionDate", "MMM d (EEEE)")) The question specifically asks about "adding" a column. In the context of all presented answers, DataFrame.withColumn() is the correct command for this. In theory, DataFrame.select() could also be used for this purpose, if all existing columns are selected and a new one is added.
DataFrame.withColumnRenamed() is not the appropriate command, since it can only rename existing columns, but cannot add a new column or change the value of a column.
Once DataFrame.withColumn() is chosen, you can read in the documentation (see below) that the first input argument to the method should be the column name of the new column.
The final difficulty is the date format. The question indicates that the date format Apr 26 (Sunday) is desired. The answers give "MMM d (EEEE)" and "MM d (EEE)" as options. It can be hard to know the details of the date format that is used in Spark. Specifically, knowing the differences between MMM and MM is probably not something you deal with every day. But, there is an easy way to remember the difference: M (one letter) is usually the shortest form: 4 for April. MM includes padding: 04 for April. MMM (three letters) is the three-letter month abbreviation: Apr for April. And MMMM is the longest possible form: April. Knowing this four-letter sequence helps you select the correct option here.
More info: pyspark.sql.DataFrame.withColumn - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 40
Which of the following statements about data skew is incorrect?

  • A. In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.
  • B. To mitigate skew, Spark automatically disregards null values in keys when joining.
  • C. Salting can resolve data skew.
  • D. Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.
  • E. Spark will not automatically optimize skew joins by default.

Answer: B

Explanation:
Explanation
To mitigate skew, Spark automatically disregards null values in keys when joining.
This statement is incorrect, and thus the correct answer to the question. Joining keys that contain null values is of particular concern with regard to data skew.
In real-world applications, a table may contain a great number of records that do not have a value assigned to the column used as a join key. During the join, the data is at risk of being heavily skewed. This is because all records with a null-value join key are then evaluated as a single large partition, standing in stark contrast to the potentially diverse key values (and therefore small partitions) of the non-null-key records.
Spark specifically does not handle this automatically. However, there are several strategies to mitigate this problem like discarding null values temporarily, only to merge them back later (see last link below).
In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.
This statement is correct. In fact, having very different partition sizes is the very definition of skew. Skew can degrade Spark performance because the largest partition occupies a single executor for a long time. This blocks a Spark job and is an inefficient use of resources, since other executors that processed smaller partitions need to idle until the large partition is processed.
Salting can resolve data skew.
This statement is correct. The purpose of salting is to provide Spark with an opportunity to repartition data into partitions of similar size, based on a salted partitioning key.
A salted partitioning key typically is a column that consists of uniformly distributed random numbers. The number of unique entries in the partitioning key column should match the number of your desired number of partitions. After repartitioning by the salted key, all partitions should have roughly the same size.
Spark does not automatically optimize skew joins by default.
This statement is correct. Automatic skew join optimization is a feature of Adaptive Query Execution (AQE).
By default, AQE is disabled in Spark. To enable it, Spark's spark.sql.adaptive.enabled configuration option needs to be set to true instead of leaving it at the default false.
To automatically optimize skew joins, Spark's spark.sql.adaptive.skewJoin.enabled options also needs to be set to true, which it is by default.
When skew join optimization is enabled, Spark recognizes skew joins and optimizes them by splitting the bigger partitions into smaller partitions which leads to performance increases.
Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.
This statement is correct. Broadcast joins can indeed help increase join performance for skewed data, under some conditions. One of the DataFrames to be joined needs to be small enough to fit into each executor's memory, along a partition from the other DataFrame. If this is the case, a broadcast join increases join performance over a sort-merge join.
The reason is that a sort-merge join with skewed data involves excessive shuffling. During shuffling, data is sent around the cluster, ultimately slowing down the Spark application. For skewed data, the amount of data, and thus the slowdown, is particularly big.
Broadcast joins, however, help reduce shuffling data. The smaller table is directly stored on all executors, eliminating a great amount of network traffic, ultimately increasing join performance relative to the sort-merge join.
It is worth noting that for optimizing skew join behavior it may make sense to manually adjust Spark's spark.sql.autoBroadcastJoinThreshold configuration property if the smaller DataFrame is bigger than the 10 MB set by default.
More info:
- Performance Tuning - Spark 3.0.0 Documentation
- Data Skew and Garbage Collection to Improve Spark Performance
- Section 1.2 - Joins on Skewed Data * GitBook

 

NEW QUESTION 41
Which of the following code blocks returns a DataFrame that is an inner join of DataFrame itemsDf and DataFrame transactionsDf, on columns itemId and productId, respectively and in which every itemId just appears once?

  • A. itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId", how="inner").dropDuplicates(["itemId"])
  • B. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates("itemId")
  • C. itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId").distinct("itemId")
  • D. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId, how="inner").distinct(["itemId"])
  • E. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates(["itemId"])

Answer: E

Explanation:
Explanation
Filtering out distinct rows based on columns is achieved with the dropDuplicates method, not the distinct method which does not take any arguments.
The second argument of the join() method only accepts strings if they are column names. The SQL-like statement "itemsDf.itemId==transactionsDf.productId" is therefore invalid.
In addition, it is not necessary to specify how="inner", since the default join type for the join command is already inner.
More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 42
Which of the following statements about the differences between actions and transformations is correct?

  • A. Actions can be queued for delayed execution, while transformations can only be processed immediately.
  • B. Actions do not send results to the driver, while transformations do.
  • C. Actions are evaluated lazily, while transformations are not evaluated lazily.
  • D. Actions generate RDDs, while transformations do not.
  • E. Actions can trigger Adaptive Query Execution, while transformation cannot.

Answer: E

Explanation:
Explanation
Actions can trigger Adaptive Query Execution, while transformation cannot.
Correct. Adaptive Query Execution optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any runtime information to optimize the query until an action is called. If Adaptive Query Execution is enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating the query.
Actions can be queued for delayed execution, while transformations can only be processed immediately.
No, there is no such concept as "delayed execution" in Spark. Actions cannot be evaluated lazily, meaning that they are executed immediately.
Actions are evaluated lazily, while transformations are not evaluated lazily.
Incorrect, it is the other way around: Transformations are evaluated lazily and actions trigger their evaluation.
Actions generate RDDs, while transformations do not.
No. Transformations change the data and, since RDDs are immutable, generate new RDDs along the way.
Actions produce outputs in Python and data types (integers, lists, text files,...) based on the RDDs, but they do not generate them.
Here is a great tip on how to differentiate actions from transformations: If an operation returns a DataFrame, Dataset, or an RDD, it is a transformation. Otherwise, it is an action.
Actions do not send results to the driver, while transformations do.
No. Actions send results to the driver. Think about running DataFrame.count(). The result of this command will return a number to the driver. Transformations, however, do not send results back to the driver. They produce RDDs that remain on the worker nodes.
More info: What is the difference between a transformation and an action in Apache Spark? | Bartosz Mikulski, How to Speed up SQL Queries with Adaptive Query Execution

 

NEW QUESTION 43
Which of the following describes a shuffle?

  • A. A shuffle is a Spark operation that results from DataFrame.coalesce().
  • B. A shuffle is a process that compares data across partitions.
  • C. A shuffle is a process that is executed during a broadcast hash join.
  • D. A shuffle is a process that compares data across executors.
  • E. A shuffle is a process that allocates partitions to executors.

Answer: B

Explanation:
Explanation
A shuffle is a Spark operation that results from DataFrame.coalesce().
No. DataFrame.coalesce() does not result in a shuffle.
A shuffle is a process that allocates partitions to executors.
This is incorrect.
A shuffle is a process that is executed during a broadcast hash join.
No, broadcast hash joins avoid shuffles and yield performance benefits if at least one of the two tables is small in size (<= 10 MB by default). Broadcast hash joins can avoid shuffles because instead of exchanging partitions between executors, they broadcast a small table to all executors that then perform the rest of the join operation locally.
A shuffle is a process that compares data across executors.
No, in a shuffle, data is compared across partitions, and not executors.
More info: Spark Repartition & Coalesce - Explained (https://bit.ly/32KF7zS)

 

NEW QUESTION 44
The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before
2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error.
Schema:
1.root
2. |-- itemId: integer (nullable = true)
3. |-- attributes: array (nullable = true)
4. | |-- element: string (containsNull = true)
5. |-- supplier: string (nullable = true)
Code block:
1.schema = StructType([
2. StructType("itemId", IntegerType(), True),
3. StructType("attributes", ArrayType(StringType(), True), True),
4. StructType("supplier", StringType(), True)
5.])
6.
7.spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)

  • A. The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
  • B. Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.
  • C. The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
  • D. Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
  • E. Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Answer: E

Explanation:
Explanation
Correct code block:
schema = StructType([
StructField("itemId", IntegerType(), True),
StructField("attributes", ArrayType(StringType(), True), True),
StructField("supplier", StringType(), True)
])
spark.read.options(modifiedBefore="2029-03-20T05:44:46").schema(schema).parquet(filePath) This question is more difficult than what you would encounter in the exam. In the exam, for this question type, only one error needs to be identified and not "one or multiple" as in the question.
Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the question is wrong.
The modification date threshold should be specified by a keyword argument like options(modifiedBefore="2029-03-20T05:44:46") and not two consecutive non-keyword arguments as in the original code block (see documentation linked below).
Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for example, DataFrameReader.parquet().
Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.
No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be nullable and this is specified correctly by the third argument being True in the schema in the code block.
It is correct, however, that the modification date threshold is specified incorrectly (see above).
The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer above. In addition, the DataFrameReader is called correctly through the SparkSession spark.
Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.
The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified incorrectly (see correct answer above).

 

NEW QUESTION 45
The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')

  • A. 1. withColumn
    2. 'associateId'
    3. lit(5)
    4. drop
    5. 'productId'
  • B. 1. withColumn
    2. 'associateId'
    3. 5
    4. remove
    5. 'productId'
  • C. 1. withNewColumn
    2. associateId
    3. lit(5)
    4. drop
    5. productId
  • D. 1. withColumn
    2. col(associateId)
    3. lit(5)
    4. drop
    5. col(productId)
  • E. 1. withColumnRenamed
    2. 'associateId'
    3. 5
    4. drop
    5. 'productId'

Answer: A

Explanation:
Explanation
Correct code block:
transactionsDf.withColumn('associateId', lit(5)).drop('productId', 'value') For solving this question it is important that you know the lit() function (link to documentation below). This function enables you to add a column of a constant value to a DataFrame.
More info: pyspark.sql.functions.lit - PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1

 

NEW QUESTION 46
Which of the following describes a narrow transformation?

  • A. narrow transformation is an operation in which data is exchanged across partitions.
  • B. A narrow transformation is an operation in which data is exchanged across the cluster.
  • C. A narrow transformation is a process in which data from multiple RDDs is used.
  • D. A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.
  • E. A narrow transformation is an operation in which no data is exchanged across the cluster.

Answer: E

Explanation:
Explanation
A narrow transformation is an operation in which no data is exchanged across the cluster.
Correct! In narrow transformations, no data is exchanged across the cluster, since these transformations do not require any data from outside of the partition they are applied on. Typical narrow transformations include filter, drop, and coalesce.
A narrow transformation is an operation in which data is exchanged across partitions.
No, that would be one definition of a wide transformation, but not of a narrow transformation. Wide transformations typically cause a shuffle, in which data is exchanged across partitions, executors, and the cluster.
A narrow transformation is an operation in which data is exchanged across the cluster.
No, see explanation just above this one.
A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like
16-bit or 8-bit float variables.
No, type conversion has nothing to do with narrow transformations in Spark.
A narrow transformation is a process in which data from multiple RDDs is used.
No. A resilient distributed dataset (RDD) can be described as a collection of partitions. In a narrow transformation, no data is exchanged between partitions. Thus, no data is exchanged between RDDs.
One could say though that a narrow transformation and, in fact, any transformation results in a new RDD being created. This is because a transformation results in a change to an existing RDD (RDDs are the foundation of other Spark data structures, like DataFrames). But, since RDDs are immutable, a new RDD needs to be created to reflect the change caused by the transformation.
More info: Spark Transformation and Action: A Deep Dive | by Misbah Uddin | CodeX | Medium

 

NEW QUESTION 47
Which of the following statements about Spark's DataFrames is incorrect?

  • A. Spark's DataFrames are immutable.
  • B. The data in DataFrames may be split into multiple chunks.
  • C. Data in DataFrames is organized into named columns.
  • D. RDDs are at the core of DataFrames.
  • E. Spark's DataFrames are equal to Python's DataFrames.

Answer: E

Explanation:
Explanation
Spark's DataFrames are equal to Python's or R's DataFrames.
No, they are not equal. They are only similar. A major difference between Spark and Python is that Spark's DataFrames are distributed, whereby Python's are not.

 

NEW QUESTION 48
The code block displayed below contains an error. The code block should use Python method find_most_freq_letter to find the letter present most in column itemName of DataFrame itemsDf and return it in a new column most_frequent_letter. Find the error.
Code block:
1. find_most_freq_letter_udf = udf(find_most_freq_letter)
2. itemsDf.withColumn("most_frequent_letter", find_most_freq_letter("itemName"))

  • A. UDFs do not exist in PySpark.
  • B. Spark is not adding a column.
  • C. The "itemName" expression should be wrapped in col().
  • D. The UDF method is not registered correctly, since the return type is missing.
  • E. Spark is not using the UDF method correctly.

Answer: E

Explanation:
Explanation
Correct code block:
find_most_freq_letter_udf = udf(find_most_frequent_letter)
itemsDf.withColumn("most_frequent_letter", find_most_freq_letter_udf("itemName")) Spark should use the previously registered find_most_freq_letter_udf method here - but it is not doing that in the original codeblock. There, it just uses the non-UDF version of the Python method.
Note that typically, we would have to specify a return type for udf(). Except in this case, since the default return type for udf() is a string which is what we are expecting here. If we wanted to return an integer variable instead, we would have to register the Python function as UDF using find_most_freq_letter_udf = udf(find_most_freq_letter, IntegerType()).
More info: pyspark.sql.functions.udf - PySpark 3.1.1 documentation

 

NEW QUESTION 49
Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

  • A. transactionsDf.persist()
  • B. del transactionsDf
  • C. transactionsDf.clearCache()
  • D. array_remove(transactionsDf, "*")
  • E. transactionsDf.unpersist()
    (Correct)

Answer: E

Explanation:
Explanation
transactionsDf.unpersist()
Correct. The DataFrame.unpersist() command does exactly what the question asks for - it removes all cached parts of the DataFrame from memory and disk.
del transactionsDf
False. While this option can help remove the DataFrame from memory and disk, it does not do so immediately. The reason is that this command just notifies the Python garbage collector that the transactionsDf now may be deleted from memory. However, the garbage collector does not do so immediately and, if you wanted it to run immediately, would need to be specifically triggered to do so. Find more information linked below.
array_remove(transactionsDf, "*")
Incorrect. The array_remove method from pyspark.sql.functions is used for removing elements from arrays in columns that match a specific condition. Also, the first argument would be a column, and not a DataFrame as shown in the code block.
transactionsDf.persist()
No. This code block does exactly the opposite of what is asked for: It caches (writes) DataFrame transactionsDf to memory and disk. Note that even though you do not pass in a specific storage level here, Spark will use the default storage level (MEMORY_AND_DISK).
transactionsDf.clearCache()
Wrong. Spark's DataFrame does not have a clearCache() method.
More info: pyspark.sql.DataFrame.unpersist - PySpark 3.1.2 documentation, python - How to delete an RDD in PySpark for the purpose of releasing resources? - Stack Overflow Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 50
Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format month/day/year in column transactionDateFormatted?
Excerpt of DataFrame transactionsDf:

  • A. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))
  • B. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))
  • C. transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))
  • D. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))
  • E. transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")

Answer: A

Explanation:
Explanation
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy")) Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark's from_unixtime method to transform values in column transactionDate into strings, following the format requested in the question.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy")) No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.
transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy")) Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column.
transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted") Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped data - but this is irrelevant for this question, since we do not deal with grouped data here.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this:
2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not what is asked for in the question.
More info: pyspark.sql.functions.from_unixtime - PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1

 

NEW QUESTION 51
Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate format for this kind of data?

  • A. 1.spark.read.schema([
    2. StructField("transactionId", IntegerType(), True),
    3. StructField("predError", IntegerType(), True)
    4. ]).load(filePath, format="parquet")
  • B. 1.spark.read.schema([
    2. StructField("transactionId", NumberType(), True),
    3. StructField("predError", IntegerType(), True)
    4. ]).load(filePath)
  • C. 1.spark.read.schema(
    2. StructType([
    3. StructField("transactionId", IntegerType(), True),
    4. StructField("predError", IntegerType(), True)]
    5. )).format("parquet").load(filePath)
  • D. 1.spark.read.schema(
    2. StructType(
    3. StructField("transactionId", IntegerType(), True),
    4. StructField("predError", IntegerType(), True)
    5. )).load(filePath)
  • E. 1.spark.read.schema(
    2. StructType([
    3. StructField("transactionId", StringType(), True),
    4. StructField("predError", IntegerType(), True)]
    5. )).parquet(filePath)

Answer: C

Explanation:
Explanation
The schema passed into schema should be of type StructType or a string, so all entries in which a list is passed are incorrect.
In addition, since all numbers are whole numbers, the IntegerType() data type is the correct option here.
NumberType() is not a valid data type and StringType() would fail, since the parquet file is stored in the "most appropriate format for this kind of data", meaning that it is most likely an IntegerType, and Spark does not convert data types if a schema is provided.
Also note that StructType accepts only a single argument (a list of StructFields). So, passing multiple arguments is invalid.
Finally, Spark needs to know which format the file is in. However, all of the options listed are valid here, since Spark assumes parquet as a default when no file format is specifically passed.
More info: pyspark.sql.DataFrameReader.schema - PySpark 3.1.2 documentation and StructType - PySpark 3.1.2 documentation

 

NEW QUESTION 52
In which order should the code blocks shown below be run in order to assign articlesDf a DataFrame that lists all items in column attributes ordered by the number of times these items occur, from most to least often?
Sample of DataFrame articlesDf:
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+

  • A. 2, 5, 4
  • B. 2, 5, 3
  • C. 5, 2
  • D. 2, 3, 4
  • E. 4, 5
  • F. 1. articlesDf = articlesDf.groupby("col")
    2. articlesDf = articlesDf.select(explode(col("attributes")))
    3. articlesDf = articlesDf.orderBy("count").select("col")
    4. articlesDf = articlesDf.sort("count",ascending=False).select("col")
    5. articlesDf = articlesDf.groupby("col").count()

Answer: D

Explanation:
Explanation
Correct code block:
articlesDf = articlesDf.select(explode(col('attributes')))
articlesDf = articlesDf.groupby('col').count()
articlesDf = articlesDf.sort('count',ascending=False).select('col')
Output of correct code block:
+-------+
| col|
+-------+
| summer|
| winter|
| blue|
| cozy|
| travel|
| fresh|
| red|
|cooling|
| green|
+-------+
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 53
Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

  • A. spark.read.path(filePath)
  • B. spark.read.json(filePath)
  • C. spark.read().path(filePath)
  • D. spark.read().json(filePath)
  • E. spark.read.path(filePath, source="json")

Answer: B

Explanation:
Explanation
spark.read.json(filePath)
Correct. spark.read accesses Spark's DataFrameReader. Then, Spark identifies the file type to be read as JSON type by passing filePath into the DataFrameReader.json() method.
spark.read.path(filePath)
Incorrect. Spark's DataFrameReader does not have a path method. A universal way to read in files is provided by the DataFrameReader.load() method (link below).
spark.read.path(filePath, source="json")
Wrong. A DataFrameReader.path() method does not exist (see above).
spark.read().json(filePath)
Incorrect. spark.read is a way to access Spark's DataFrameReader. However, the DataFrameReader is not callable, so calling it via spark.read() will fail.
spark.read().path(filePath)
No, Spark's DataFrameReader is not callable (see above).
More info: pyspark.sql.DataFrameReader.json - PySpark 3.1.2 documentation, pyspark.sql.DataFrameReader.load - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 54
......

Associate-Developer-Apache-Spark PDF Pass Leader, Associate-Developer-Apache-Spark Latest Real Test: https://www.testkingpdf.com/Associate-Developer-Apache-Spark-testking-pdf-torrent.html

Valid Associate-Developer-Apache-Spark Test Answers & Associate-Developer-Apache-Spark Exam PDF: https://drive.google.com/open?id=1QHWGhOnpNnvnV6D2oIngOeiqdr-u0TcK