All Questions
Tagged with scala apache-spark
24,266
questions
0
votes
1
answer
33
views
Spark-Scala vs Pyspark Dag is different?
I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
1
vote
0
answers
19
views
Encrypt Spark Libsvm Dataframe
I have a libsvm file that I want to load into Spark and then encrypt it. I want to iterate over every element in the features to apply my encrypt function, but there doesn't seem to be any way to ...
0
votes
1
answer
17
views
Adding new Rows to Spark Partition while using forEachPartition
I am trying to add a new Row to each Partition in my Spark Job. I am using the following code to achieve this:
StructType rowType = new StructType();
rowType.add(DataTypes.createStructField("...
0
votes
0
answers
21
views
Scala Spark Dataframe creation from Seq of tuples doesn't work in Scala 3, but does in Scala 2
When trying to test something locally with Scala Spark, I noticed the following problem and was wondering what causes it, and whether there exists a workaround.
Consider the following build ...
-1
votes
0
answers
42
views
Using spark 3.4.1 lib in Java when extending StringRegexExpression to a java class
I am using spark 3.4.1 in maven project where I am configured scala (2.13.8) lang as well. I am trying to create a class Like.java in project by extending spark's StringRegexExpression
package com....
1
vote
1
answer
27
views
Can I use same SparkSession in different threads
In my spark app I use many temp views to read datasets and then use it in huge sql expression, like that:
for (view < cfg.views)
spark.read.format(view.format).load(view.path).createTempView(view....
1
vote
0
answers
27
views
Spark scala transformations
I have spark input dataframe like below.
Emp_ID
Cricket
Chess
Swim
11
Y
N
N
12
Y
Y
Y
13
N
N
Y
Need Out Dataframe like below.
Hobbies
Emp_id_list
Cricket
11,12
Chess
12
Swim
12,13
Any way to ...
-1
votes
0
answers
26
views
udf to transform a json string into multiple rows based on first level of nesting
I am trying to transform a df based on the first level nesting in the json string.
input dataframe
+------+------------------------------------+---------------------------------------------------------...
0
votes
1
answer
51
views
spark.sql() giving error : org.apache.spark.sql.catalyst.parser.ParseException: Syntax error at or near '('(line 2, pos 52)
I have class LowerCaseColumn.scala where one function is defined as below :
override def registerSQL(): Unit = spark.sql(
"""
|CREATE OR REPLACE TEMPORARY ...
0
votes
1
answer
59
views
+50
How to create data-frame on rocks db (SST files)
We hold our documents in rocks-db. We will be syncing these rocks-db sst files to S3. I would like to create a dataframe on the SST files and later run an sql. When googled, I was not able to find any ...
0
votes
0
answers
22
views
Flattening nested json with back slash in apache spark scala Dataframe
{
"messageBody": "{\"task\":{\"taskId\":\"c6d9fb0e-42ba-4a3e-bd39-f2a32a6958c1\",\"serializedTaskData\":\"{\\\"clientId\\\":\\\&...
0
votes
0
answers
33
views
Spark : Read special characters from the content of dat file without corrupting it in scala
I have to read all the special characters in some dat file (e.g.- testdata.dat) without being corrupted and initialise it into a dataframe in scala using spark.
I have one dat file (eg - testdata.dat),...
1
vote
0
answers
31
views
Creating a custom aggregator in spark with window rowsBetween?
What I'm trying to do is use a window function to get the last and current row and do some computation on a couple of the columns with a custom aggregator. I have time series data with points that are ...
1
vote
0
answers
34
views
More Parallelism Than Expected in Glue ETL Spark Job
I am using Glue ETL Spark jobs to run some tests. I am trying to understand why I am getting more parallel processing than the available cores on a single executor.
Here's my job config:
I setting ...
0
votes
2
answers
38
views
Determine if a condition is ever true in an aggregated dataset with Scala spark sql library
I'm trying to aggregate a dataset and determine if a condition is ever true for a row in the dataset.
Suppose I have a dataset with these values
cust_id
travel_type
distance_travelled
1
car
10
1
...