question archive Consider the two data files (users

Consider the two data files (users

Subject:Computer SciencePrice:2.87 Bought7

Consider the two data files (users.csv, transactions.csv):

  Users file has the following fields:

a) UserID

b) EmailID

c) NativeLanguage

d) Location

Transactions file has the following fields:

a) Transaction_ID

b) Product_ID

c) UserID

d) Price

e) Product_Description

-> By making use of Spark find out:

a) Count of unique locations where each product is sold. 

b) Find out products bought by each user.

c) Total spending done by each user on each product.

Transactions

1 1004 19 129 whatchamacallit

2 1001 10 99 thingamajig

3 1004 17 129 whatchamacallit

4 1001 9 99 thingamajig

5 1003 3 89 gadget

6 1002 19 149 gizmo

7 1002 30 149 gizmo

8 1002 26 149 gizmo

9 1001 22 99 thingamajig

10 1003 6 89 gadget

11 1004 1 129 whatchamacallit

12 1004 2 129 whatchamacallit

13 1005 5 199 doohickey

14 1004 7 129 whatchamacallit

15 1002 16 149 gizmo

Users

1 u..1@company.com ES MX

2 u..4@domain.com EN US

3 u..5@company.com FR FR

4 u..9@site.org HI IN

5 u..2@service.io EN CA

6 u..7@website.net FR FR

7 u..1@company.com FR FR

8 u..5@company.com FR FR

9 u..7@school.edu ES MX

10 u..1@website.net EN CA

11 u..6@website.net FR FR

12 u..9@domain.com FR FR

13 u..1@company.com ES MX

14 u..5@domain.com HI IN

15 u..8@site.org ES MX

16 u..3@school.edu EN US

17 u..7@school.edu ES MX

18 u..9@website.net HI IN

19 u..4@school.edu EN US

20 u..7@domain.com HI IN

21 u..8@site.org EN US

22 u..1@domain.com ES MX

23 u..4@service.io EN US

24 u..9@website.net ES MX

25 u..1@site.org EN US

26 u..5@service.io HI IN

27 u..9@service.io EN CA

28 u..1@company.com EN CA

29 u..6@site.org ES MX

30 u..9@website.net EN US

pur-new-sol

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Answer:

Below is the PySpark codes for the above set of related questions:

#Establishing PySpark Connection from pyspark import SparkConf, SparkContext import sys conf = SparkConf().setMaster("local").setAppName("Assignment1") sc = SparkContext(conf = conf) ############################################################################################################################# #Read the csv files users= sc.textFile('file:///home/cloudera/users.csv') transactions=sc.textFile('file:///home/cloudera/transactions.csv') ############################################################################################################################# #Select specific columns from the read files rdd_user = users.map(lambda l: l.split(",")).map(lambda l :(l[0],l[3])) rdd_transaction=transactions.map(lambda l: l.split(",")).map(lambda l :(l[2],l[4])) ############################################################################################################################# #Function to generate key value pair def generatekeyvalue (a): key = a[1] value = a[0] return (key,value) ############################################################################################################################# ############################################################################################################################# #Question 1 a) #Merge the two rdds and form the distinct key value pair merged_rdd=rdd_user.join(rdd_transaction).values().map(generatekeyvalue).distinct().countByKey() print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>" print merged_rdd print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>" ############################################################################################################################# ############################################################################################################################# #Question 1 b) #Get product lists group by keys products = transactions.map(lambda l: l.split(",")).map(lambda l:(l[2],l[4])).groupByKey() print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>" print(list((product[0], list(product[1])) for product in products.collect())) print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>" ############################################################################################################################# ############################################################################################################################# #Question 1c) #Get distinct product id and price from the transaction file price = transactions.map(lambda l: l.split(",")).map(lambda l:(l[1],l[3])).distinct() #Get product id and user id user = transactions.map(lambda l: l.split(",")).map(lambda l:(l[1],l[2])) #Join price and use user_price_product = price.join(user) #Define key value as to how the output should appear i.e, Usr id followed by product id and product price def getmappings (val): product_id = val[0] key_value = val[1] product_price = key_value[0] user_id = key_value[1] return (user_id,(product_id,product_price)) #Find total spendings of each user on each product total_spendings = user_price_product.map(getmappings).sortByKey().collect() print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>" print total_spendings print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>"

Related Questions