发布时间:2023-12-22 10:30
需求:从一个集合中计算出每个单词的个数,并输出前三的单词
集合中的形式:
val stringList: List[String] = List( \"hello world\", \"beautiful city\", \"I a chinese\", \"you is a good man\", \"I is a kind woman\", \"thanks for you\", \"forgive you\", \"hello world,I am student in a beautiful city\", \"whatever I will keep happy\" )
1. 先对每一行以空格切割,获得每一个单词,此时返回的是一个字符串数组。
2. 将每个字符串数组扁平化处理
3. 对单词进行分组操作
4.统计
// 简单版本:单词计数:将集合中出现的相同的单词统计其个数
val stringList: List[String] = List(
\"hello world\",
\"beautiful city\",
\"I a chinese\",
\"you is a good man\",
\"I is a kind woman\",
\"thanks for you\",
\"forgive you\",
\"hello world,I am student in a beautiful city\",
\"whatever I will keep happy\"
)
// 1. 将单词按空格切割
val stringSplit: List[Array[String]] = stringList.map((strings) => {
strings.split(\" \")
})
println(stringSplit)
// 2. 扁平化
val stringListFlatten: List[String] = stringSplit.flatten
println(stringListFlatten)
// 3.将相同的单词放在一起 groupby
val stringGroupBy: Map[String, List[String]] = stringListFlatten.groupBy((word) => word)
println(stringGroupBy)
// 4.统计相同元素的个数,并返回map
val stringCountList: Map[String, Int] = stringGroupBy.map((kv) => (kv._1, kv._2.size))
// 5.排序 取出前三个
val toList = stringCountList.toList
val resultList = toList.sortWith((x, y) => {
x._2 > y._2
}).take(3)
println(resultList)
与简单版本不一样的是复杂版本中,它的集合格式如下:数字代表该字符串重复的次数,这里提供两种解法。
val tupleList: List[(String, Int)] = List(((\"Hello Scala Spark World\"), 7), ((\"Hello Scala\"), 3), ((\"Hello china\"), 5)
它先把集合中的字符串转成如下格式(简单版中的格式),其他的步骤与简单版一致:
val tupleList: List[String] = List(\"hello world\", \"beautiful city\")
// 方法一:(不通用)
val tupleList: List[(String, Int)] = List(((\"Hello Scala Spark World\"), 7), ((\"Hello Scala\"), 3), ((\"Hello china\"), 5))
tupleList.map((elem) => (elem._1 + \" \") * elem._2)
.flatMap(_.split(\" \"))
.groupBy(word => word)
.map((kv) => (kv._1, kv._2.length))
.toList.sortWith(_._2 > _._2)
.take(3)
.foreach(println)
这种方法就是先计算每个元组中单词的个数,再进行累加即可
// 方法二:先计算每个元组中单词的个数,再把相同的key的value累加起来
// (\"Hello Scala Spark World\"), 7)
// (\"Hello\",7) (\"Scala\",7) (\"Spark\",7) (\"World\", 7)
val tupleList: List[(String, Int)] = List(((\"Hello Scala Spark World\"), 7),
((\"Hello Scala\"), 3), ((\"Hello china\"), 5))
val wordToCountList: List[(String, Int)] = tupleList.flatMap(t => {
val strings: Array[String] = t._1.split(\" \")
strings.map(word => (word, t._2))
})
println(wordToCountList)
// 分组
val wordGroupBy: Map[String, List[(String, Int)]] = wordToCountList.groupBy(_._1)
println(wordGroupBy)
// 把数字合并成列表 类似于(“hello” => List(7,7,7))
val wordToCountMap = wordGroupBy.map(t => {
(t._1, t._2.map(t1 => t1._2))
})
val wordToTotalCountMap:Map[String, Int] = wordToCountMap.map(t => (t._1,t._2.sum))
println(wordToTotalCountMap)
wordToTotalCountMap
.toList
.sortWith(_._2 > _._2)
.take(3)
.foreach(println)