Scala语言进阶(二):单词计数WordCount案例

发布时间:2023-12-22 10:30

        目录

                        wordCount基本版

                        需求分析

                        思路解析

                        代码编写

                        复杂版WordCount

                    方法 1

                    方法 2


1.wordCount基本版

1.1 需求分析

需求:从一个集合中计算出每个单词的个数,并输出前三的单词

集合中的形式:

val stringList: List[String] = List(
  \"hello world\",
  \"beautiful city\",
  \"I a chinese\",
  \"you is  a good man\",
  \"I is a kind woman\",
  \"thanks for you\",
  \"forgive you\",
  \"hello world,I am student in a beautiful city\",
  \"whatever I will keep happy\"
)

1.2思路解析

\"Scala语言进阶(二):单词计数WordCount案例_第1张图片\"

        1. 先对每一行以空格切割,获得每一个单词,此时返回的是一个字符串数组。

        2. 将每个字符串数组扁平化处理

        3. 对单词进行分组操作

        4.统计

1.3代码编写

   // 简单版本:单词计数:将集合中出现的相同的单词统计其个数
    val stringList: List[String] = List(
      \"hello world\",
      \"beautiful city\",
      \"I a chinese\",
      \"you is  a good man\",
      \"I is a kind woman\",
      \"thanks for you\",
      \"forgive you\",
      \"hello world,I am student in a beautiful city\",
      \"whatever I will keep happy\"
    )
    // 1. 将单词按空格切割
    val stringSplit: List[Array[String]] = stringList.map((strings) => {
      strings.split(\" \")
    })
    println(stringSplit)
    // 2. 扁平化
    val stringListFlatten: List[String] = stringSplit.flatten
    println(stringListFlatten)

    // 3.将相同的单词放在一起 groupby
    val stringGroupBy: Map[String, List[String]] = stringListFlatten.groupBy((word) => word)
    println(stringGroupBy)

    // 4.统计相同元素的个数,并返回map
    val stringCountList: Map[String, Int] = stringGroupBy.map((kv) => (kv._1, kv._2.size))

    // 5.排序 取出前三个
    val toList = stringCountList.toList
    val resultList = toList.sortWith((x, y) => {
      x._2 > y._2
    }).take(3)
    println(resultList)

2. 复杂版WordCount

与简单版本不一样的是复杂版本中,它的集合格式如下:数字代表该字符串重复的次数,这里提供两种解法。

val tupleList: List[(String, Int)] = List(((\"Hello Scala Spark World\"), 7), ((\"Hello Scala\"), 3), ((\"Hello china\"), 5)

 2.1 方法 1

它先把集合中的字符串转成如下格式(简单版中的格式),其他的步骤与简单版一致:

val tupleList: List[String] = List(\"hello world\", \"beautiful city\")
 // 方法一:(不通用)
    val tupleList: List[(String, Int)] = List(((\"Hello Scala Spark World\"), 7), ((\"Hello Scala\"), 3), ((\"Hello china\"), 5))
    tupleList.map((elem) => (elem._1 + \" \") * elem._2)
      .flatMap(_.split(\" \"))
      .groupBy(word => word)
      .map((kv) => (kv._1, kv._2.length))
      .toList.sortWith(_._2 > _._2)
      .take(3)
      .foreach(println)

2.2 方法 2

这种方法就是先计算每个元组中单词的个数,再进行累加即可

 //    方法二:先计算每个元组中单词的个数,再把相同的key的value累加起来
    //    (\"Hello Scala Spark World\"), 7)
    //    (\"Hello\",7) (\"Scala\",7) (\"Spark\",7) (\"World\", 7)
    val tupleList: List[(String, Int)] = List(((\"Hello Scala Spark World\"), 7),
     ((\"Hello Scala\"), 3), ((\"Hello china\"), 5))
    val wordToCountList: List[(String, Int)] = tupleList.flatMap(t => {
      val strings: Array[String] = t._1.split(\" \")
      strings.map(word => (word, t._2))
    })
    println(wordToCountList)

    // 分组
    val wordGroupBy: Map[String, List[(String, Int)]] = wordToCountList.groupBy(_._1)
    println(wordGroupBy)
    // 把数字合并成列表 类似于(“hello” => List(7,7,7))
    val wordToCountMap = wordGroupBy.map(t => {
      (t._1, t._2.map(t1 => t1._2))
    })

    val wordToTotalCountMap:Map[String, Int] = wordToCountMap.map(t => (t._1,t._2.sum))
    println(wordToTotalCountMap)
    wordToTotalCountMap
    .toList
    .sortWith(_._2 > _._2)
    .take(3)
    .foreach(println)

ItVuer - 免责声明 - 关于我们 - 联系我们

本网站信息来源于互联网,如有侵权请联系:561261067@qq.com

桂ICP备16001015号