五个最难的SQL任务:数据缺失、透视、独热编码( 二 )

结果:

五个最难的SQL任务:数据缺失、透视、独热编码

文章插图
4. 市场篮子分析市场篮子分析是购物篮分析和挖掘关联规则的一种方法 。在这个过程中,首先需要对数据进行格式化,以便将每笔交易聚合到单个记录中 。对于个人电脑来说,这一步骤可能很具挑战性,因为它涉及数据处理和转换 。然而,数据仓库专为高效处理这些数据而设计,因此它更适合执行市场篮子分析所需的数据格式化任务 。数据仓库提供了便捷的功能,使数据格式化更加容易,从而支持购物篮分析和关联规则的挖掘 。
WITH order_detAIl as (SELECTSALESORDERNUMBER,listagg(ENGLISHPRODUCTNAME, ', ') WITHIN group (order byENGLISHPRODUCTNAME) as ENGLISHPRODUCTNAME_listagg,COUNT(ENGLISHPRODUCTNAME) as num_productsFROMtransactionsGROUP BYSALESORDERNUMBER) SELECTENGLISHPRODUCTNAME_listagg,count(SALESORDERNUMBER) as NumTransactions FROMorder_detail wherenum_products > 1 GROUP BYENGLISHPRODUCTNAME_listagg order bycount(SALESORDERNUMBER) desc;结果:
五个最难的SQL任务:数据缺失、透视、独热编码

文章插图
5. 时间序列聚合时间序列聚合是指将时间序列数据按照一定的时间间隔进行汇总和聚合,广泛用语分析数据 。然而,要正确执行时间序列聚合,关键因素之一是需要对数据进行适当的格式化,尤其是在使用窗口函数时 。这一步骤的正确处理对于获得准确且有意义的聚合结果至关重要 。
例如 , 假设我们想计算过去14天的平均销售额 。使用窗口函数的方法要求我们将所有销售数据转换为每天一行的格式 。然而 , 实际上,大多数销售数据是以交易级别存储的 , 这让格式转换变得困难 。这就是时间序列聚合发挥作用的地方 。通过时间序列聚合 , 我们可以创建历史指标的聚合结果,无需重新格式化整个数据集 。如果我们想一次添加多个指标,它也会派上用?。?
  • 过去14天内的平均销售额
  • 过去6个月中最大的购买
  • 统计过去 90 天内的不同产品类型
如果您想使用窗口函数,则需要通过几个步骤独立构建每个指标 。
处理此问题的更好方法是使用公共表表达式 (CTE) 来定义每个预先聚合的历史窗口 。
例如:
五个最难的SQL任务:数据缺失、透视、独热编码

文章插图
WITH BASIC_OFFSET_14DAY AS (SELECTA.CustomerID,A.TransactionDate,AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST14DAY,MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST14DAY,COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST14DAYFROMMy_First_Table AINNER JOIN My_First_Table B ON A.CustomerID = B.CustomerIDAND 1 = 1WHEREB.TransactionDate >= DATEADD(day, -14, A.TransactionDate)AND B.TransactionDate <= A.TransactionDateGROUP BYA.CustomerID,A.TransactionDate), BASIC_OFFSET_90DAY AS (SELECTA.CustomerID,A.TransactionDate,AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST90DAY,MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST90DAY,COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST90DAYFROMMy_First_Table AINNER JOIN My_First_Table B ON A.CustomerID = B.CustomerIDAND 1 = 1WHEREB.TransactionDate >= DATEADD(day, -90, A.TransactionDate)AND B.TransactionDate <= A.TransactionDateGROUP BYA.CustomerID,A.TransactionDate), BASIC_OFFSET_180DAY AS (SELECTA.CustomerID,A.TransactionDate,AVG(B.PurchaseAmount) as AVG_PURCHASEAMOUNT_PAST180DAY,MAX(B.PurchaseAmount) as MAX_PURCHASEAMOUNT_PAST180DAY,COUNT(DISTINCT B.TransactionID) as COUNT_DISTINCT_TRANSACTIONID_PAST180DAYFROMMy_First_Table AINNER JOIN My_First_Table B ON A.CustomerID = B.CustomerIDAND 1 = 1WHEREB.TransactionDate >= DATEADD(day, -180, A.TransactionDate)AND B.TransactionDate <= A.TransactionDateGROUP BYA.CustomerID,A.TransactionDate) SELECTsrc.*,BASIC_OFFSET_14DAY.AVG_PURCHASEAMOUNT_PAST14DAY,BASIC_OFFSET_14DAY.MAX_PURCHASEAMOUNT_PAST14DAY,BASIC_OFFSET_14DAY.COUNT_DISTINCT_TRANSACTIONID_PAST14DAY,BASIC_OFFSET_90DAY.AVG_PURCHASEAMOUNT_PAST90DAY,BASIC_OFFSET_90DAY.MAX_PURCHASEAMOUNT_PAST90DAY,BASIC_OFFSET_90DAY.COUNT_DISTINCT_TRANSACTIONID_PAST90DAY,BASIC_OFFSET_180DAY.AVG_PURCHASEAMOUNT_PAST180DAY,BASIC_OFFSET_180DAY.MAX_PURCHASEAMOUNT_PAST180DAY,BASIC_OFFSET_180DAY.COUNT_DISTINCT_TRANSACTIONID_PAST180DAY FROMMy_First_Table srcLEFT OUTER JOIN BASIC_OFFSET_14DAY ON BASIC_OFFSET_14DAY.TransactionDate = src.TransactionDateAND BASIC_OFFSET_14DAY.CustomerID = src.CustomerIDLEFT OUTER JOIN BASIC_OFFSET_90DAY ON BASIC_OFFSET_90DAY.TransactionDate = src.TransactionDateAND BASIC_OFFSET_90DAY.CustomerID = src.CustomerIDLEFT OUTER JOIN BASIC_OFFSET_180DAY ON BASIC_OFFSET_180DAY.TransactionDate = src.TransactionDateAND BASIC_OFFSET_180DAY.CustomerID = src.CustomerID;


推荐阅读