五个最难的SQL任务:数据缺失、透视、独热编码

许多人体验到了云数据仓库内集中计算的速度和效率优势,但同时大家也认识到这种方法存在一些缺点 。缺点之一是需要学习和执行不同语言(尤其是SQL)的查询,这增加了很多复杂性 。
为了解决这个问题 , 本文分享一些最难以在SQL中学习和执行的具体转换,并提供实际所需的SQL代码 。

五个最难的SQL任务:数据缺失、透视、独热编码

文章插图
1. DatespineDatespine是生成日期索引的转换 。
想象下,您正在分析每日销售数据,表格如下:
五个最难的SQL任务:数据缺失、透视、独热编码

文章插图
【五个最难的SQL任务:数据缺失、透视、独热编码】因16日和17日没有销售,对应的行缺失 。如果想计算平均每日销售额或构建时间序列预测模型 , 那这种格式是个问题,我们需要插入缺失日期的行 。
基本概念:
  • 生成或选择唯一日期
  • 生成或选择唯一产品
  • 交叉连接(笛卡尔积)1&2的所有组合
  • 将第3步的结果与原始数据进行外连接
WITH GLOBAL_SPINE AS (SELECTROW_NUMBER() OVER (ORDER BYNULL) as INTERVAL_ID,DATEADD('day',(INTERVAL_ID - 1),'2020-01-01T00:00' :: timestamp_ntz) as SPINE_START,DATEADD('day', INTERVAL_ID, '2020-01-01T00:00' :: timestamp_ntz) as SPINE_ENDFROMTABLE (GENERATOR(ROWCOUNT => 1097))), GROUPS AS (SELECTproduct,MIN(sales_date) AS LOCAL_START,MAX(sales_date) AS LOCAL_ENDFROMMy_First_TableGROUP BYproduct), GROUP_SPINE AS (SELECTproduct,SPINE_START AS GROUP_START,SPINE_END AS GROUP_ENDFROMGROUPS G CROSSJOIN LATERAL (SELECTSPINE_START,SPINE_ENDFROMGLOBAL_SPINE SWHERES.SPINE_START >= G.LOCAL_START)) SELECTG.product AS GROUP_BY_product,GROUP_START,GROUP_END,T.* FROMGROUP_SPINE GLEFT JOIN My_First_Table T ON sales_date >= G.GROUP_STARTAND sales_date < G.GROUP_ENDAND G.product = T.product;最终结果如下所示:
五个最难的SQL任务:数据缺失、透视、独热编码

文章插图
2. PivotPivot是一种将行数据转换为列数据的操作,以便更好地进行分析和可视化 。
有时,在进行分析时,您希望重新构造表格 。例如,有个罗列学生、科目和成绩表格,我们想把具体科目分解为每个列 。
之前:
五个最难的SQL任务:数据缺失、透视、独热编码

文章插图
SELECT Student, MATHEMATICS, GEOGRAPHY, PHYS_EDFROM ( SELECT Student, Grade, Subject FROM skool)PIVOT ( AVG ( Grade ) FOR Subject IN ( 'Mathematics', 'Geography', 'Phys Ed' ) ) as p( Student, MATHEMATICS, GEOGRAPHY, PHYS_ED );结果:
五个最难的SQL任务:数据缺失、透视、独热编码

文章插图
3. One-Hot编码(或“虚拟”变量)One-Hot编码是一种将分类变量转换为数字变量的方法 。
这里是一个使用STATE作为列进行独热编码的示例 。
之前:
五个最难的SQL任务:数据缺失、透视、独热编码

文章插图
SELECT *,CASE WHEN State = 'AL' THEN 1 ELSE 0 END as STATE_AL,CASE WHEN State = 'AK' THEN 1 ELSE 0 END as STATE_AK,CASE WHEN State = 'AZ' THEN 1 ELSE 0 END as STATE_AZ,CASE WHEN State = 'AR' THEN 1 ELSE 0 END as STATE_AR,CASE WHEN State = 'AS' THEN 1 ELSE 0 END as STATE_AS,CASE WHEN State = 'CA' THEN 1 ELSE 0 END as STATE_CA,CASE WHEN State = 'CO' THEN 1 ELSE 0 END as STATE_CO,CASE WHEN State = 'CT' THEN 1 ELSE 0 END as STATE_CT,CASE WHEN State = 'DC' THEN 1 ELSE 0 END as STATE_DC,CASE WHEN State = 'FL' THEN 1 ELSE 0 END as STATE_FL,CASE WHEN State = 'GA' THEN 1 ELSE 0 END as STATE_GA,CASE WHEN State = 'HI' THEN 1 ELSE 0 END as STATE_HI,CASE WHEN State = 'ID' THEN 1 ELSE 0 END as STATE_ID,CASE WHEN State = 'IL' THEN 1 ELSE 0 END as STATE_IL,CASE WHEN State = 'IN' THEN 1 ELSE 0 END as STATE_IN,CASE WHEN State = 'IA' THEN 1 ELSE 0 END as STATE_IA,CASE WHEN State = 'KS' THEN 1 ELSE 0 END as STATE_KS,CASE WHEN State = 'KY' THEN 1 ELSE 0 END as STATE_KY,CASE WHEN State = 'LA' THEN 1 ELSE 0 END as STATE_LA,CASE WHEN State = 'ME' THEN 1 ELSE 0 END as STATE_ME,CASE WHEN State = 'MD' THEN 1 ELSE 0 END as STATE_MD,CASE WHEN State = 'MA' THEN 1 ELSE 0 END as STATE_MA,CASE WHEN State = 'MI' THEN 1 ELSE 0 END as STATE_MI,CASE WHEN State = 'MN' THEN 1 ELSE 0 END as STATE_MN,CASE WHEN State = 'MS' THEN 1 ELSE 0 END as STATE_MS,CASE WHEN State = 'MO' THEN 1 ELSE 0 END as STATE_MO,CASE WHEN State = 'MT' THEN 1 ELSE 0 END as STATE_MT,CASE WHEN State = 'NE' THEN 1 ELSE 0 END as STATE_NE,CASE WHEN State = 'NV' THEN 1 ELSE 0 END as STATE_NV,CASE WHEN State = 'NH' THEN 1 ELSE 0 END as STATE_NH,CASE WHEN State = 'NJ' THEN 1 ELSE 0 END as STATE_NJ,CASE WHEN State = 'NM' THEN 1 ELSE 0 END as STATE_NM,CASE WHEN State = 'NY' THEN 1 ELSE 0 END as STATE_NY,CASE WHEN State = 'NC' THEN 1 ELSE 0 END as STATE_NC,CASE WHEN State = 'ND' THEN 1 ELSE 0 END as STATE_ND,CASE WHEN State = 'OH' THEN 1 ELSE 0 END as STATE_OH,CASE WHEN State = 'OK' THEN 1 ELSE 0 END as STATE_OK,CASE WHEN State = 'OR' THEN 1 ELSE 0 END as STATE_OR,CASE WHEN State = 'PA' THEN 1 ELSE 0 END as STATE_PA,CASE WHEN State = 'RI' THEN 1 ELSE 0 END as STATE_RI,CASE WHEN State = 'SC' THEN 1 ELSE 0 END as STATE_SC,CASE WHEN State = 'SD' THEN 1 ELSE 0 END as STATE_SD,CASE WHEN State = 'TN' THEN 1 ELSE 0 END as STATE_TN,CASE WHEN State = 'TX' THEN 1 ELSE 0 END as STATE_TX,CASE WHEN State = 'UT' THEN 1 ELSE 0 END as STATE_UT,CASE WHEN State = 'VT' THEN 1 ELSE 0 END as STATE_VT,CASE WHEN State = 'VA' THEN 1 ELSE 0 END as STATE_VA,CASE WHEN State = 'WA' THEN 1 ELSE 0 END as STATE_WA,CASE WHEN State = 'WV' THEN 1 ELSE 0 END as STATE_WV,CASE WHEN State = 'WI' THEN 1 ELSE 0 END as STATE_WI,CASE WHEN State = 'WY' THEN 1 ELSE 0 END as STATE_WYFROM BABYTABLE;


推荐阅读