SPL---强大的集算器

发布时间:2024-10-10 14:01

(仅用于学习,如侵犯著作权,请告知后删除)转载 程序it圈(公众号)

SQL必须基于RDB工作,而很多场景下并没有RDB,比如遇到csv\\restful json\\MongoDB等数据源,或进行这些数据源之间的混合计算,比如csv和xls之间。

在这些场景下,很多人会选择用JAVA或C#等高级语言硬写算法,这要从头编写冗长的底层函数,执行效率也很难保证,很容易堆积出人人痛恨的“代码屎山”。

也有人会把数据写入数据库,再用SQL进行计算,但入库的过程非常繁琐,实时性也很差,有时还要求助于ETL工具,架构重上加重,风险增了又增,遇到混合计算更是加倍的麻烦。现在好了,集算器SPL可以解决这些问题。

SPL是开源的计算技术,完全覆盖了SQL的计算能力,支持种类繁多的数据源,没有RDB也可以用SQL进行结构化数据计算了。

\"SPL---强大的集算器_第1张图片\"

 对于交互式的计算分析,SPL有专业的IDE,不仅具有完整的调试功能,还可以用表格直观观察每一步的中间结算结果。

SPL的本意是Structure Process Language,是一种专门用于结构化数据处理的语言,在前面的例子中也已经展示了部分SPL本身的语法(那些扩展函数)。SQL只是SPL顺带提供的一种功能, SPL本身还拥有比SQL更强大便捷的计算能力。有些计算逻辑比较复杂,用SQL甚至存储过程都很难写,而用SPL则可以用简单的代码完成计算。

比如这个任务,计算某支股票最长的连续上涨天数,SQL要用多层嵌套的子查询和窗口函数,代码冗长难懂:

select max(continuousDays)-1
from (select count(*) continuousDays
    from (select sum(changeSign) over(order by tradeDate) unRiseDays
        from (select tradeDate,
            case when price>lag(price) over(order by tradeDate) then 0 else 1 end changeSign
            from AAPL) )
        group by unRiseDays)

而SPL只需两行:

A B
1 =T(\"d:/AAPL.xlsx\") 读Excel文件,首行为列名
2 =a=0,A1.max(a=if(price>price[-1],a+1,0)) 求最长连续上涨天数

对于简单的运算,使用基本的SQL是很方便的,但运算需求变复杂时,SQL就不适用了,即使提供更多的功能(比如窗口函数)也不能简化计算。这种情况下,我们推荐用户直接使用代码简洁的SPL,而不必再写多层嵌套的复杂SQL了。基于这个原因,SPL中的SQL也只支持到SQL92标准,没有提供包括窗口函数在内的更多语法。

完善的SQL计算能力

SPL提供了相当于SQL92标准的语法,可以进行足够丰富多样的数据计算,包括过滤、计算列、选择部分列、改名等等,可以直接把文本、xls等文件当成数据表来执行SQL。下面以csv文件作为数据源为例说明:

1. 过滤

基本的比较运算:

$select * from d:/Orders.csv where Amount>=100

like:

$select * from d:/Orders.csv where Client like \'%bro%\'

空值判断:

$select * from d:/Orders.csv where Client is null

与、或、非这样的逻辑运算符可以把比较运算组合起来,实现组合过滤:

$select * from d:/Orders.csv where not Amount>=100 and Client like \'bro\' or OrderDate is null

in:

$select * from d:/Orders.csv where Client in (\'TAS\',\'KBRO\',\'PNS\')

多层括号:

$select * from d:/Orders.csv where (OrderDate=date(\'2020-12-31\') and Amount>100)

2.    计算列

SPL有丰富的数学函数、字符串函数、日期函数:

$select round(Amount,2), price*quantity from d:/Orders.csv

$select left(Client,4) from d:/Orders.csv

$select year(OrderDate) from d:/Orders.csv

case when:

$select case year(OrderDate) when 2021 then \'this year\' when 2020 then \'last year\' else \'previous years\' end from d:/Orders.csv

coalesce反显空值:

$select coalesce(Client,\'unknown\') from d:/Orders.csv

3.    SELECT

$select OrderId, Amount, OrderDate from d:/Orders.csv

4.    ORDER BY

$select * from d:/Orders.csv order by Client, Amount desc

5.    DISTINCT

$select distinct Client ,Sellerid from d:/Orders.csv

6.    GROUP BY … HAVING

$select year(OrderDate),Client ,sum(Amount),count(1) from d:/Orders.csv 
group by year(OrderDate),Client 
having sum(Amount)<=100

聚合函数包括sum、count、avg、max、min,不分组也可以直接汇总:

$select avg(Amount) from d:/Orders.csv

7.    JOIN

左关联:

$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Orders.txt o 
left join d:/Employees.txt e on o.SellerId=e.Eid

右关联:

$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Employees.txt e 
right join d:/Orders.txt o on o.SellerId=e.Eid

全关联:

$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Employees.txt e 
full join d:/Orders.txt o on o.SellerId=e.EId

内关联:

$select o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o 
inner join d:/Employees.csv e on o.SellerId=e.Eid

内关联还可以写成WHERE形式:

$select o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o ,d:/Employees.csv e 
where  o.SellerId=e.Eid

8.    子查询

$select t.Client, t.s, ct.Name, ct.address from 
(select Client ,sum(amount) s from d:/Orders.csv group by Client) t 
left join ClientTable ct on t.Client=ct.Client

with:

$with t as (select Client ,sum(amount) s from d:/Orders.csv group by Client)
select t.Client, t.s, ct.Name, ct.address from t 
left join ClientTable ct on t.Client=ct.Client

In中的子查询:

$select * from d:/Orders.txt o  where o.sellerid in (select eid from d:/Employees.txt)

9.    AS

使用as关键字,可对字段、计算列、物理表、子查询改名:

$select price*quantity as subtotal from d:/detail.csv

10. 集合运算

union、union all、intersect、minus都有,举一例:

Select * from Orders1.csv
Union all
Select * from Orders2.csv

11. into输出结果

查询结果可用into关键字写入文件:

$select dept,count(1) c,sum(salary) s into deptResult.xlsx from employee.txt group by dept having s>100000


丰富的数据源支持

SPL支持各种非数据库的数据源,包括各种非标准格式的文本,前面例子中已展示过csv。TAB分隔的txt也一样可以支持,SPL会根据扩展名自动处理:

$select * from d:/Orders.txt where Amount>=100 and Client like \'bro\' or OrderDate is null

如果分隔符不是逗号和tab,就要用SPL扩展函数处理了,比如分隔符是冒号:

$select * from {file(\"d:/Orders.txt\").import@t (;\":\")} 
where Amount>=100 and Client like \'bro\' or OrderDate is null

没有标题行的文件,可以用序号表示列名:

$select * from {file(\"d:/Orders.txt\").import()} where _4>=100 and _2 like \'bro\' or _5 is null

某些特殊格式的字符串也要用扩展函数解析,比如日期格式不是标准的yyyy-MM-dd:

$select year(OrderDate),sum(Amount) from 
{file(\"d:/Orders.txt\").import@t (orderid,client,sellerid,amount,orderdate:date:\"dd-MM-yyyy\")}
group by year(OrderDate)

Excel文件上也可以执行SQL,对于格式规范的Excel,只需直接引用文件名:


$select * from d:/Orders.xlsx where Amount>=100 and Client like \'bro\' or OrderDate is null

可以读取指定sheet:


$select * from {file(\"D:/Orders.xlsx\").xlsimport@t (;\"sheet3\")} 
where Amount>=100 and Client like \'bro\' or OrderDate is null 

从远程网站下载来的csv/xls文件

$select * from {httpfile(\"http://127.0.0.1:6868/Orders.csv).import@tc() } 
where Amount>=100 and Client like \'bro\' or OrderDate is null

HTTP协议的特性很多,比如字符集、端口号、post参数、header参数、登录认证等等,SPL扩展函数都可以支持。扩展函数还可以抓取网页中的表格数据,也支持从FTP服务器下载文件,这里不再赘述。

Json文件要先读为字符串再解析:

$select * from {json(file(\"d:\\\\data.json\").read())} 
where Amount>=100 and Client like \'bro\' or OrderDate is null

二维Json比较少,多层才是常态,SPL扩展函数可以把多层数据转为二维记录,再用SQL计算,这里就不展开了。

Restful json

$select * from {json(httpfile(\"http://127.0.0.1:6868/api/getData\").read())}
where Amount>=100 and Client like \'bro\' or OrderDate is null

类似csv/xls,SPL也可以读取HTTP网站上的json/xml文件。

XML

$select * from {xml(file(\"d:/data.xml\").read(),\"xml/row\")}
where Amount>=100 and Client like \'bro\' or OrderDate is null

Web Service


$select * 
from {ws_call(ws_client(\"http://。/entityWS.asmx?wsdl\"),\"entityWS \":\" entityWSSoap\":\"getData\")} 
where Amount>=100 and Client like\'bro\' or OrderDate is null

NoSQL也不在话下。

MongoDB


$select * from
{mongo_shell@x (mongo_open(\"mongodb://127.0.0.1:27017/mongo\"),\"main.find()\")}
where Amount>=100 and Client like \'bro\' or OrderDate is null

MongoDB经常是多层数据,包括前面的restful、web Service,用SPL扩展函数都可以转为二维。

Salesforce

$select * from {sf_query(sf_open(),\"/services/data/v51.0/query\",\"Select Id,CaseNumber,Subject From Case where Status=\'New\'\")} where Amount>=100 and Client like \'bro\' or OrderDate is null

Hadoop HDFS上的csv/xls/json/xml:

A
1 =hdfs_open(;\"hdfs://192.168.0.8:9000\")
2 =hdfs_file(A1,\"/user/Orders.csv\":\"GBK\")
3 =A2.import@t()
4 =hdfs_close(A1)
5 $select Client,sum(Amount)   from {A3} group by Client

HBase也支持:

A
1 =hbase_open(\"hdfs://192.168.0.8\",   \"192.168.0.8\")
2 =hbase_scan(A1,\"Orders\")
3 =hbase_close(A1)
4 $select Client,sum(Amount)   from {A2} group by Client

Hbase还有filter、cmp等取数方式,SPL都可以支持。

Hive有公共的JDBC接口,但性能较差,SPL提供了高性能接口:

A
1 =hive_client(\"hdfs://192.168.0.8:9000\",\"thrift://192.168.0.8:9083\",\"hive\",\"asus\")
2 =hive_query(A1,   \"select  *  from    table\")
3 =hive_close()
4 $select Client,sum(Amount)   from {A2} group by Client

Spark

A
1 =spark_client(\"hdfs://192.168.0.8:9000\",\"thrift://192.168.0.8:9083\",\"aa\")
2 =spark_query(A1,\"select   * from tablename\")
3 =spark_close(A1)
4 $select   Client,sum(Amount) from {A2} group by Client

阿里云

A
1 =ali_open(\"http://test.ots.aliyuncs.com\",\"LTAIXZNG5zzSPHTQ\",\"sa\",\"test\")
2 =ali_query@x(A1,\"test\",[\"id1\",\"id2\"],[1,\"10001\"]:[10,\"70001\"],   [\"id1\",\"id2\",\"f1\",\"f2\"],f1>=2000.0)
3 $select Client,sum(Amount)   from {A2} group by Client

Cassandra

A
1 =stax_open(\"127.0.0.1\":9042,\"mycasdb\",\"cassandra\":\"cassandra\")
2 =stax_query(A1,\"select   * from user where id=?\",1)
3 =stax_close(A1)
4 $select Client,sum(Amount)   from {A2} group by Client

ElasticSearch

A
1 =es_open(\"localhost:9200\",\"user\":\"un1234\")
2 =es_get(A1,\"/person/_mget\",\"{\\\"ids\\\":[\\\"1\\\",\\\"2\\\",\\\"5\\\"]}\")
3 =es_close(A1)
4 $select   Client,sum(Amount) from {A2} group by Client

Redis

A
1 =redis_open()
2 =redis_hscan(A1,   \"runoobkey\", \"v*\", 3)
3 =redis_close (A1)
4 $select key,value from   {A2} where value>=2000  and   value<3000

SAP BW

A
1 =sap_open(\"userName\",\"passWord\",\"192.168.0.188\",\"00\",\"000\",”E\")
2 =sap_cursor(A1,   \"Z_TEST1\",\"IT_ROOM\").fetch()
3 =sap_close(A1)
4 $select * from   {A2} where Vendor like \'%software%\'

InfluxDB

A
1 =influx_open(\"http://127.0.0.1:8086\",   \"mydb\", \"autogen\", \"admin\", \"admin\")
2 =influx_query(A1,   \"SELECT * FROM Orders\")
3 =influx_close(A1)
4 $select   Client,sum(Amount) from {A2} group by Client

Kafka

A
1 =kafka_open(\"D://kafka.properties\";\"topic-test\")
2 =kafka_poll(A1)
3 =kafka_close (A1)
4 $select   Client,sum(Amount) from {A2} group by Client

MDX多维数据库

A
1 =olap_open(\"http://192.168.0.178:8088/msmdpump.dll\",\"CubeTest\",\"Administrator\",\"admin\")
2 =olap_query(A1,\"with member [Measures].[AnnualInterestRate]   as\'[Measures].[SalesAmount]/[Measures].[StandardCost]-1\'select   {[Measures].[SalesAmount],[Measures].[StandardCost],[Measures].[  AnnualInterestRate]} on columns, {[Order Date].[Calendar Year].[Calendar   Year]} on rows from [DataSourceMulti]\")
3 =olap_close(A1)
4 $select * from {A2} where   SalesAmount>10000

SPL除了支持种类繁多的数据源,也可以进行数据源之间的混合计算。比如csv和RDB之间:

$select   o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o inner join   d:/Employees.xls e on o.SellerId=e.Eid

Hadoop HDFS上的csv/xls/json/xml:

A
1 =hdfs_open(;\"hdfs://192.168.0.8:9000\")
2 =hdfs_file(A1,\"/user/Orders.csv\":\"GBK\")
3 =A2.import@t()
4 =hdfs_close(A1)
5 $select Client,sum(Amount)   from {A3} group by Client

HBase也支持:

A
1 =hbase_open(\"hdfs://192.168.0.8\",   \"192.168.0.8\")
2 =hbase_scan(A1,\"Orders\")
3 =hbase_close(A1)
4 $select Client,sum(Amount)   from {A2} group by Client

Hbase还有filter、cmp等取数方式,SPL都可以支持。

Hive有公共的JDBC接口,但性能较差,SPL提供了高性能接口:

A
1 =hive_client(\"hdfs://192.168.0.8:9000\",\"thrift://192.168.0.8:9083\",\"hive\",\"asus\")
2 =hive_query(A1,   \"select  *  from    table\")
3 =hive_close()
4 $select Client,sum(Amount)   from {A2} group by Client

Spark

A
1 =spark_client(\"hdfs://192.168.0.8:9000\",\"thrift://192.168.0.8:9083\",\"aa\")
2 =spark_query(A1,\"select   * from tablename\")
3 =spark_close(A1)
4 $select   Client,sum(Amount) from {A2} group by Client

阿里云

A
1 =ali_open(\"http://test.ots.aliyuncs.com\",\"LTAIXZNG5zzSPHTQ\",\"sa\",\"test\")
2 =ali_query@x(A1,\"test\",[\"id1\",\"id2\"],[1,\"10001\"]:[10,\"70001\"],   [\"id1\",\"id2\",\"f1\",\"f2\"],f1>=2000.0)
3 $select Client,sum(Amount)   from {A2} group by Client

Cassandra

A
1 =stax_open(\"127.0.0.1\":9042,\"mycasdb\",\"cassandra\":\"cassandra\")
2 =stax_query(A1,\"select   * from user where id=?\",1)
3 =stax_close(A1)
4 $select Client,sum(Amount)   from {A2} group by Client

ElasticSearch

A
1 =es_open(\"localhost:9200\",\"user\":\"un1234\")
2 =es_get(A1,\"/person/_mget\",\"{\\\"ids\\\":[\\\"1\\\",\\\"2\\\",\\\"5\\\"]}\")
3 =es_close(A1)
4 $select   Client,sum(Amount) from {A2} group by Client

Redis

A
1 =redis_open()
2 =redis_hscan(A1,   \"runoobkey\", \"v*\", 3)
3 =redis_close (A1)
4 $select key,value from   {A2} where value>=2000  and   value<3000

SAP BW

A
1 =sap_open(\"userName\",\"passWord\",\"192.168.0.188\",\"00\",\"000\",”E\")
2 =sap_cursor(A1,   \"Z_TEST1\",\"IT_ROOM\").fetch()
3 =sap_close(A1)
4 $select * from   {A2} where Vendor like \'%software%\'

InfluxDB

A
1 =influx_open(\"http://127.0.0.1:8086\",   \"mydb\", \"autogen\", \"admin\", \"admin\")
2 =influx_query(A1,   \"SELECT * FROM Orders\")
3 =influx_close(A1)
4 $select   Client,sum(Amount) from {A2} group by Client

Kafka

A
1 =kafka_open(\"D://kafka.properties\";\"topic-test\")
2 =kafka_poll(A1)
3 =kafka_close (A1)
4 $select   Client,sum(Amount) from {A2} group by Client

MDX多维数据库

A
1 =olap_open(\"http://192.168.0.178:8088/msmdpump.dll\",\"CubeTest\",\"Administrator\",\"admin\")
2 =olap_query(A1,\"with member [Measures].[AnnualInterestRate]   as\'[Measures].[SalesAmount]/[Measures].[StandardCost]-1\'select   {[Measures].[SalesAmount],[Measures].[StandardCost],[Measures].[  AnnualInterestRate]} on columns, {[Order Date].[Calendar Year].[Calendar   Year]} on rows from [DataSourceMulti]\")
3 =olap_close(A1)
4 $select * from {A2} where   SalesAmount>10000

SPL除了支持种类繁多的数据源,也可以进行数据源之间的混合计算。比如csv和RDB之间:

$select   o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o inner join   d:/Employees.xls e on o.SellerId=e.Eid

MongoDB和数据库之间:

A B
1 =mongo_open(\"mongodb://127.0.0.1:27017/mongo\")
2 =mongo_shell@x(A1,\"detail.find()\").fetch() =connect(\"orcl\").query@x(\"select   * from main\")
3 $select d.title,   m.path,sum(d.amount) from {A2} as d left join {B2} as m on d.cat=m.cat group   by d.title, m.path

任意数据源之间都可以进行混合计算,而且SQL语法不受数据源的影响。

SPL还支持命令行执行,任何主流操作系统都可以,对于应用程序中的计算,SPL提供了标准的JDBC驱动,可以方便地集成进JAVA、SPL允许将代码外置于JAVA程序。(比如SPL代码可以先存为脚本文件,再在JAVA中以存储过程的形式调用)。

有了SPL这样的开源神器,没有RDB也能轻松使用SQL了。

ItVuer - 免责声明 - 关于我们 - 联系我们

本网站信息来源于互联网,如有侵权请联系:561261067@qq.com

桂ICP备16001015号