SqlServer類似正則表達式的字符處理是怎樣的

356次閱讀

共計 6155 個字符，預計需要花費 16 分鐘才能閱讀完成。

SqlServer 類似正則表達式的字符處理是怎樣的，相信很多沒有經驗的人對此束手無策，為此本文總結了問題出現的原因和解決方法，通過這篇文章希望你能解決這個問題。

SQL Serve 提供了簡單的字符模糊匹配功能，比如：like, patindex，不過對于某些字符處理場景還顯得并不足夠，日常碰到的幾個問題有：

一. 同一個字符 / 字符串，出現了多少次

同一個字符，將其替換為空串，即可計算

declare @text varchar(1000)declare @str varchar(10)set @text =  ABCBDBE set @str =  B select len(@text) - len(replace(@text,@str,))

同一個字符串，仍然是替換，因為是多個字符，方法 1 替換后需要做一次除法；方法 2 替換時增加一個字符，則不需要

-- 方法 1declare @text varchar(1000)declare @str varchar(10)set @text =  ABBBCBBBDBBBE set @str =  BBB select (len(@text) - len(replace(@text,@str,)))/len(@str)-- 方法 2declare @text varchar(1000)declare @str varchar(10)set @text =  ABBBCBBBDBBBE set @str =  BBB select len(replace(@text,@str,@str+ _)) - len(@text)

二. 同一個字符 / 字符串，第 N 次出現的位置

SQL SERVER 定位字符位置的函數為 CHARINDEX：

CHARINDEX ( expressionToFind , expressionToSearch [ , start_location ] )

可以從指定位置起開始檢索，但是不能取第 N 次出現的位置，需要自己寫 SQL 來補充，有以下幾種思路：

1. 自定義函數, 循環中每次為 charindex 加一個計數，直到為 N

if object_id(NthChar , FN) is not null drop function NthcharGOcreate function NthChar(@source_string as nvarchar(4000), @sub_string as nvarchar(1024),@nth as int) returns int as begin declare @postion int declare @count int set @postion = CHARINDEX(@sub_string, @source_string) set @count = 0 while @postion   0 begin set @count = @count + 1 if @count = @nth begin break end set @postion = CHARINDEX(@sub_string, @source_string, @postion + 1) End return @postion end GO--select dbo.NthChar(abcabc , abc ,2)--4

2. 通過 CTE，對待處理的整個表字段操作, 遞歸中每次為 charindex 加一個計數，直到為 N

if object_id(tempdb..#T) is not null drop table #Tcreate table #T(source_string nvarchar(4000))insert into #T values (N 我們我們)insert into #T values (N 我我哦我)declare @sub_string nvarchar(1024)declare @nth intset @sub_string = N 我們 set @nth = 2;with T(source_string, starts, pos, nth) as ( select source_string, 1, charindex(@sub_string, source_string), 1 from #t union all select source_string, pos + 1, charindex(@sub_string, source_string, pos + 1), nth+1 from T where pos   0)select source_string, pos, nthfrom Twhere pos   0 and nth = @nthorder by source_string, starts--source_string pos nth-- 我們我們  3 2

3. 借助數字表 (tally table)，到不同起點位置去做 charindex，需要先自己構造個數字表

--numbers/tally tableIF EXISTS (select * from dbo.sysobjects where id = object_id(N [dbo].[Numbers] ) and OBJECTPROPERTY(id, N IsUserTable) = 1) DROP TABLE dbo.Numbers--===== Create and populate the Tally table on the fly SELECT TOP 1000000 IDENTITY(int,1,1) AS number INTO dbo.Numbers FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2--===== Add a Primary Key to maximize performance ALTER TABLE dbo.Numbers ADD CONSTRAINT PK_numbers_number PRIMARY KEY CLUSTERED (number)--===== Allow the general public to use it GRANT SELECT ON dbo.Numbers TO PUBLIC-- 以上數字表創建一次即可，不需要每次都重復創建 DECLARE @source_string nvarchar(4000), @sub_string nvarchar(1024), @nth intSET @source_string =  abcabcvvvvabc SET @sub_string =  abc SET @nth = 2 ;WITH T AS( SELECT ROW_NUMBER() OVER(ORDER BY number) AS nth, number AS [Position In String] FROM dbo.Numbers n WHERE n.number  = LEN(@source_string) AND CHARINDEX(@sub_string, @source_string, n.number)-number = 0 ----OR --AND SUBSTRING(@source_string,number,LEN(@sub_string)) = @sub_string) SELECT * FROM T WHERE nth = @nth

4. 通過 CROSS APPLY 結合 charindex，適用于 N 值較小的時候，因為 CROSS APPLY 的次數要隨著 N 的變大而增加，語句也要做相應的修改

declare @T table(source_string nvarchar(4000))insert into @T values(abcabc),(abcabcvvvvabc)declare @sub_string nvarchar(1024)set @sub_string =  abc select source_string, p1.pos as no1, p2.pos as no2, p3.pos as no3from @Tcross apply (select (charindex(@sub_string, source_string))) as P1(Pos)cross apply (select (charindex(@sub_string, source_string, P1.Pos+1))) as P2(Pos)cross apply (select (charindex(@sub_string, source_string, P2.Pos+1))) as P3(Pos)

5. 在 SSIS 里有內置的函數，但 T -SQL 中并沒有

--FINDSTRING in SQL Server 2005 SSISFINDSTRING([yourColumn],  | , 2),--TOKEN in SQL Server 2012 SSISTOKEN(Col1, | ,3)

注：不難發現，這些方法和字符串拆分的邏輯是類似的，只不過一個是定位，一個是截取，如果要獲取第 N 個字符左右的一個 / 多個字符，有了 N 的位置，再結合 substring 去截取即可；

三. 多個相同字符連續，合并為一個字符

最常見的就是把多個連續的空格合并為一個空格，解決思路有兩個：

1. 比較容易想到的就是用多個 replace

但是究竟需要 replace 多少次并不確定，所以還得循環多次才行

-- 把兩個連續空格替換成一個空格，然后循環，直到 charindex 檢查不到兩個連續空格 declare @str varchar(100)set @str= abc abc kljlk kljkl while(charindex(   ,@str) 0)begin select @str=replace(@str,   ,  )endselect @str

2. 按照空格把字符串拆開

對每一段拆分開的字符串 trim 或者 replace 后，再用一個空格連接，有點繁瑣，沒寫代碼示例，如何拆分字符串可參考：“第 N 次出現的位置”；

四. 是否為有效 IP/ 身份證號 / 手機號等

類似 IP/ 身份證號 / 手機號等這些字符串，往往都有自身特定的規律，通過 substring 去逐位或逐段判斷是可以的，但 SQL 語句的方式往往性能不佳，建議嘗試正則函數，見下。

五. 正則表達式函數

1. Oracle

從 10g 開始，可以在查詢中使用正則表達式，它通過一些支持正則表達式的函數來實現：

Oracle 10 gREGEXP_LIKEREGEXP_REPLACEREGEXP_INSTRREGEXP_SUBSTROracle 11g (新增)REGEXP_COUNT

Oracle 用 REGEXP 函數處理上面幾個問題：

(1) 同一個字符 / 字符串，出現了多少次

select length(regexp_replace( 123-345-566 ,  [^-] ,  )) from dual;select REGEXP_COUNT(123-345-566 ,  -) from dual; --Oracle 11g

(2) 同一個字符 / 字符串，第 N 次出現的位置

不需要正則，ORACLE 的 instr 可以直接查找位置：

instr(source_string , sub_string  [,n][,m])

n 表示從第 n 個字符開始搜索，缺省值為 1，m 表示第 m 次出現，缺省值為 1。

select instr(abcdefghijkabc , abc , 1, 2) position from dual;

(3) 多個相同字符連續，合并為一個字符

select regexp_replace(trim( agc f f ), \s+ ,   ) from dual;

(4) 是否為有效 IP/ 身份證號 / 手機號等

-- 是否為有效 IPWITH IPAS(SELECT  10.20.30.40  ip_address FROM dual UNION ALLSELECT  a.b.c.d  ip_address FROM dual UNION ALLSELECT  256.123.0.254  ip_address FROM dual UNION ALLSELECT  255.255.255.255  ip_address FROM dual)SELECT *FROM IPWHERE REGEXP_LIKE(ip_address,  ^(([0-9]{1}|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.){3}([0-9]{1}|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$ -- 是否為有效身份證 / 手機號，暫未舉例

2. SQL Server

目前最新版本為 SQL Server 2017，還沒有對 REGEXP 函數的支持，需要通用 CLR 來擴展，如下為 CLR 實現 REG_REPLACE：

--1.  開啟  CLR EXEC sp_configure  show advanced options  ,  1 GORECONFIGUREGOEXEC sp_configure  clr enabled  ,  1 GORECONFIGUREGOEXEC sp_configure  show advanced options  ,  0 GO

2. 創建 Assembly

--3.  創建  CLR  函數 CREATE FUNCTION [dbo].[regex_replace](@input [nvarchar](4000), @pattern [nvarchar](4000), @replacement [nvarchar](4000))RETURNS [nvarchar](4000) WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUTAS EXTERNAL NAME [RegexUtility].[RegexUtility].[RegexReplaceDefault]GO--4.  使用 regex_replace 替換多個空格為一個空格 select dbo.regex_replace(agc f f  , \s+ ,

注：通過 CLR 實現更多 REGEXP 函數，如果有高級語言開發能力，可以自行開發；或者直接使用一些開源貢獻也行

小結：

1. 非正則 SQL 語句的思路，對不同數據庫往往都適用；

2. 正則表達式中的規則 (pattern) 在不同開發語言里，有很多語法是相通的，通常是遵守 perl 或者 linux shell 中的 sed 等工具的規則；

3. 從性能上來看，通用 SQL 判斷 REGEXP 函數自定義 SQL 函數。

看完上述內容，你們掌握 SqlServer 類似正則表達式的字符處理是怎樣的的方法了嗎？如果還想學到更多技能或想了解更多相關內容，歡迎關注丸趣 TV 行業資訊頻道，感謝各位的閱讀！

正文完