We have been discussing options for my Serpent serializer on the Lua maillist and David Manura suggested natural sort order as one of possible choices for sorting table keys. I liked the idea and came up with my own version; you may find other implementation in various languages on Dave Koelle's page.
Most of the implementations of this alphanumeric sort first split the values being sorted into chunks of numbers and non-numbers and then sort them going over those chunks and comparing them pairwise. After reviewing options for split in Lua I decided to look for alternatives and came up with an algorithm that doesn't require splitting strings or storing intermediate chunks. I have update the post to offer you four versions so that you can pick the one that matches your case (see the updates at the bottom of the post).
1. The simplest one. Doesn't correctly sort numbers with decimal points ("0.11" comes after "0.2") and doesn't ignore leading zeroes ("60" comes before "050"), but should still cover the majority of cases:
function alphanumsort(o)
local function padnum(d) return ("%03d%s"):format(#d, d) end
table.sort(o, function(a,b)
return tostring(a):gsub("%d+",padnum) < tostring(b):gsub("%d+",padnum) end)
return o
end
2. Ignore leading zeros. This one sorts "060" and "60" together (so "050" will come before "60"), but the order in which "060" and "60" are sorted is undetermined:
function alphanumsort(o)
local function padnum(d) local r = string.match(d, "0*(.+)")
return ("%03d%s"):format(#r, r) end
table.sort(o, function(a,b)
return tostring(a):gsub("%d+",padnum) < tostring(b):gsub("%d+",padnum) end)
return o
end
3. Sort numbers with decimal points. This one puts "0.2" after "0.11":
function alphanumsort(o)
local function padnum(d) local dec, n = string.match(d, "(%.?)0*(.+)")
return #dec > 0 and ("%.12f"):format(d) or ("%s%03d%s"):format(dec, #n, n) end
table.sort(o, function(a,b)
return tostring(a):gsub("%.?%d+",padnum)
< tostring(b):gsub("%.?%d+",padnum) end)
return o
end
4. This one partially fixes leading zeros, such that "0050" is put before "050", which is in turn before "50"; if you want "050" to be after "50" then reverse #a and #b in the last comparison:
function alphanumsort(o)
local function padnum(d) local dec, n = string.match(d, "(%.?)0*(.+)")
return #dec > 0 and ("%.12f"):format(d) or ("%s%03d%s"):format(dec, #n, n) end
table.sort(o, function(a,b)
return tostring(a):gsub("%.?%d+",padnum)..("%3d"):format(#b)
< tostring(b):gsub("%.?%d+",padnum)..("%3d"):format(#a) end)
return o
end
This is how you would use it:
local o = {"1000X Rad Max","10X Rad","200X Rad","20X Rad","20X Rad Prime",
"30X Rad","40X Rad","Abc 012b","Abc 12a","Abc 0.11","Abc 0.2",
"Abc 50.02 Ult","Abc 50.01 Ult","Abc 50.1 Ult","Abc 060 Ult",
"Abc 0060 Ult", "Abc 050 Ult","Abc 0050 Ult","Abc 51 Ult","Abc 50 Ult",
"Abc 500 Ult","Abc 51 Ult", "Abc 51B Ult","Abc 52 Ult","Abc 60 Ult",
"Alpha 100","Alpha 2","Alpha 200","Alpha 2A","Alpha 2A-8000","Alpha 2A-900"}
for k,v in ipairs(alphanumsort(o)) do print(v) end
And these are the results that would be generated for each of the options:
#0. Default #1. Simplest #2. Leading 0 #3. Decimals #4. Best
------------- ------------- ------------- ------------- -------------
1000X Rad Max 10X Rad 10X Rad 10X Rad 10X Rad
10X Rad 20X Rad 20X Rad 20X Rad 20X Rad
200X Rad 20X Rad Prime 20X Rad Prime 20X Rad Prime 20X Rad Prime
20X Rad 30X Rad 30X Rad 30X Rad 30X Rad
20X Rad Prime 40X Rad 40X Rad 40X Rad 40X Rad
30X Rad 200X Rad 200X Rad 200X Rad 200X Rad
40X Rad 1000X Rad Max 1000X Rad Max 1000X Rad Max 1000X Rad Max
Abc 0.11 Abc 0.2 Abc 0.2 Abc 0.11 Abc 0.11
Abc 0.2 Abc 0.11 Abc 0.11 Abc 0.2 Abc 0.2
Abc 0050 Ult Abc 12a Abc 12a Abc 12a Abc 12a
Abc 0060 Ult Abc 50 Ult Abc 012b Abc 012b Abc 012b
Abc 012b Abc 50.1 Ult Abc 050 Ult Abc 050 Ult Abc 0050 Ult
Abc 050 Ult Abc 50.01 Ult Abc 0050 Ult Abc 0050 Ult Abc 050 Ult
Abc 060 Ult Abc 50.02 Ult Abc 50 Ult Abc 50 Ult Abc 50 Ult
Abc 12a Abc 51 Ult Abc 50.1 Ult Abc 50.01 Ult Abc 50.01 Ult
Abc 50 Ult Abc 51 Ult Abc 50.01 Ult Abc 50.02 Ult Abc 50.02 Ult
Abc 50.01 Ult Abc 51B Ult Abc 50.02 Ult Abc 50.1 Ult Abc 50.1 Ult
Abc 50.02 Ult Abc 52 Ult Abc 51 Ult Abc 51 Ult Abc 51 Ult
Abc 50.1 Ult Abc 60 Ult Abc 51 Ult Abc 51 Ult Abc 51 Ult
Abc 500 Ult Abc 012b Abc 51B Ult Abc 51B Ult Abc 51B Ult
Abc 51 Ult Abc 050 Ult Abc 52 Ult Abc 52 Ult Abc 52 Ult
Abc 51 Ult Abc 060 Ult Abc 60 Ult Abc 60 Ult Abc 0060 Ult
Abc 51B Ult Abc 500 Ult Abc 0060 Ult Abc 0060 Ult Abc 060 Ult
Abc 52 Ult Abc 0050 Ult Abc 060 Ult Abc 060 Ult Abc 60 Ult
Abc 60 Ult Abc 0060 Ult Abc 500 Ult Abc 500 Ult Abc 500 Ult
Alpha 100 Alpha 2 Alpha 2 Alpha 2 Alpha 2
Alpha 2 Alpha 2A Alpha 2A Alpha 2A Alpha 2A
Alpha 200 Alpha 2A-900 Alpha 2A-900 Alpha 2A-900 Alpha 2A-900
Alpha 2A Alpha 2A-8000 Alpha 2A-8000 Alpha 2A-8000 Alpha 2A-8000
Alpha 2A-8000 Alpha 100 Alpha 100 Alpha 100 Alpha 100
Alpha 2A-900 Alpha 200 Alpha 200 Alpha 200 Alpha 200
[Update 8/25/2016] 5. While the previous function already does what's needed in most of the cases, there is one special case where it fails to sort in a way that humans would sort it: when a number in one string is in a position where a number is absent in some other string. For example: when sorting {"dir\\file.lua", "dir2\\file.lua", "dir3\\file.lua", "dir3\\file2.lua"} using the algorithm above, you'd get {"dir2\\file.lua", "dir3\\file.lua", "dir3\\file2.lua", "dir\\file.lua"}, which is probably not what you'd expect.
Egor Skriptunoff came up with a clever fix for this problem:
function alphanumsort(o)
local function conv(s)
local res, dot = "", ""
for n, m, c in tostring(s):gmatch"(0*(%d*))(.?)" do
if n == "" then
dot, c = "", dot..c
else
res = res..(dot == "" and ("%03d%s"):format(#m, m)
or "."..n)
dot, c = c:match"(%.?)(.*)"
end
res = res..c:gsub(".", "\0%0")
end
return res
end
table.sort(o,
function (a, b)
local ca, cb = conv(a), conv(b)
return ca < cb or ca == cb and a < b
end)
return o
end
If you sort the same table using this function, you'll get the sorted table you'd likely expect: {"dir\\file.lua", "dir2\\file.lua", "dir3\\file.lua", "dir3\\file2.lua"}.
One caveat for those using UTF-8 strings: the converted text may not be a proper UTF-8 text after the conversion. The fix is to only prepend zeros to the starting bytes of UTF-8 symbols:
res = res..c:gsub("[\0-\127\192-\255]", "\0%0")
...
return utf8.lt(ca, cb) or ca == cb and utf8.lt(a, b)
[Update 8/25/2012] I rewrote the solution based on the feedback and discussion on the maillist. These notes no longer apply, as they were for an earlier version that suffered from not handling numbers larger than maxint on your platform. Here is the original code:
function alphanumsort(o)
local function padnum(d) return ("%012d"):format(d) end
table.sort(o, function(a,b)
return tostring(a):gsub("%d+",padnum) < tostring(b):gsub("%d+",padnum) end)
return o
end
If you are concerned that the hard-coded length (12) may be too small for your numeric fragments, you can replace it with something a bit more elaborate:
local maxl = 0
for n,v in ipairs(o) do tostring(v)
:gsub("%d+", function(d) if #d > maxl then maxl = #d end; return d end) end
local function padnum(d) return ("%0"..maxl.."d"):format(d) end
Keep in mind that this may fail badly if your text happened to have a long sequence of numbers as some libraries may not be able to take an arbitrarily large length in number formatting. For example, on Windows ("%099d"):format(1)
works, but ("%100d"):format(1)
fails with invalid format (width or precision too long)
message.
Leave a comment