Alphanum sorting for humans in Lua

We have been discussing options for my Serpent serializer on the Lua maillist and David Manura suggested natural sort order as one of possible choices for sorting table keys. I liked the idea and came up with my own version; you may find other implementation in various languages on Dave Koelle's page.

Most of the implementations of this alphanumeric sort first split the values being sorted into chunks of numbers and non-numbers and then sort them going over those chunks and comparing them pairwise. After reviewing options for split in Lua I decided to look for alternatives and came up with an algorithm that doesn't require splitting strings or storing intermediate chunks. I have update the post to offer you four versions so that you can pick the one that matches your case (see the updates at the bottom of the post).

1. The simplest one. Doesn't correctly sort numbers with decimal points ("0.11" comes after "0.2") and doesn't ignore leading zeroes ("60" comes before "050"), but should still cover the majority of cases:

function alphanumsort(o)
  local function padnum(d) return ("%03d%s"):format(#d, d) end
  table.sort(o, function(a,b)
    return tostring(a):gsub("%d+",padnum) < tostring(b):gsub("%d+",padnum) end)
  return o
end

2. Ignore leading zeros. This one sorts "060" and "60" together (so "050" will come before "60"), but the order in which "060" and "60" are sorted is undetermined:

function alphanumsort(o)
  local function padnum(d) local r = string.match(d, "0*(.+)")
    return ("%03d%s"):format(#r, r) end
  table.sort(o, function(a,b)
    return tostring(a):gsub("%d+",padnum) < tostring(b):gsub("%d+",padnum) end)
  return o
end

3. Sort numbers with decimal points. This one puts "0.2" after "0.11":

function alphanumsort(o)
  local function padnum(d) local dec, n = string.match(d, "(%.?)0*(.+)")
    return #dec > 0 and ("%.12f"):format(d) or ("%s%03d%s"):format(dec, #n, n) end
  table.sort(o, function(a,b)
    return tostring(a):gsub("%.?%d+",padnum)
         < tostring(b):gsub("%.?%d+",padnum) end)
  return o
end

4. This one partially fixes leading zeros, such that "0050" is put before "050", which is in turn before "50"; if you want "050" to be after "50" then reverse #a and #b in the last comparison:

function alphanumsort(o)
  local function padnum(d) local dec, n = string.match(d, "(%.?)0*(.+)")
    return #dec > 0 and ("%.12f"):format(d) or ("%s%03d%s"):format(dec, #n, n) end
  table.sort(o, function(a,b)
    return tostring(a):gsub("%.?%d+",padnum)..("%3d"):format(#b)
         < tostring(b):gsub("%.?%d+",padnum)..("%3d"):format(#a) end)
  return o
end

This is how you would use it:

local o = {"1000X Rad Max","10X Rad","200X Rad","20X Rad","20X Rad Prime",
   "30X Rad","40X Rad","Abc 012b","Abc 12a","Abc 0.11","Abc 0.2", 
   "Abc 50.02 Ult","Abc 50.01 Ult","Abc 50.1 Ult","Abc 060 Ult",
   "Abc 0060 Ult", "Abc 050 Ult","Abc 0050 Ult","Abc 51 Ult","Abc 50 Ult",
   "Abc 500 Ult","Abc 51 Ult", "Abc 51B Ult","Abc 52 Ult","Abc 60 Ult",
   "Alpha 100","Alpha 2","Alpha 200","Alpha 2A","Alpha 2A-8000","Alpha 2A-900"}
for k,v in ipairs(alphanumsort(o)) do print(v) end

And these are the results that would be generated for each of the options:

#0. Default      #1. Simplest     #2. Leading 0    #3. Decimals     #4. Best
-------------    -------------    -------------    -------------    -------------
1000X Rad Max    10X Rad          10X Rad          10X Rad          10X Rad
10X Rad          20X Rad          20X Rad          20X Rad          20X Rad
200X Rad         20X Rad Prime    20X Rad Prime    20X Rad Prime    20X Rad Prime
20X Rad          30X Rad          30X Rad          30X Rad          30X Rad
20X Rad Prime    40X Rad          40X Rad          40X Rad          40X Rad
30X Rad          200X Rad         200X Rad         200X Rad         200X Rad
40X Rad          1000X Rad Max    1000X Rad Max    1000X Rad Max    1000X Rad Max
Abc 0.11         Abc 0.2          Abc 0.2          Abc 0.11         Abc 0.11
Abc 0.2          Abc 0.11         Abc 0.11         Abc 0.2          Abc 0.2
Abc 0050 Ult     Abc 12a          Abc 12a          Abc 12a          Abc 12a
Abc 0060 Ult     Abc 50 Ult       Abc 012b         Abc 012b         Abc 012b
Abc 012b         Abc 50.1 Ult     Abc 050 Ult      Abc 050 Ult      Abc 0050 Ult
Abc 050 Ult      Abc 50.01 Ult    Abc 0050 Ult     Abc 0050 Ult     Abc 050 Ult
Abc 060 Ult      Abc 50.02 Ult    Abc 50 Ult       Abc 50 Ult       Abc 50 Ult
Abc 12a          Abc 51 Ult       Abc 50.1 Ult     Abc 50.01 Ult    Abc 50.01 Ult
Abc 50 Ult       Abc 51 Ult       Abc 50.01 Ult    Abc 50.02 Ult    Abc 50.02 Ult
Abc 50.01 Ult    Abc 51B Ult      Abc 50.02 Ult    Abc 50.1 Ult     Abc 50.1 Ult
Abc 50.02 Ult    Abc 52 Ult       Abc 51 Ult       Abc 51 Ult       Abc 51 Ult
Abc 50.1 Ult     Abc 60 Ult       Abc 51 Ult       Abc 51 Ult       Abc 51 Ult
Abc 500 Ult      Abc 012b         Abc 51B Ult      Abc 51B Ult      Abc 51B Ult
Abc 51 Ult       Abc 050 Ult      Abc 52 Ult       Abc 52 Ult       Abc 52 Ult
Abc 51 Ult       Abc 060 Ult      Abc 60 Ult       Abc 60 Ult       Abc 0060 Ult
Abc 51B Ult      Abc 500 Ult      Abc 0060 Ult     Abc 0060 Ult     Abc 060 Ult
Abc 52 Ult       Abc 0050 Ult     Abc 060 Ult      Abc 060 Ult      Abc 60 Ult
Abc 60 Ult       Abc 0060 Ult     Abc 500 Ult      Abc 500 Ult      Abc 500 Ult
Alpha 100        Alpha 2          Alpha 2          Alpha 2          Alpha 2
Alpha 2          Alpha 2A         Alpha 2A         Alpha 2A         Alpha 2A
Alpha 200        Alpha 2A-900     Alpha 2A-900     Alpha 2A-900     Alpha 2A-900
Alpha 2A         Alpha 2A-8000    Alpha 2A-8000    Alpha 2A-8000    Alpha 2A-8000
Alpha 2A-8000    Alpha 100        Alpha 100        Alpha 100        Alpha 100
Alpha 2A-900     Alpha 200        Alpha 200        Alpha 200        Alpha 200

[Update 8/25/2016] 5. While the previous function already does what's needed in most of the cases, there is one special case where it fails to sort in a way that humans would sort it: when a number in one string is in a position where a number is absent in some other string. For example: when sorting {"dir\\file.lua", "dir2\\file.lua", "dir3\\file.lua", "dir3\\file2.lua"} using the algorithm above, you'd get {"dir2\\file.lua", "dir3\\file.lua", "dir3\\file2.lua", "dir\\file.lua"}, which is probably not what you'd expect.

Egor Skriptunoff came up with a clever fix for this problem:

function alphanumsort(o)
   local function conv(s)
      local res, dot = "", ""
      for n, m, c in tostring(s):gmatch"(0*(%d*))(.?)" do
         if n == "" then
            dot, c = "", dot..c
         else
            res = res..(dot == "" and ("%03d%s"):format(#m, m)
                                  or "."..n)
            dot, c = c:match"(%.?)(.*)"
         end
         res = res..c:gsub(".", "\0%0")
      end
      return res
   end
   table.sort(o,
      function (a, b)
         local ca, cb = conv(a), conv(b)
         return ca < cb or ca == cb and a < b
      end)
   return o
end

If you sort the same table using this function, you'll get the sorted table you'd likely expect: {"dir\\file.lua", "dir2\\file.lua", "dir3\\file.lua", "dir3\\file2.lua"}.

One caveat for those using UTF-8 strings: the converted text may not be a proper UTF-8 text after the conversion. The fix is to only prepend zeros to the starting bytes of UTF-8 symbols:

res = res..c:gsub("[\0-\127\192-\255]", "\0%0")
...
return utf8.lt(ca, cb) or ca == cb and utf8.lt(a, b)

[Update 8/25/2012] I rewrote the solution based on the feedback and discussion on the maillist. These notes no longer apply, as they were for an earlier version that suffered from not handling numbers larger than maxint on your platform. Here is the original code:

function alphanumsort(o)
  local function padnum(d) return ("%012d"):format(d) end
  table.sort(o, function(a,b)
    return tostring(a):gsub("%d+",padnum) < tostring(b):gsub("%d+",padnum) end)
  return o
end

If you are concerned that the hard-coded length (12) may be too small for your numeric fragments, you can replace it with something a bit more elaborate:

  local maxl = 0
  for n,v in ipairs(o) do tostring(v)
    :gsub("%d+", function(d) if #d > maxl then maxl = #d end; return d end) end
  local function padnum(d) return ("%0"..maxl.."d"):format(d) end

Keep in mind that this may fail badly if your text happened to have a long sequence of numbers as some libraries may not be able to take an arbitrarily large length in number formatting. For example, on Windows ("%099d"):format(1) works, but ("%100d"):format(1) fails with invalid format (width or precision too long) message.

You should get a copy of my slick ZeroBrane Studio IDE.

Leave a comment

what will you say?
(required)
(required)
Close
OSZAR »