By: Helder Guerreiro.
In the last year and a half I've been programming (very slowly I admit) a web mail client written in Python. The objective is to have more or less the same feature set that Squirrelmail has.
You can check the result of my efforts here. Please be advised that this software is on its early stages and far from complete, it was only tested with Cyrus IMAP which is a very standard compliant IMAP server, so, if you test the program with other less compliant servers it may fail to work.
By now you've guessed that I'm using the IMAP server as the backend storage system for the web mail client. Since the IMAP server responses are symbolic expressions (s-exp for short) I needed a fast enough scanner to be able to parse the server responses. Note that this responses can be a bit lengthy and complex.
I'll try to show how I've solved this problem in the context described above.
For instance the response to a FETCH command from an IMAP server can be:
266 FETCH (FLAGS (\Seen) UID 31608 INTERNALDATE "30-Jan-2008 02:48:01 +0000" RFC822.SIZE 4509 ENVELOPE ("Tue, 29 Jan 2008 14:00:24 +0000" "Aprenda as tXcnicas e os truques da cozinha mais doce..." (("Ediclube" NIL "ediclube" "sigmathis.info")) (("Ediclube" NIL "ediclube" "sigmathis.info")) ((NIL NIL "ediclube" "sigmathis.info")) ((NIL NIL "helder" "example.com")) NIL NIL NIL "<64360f85d83238281a27b921fd3e7eb3@localhost.localdomain>"))
I want to convert this string to a python list:
[266, 'FETCH', ['FLAGS', ['\\Seen'], 'UID', 31608, 'INTERNALDATE', '30-Jan-2008 02:48:01 +0000', 'RFC822.SIZE', 4509, 'ENVELOPE', ['Tue, 29 Jan 2008 14:00:24 +0000', 'Aprenda as tXcnicas e os truques da cozinha mais doce...', [['Ediclube', None, 'ediclube', 'sigmathis.info']], [['Ediclube', None, 'ediclube', 'sigmathis.info']], [[None, None, 'ediclube', 'sigmathis.info']], [[None, None, 'helder', 'example.com']], None, None, None, '<64360f85d83238281a27b921fd3e7eb3@localhost.localdomain>']]]
This is a fairly common IMAP command. I'm requesting the message flag list (FLAGS), its unique identifier (UID), the message size (RFC822.SIZE) and the message ENVELOPE (from address, recipient list, date, subject, etc). This command may be used to extract the necessary message headers to display the message list of a given folder. This response would be repeated for each message on the folder, meaning that potentially this would be repeated thousand of times before displaying the folder message list. This is why the solution had to be "fast enough".
You can get my solution to this problem here.
The relevant function is:
# -*- coding: utf-8 -*-
#
# Helder Guerreiro <helder em paxjulia.com>
#
# $Id: sexp.py 364 2008-06-16 11:46:37Z helder $
#
# Global imports
import re, string
# Regexp
literal_re = re.compile(r'^{(\d+)}\r\n')
simple_re = re.compile(r'^([^ ()]+)')
quoted_re = re.compile(r'^"((?:[^"\\]|\\")*?)"')
# Errors
class SError(Exception): pass
def scan_sexp(text):
'''S-Expression scanner.
This is a non-recursive version. It uses the lists property of assigning
only by reference to assemble the s-exp.
@param text: text to be scanned.
@type text: s-exp string
@return result: s-exp in a python list.
'''
# Initialization
pos = 0
lenght = len(text)
current = ''
result = []
cur_result = result
level = [ cur_result ]
# Scanner
while pos < lenght:
# Quoted literal:
if text[pos] == '"':
quoted = quoted_re.match(text[pos:])
if quoted:
cur_result.append( quoted.groups()[0] )
pos += quoted.end() - 1
# Numbered literal:
elif text[pos] == '{':
lit = literal_re.match(text[pos:])
if lit:
start = pos+lit.end()
end = pos+lit.end()+int(lit.groups()[0])
pos = end - 1
cur_result.append( text[ start:end ] )
# Simple literal
elif text[pos] not in '() ':
simple = simple_re.match(text[pos:])
if simple:
tmp = simple.groups()[0]
if tmp.isdigit():
tmp = int(tmp)
elif tmp == 'NIL':
tmp = None
cur_result.append( tmp )
pos += simple.end() - 1
# Level handling, if we find a '(' we must add another list, if we
# find a ')' we must return to the previous list.
elif text[pos] == '(':
cur_result.append([])
cur_result = cur_result[-1]
level.append(cur_result)
elif text[pos] == ')':
try:
cur_result = level[-2]
del level[-1]
except IndexError:
raise SError('Unexpected parenthesis at pos %d' % pos)
pos += 1
return result
This scanner accepts three types of atoms (besides the possible nested expressions):
It's also useful to convert the keyword 'NIL' to None, and the integers to integers.
In order to transform a string onto a python list, the string is read character by character. If we have a very long string (for instance the entire body of a message) this can take a long time. To solve this, the type of each atom is guessed using a regular expression. When a regex matches, the final position of the match is extracted and, this way, we can jump to the end of the atom instead of going through all the bytes of the string. Using python slices the current atom is extracted and added to the current result (that is a list).
The possible nested sub expressions are extracted relying on the fact that in python the lists are assigned by reference (very much like the usage of pointers on lower level languages). As soon as a '(' is found, a new list is created and all the atoms found from this point on are appended to this new list. The previous list that was being used is stored on a list stack (simply a list of lists). When a ')' is found on the string, we make the previously used list the current one, remove the reference to it from the list stack and continue to add atoms to it.
Using python 2.5 on a Pentium D at 3.4GHz, this function takes about 0.3ms to parse the example line above (on average). This performance level is just about right for what I needed.