
``email``: Parsing email messages
*********************************

Message object structures can be created in one of two ways: they can
be created from whole cloth by instantiating ``Message`` objects and
stringing them together via ``attach()`` and ``set_payload()`` calls,
or they can be created by parsing a flat text representation of the
email message.

The ``email`` package provides a standard parser that understands most
email document structures, including MIME documents.  You can pass the
parser a string or a file object, and the parser will return to you
the root ``Message`` instance of the object structure.  For simple,
non-MIME messages the payload of this root object will likely be a
string containing the text of the message. For MIME messages, the root
object will return ``True`` from its ``is_multipart()`` method, and
the subparts can be accessed via the ``get_payload()`` and ``walk()``
methods.

There are actually two parser interfaces available for use, the
classic ``Parser`` API and the incremental ``FeedParser`` API.  The
classic ``Parser`` API is fine if you have the entire text of the
message in memory as a string, or if the entire message lives in a
file on the file system. ``FeedParser`` is more appropriate for when
you're reading the message from a stream which might block waiting for
more input (e.g. reading an email message from a socket).  The
``FeedParser`` can consume and parse the message incrementally, and
only returns the root object when you close the parser [1].

Note that the parser can be extended in limited ways, and of course
you can implement your own parser completely from scratch.  There is
no magical connection between the ``email`` package's bundled parser
and the ``Message`` class, so your custom parser can create message
object trees any way it finds necessary.


FeedParser API
==============

The ``FeedParser``, imported from the ``email.feedparser`` module,
provides an API that is conducive to incremental parsing of email
messages, such as would be necessary when reading the text of an email
message from a source that can block (e.g. a socket).  The
``FeedParser`` can of course be used to parse an email message fully
contained in a string or a file, but the classic ``Parser`` API may be
more convenient for such use cases.  The semantics and results of the
two parser APIs are identical.

The ``FeedParser``'s API is simple; you create an instance, feed it a
bunch of text until there's no more to feed it, then close the parser
to retrieve the root message object.  The ``FeedParser`` is extremely
accurate when parsing standards-compliant messages, and it does a very
good job of parsing non-compliant messages, providing information
about how a message was deemed broken.  It will populate a message
object's *defects* attribute with a list of any problems it found in a
message.  See the ``email.errors`` module for the list of defects that
it can find.

Here is the API for the ``FeedParser``:

class email.parser.FeedParser([_factory])

   Create a ``FeedParser`` instance.  Optional *_factory* is a no-
   argument callable that will be called whenever a new message object
   is needed.  It defaults to the ``email.message.Message`` class.

   feed(data)

      Feed the ``FeedParser`` some more data.  *data* should be a
      string containing one or more lines.  The lines can be partial
      and the ``FeedParser`` will stitch such partial lines together
      properly.  The lines in the string can have any of the common
      three line endings, carriage return, newline, or carriage return
      and newline (they can even be mixed).

   close()

      Closing a ``FeedParser`` completes the parsing of all previously
      fed data, and returns the root message object.  It is undefined
      what happens if you feed more data to a closed ``FeedParser``.


Parser class API
================

The ``Parser`` class, imported from the ``email.parser`` module,
provides an API that can be used to parse a message when the complete
contents of the message are available in a string or file.  The
``email.parser`` module also provides a second class, called
``HeaderParser`` which can be used if you're only interested in the
headers of the message. ``HeaderParser`` can be much faster in these
situations, since it does not attempt to parse the message body,
instead setting the payload to the raw body as a string.
``HeaderParser`` has the same API as the ``Parser`` class.

class email.parser.Parser([_class])

   The constructor for the ``Parser`` class takes an optional argument
   *_class*.  This must be a callable factory (such as a function or a
   class), and it is used whenever a sub-message object needs to be
   created.  It defaults to ``Message`` (see ``email.message``).  The
   factory will be called without arguments.

   The optional *strict* flag is ignored.

   Deprecated since version 2.4: Because the ``Parser`` class is a
   backward compatible API wrapper around the new-in-Python 2.4
   ``FeedParser``, *all* parsing is effectively non-strict.  You
   should simply stop passing a *strict* flag to the ``Parser``
   constructor.

   The other public ``Parser`` methods are:

   parse(fp[, headersonly])

      Read all the data from the file-like object *fp*, parse the
      resulting text, and return the root message object.  *fp* must
      support both the ``readline()`` and the ``read()`` methods on
      file-like objects.

      The text contained in *fp* must be formatted as a block of **RFC
      2822** style headers and header continuation lines, optionally
      preceded by a envelope header.  The header block is terminated
      either by the end of the data or by a blank line.  Following the
      header block is the body of the message (which may contain MIME-
      encoded subparts).

      Optional *headersonly* is as with the ``parse()`` method.

   parsestr(text[, headersonly])

      Similar to the ``parse()`` method, except it takes a string
      object instead of a file-like object.  Calling this method on a
      string is exactly equivalent to wrapping *text* in a
      ``StringIO`` instance first and calling ``parse()``.

      Optional *headersonly* is a flag specifying whether to stop
      parsing after reading the headers or not.  The default is
      ``False``, meaning it parses the entire contents of the file.

Since creating a message object structure from a string or a file
object is such a common task, two functions are provided as a
convenience.  They are available in the top-level ``email`` package
namespace.

email.message_from_string(s[, _class[, strict]])

   Return a message object structure from a string.  This is exactly
   equivalent to ``Parser().parsestr(s)``.  Optional *_class* and
   *strict* are interpreted as with the ``Parser`` class constructor.

email.message_from_file(fp[, _class[, strict]])

   Return a message object structure tree from an open file object.
   This is exactly equivalent to ``Parser().parse(fp)``.  Optional
   *_class* and *strict* are interpreted as with the ``Parser`` class
   constructor.

Here's an example of how you might use this at an interactive Python
prompt:

   >>> import email
   >>> msg = email.message_from_string(myString)


Additional notes
================

Here are some notes on the parsing semantics:

* Most non-*multipart* type messages are parsed as a single message
  object with a string payload.  These objects will return ``False``
  for ``is_multipart()``.  Their ``get_payload()`` method will return
  a string object.

* All *multipart* type messages will be parsed as a container message
  object with a list of sub-message objects for their payload.  The
  outer container message will return ``True`` for ``is_multipart()``
  and their ``get_payload()`` method will return the list of
  ``Message`` subparts.

* Most messages with a content type of *message/** (e.g. *message
  /delivery-status* and *message/rfc822*) will also be parsed as
  container object containing a list payload of length 1.  Their
  ``is_multipart()`` method will return ``True``.  The single element
  in the list payload will be a sub-message object.

* Some non-standards compliant messages may not be internally
  consistent about their *multipart*-edness.  Such messages may have a
  *Content-Type* header of type *multipart*, but their
  ``is_multipart()`` method may return ``False``.  If such messages
  were parsed with the ``FeedParser``, they will have an instance of
  the ``MultipartInvariantViolationDefect`` class in their *defects*
  attribute list.  See ``email.errors`` for details.

-[ Footnotes ]-

[1] As of email package version 3.0, introduced in Python 2.4, the
    classic ``Parser`` was re-implemented in terms of the
    ``FeedParser``, so the semantics and results are identical between
    the two parsers.
