Marshmallow serialization with MongoDB and Python
In programming, serialization is the process of turning an object into a new format that can be stored (e.g. files or databases) or transmitted (e.g. over the internet). Deserialization, therefore, is the process of turning something in that format into an object. Serialization is often called "marshalling", and deserialization, "unmarshalling".
That's where the name "marshmallow" comes from.
Marshmallow is a Python library developed to simplify the process of serialization and deserialization. It can take our Python objects and turn them into native Python data types such as dictionaries or strings, and also the other way round.
Serialization with marshmallow and Python
First, we must install marshmallow:
pip install marshmallow
Make sure to install marshmallow 3, as that is the new version. The code we write in this blog post will be for marshmallow 3.
Once it's installed, you can go ahead and create a Schema
.
A Schema
definition tells marshmallow what individual pieces of data it will deal with when serializing or deserializing. Just to repeat once more:
- Marshmallow will serialize our objects into native data types containing these individual pieces of data. For example, it can turn an object into a dictionary.
- Marshmallow will deserialize native data types containing these individual pieces of data into our objects.
So let's say we have this class:
class Store:
def __init__(self, name: str, location: str):
self.name = name
self.location = location
This is a very simple class whose constructor has two parameters: two strings representing the store's name and its location.
As humans, we could easily identify that the following dictionary could represent an instance of that class:
{
"name": "Walmart",
"location": "Venice, CA"
}
But Python doesn't know how to take something like a Store
object and turn it into a dictionary.
That's where marshmallow comes in, but we have to tell it what attributes of the object it needs to use in order to construct the dictionary: name
and location
.
We do that by creating a Schema
:
from marshmallow import Schema, fields
class StoreSchema(Schema):
name = fields.Str()
location = fields.Str()
Here we've created the StoreSchema
class, which inherits from marshmallow's Schema
class. It contains two class attributes, name
and location
. The names are important! The values are important too: fields.Str()
.
When we use marshmallow to create a dictionary out of a Python object, the result will be a dictionary with two keys: name
and location
. The values will be strings.
Now, to turn the object into a dictionary we need to do three things:
- Import our
Store
andStoreSchema
classes. - Create a
StoreSchema
object that is used to actually perform serialization. - "Dump" the object through the
StoreSchema
object with.dump()
. That gives us a dictionary.
from store import Store
from schema import StoreSchema
walmart = Store("Walmart", "Venice, CA")
store_schema = StoreSchema()
print(store_schema.dump(walmart))
# {'name': 'Walmart', 'location': 'Venice, CA'}
A typical question at this point is: "why do we need to create the StoreSchema
object?" It's because we can pass some configuration options at that point to slightly modify or limit what the schema does[1].
Deserialization with marshmallow and Python
Before deserializing, marshmallow can validate the data to be deserialized.
We can add validation rules so that errors will be raised if the data does not agree with those rules.
At the moment the only validation rules we have are that the fields name
and location
must be strings, so let's double check that by:
- Importing our
StoreSchema
class. - Getting our store data (which might be in a file, given by our users, or in this case just hard-coded).
- Creating our
StoreSchema
object. - Using
.load()
to pass the data through the schema for validation.
from schema import StoreSchema
store_data = {"name": "Walmart", "location": "Venice, CA"}
store_schema = StoreSchema()
print(store_schema.load(store_data))
# {'name': 'Walmart', 'location': 'Venice, CA'}
No problem here, because the fields are indeed strings!
If we try this though, we'll get an error:
store_data = {"name": 5, "location": "Venice, CA"}
print(store_schema.load(store_data))
You should get an error like this one:
Traceback (most recent call last):
File "main.py", line 11, in <module>
print(store_schema.load(store_data))
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.8/site-packages/marshmallow/schema.py", line 722, in load
return self._do_load(
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.8/site-packages/marshmallow/schema.py", line 904, in _do_load
raise exc
marshmallow.exceptions.ValidationError: {'name': ['Not a valid string.']}
Beautiful, innit! Not a valid string is what it says at the end, which is accurate!
If we wanted to turn our validated dictionary into a Store
object, we can do this, passing each key of the dictionary as a named argument to the Store
constructor:
from store import Store
from schema import StoreSchema
store_data = {"name": "Walmart", "location": "Venice, CA"}
store_schema = StoreSchema()
store = Store(**store_schema.load(store_data))
print(stores)
# {'name': 'Walmart', 'location': 'Venice, CA'}
By default, when we deserialize, marshmallow only performs validation. It doesn't create an object for us.
But now that we've got the validation out of the way, let's modify our schema slightly so that it does create a Store
object when it's done validating. I'll show you how you can do this, but normally I wouldn't do this:
from marshmallow import Schema, fields, post_load
from store import Store
class StoreSchema(Schema):
name = fields.Str()
location = fields.Str()
@post_load
def make_store(self, data, **kwargs):
return Store(**data)
We've now added the @post_load
decorated method. This runs after the default loading operations conclude (i.e. after validation).
The make_store
method receives data
: the entire validated dictionary that marshmallow has processed. It also has some other keyword arguments that might be used, and you can find more on that in the official documentation[2].
Now that we've got this schema, it will no longer give us the validated dictionary after loading. It'll validate and immediately execute make_store
, which gives us the Store
object:
from schema import StoreSchema
store_data = {"name": "Walmart", "location": "Venice, CA"}
store_schema = StoreSchema()
print(store_schema.load(store_data))
# <store.Store instance at 0x7ff2a18c>
We'll see in a moment that having our schema create objects for us can be a blessing, but it can also be a bit limiting at times! I would generally avoid making our schemas return objects, instead doing that in the code that uses the schema.
How to store data into a MongoDB database
If you've never used MongoDB before, this post isn't going to be a complete beginner's guide to MongoDB! You can check the official introduction if you're new to MongoDB.
MongoDB is a non-relational database where we store JSON strings. These JSON strings are searchable, but MongoDB doesn't have the concept of table definitions, so the JSON strings don't all have to have the same structure in one collection.
In MongoDB, tables are called "collections" since the concept of a "column" doesn't really apply when every row can have different columns.
The easiest way to start interacting with MongoDB in Python is to install the pymongo
library:
pip install pymongo
Then, we can create a database.py
file that will handle the interaction with MongoDB. This class has:
- An
initialize()
method that handles creating the MongoDB connection. - A
save_to_db()
method that saves thedata
parameter to thestores
collection in MongoDB. - A
load_from_db()
method that uses thequery
parameter to find all matching elements in thestores
collection.
Note that this is by no means the most perfect way to interact with MongoDB (particularly in larger applications), but the purpose of this blog post is to teach you about marshmallow serialization and deserialization—not MongoDB best practices!
Here's the sample database.py file:
import pymongo
class Database:
@classmethod
def initialize(cls):
client = pymongo.MongoClient("mongodb://localhost:27017/test_db")
cls.database = client.get_default_database()
@classmethod
def save_to_db(cls, data):
cls.database.stores.insert_one(data)
@classmethod
def load_from_db(cls, query):
return cls.database.stores.find(query)
Let's go and save a dictionary to the database to test this out:
from database import Database
Database.initialize()
Database.save_to_db({"name": "Walmart", "location": "Venice, CA"})
loaded_objects = Database.load_from_db({"name": "Walmart"})
print(loaded_objects)
# [{'_id': ObjectId('5e7cea2c0d86c32f5a934f92'), name': 'Walmart', 'location': 'Venice, CA'}]
Note how when querying the database, MongoDB will find all elements in the stores
collection that have a name
of Walmart
, and return them. At the moment we only have one, but if you ran the file multiple times you'd see we insert the same store multiple times into MongoDB. The returned list would therefore increase in size each time.
Note that MongoDB is generating the _id
field, which is an ObjectId
. It is recommended to generate your own ids instead of using the MongoDB defaults. We'll be using uuid
for this.
This was a very quick primer of MongoDB. Now let's see how we could use marshmallow with this application. First, we'll change the Schema to have a _id
field:
from marshmallow import Schema, fields, post_load
from store import Store
class StoreSchema(Schema):
_id = fields.Str()
name = fields.Str()
location = fields.Str()
@post_load
def make_store(self, data, **kwargs):
return Store(**data)
Then we'll change the model to accept it in the __init__
method. We'll give it a default value so that we'll generate a UUID if one is not passed in:
import uuid
class Store:
def __init__(self, name: str, location: str, _id: str = None):
self.name = name
self.location = location
self._id = _id or uuid.uuid4().hex
Finally, we can use the two together:
from database import Database
from schema import StoreSchema
store_schema = StoreSchema()
Database.initialize()
Database.save_to_db({"name": "Walmart", "location": "Venice, CA"})
loaded_objects = Database.load_from_db({"name": "Walmart"})
for loaded_store in loaded_objects:
store = store_schema.load(loaded_store)
print(store.name)
What we've done here is take the list of dictionaries MongoDB returns, and passed each store through the .load()
method of our StoreSchema
object. Then, we can access that Store
object's properties (or methods, if it had any!).
How to handle user data in this process
Instead of (or as well as) using marshmallow to handle serializing and deserializing from MongoDB, you could use marshmallow to handle user data.
Let's say a user gives you a dictionary as data, for you to save into MongoDB:
- First we'll get user input, usually as a string.
- We then convert it to a dictionary.
- We then pass it through our schema for validation. That gives us an object.
- We then use
.dump
to get back the validated dictionary, and save that to MongoDB.
Here's a perfect example of where we could save a bit of work if our StoreSchema
didn't give us Store
objects.
import json
from database import Database
from schema import StoreSchema
store_schema = StoreSchema()
Database.initialize()
user_input = input("Enter a store dictionary: ")
user_dict = json.loads(user_input)
user_object = store_schema.load(user_dict)
Database.save_to_db(store_schema.dump(user_object))
loaded_objects = Database.load_from_db({"name": "Walmart"})
for loaded_store in loaded_objects:
store = store_schema.load(loaded_store)
print(store.name)
Below, the code if the schema didn't give us Store
objects:
import json
from database import Database
from schema import StoreSchema
store_schema = StoreSchema()
Database.initialize()
user_input = input("Enter a store dictionary: ")
user_dict = json.loads(user_input)
validated_dict = store_schema.load(user_dict)
Database.save_to_db(validated_dict)
loaded_objects = Database.load_from_db({"name": "Walmart"})
for loaded_store in loaded_objects:
store = Store(**store_schema.load(loaded_store))
print(store.name)
The benefit of this is now we can use the schema to load the user's data and the database's data, in case it has been changed later on by another part of the application.
Wrapping Up
In this blog post we've learned about how to use Marshmallow to serialize and deserialize data, and how we can use that with MongoDB.
I hope this post has been useful, and thanks for reading!