Executing plain SQL over Pandas dataframes
Sometimes when using Pandas it's difficult to manipulate data the way we would do in a relational database. Some simple operations like joins or composite filters are not very intuitive or too painful to execute with Pandas library.
There is a package called pandasql
. It solves almost all our problems by allowing us to execute plain SQL over Pandas dataframes.
It's a very simple tool to use, just load your dataframe and query it with the library tools just like it was a table.
It uses the SQLite syntax, which is very complete and handles a lot of functions you would need in a normal relational database situation.
You can even use temporary tables with it!
Since SQL is a language people are more used to, and could be called "universal", you can expand the limits of your work with pandas
by now using this simple tip.
Obs.: All my examples were made using the Jupyter Notebooks, and I did not care about receiving the return of the sqldf
function to another variable for presentation purposes, but when working with pandasql
you can be free to continue working with pandas dataframes since the return of the function is a transformed dataframe.
The notebook used in this example can be found here.
Feel free to access my other repositories, I post a lot of snippets and personal projects that could help you!
Follow me here to receive weekly tips and contributions, mainly about data engineering.
Thanks for reading :)