An end-to-end exploratory data project using R and Python
“Let’s order Thai.”
“Great, what’s your go-to dish?”
“Pad Thai.”
This has bugged me for years and is the genesis for this project.
People need to know they have other choices aside from Pad Thai. Pad Thai is one of 53 individual dishes and stopping there risks missing out on at least 201 shared Thai dishes (source: wikipedia).
This project is an opportunity to build a data set of Thai dishes by scraping tables off Wikipedia. We will use Python for web scraping and R for visualization. Web scraping is done in Beautiful Soup
(Python) and pre-processed further with dplyr
and visualized with ggplot2
.
Furthermore, we’ll use the tidytext
package in R to explore the names of Thai dishes (in English) to see if we can learn some interest things from text data.
Finally, there is an opportunity to make an open source contribution.
The project repo is here.
The purpose of this analysis is to generate questions.
Because exploratory analysis is iterative, these questions were generated in the process of manipulating and visualizing data. We can use these questions to structure the rest of the post:
Web Scraping
We scraped over 300 Thai dishes. For each dish, we got:
First, we’ll use the following Python libraries/modules:
We’ll use requests
to send an HTTP requests to the wikipedia url we need. We’ll access network sockets using ‘secure sockets layer’ (SSL). Then we’ll read in the html data to parse it with Beautiful Soup.
Before using Beautiful Soup, we want to understand the structure of the page (and tables) we want to scrape under inspect element on the browser (note: I used Chrome). We can see that we want the table
tag, along with class
of wikitable sortable.
The main function we’ll use from Beautiful Soup is findAll()
and the three parameters are th
(Header Cell in HTML table), tr
(Row in HTML table) and td
(Standard Data Cell).
First, we’ll save the table headers in a list, which we’ll use when creating an empty dictionary
to store the data we need.
Initially, we want to scrape one table, knowing that we’ll need to repeat the process for all 16 tables. Therefore we’ll use a nested loop. Because all tables have 6 columns, we’ll want to create 6 empty lists.
We’ll scrape through all table rows tr
and check for 6 cells (which we should have for 6 columns), then we’ll append the data to each empty list we created.
You’ll note the code for a1
and a6
are slightly different. In retrospect, I found that cells[0].find(text=True)
did not yield certain texts, particularly if they were links, therefore a slight adjustment is made.
The strings tag returns a NavigableString
type object while text returns a unicode
object (see stack overflow explanation).
After we’ve scrapped the data, we’ll need to store the data in a dictionary
before converting to data frame
:
For a1
and a6
, we need to do an extra step of joining the strings together, so I’ve created two additional corresponding columns, Thai name 2
and Description2
:
After we’ve scrapped all the data and converted from dictionary
to data frame
, we’ll write to CSV to prepare for data cleaning in R (note: I saved the csv as thai_dishes.csv, but you can choose a different name).
Data Cleaning
Data cleaning is typically non-linear.
We’ll manipulate the data to explore, learn about the data and see that certain things need cleaning or, in some cases, going back to Python to re-scrape. The columns a1
and a6
were scraped differently from other columns due to missing data found during exploration and cleaning.
For certain links, using .find(text=True)
did not work as intended, so a slight adjustment was made.
For this post, R
is the tool of choice for cleaning the data.
Here are other data cleaning tasks:
Note: This was only necessary the first time round, after the changes are made to how I scraped a1
and a6
, this step is no longer necessary:
Data Visualization
There are several ways to visualize the data. Because we want to communicate the diversity of Thai dishes, aside from Pad Thai, we want a visualization that captures the many, many options.
I opted for a dendrogram. This graph assumes hierarchy within the data, which fits our project because we can organize the dishes in grouping and sub-grouping.
How might we organized Thai dishes?
We first make a distinction between individual and shared dishes to show that Pad Thai is not even close to being the best individual dish. And, in fact, more dishes fall under the shared grouping.
To avoid cramming too much data into one visual, we’ll create two separate visualizations for individual vs. shared dishes.
Here is the first dendrogram representing 52 individual dish alternatives to Pad Thai.
Creating a dendrogram requires using the ggraph
and igraph
libraries. First, we’ll load the libraries and sub-set our data frame by filtering for Individual Dishes:
We create edges and nodes (i.e., from and to) to create the sub-groupings within Individual Dishes (i.e., Rice, Noodles and Misc):
What is the best way to organized the different dishes?
There are approximately 4X as many shared dishes as individual dishes, so the dendrogram should be circular to fit the names of all dishes in one graphic.
A wonderful resource I use regularly for these types of visuals is the R Graph Gallery. There was a slight issue in how the text angles were calculated so I submitted a PR to fix.
Perhaps distinguishing between individual and shared dishes is too crude, within the dendrogram for 201 shared Thai dishes, we can see further sub-groupings including Curries, Sauces/Pastes, Steamed, Grilled, Deep-Fried, Fried & Stir-Fried, Salads, Soups and other Misc:
Text Mining
Which raw material(s) are most popular?
One way to answer this question is to use text mining to tokenize by either word and count the words by frequency as one measure of popularity.
In the below bar chart, we see frequency of words across all Thai Dishes. Mu (หมู) which means pork in Thai appears most frequently across all dish types and sub-grouping. Next we have kaeng (แกง) which means curry. Phat (ผัด) comings in third suggesting “stir-fry” is a popular cooking mode.
As we can see not all words refer to raw materials, so we may not be able to answer this question directly.
We can also see words common to both Individual and Shared Dishes. We see other words like nuea (beef), phrik (chili) and kaphrao (basil leaves).
Which raw materials are most important?
We can only learn so much from frequency, so text mining practitioners have created term frequency — inverse document frequency to better reflect how important a word is in a document or corpus (further details here).
Again, the words don’t necessarily refer to raw materials, so this question can’t be fully answered directly here.
Could you learn about Thai food just from the names of the dishes?
The short answer is “yes”.
We learned just from frequency and “term frequency — inverse document frequency” not only the most frequent words, but the relative importance within the current set of words that we have tokenized with tidytext
. This informs us of not only popular raw materials (Pork), but also dish types (Curries) and other popular mode of preparation (Stir-Fry).
We can even examine the network of relationships between words. Darker arrows suggest a stronger relationship between pairs of words, for example “nam phrik” is a strong pairing. This means “chili sauce” in Thai and suggests the important role that it plays across many types of dishes.
We learned above that “mu” (pork) appears frequently. Now we see that “mu” and “krop” are more related than other pairings (note: “mu krop” means “crispy pork”). We also saw above that “khao” appears frequently in Rice dishes. This alone is not surprising as “khao” means rice in Thai, but we see here “khao phat” is strongly related suggesting that fried rice (“khao phat”) is quite popular.
Finally, we may be interested in word relationships within individual dishes.
The below graph shows a network of word pairs with moderate-to-high correlations. We can see certain words clustered near each other with relatively dark lines: kaeng (curry), pet (spicy), wan (sweet), khiao (green curry), phrik (chili) and mu (pork). These words represent a collection of ingredient, mode of cooking and description that are generally combined.
We have completed an exploratory data project where we scraped, clean, manipulated and visualized data using a combination of Python and R. We also used the tidytext
package for basic text mining task to see if we could gain some insights into Thai cuisine using words from dish names scraped off Wikipedia.
For more content on data science, R, Python, SQL and more, find me on Twitter.
This content was originally published here.