How Can You Extract Expedia using Python and LXML?
Collecting travel data related to flights is a huge task if it’s done manually. There are thousands of mixtures of routes, airports, ever-changing prices, and timings. Ticket prices vary daily and there are a huge number of flights feasible each day. Web extracting is the only solution to keep a track of data. In this blog, you will come to know about how we extract Expedia Data, and we provide the best Expedia Hotel & Flight Data Scraper Tool from website to scrape data from flights. Our web extractor will scrape the flight prices and schedules for a source and destination.
Below is the listing of data fields that for Expedia Scraper: –
- Airport Arrival Destination
- Arrival Airport Time
- Departure Airport Destination
- Departure Airline Time
- Name of Plane
- Airline
- Duration of Flight
- Code of Plane
- Price of Ticket
- Number of Stops
Scraping Logic
- Build the URL to search results from Expedia – Here is one for the feasible flights listed from Miami to New York.
https://www.expedia.com/Flights-Search?trip=oneway&leg1=from:New%20York,%20NY%20(NYC-All%20Airports),to:Miami,%20Florida,departure:04/01/2017TANYT&passengers=children:0,adults:1,seniors:0,infantinlap:Y&mode=search
- Download HTML for search result page utilizing Python request.
- Parse the page utilizing LXML – LXML lets you route the HTML Tree Structure utilizing Xpaths. We have predefined the XPaths for the information we require in the code.
- Save the information to JSON format. You can afterward transform to write database.
Installing Pip and Python 3
Here is a guidebook to mount Python 3 in Linux
http://docs.python-guide.org/en/latest/starting/install3/linux/
Mac clients can follow this guidebook
http://docs.python-guide.org/en/latest/starting/install3/osx/
Windows clients can contact us for more details
http://www.websitescraper.com/contact-us/
Installing Packages
PIP to mount the following instructions in Python
(https://pip.pypa.io/en/stable/installing/)
Python Requests, to make download and requests the HTML content of the given pages.
(http://docs.python-requests.org/en/master/user/install/).
Python LXML, for analyzing the HTML Tree Structure utilizing Xpaths
(Learn how to install that here – http://lxml.de/installation.html)
The Code
https://gist.github.com/websitescraper/c1374488ee8acff09e34ae2001ca9b3a
If the above link doesn’t work then you can download the code from the below-given link
https://gist.github.com/websitescraper/c1374488ee8acff09e34ae2001ca9b3a
If you like Python 2 then you can contact us for another code.
http://www.websitescraper.com/contact-us/
Run the Expedia Scraper
Think that the script name is expedia.py. If you type in the script title in terminal along or command prompt with a –h.
usage: expedia.py [-h] source destination date positional arguments: sourceSource airport code destinationDestination airport code date MM/DD/YYYY optional arguments: -h, --help show this help message and exitThe destination and arguments sources are the airport codes for the destination airports and source. The date argument is in the format MM/DD/YYYY.
python3 expedia.py nycmia 04/01/2017
This will make a JSON result file called nyc-mia-flight-results.json that will remain in the same folder as the script.
The output will look like this: –
{ "arrival": "Miami Intl., Miami", "timings": [ { "arrival_airport": "Miami, FL (MIA-Miami Intl.)", "arrival_time": "12:19a", "departure_airport": "New York, NY (LGA-LaGuardia)", "departure_time": "9:00p" } ], "airline": "American Airlines", "flight duration": "1 days 3 hours 19 minutes", "plane code": "738", "plane": "Boeing 737-800", "departure": "LaGuardia, New York", "stops": "Nonstop", "ticket price": "1144.21" }, { "arrival": "Miami Intl., Miami", "timings": [ { "arrival_airport": "St. Louis, MO (STL-Lambert-St. Louis Intl.)", "arrival_time": "11:15a", "departure_airport": "New York, NY (LGA-LaGuardia)", "departure_time": "9:11a" }, { "arrival_airport": "Miami, FL (MIA-Miami Intl.)", "arrival_time": "8:44p", "departure_airport": "St. Louis, MO (STL-Lambert-St. Louis Intl.)", "departure_time": "4:54p" } ], "airline": "Republic Airlines As American Eagle", "flight duration": "0 days 11 hours 33 minutes", "plane code": "E75", "plane": "Embraer 175", "departure": "LaGuardia, New York", "stops": "1 Stop", "ticket price": "2028.40" },Conclusion
This scraper must work for scraping most flight information feasible on Expedia unless the website structure changes radically. If you like to extract the information of Millions of pages in a very short time, this Scraping Expedia Python is probably not going to work for you. You must read Scalable do-it-yourself extracting – How to run and build scrapers on a large scale and How to preclude getting blacklisted while extracting.
If you are looking for the best scrape flight details from Expedia.com, then you can contact Scraping Intelligence for all your queries.