Need to speedup my code

Guru_Prasad · November 19, 2020, 6:43am

Hi All,
My name is Guru Prasad. I am new to jupyter notebook. I have got this code from my friend who helped me to give me this code, which is working but the speed of the code is very slow. Please help me if anyway I can speed up the code to get results faster. I tried everything from remote run to local run but no luck in speed.
Here is the code:

import requests
import csv
import re
import time
import os
from bs4 import BeautifulSoup as bs
from random import randint

results_path = “scripts_output/master/usa”
main_url = “https://www.mastersportal.com/countries/82/united-states.html”

Create results folder if it does not exist

if not os.path.exists(results_path):
os.makedirs(results_path)

Create results folder if it does not exist

if not os.path.exists(results_path):
os.makedirs(results_path)

helpers

def get_no_from_str(x):
result = “N/A”
s = re.search(r"[0-9]{1,5}([,.][0-9]{1,2})?", x)
if s:
result = s.group()
return result

get main url data

countries_res = requests.get(main_url)
universities_page = bs(countries_res.text, “lxml”)

Write csv output header

programs_headers = [
“university_name”, “study_name”, “deadline”, “tuition_amount”, “tuition_currency”, “tuition_price_specification”,
“language”, “Course_Link”, “Course_Description”, “city”, “structure”, “start_date”,
“duration”, “application_deadline”, “language”, “facts”, “ielts_score”, “cae_score”, “toefl_score”,
“University_Rank”, “academic_req”]
programs_file = open("{}/programs{}.csv".format(results_path, round(time.time())), “a”)
programs_csv = csv.DictWriter(programs_file, fieldnames=programs_headers)
programs_csv.writeheader()

counter = 1

for u in universities_page.select("#CountryStudies li a"):

    university_url = u['href']
    university_name = u['title']
    university_id = re.search("/\d+/", university_url).group()[1:-1]
    #university_country = country_id

    print("++", university_name, "(" + university_id + ")")

    studies_res = requests.get(university_url)
    studies_page = bs(studies_res.text, "lxml")

    for s in studies_page.select("#StudyListing .StudyInfo a"):

        study_url = s['href']
        study_name = s['title'].replace("/", "").replace(".", "")
        program_name = study_name.split(',')[:-1][0].strip()
        program_type = study_name.split(',')[-1].strip()
        study_id = re.search("/\d+/", study_url).group()[1:-1]
        study_university = university_id
        #study_country = country_id

        print("+++", counter, study_name, "(" + study_id + ")")
        counter += 1

        details_res = requests.get(study_url)

        page_text = details_res.text.strip()
        page = bs(page_text, "lxml")

        name_s = "#Hero h1"
        name = page.select(name_s)[0].text.strip() if page.select(name_s) else 'N/A'

        deadline_s = "#js-StartdateContainer .Deadline time"
        deadline = page.select(deadline_s)[0].text.strip() if page.select(deadline_s) else 'N/A'

        tuition_amount_s = "#TuitionFeeContainer .Amount span"
        tuition_amount = page.select(tuition_amount_s)[0]['data-amount']  if page.select(tuition_amount_s) else 'N/A'

        tuition_currency_s = "#TuitionFeeContainer .Amount span.CurrencyType"
        tuition_currency = page.select(tuition_currency_s)[0]['data-currency-text'] if page.select(tuition_currency_s) else 'N/A'

        tuition_spec_s = "#TuitionFeeContainer .Amount span.CurrencyType"
        tuition_price_specification = page.select(tuition_spec_s)[1].text.strip() if page.select(tuition_spec_s) else 'N/A'

        duration_s = "#StudyKeyFacts .FactListSubListItem span.Duration"
        duration = page.select(duration_s)[0].text.strip().replace(u'\xa0', u' ') if page.select(duration_s) else 'N/A'

        language_s = "#StudyKeyFacts .FactItem.LanguageFact.js-languageFact .Languages"
        language = page.select(language_s)[0].text.strip() if page.select(language_s) else 'N/A'

        university_s = "#StudyDescription .StudyLink.TrackingExternalLink.ProgrammeWebsiteLink"
        university = page.select(university_s)[0]['href'] if page.select(university_s) else 'N/A'

        university_rank_s = "#StudyDescription"
        university_rank_st = page.select(university_rank_s)[0].text.strip() if page.select(university_rank_s) else 'N/A'
        Course_Description = (university_rank_st)

        location_s = "#OrganisationInformation .js-locationInformationWrapper .LocationItems a"
        city = page.select(location_s)[0].text.strip() if page.select(location_s) else 'N/A'
        country = page.select(location_s)[1].text.strip() if page.select(location_s) else 'N/A'

        structure_s = "#StudyContents ul li"
        structure = [x.text.strip() for x in page.select(structure_s)] if page.select(structure_s) else 'N/A'

        start_date_s = "#js-StartdateContainer .StartDateItemTime.js-deadlineFact time"
        start_date = page.select(start_date_s)[0].text.strip() if page.select(start_date_s) else 'N/A'

        application_deadline_s = "#js-StartdateContainer .Deadline time"
        application_deadline = page.select(application_deadline_s)[0].text.strip() if page.select(application_deadline_s) else 'N/A'

        facts_s = "#StudyKeyFacts"
        #facts = page.select(facts_s)[0].text.strip() if page.select(facts_s) else 'N/A'
        facts = [x.text.strip() for x in page.select(facts_s)] if page.select(facts_s) else 'N/A'   

        ielts_score_s = "#EnglishRequirements .CardContents.IELTSCard .Score span"
        ielts_score_st = page.select(ielts_score_s)[0].text.strip() if page.select(ielts_score_s) else 'N/A'
        ielts_score = get_no_from_str(ielts_score_st)

        cae_score_s = "#AdmissionRequirements .Score"
        cae_score_st = page.select(cae_score_s)[0].text.strip() if page.select(cae_score_s) else 'N/A'
        cae_score = get_no_from_str(cae_score_st)

        toefl_score_s = "#EnglishRequirements .CardContents.TOEFLCard .Score span"
        toefl_score_st = page.select(toefl_score_s)[0].text.strip() if page.select(toefl_score_s) else 'N/A'
        toefl_score = (toefl_score_st)

        fees_s = "#js-worldRankingReadMoreButton"
        tuition_fees_containers = [x.text.strip().replace(u'\xa0', u' ') for x in page.select(fees_s)] if page.select(fees_s) else 'N/A'
        

        academic_req_s = "#OtherRequirements ul li"
        academic_req = [x.text.strip() for x in page.select(academic_req_s)] if page.select(academic_req_s) else 'N/A'

        new_row = {
            #"country_url": country_url.replace(",", ""),
            #"country_name": country_name.replace(",", ""),
            #"country_id": country_id.replace(",", ""),
            #"university_url": university_url.replace(",", ""),
            "university_name": university_name.replace(",", ""),
            #"university_id": university_id.replace(",", ""),
            #"university_country": university_country.replace(",", ""),
            #"study_url": study_url.replace(",", ""),
            "study_name": study_name,
            #"program_name": program_name,
            #"program_type": program_type,
            #"study_id": study_id.replace(",", ""),
            #"name": name,
            "deadline": deadline,
            "tuition_amount": tuition_amount,
            "tuition_currency": tuition_currency,
            "tuition_price_specification": tuition_price_specification,
            "duration": duration,
            "language": language,
            "Course_Link": university,
            "Course_Description": Course_Description,
            "city": city,
            #"country": country,
            "structure": structure,
            "start_date": start_date,
            "application_deadline": application_deadline,
            "facts": facts,
            "ielts_score": ielts_score,
            "cae_score": cae_score,
            "toefl_score": toefl_score,
            "University_Rank": tuition_fees_containers,
            "academic_req": academic_req
        }

        new_row = {k: str(v) for k, v in new_row.items()}
        programs_csv.writerow(new_row)

programs_file.close()

fomightez · November 19, 2020, 5:05pm

TL;DR:
I suspect it is a Python issue and belongs elsewhere.
You should provide what you’ve done to figure it out and make things easier for people to help you.

More details below.

Several things:

Is this a Jupyter issue or a Python issue? If it runs slow just as a Python script, then it doesn’t really fall under the Jupyter discourse forum purview. You might get lucky and someone would help you here but it would be best that you ask elsewhere if it is just a Python issue.
What have you done to try to track the step that takes the most time? Or is it multiple steps? Have you profiled it? Or at least broken down some parts and used Jupyter’s magic time commands to narrow down what is causing you to say it is slow?
Importantly, for asking others for help here or elsewhere, you want to do your best to provide the code in a way that it can be run. Did you notice the indentation has been lost for everything above the line for u in universities_page.select("#CountryStudies li a"):? I assume you put that block in triple ticks as a markdown code block? If you need help, see here to get some related tips.
Tip: What you posted isn’t a short snippet of code. It would be best to post it on Github or as a Github gist some other code snippet site, like pastebin (see here), and then link over to it once you have expanded your question to include answers all the other questions above and added them to the description of your issue.

Topic		Replies	Views
I enter the page to coding python , Why so slow？I used to nfs so that pvc mount pv Zero to JupyterHub on Kubernetes help-wanted	0	384	March 16, 2021
Jupyter notebook is getting slower and cpu usage shown in red ipynb	0	326	July 26, 2023
Getting pyscript to jupyter notebook Notebook	0	590	June 20, 2023
500 Error when downloading Jupyter notebook as ipynb file General	4	1181	June 10, 2021
Reading web based url data into data frame using Jupyter Notebook team how-to , help-wanted , notebook	5	9830	September 14, 2023

Need to speedup my code

Create results folder if it does not exist

Create results folder if it does not exist

helpers

get main url data

Write csv output header

Related topics