ASSIGNMENT - 09

Australian Housing Prices prediction

This dataset can be used to predict hosing prices in Australia. This dataset can be used to find relationships between housing prices and location. This dataset can be used to find relationships between housing prices and features such as size, number of bedrooms, and number of bathrooms

Hint: RealEstateAU_1000_Samples.csv file

In [1]:
# Step 1: Load and Explore the Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [2]:
# Load the dataset
df = pd.read_csv(r"C:\Users\ahlad\Downloads\JNTUH_ML_DL_assignment_2(1).zip")
In [3]:
df
Out[3]:
index TID breadcrumb category_name property_type building_size land_size preferred_size open_date listing_agency ... state zip_code phone latitude longitude product_depth bedroom_count bathroom_count parking_count RunDate
0 0 1350988 Buy>NT>DARWIN CITY Real Estate & Property for sale in DARWIN CITY... House NaN NaN NaN Added 2 hours ago Professionals - DARWIN CITY ... NT 800 08 8941 8289 NaN NaN premiere 2.0 1.0 1.0 2022-05-27 15:54:05
1 1 1350989 Buy>NT>DARWIN CITY Real Estate & Property for sale in DARWIN CITY... Apartment 171m² NaN 171m² Added 7 hours ago Nick Mousellis Real Estate - Eview Group Member ... NT 800 0411724000 NaN NaN premiere 3.0 2.0 2.0 2022-05-27 15:54:05
2 2 1350990 Buy>NT>DARWIN CITY Real Estate & Property for sale in DARWIN CITY... Unit NaN NaN NaN Added 22 hours ago Habitat Real Estate - THE GARDENS ... NT 800 08 8981 0080 NaN NaN premiere 2.0 1.0 1.0 2022-05-27 15:54:05
3 3 1350991 Buy>NT>DARWIN CITY Real Estate & Property for sale in DARWIN CITY... House NaN NaN NaN Added yesterday Ray White - NIGHTCLIFF ... NT 800 08 8982 2403 NaN NaN premiere 1.0 1.0 0.0 2022-05-27 15:54:05
4 4 1350992 Buy>NT>DARWIN CITY Real Estate & Property for sale in DARWIN CITY... Unit 201m² NaN 201m² Added yesterday Carol Need Real Estate - Fannie Bay ... NT 800 0418885966 NaN NaN premiere 3.0 2.0 2.0 2022-05-27 15:54:05
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 995 1351983 Buy>NT>DARWIN Real Estate & Property for sale in DARWIN, NT ... House NaN 9.17ha 9.17ha Under offer United Realty NT - Parap ... NT 834 08 8981 2666 NaN NaN feature 4.0 3.0 6.0 2022-05-27 15:54:05
996 996 1351984 Buy>NT>DARWIN Real Estate & Property for sale in DARWIN, NT ... House 203m² 600m² 600m² NaN Kassiou Constructions - HOWARD SPRINGS ... NT 836 08 89834326 NaN NaN standard 4.0 2.0 2.0 2022-05-27 15:54:05
997 997 1351985 Buy>NT>DARWIN Real Estate & Property for sale in DARWIN, NT ... House 209.6m² 800m² 800m² NaN Kassiou Constructions - HOWARD SPRINGS ... NT 836 08 89834326 NaN NaN standard 4.0 2.0 2.0 2022-05-27 15:54:05
998 998 1351986 Buy>NT>DARWIN Real Estate & Property for sale in DARWIN, NT ... House 180m² 450m² 450m² NaN Kassiou Constructions - HOWARD SPRINGS ... NT 810 08 89834326 NaN NaN standard 4.0 2.0 3.0 2022-05-27 15:54:05
999 999 1351987 Buy>NT>DARWIN Real Estate & Property for sale in DARWIN, NT ... Unit 120m² NaN 120m² NaN Home Zone NT - DARWIN ... NT 820 0418 895 345 NaN NaN feature 2.0 2.0 2.0 2022-05-27 15:54:05

1000 rows × 27 columns

In [4]:
# Drop any duplicates
df.drop_duplicates(inplace=True)
In [5]:
# Check for missing values and handle them accordingly
print(df.isnull().sum())
index                 0
TID                   0
breadcrumb            0
category_name         0
property_type         0
building_size       720
land_size           467
preferred_size      391
open_date           698
listing_agency        0
price                 0
location_number       0
location_type         0
location_name         0
address              12
address_1            12
city                  0
state                 0
zip_code              0
phone                 0
latitude           1000
longitude          1000
product_depth         0
bedroom_count        33
bathroom_count       33
parking_count        33
RunDate               0
dtype: int64
In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
In [16]:
# Explore the dataset
print(df.head())  # Display the first few rows
print(df.info())  # Summary of the dataset
   bedroom_count  bathroom_count  location_type_Buy  land_size_1,000m²  \
0            2.0             1.0                  1                  0   
1            3.0             2.0                  1                  0   
2            2.0             1.0                  1                  0   
3            1.0             1.0                  1                  0   
4            3.0             2.0                  1                  0   

   land_size_1,010m²  land_size_1,020m²  land_size_1,030m²  land_size_1,038m²  \
0                  0                  0                  0                  0   
1                  0                  0                  0                  0   
2                  0                  0                  0                  0   
3                  0                  0                  0                  0   
4                  0                  0                  0                  0   

   land_size_1,040m²  land_size_1,050m²  ...  \
0                  0                  0  ...   
1                  0                  0  ...   
2                  0                  0  ...   
3                  0                  0  ...   
4                  0                  0  ...   

   price_UNDER CONTRACT... MORE PROPERTIES WANTED  price_UNDER OFFER  \
0                                               0                  0   
1                                               0                  0   
2                                               0                  0   
3                                               0                  0   
4                                               0                  0   

   price_Under  Contract  price_Under Contract  price_Under Offer  \
0                      0                     0                  0   
1                      0                     0                  0   
2                      0                     0                  0   
3                      0                     0                  0   
4                      0                     0                  0   

   price_Under contract  price_Under iContract  price_offers above $510,000  \
0                     0                      0                            0   
1                     0                      0                            0   
2                     0                      0                            0   
3                     0                      0                            0   
4                     0                      0                            0   

   price_offers over $1,250,000  price_offers over $399,000  
0                             0                           0  
1                             0                           0  
2                             0                           0  
3                             0                           0  
4                             0                           0  

[5 rows x 843 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Columns: 843 entries, bedroom_count to price_offers over $399,000
dtypes: float64(2), uint8(841)
memory usage: 844.7 KB
None
In [ ]: