This project is an asynchronous Python scraper for car listings. It collects detailed information from listings, including title, price, mileage, VIN, images, and seller contact info, and stores the data in a PostgreSQL database.
Additionally, it provides a scheduler to create daily database dumps for backup.
Key Features:
- Async scraping using
httpxandasyncio - Robust parsing with
lxmland regex - PostgreSQL storage via async SQLAlchemy
- Daily database dumps via
scheduler.pyandpg_dump - Fully Dockerized for easy deployment
scrap-test-task/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── main.py
├── scheduler.py
├── dump_db.py
├── db.py
├── additional_func.py
└── README.md
- Docker >= 24.0
- Docker Compose >= 2.17
.envfile at the project root with the following variables (just copy and paste this):
# TARGET URL
URL=https://auto.ria.com/uk/search/?indexName=auto
# POSTGRES
POSTGRES_USER=user
POSTGRES_PASSWORD=some_password
POSTGRES_DB=autos
POSTGRES_HOST=db
POSTGRES_PORT=5432
- Build and start services:
docker-compose up --buildThis will:
- Start a PostgreSQL container with persistent volume
db_data - Start a Python scraper container
- Volume Mapping for Dumps:
- Database dumps are stored in
./dumpson your host machine. scheduler.pycreates daily dumps at 12:00 by default.
The scraper runs automatically in the container:
- It parses pages asynchronously with a maximum of
MAX_CONCURRENT=3requests at a time. - Pagination continues until no new listings are found.
- Extracted data is inserted into PostgreSQL with duplicate URLs ignored.
The scheduler (scheduler.py) uses the schedule library to run daily database backups:
-
Dumps are stored in
dumps/folder with timestamp:dumps/POSTGRES_DB_YYYY-MM-DD_HH-MM-SS.sql -
Uses
pg_dumpwith custom format.
Important Docker Note:
Currently, both scheduler.py and main.py are started in the same container. For production, consider running the scheduler in a separate container to ensure proper PID management and reliability.
-
Uses SQLAlchemy Async with
asyncpgdriver. -
Table:
carswith columns:url(primary key)title,price_usd,odometerusername,phone_numberimage_url,images_countcar_number,car_vincreated_at(UTC timestamp)
-
insert_carhandles conflicts gracefully usingON CONFLICT DO NOTHING.
Tables are automatically created on scraper startup via init_db().
-
Environment Validation
- Ensure
.envis complete and correct. - Scraper and dump scripts rely on proper DB host and credentials.
- Ensure
-
Docker Networking
- Containers communicate via service names (
dbfor PostgreSQL).
- Containers communicate via service names (
-
Scheduler Reliability
- Running scheduler and scraper in a single container using
&may cause termination issues. - Recommended: Separate
schedulerinto its own service in Docker Compose.
- Running scheduler and scraper in a single container using
-
Logging
- Scraper prints detailed logs for fetched pages and parsing errors.
- Scheduler prints success/error messages for each database dump.
-
Extensibility
- Concurrency, dump schedule, and target URL can be configured via
.env.
- Concurrency, dump schedule, and target URL can be configured via
Start all services:
docker-compose upStart only the scraper:
docker-compose run scraper python main.pyCreate a manual dump:
docker-compose run scraper python dump_db.pyView logs:
docker-compose logs -f scraper