Transport-Data-Gen

Collect, filter, and generate data from open-source transportation dataset

change log

[2024/9/27] add some generated QA data in saves

[2024/9/26] could generate QA pairs from raw air travel title-content dataset

[2024/9/16] air dataset generated with gpt-4o-mini from not combined data with 17185 rows with evolution template was stored in saves

[2024/9/15] more modular in src folder, you can add your own prompts in src/template.py

[2024/9/12] add transGPT dataset; add evolution method for generating

[2024/9/11] support data augmentation on air travel dataset

Info

Air travel

Reference

Refer to the data processing code, we combine training and validation set together.

TransGPT

Reference

Dataset for finetuning could be found at here. This dataset isn't formatted well as json, we need to preprocess first. Here I transformed it into alpaca format.

how to use

air travel
- QA data from official
  
  In air_travel folder, run code listed in gen_air_travel.sh to generate data from scratch.
  
  You can use default method or evolution method with breadth to do generation.
  
  If you have generated air_travel_combined.json file already, you can try to generate augmented dataset. Just run code listed in gen_augmented_air.sh.
- QA data generated from LLM
  
  make sure you have installed this repo using:
```
pip install -e git+https://github.com/ElementQi/chroma-langchain.git#egg=chroma-langchain
```
  Firstly, generate little chunks from big content, you can modify and use code chunk_content_multithread.py in chunk_gen folder to generate chunked title-content pairs through a embedding model(you can set any LangChain Embeddings).
  
  Then, you can rigister or use the question_answer_gen template in src/template.py to generate QA pairs using code in examples/qa_gen.sh. Make sure you are under the root directory.
transGPT

If you download the dataset from the source of transGPT, you need to make sure you set the proper file path for code json_formatter.py and then run it in transGPT folder.

general

check more info about args in src/datagen.py, there are demo scripts listed in examples/air_gen.sh

issues

raw dataset

same content

e.g.

{"question": "儿童可不可以在网上值机选座啊", "context": "可以"}
{"question": "儿童可不可以在网上值机选座啊", "context": "可以"}
{"question": "儿童可不可以在网上值机选座啊", "context": "可以办理。"}
{"question": "儿童可不可以在网上值机选座啊", "context": "可以办理。"}
{"question": "儿童可不可以在网上值机选座啊", "context": "可以"}
{"question": "航空公司网上值机时间", "context": "可以"}
{"question": "航空公司网上值机时间", "context": "可以"}

meaningless content

e.g.

{"question": "超规行李托运", "context": "说明："}
{"question": "能否托运宠物？两只小狗狗", "context": "；"}
{"question": "能否托运宠物？两只小狗狗", "context": "五."}
{"question": "能否托运宠物？两只小狗狗", "context": "$$table/cz3/cz_checked_luggage_live_animals78.txt$$"}
{"question": "航空公司网上值机时间", "context": "可以"}
{"question": "航空公司网上值机时间", "context": "可以"}

some content need to be combined

e.g.

{"question": "机场信息", "context": "（2）国内值机区域："}
{"question": "机场信息", "context": "国内经济舱值机人工柜台主要分布在T2航站楼D区，包括经济舱值机、中转联程、特殊旅客服务、补交运、补班、售票、行李逾重等柜台，其中特殊旅客服务柜台位于D20，军人依法优先/消防救援人员优先柜台位于D13，中转联程柜台位于D26。"}
{"question": "机场无陪儿童服务多少钱", "context": "2）无成人陪伴青少年票价及服务费："}

question and answer don't match

e.g.

{"question": "登飞机的行李箱是多大尺寸", "context": "2.在您的上述随身物品额度之外，您还可以免费携带以下物品："}
{"question": "伟哥怎么过飞机安检", "context": "（7）国内/国际运输总条件规定的小动物、服务犬；"}

redundant or useless tokens(like space, and etc.)

e.g.

{"question": "我 要 值机", "context": "登机牌可在机场办理，也可通过自助设备、南航网站等方式办理；如需托运行李必须在机场柜台截止办理时间前抵达人工办理柜台，各地规定不同，请咨询始发当地机场。持纸质客票的旅客请到机场柜台办理。"}
{"question": "航空公司 行李", "context": "国内航程 国际或地区航程 *此行李运输规则于2020年12月8日（以系统销售日期为准）起实施。 （1）韩国与中国境内（不含港澳台地区）双向之间的品牌运价产品的免费托运行李额 （2）使用非上述行程的其他品牌运价产品或无品牌运价产品的免费托运行李额"}

augmentation

multi workers code seems have some problems currently(time out issue), and need to be fixed.
need more precise processing and prompts.

advanced

For air_travel_combined.json, it has 2814(about 3k) rows, if we want to get 10k dataset, all we need to do is to let the LLM generate 4 related data(just logically tear the seed data apart) from an observation.

The Chinese characters counting info is shown below:

Concerned

Need to write the data collection and processing process.
Perform data statistics, such as the approximate number of various types of data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transport-Data-Gen

change log

Info

how to use

issues

raw dataset

augmentation

advanced

Concerned

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
air_travel		air_travel
chunk_gen		chunk_gen
data		data
examples		examples
imgs		imgs
saves		saves
src		src
transGPT		transGPT
.gitignore		.gitignore
README.md		README.md

ElementQi/Transport-Data-Gen

Folders and files

Latest commit

History

Repository files navigation

Transport-Data-Gen

change log

Info

how to use

issues

raw dataset

augmentation

advanced

Concerned

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages