羊驼系列大模型和ChatGPT差多少?详细测评后,我沉默了( 二 )


 ╔═══════════╦══════════╦║Model║ Accuracy ║╠═══════════╬══════════╬║ ChatGPT║100%║║ Vicuna║85%║ ║ MPT║30%║╚═══════════╩══════════╩下面是一个 MPT 答错的例子:

羊驼系列大模型和ChatGPT差多少?详细测评后,我沉默了

文章插图
结论在这个非常简单的测试中 , 测试者使用相同的问题、相同的 prompt 得出的结论是:ChatGPT 在准确性方面远远超过了 Vicuna 和 MPT 。
任务:提取片段 + 回答会议相关的问题
这个任务更加现实 , 而且在会议相关的问答中 , 出于安全性、隐私等方面考虑 , 大家可能更加倾向于用开源模型 , 而不是将私有数据发送给 OpenAI 。
以下是一段会议记录(翻译结果来自 DeepL , 仅供参考):
 
羊驼系列大模型和ChatGPT差多少?详细测评后,我沉默了

文章插图
 
 
羊驼系列大模型和ChatGPT差多少?详细测评后,我沉默了

文章插图
 
测试者给出的第一个测试问题是:「Steven 如何看待收购一事?」 , prompt 如下:
 qa_attempt1 = guidance ('''{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}Here is a meeting transcript:----{{transcript}}----Please answer the following question:Question: {{query}}Extract from the transcript the most relevant segments for the answer, and then answer the question.{{/user}}{{#assistant~}}{{gen 'answer'}}{{~/assistant~}}''')ChatGPT 给出了如下答案:
 
羊驼系列大模型和ChatGPT差多少?详细测评后,我沉默了

文章插图
 
虽然这个回答是合理的 , 但 ChatGPT 并没有提取任何对话片段作为答案的支撑(因此不符合测试者设定的规范) 。测试者在 notebook 中迭代了 5 个不同的 prompt , 以下是一些例子:
qa_attempt3 = guidance ('''{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}Here is a meeting transcript:----{{transcript}}----Based on the above, please answer the following question:Question: {{query}}Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns.Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.As an example of output format, here is a fictitious answer to a question about another meeting transcript.CONVERSATION SEGMENTS:Segment 1: Peter and John discuss the weather.Peter: John, how is the weather today?John: It's raining.Segment 2: Peter insults JohnPeter: John, you are a bad person.Segment 3: BlankANSWER: Peter and John discussed the weather and Peter insulted John.{{/user}}{{#assistant~}}{{gen 'answer'}}{{~/assistant~}}''')在这个新的 prompt 中 , ChatGPT 确实提取了相关的片段 , 但它没有遵循测试者规定的输出格式(它没有总结每个片段 , 也没有给出对话者的名字) 。
不过 , 在构建出更复杂的 prompt 之后 , ChatGPT 终于听懂了指示:
qa_attempt5 = guidance ('''{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: What were the main things that hAppened in the meeting?Here is a meeting transcript:----Peter: HeyJohn: HeyPeter: John, how is the weather today?John: It's raining.Peter: That's too bad. I was hoping to go for a walk later.John: Yeah, it's a shame.Peter: John, you are a bad person.----Based on the above, please answer the following question:Question: {{query}}Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns.Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.{{/user}}{{#assistant~}}CONVERSATION SEGMENTS:Segment 1: Peter and John discuss the weather.Peter: John, how is the weather today?John: It's raining.Segment 2: Peter insults JohnPeter: John, you are a bad person.Segment 3: BlankANSWER: Peter and John discussed the weather and Peter insulted John.{{~/assistant~}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}Here is a meeting transcript:----{{transcript}}----Based on the above, please answer the following question:Question: {{query}}Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns.Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.{{~/user}}{{#assistant~}}{{gen 'answer'}}{{~/assistant~}}''')


推荐阅读