این صفحه به‌وسیله ‏Cloud Translation API‏ ترجمه شده است.

واژه نامه یادگیری ماشینی: متریک
با مجموعه‌ها، منظم بمانید ذخیره و طبقه‌بندی محتوا براساس اولویت‌های شما.

این صفحه شامل اصطلاحات واژه نامه متریک است. برای همه اصطلاحات واژه نامه، اینجا را کلیک کنید .

الف

دقت

#مبانی

#متریک

تعداد پیش‌بینی‌های طبقه‌بندی صحیح تقسیم بر تعداد کل پیش‌بینی‌ها. یعنی:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

به عنوان مثال، مدلی که 40 پیش‌بینی درست و 10 پیش‌بینی نادرست داشته باشد، دقتی برابر با:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

طبقه بندی باینری نام های خاصی را برای دسته های مختلف پیش بینی های صحیح و پیش بینی های نادرست ارائه می دهد. بنابراین، فرمول دقت برای طبقه بندی باینری به شرح زیر است:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

کجا:

TP تعداد مثبت های واقعی (پیش بینی های صحیح) است.
TN تعداد منفی های واقعی (پیش بینی های صحیح) است.
FP تعداد مثبت کاذب (پیش‌بینی‌های نادرست) است.
FN تعداد منفی های کاذب (پیش بینی های نادرست) است.

مقایسه و مقایسه دقت با دقت و یادآوری .

برای جزئیات در مورد دقت و مجموعه داده های نامتعادل کلاس، روی نماد کلیک کنید.

اگرچه برای برخی موقعیت‌ها یک معیار ارزشمند است، اما دقت برای برخی دیگر بسیار گمراه‌کننده است. قابل ذکر است که دقت معمولاً معیار ضعیفی برای ارزیابی مدل‌های طبقه‌بندی است که مجموعه داده‌های نامتعادل کلاس را پردازش می‌کنند.

برای مثال، فرض کنید در یک شهر نیمه گرمسیری خاص، تنها 25 روز در قرن برف می بارد. از آنجایی که روزهای بدون برف (طبقه منفی) بسیار بیشتر از روزهای با برف (طبقه مثبت) است، مجموعه داده های برف برای این شهر از نظر طبقه نامتعادل است. یک مدل طبقه‌بندی باینری را تصور کنید که قرار است هر روز برف یا بدون برف را پیش‌بینی کند، اما به سادگی هر روز «بدون برف» را پیش‌بینی می‌کند. این مدل بسیار دقیق است اما قدرت پیش بینی ندارد. جدول زیر نتایج یک قرن پیش‌بینی را خلاصه می‌کند:

دسته بندی	شماره
TP	0
TN	36499
FP	0
FN	25

بنابراین دقت این مدل عبارت است از:

accuracy = (TP + TN) / (TP + TN + FP + FN) accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

اگرچه دقت 99.93 درصد بسیار چشمگیر به نظر می رسد، این مدل در واقع قدرت پیش بینی ندارد.

دقت و یادآوری معمولاً معیارهای مفیدتری نسبت به دقت برای ارزیابی مدل‌های آموزش دیده بر روی مجموعه داده‌های نامتعادل کلاس هستند.

برای اطلاعات بیشتر به طبقه بندی: دقت، یادآوری، دقت و معیارهای مرتبط در دوره تصادف یادگیری ماشین مراجعه کنید.

ناحیه زیر منحنی PR

#متریک

به PR AUC (منطقه زیر منحنی PR) مراجعه کنید.

ناحیه زیر منحنی ROC

#متریک

AUC (مساحت زیر منحنی ROC) را ببینید.

AUC (مساحت زیر منحنی ROC)

#مبانی

#متریک

عددی بین 0.0 و 1.0 نشان دهنده توانایی یک مدل طبقه بندی باینری برای جداسازی کلاس های مثبت از کلاس های منفی است. هر چه AUC به 1.0 نزدیکتر باشد، مدل توانایی بهتری برای جداسازی کلاس ها از یکدیگر دارد.

برای مثال، تصویر زیر یک مدل طبقه‌بندی را نشان می‌دهد که کلاس‌های مثبت (بیضی‌های سبز) را از کلاس‌های منفی (مستطیل‌های بنفش) کاملاً جدا می‌کند. این مدل غیرواقعی کامل دارای AUC 1.0 است:

یک خط اعداد با 8 مثال مثبت در یک طرف و 9 مثال منفی در طرف دیگر.

برعکس، تصویر زیر نتایج یک مدل طبقه‌بندی را نشان می‌دهد که نتایج تصادفی ایجاد می‌کند. این مدل دارای AUC 0.5 است:

یک خط اعداد با 6 مثال مثبت و 6 مثال منفی. دنباله مثال ها مثبت، منفی است، مثبت، منفی، مثبت، منفی، مثبت، منفی، مثبت منفی، مثبت، منفی

بله، مدل قبلی دارای AUC 0.5 است، نه 0.0.

اکثر مدل ها جایی بین دو حالت افراطی هستند. به عنوان مثال، مدل زیر موارد مثبت را تا حدودی از منفی جدا می کند و بنابراین دارای AUC بین 0.5 و 1.0 است:

یک خط اعداد با 6 مثال مثبت و 6 مثال منفی. دنباله مثال ها منفی، منفی، منفی، منفی، مثبت، منفی، مثبت، مثبت، منفی، مثبت، مثبت، مثبت

AUC هر مقداری را که برای آستانه طبقه بندی تنظیم کرده اید نادیده می گیرد. در عوض، AUC تمام آستانه های طبقه بندی ممکن را در نظر می گیرد.

برای اطلاع از رابطه بین منحنی های AUC و ROC روی نماد کلیک کنید.

AUC نشان دهنده سطح زیر منحنی ROC است. به عنوان مثال، منحنی ROC برای مدلی که به طور کامل نکات مثبت را از منفی جدا می کند، به صورت زیر است:

AUC ناحیه خاکستری در تصویر قبل است. در این حالت غیر معمول، مساحت به سادگی طول ناحیه خاکستری (1.0) ضرب در عرض ناحیه خاکستری (1.0) است. بنابراین، حاصل ضرب 1.0 و 1.0 AUC دقیقاً 1.0 را به دست می دهد که بالاترین امتیاز AUC ممکن است.

برعکس، منحنی ROC برای یک مدل طبقه بندی که به هیچ وجه نمی تواند کلاس ها را از هم جدا کند، به شرح زیر است. مساحت این منطقه خاکستری 0.5 است.

یک منحنی معمولی ROC تقریباً شبیه زیر است:

محاسبه مساحت زیر این منحنی به صورت دستی دشوار خواهد بود، به همین دلیل است که یک برنامه معمولاً بیشتر مقادیر AUC را محاسبه می کند.

برای تعریف رسمی تر AUC روی نماد کلیک کنید.

AUC احتمال این است که یک مدل طبقه بندی مطمئن تر از مثبت بودن یک مثال تصادفی مثبت باشد تا اینکه یک مثال منفی تصادفی انتخاب شده مثبت باشد.

برای اطلاعات بیشتر به طبقه بندی: ROC و AUC در دوره تصادف یادگیری ماشینی مراجعه کنید.

دقت متوسط در k

#زبان

#متریک

معیاری برای خلاصه کردن عملکرد یک مدل در یک اعلان واحد که نتایج رتبه‌بندی‌شده‌ای را ایجاد می‌کند، مانند فهرست شماره‌دار توصیه‌های کتاب. میانگین دقت در k ، خوب، میانگین دقت در مقادیر k برای هر نتیجه مرتبط است. بنابراین فرمول دقت متوسط در k به صورت زیر است:

\[{\text{average precision at k}} = \frac{1}{n} \sum_{i=1}^n {\text{precision at k for each relevant item} } \]

کجا:

$n$ تعداد موارد مرتبط در لیست است.

کنتراست با فراخوان در k .

برای مثال روی نماد کلیک کنید

فرض کنید به یک مدل زبان بزرگ، پرس و جوی زیر داده شده است:

 List the 6 funniest movies of all time in order.

و مدل زبان بزرگ لیست زیر را برمی گرداند:

ژنرال
دختران بدجنس
جوخه
ساقدوش ها
شهروند کین
این اسپینال تپ است

چهار تا از فیلم های لیست برگشتی بسیار خنده دار هستند (یعنی مرتبط هستند) اما دو فیلم درام هستند (مرتبط نیستند). جدول زیر جزئیات نتایج را نشان می دهد:

موقعیت	فیلم	مربوطه؟	دقت در k
1	ژنرال	بله	1.0
2	دختران بدجنس	بله	1.0
3	جوخه	خیر	مرتبط نیست
4	ساقدوش ها	بله	0.75
5	شهروند کین	خیر	مرتبط نیست
6	این اسپینال تپ است	بله	0.67

تعداد نتایج مربوطه 4 است. بنابراین، می توانید میانگین دقت 6 را به صورت زیر محاسبه کنید:

$${\text{average precision at 6}} = \frac{1}{4} {\text{(1.0 + 1.0 + 0.75 + 0.67)} } $$$${\text{average precision at 6}} = {\text{~0.85} } $$

ب

خط پایه

#متریک

مدلی که به عنوان یک نقطه مرجع برای مقایسه عملکرد یک مدل دیگر (معمولاً پیچیده تر) استفاده می شود. به عنوان مثال، یک مدل رگرسیون لجستیک ممکن است به عنوان یک پایه خوب برای یک مدل عمیق عمل کند.

برای یک مشکل خاص، خط مبنا به توسعه دهندگان مدل کمک می کند تا حداقل عملکرد مورد انتظاری را که یک مدل جدید باید به آن دست پیدا کند تا مدل جدید مفید باشد، کمّی کنند.

سی

هزینه

#متریک

مترادف باخت .

انصاف خلاف واقع

#مسئول

#متریک

یک معیار انصاف که بررسی می‌کند آیا یک مدل طبقه‌بندی همان نتیجه را برای یک فرد ایجاد می‌کند که برای فرد دیگری که مشابه اولی است، مگر در رابطه با یک یا چند ویژگی حساس . ارزیابی یک مدل طبقه‌بندی برای انصاف خلاف واقع یکی از روش‌های آشکارسازی منابع بالقوه سوگیری در یک مدل است.

برای اطلاعات بیشتر به یکی از موارد زیر مراجعه کنید:

انصاف: انصاف متضاد در دوره تصادف یادگیری ماشین.
وقتی دنیاها با هم برخورد می کنند: ادغام مفروضات مختلف خلاف واقع در انصاف

آنتروپی متقابل

#متریک

تعمیم Log Loss به مسائل طبقه بندی چند طبقه . آنتروپی متقاطع تفاوت بین دو توزیع احتمال را کمیت می کند. حیرت را نیز ببینید.

تابع توزیع تجمعی (CDF)

#متریک

تابعی که فرکانس نمونه ها را کمتر یا مساوی با مقدار هدف تعریف می کند. برای مثال، توزیع نرمال مقادیر پیوسته را در نظر بگیرید. یک CDF به شما می گوید که تقریباً 50٪ نمونه ها باید کمتر یا مساوی با میانگین باشند و تقریباً 84٪ نمونه ها باید کمتر یا مساوی یک انحراف استاندارد بالاتر از میانگین باشند.

D

برابری جمعیتی

#مسئول

#متریک

یک معیار انصاف که اگر نتایج طبقه‌بندی یک مدل به یک ویژگی حساس معین وابسته نباشد، برآورده می‌شود.

به عنوان مثال، اگر هم لیلیپوتی ها و هم بروبدینگناگی ها برای دانشگاه گلابدابدریب درخواست دهند، برابری جمعیتی در صورتی حاصل می شود که درصد لیلیپوتیان پذیرفته شده با درصد بروبدینگناگیان پذیرفته شده یکسان باشد، صرف نظر از اینکه یک گروه به طور متوسط واجد شرایط تر از گروه دیگر باشد.

در مقایسه با شانس برابر و برابری فرصت ، که اجازه می‌دهد طبقه‌بندی نتایج مجموع به ویژگی‌های حساس بستگی داشته باشد، اما اجازه نمی‌دهد نتایج طبقه‌بندی برای برخی برچسب‌های حقیقت پایه مشخص شده به ویژگی‌های حساس بستگی داشته باشد. برای تجسم کاوش در مبادلات هنگام بهینه سازی برابری جمعیتی، «حمله به تبعیض با یادگیری ماشینی هوشمندتر» را ببینید.

برای اطلاعات بیشتر به Fairness: برابری جمعیتی در دوره تصادف یادگیری ماشینی مراجعه کنید.

E

فاصله حرکت دهنده زمین (EMD)

#متریک

اندازه گیری شباهت نسبی دو توزیع . هر چه فاصله زمین گردان کمتر باشد، توزیع ها مشابه تر است.

فاصله را ویرایش کنید

#زبان

#متریک

اندازه گیری شباهت دو رشته متنی به یکدیگر. در یادگیری ماشینی، ویرایش فاصله به دلایل زیر مفید است:

محاسبه فاصله ویرایش آسان است.
ویرایش فاصله می‌تواند دو رشته را که شبیه یکدیگر هستند مقایسه کند.
فاصله ویرایش می تواند میزان شباهت رشته های مختلف به یک رشته معین را تعیین کند.

تعاریف متعددی از فاصله ویرایش وجود دارد که هر کدام از عملیات رشته های متفاوتی استفاده می کنند. برای مثال فاصله Levenshtein را ببینید.

تابع توزیع تجمعی تجربی (eCDF یا EDF)

#متریک

یک تابع توزیع تجمعی بر اساس اندازه‌گیری‌های تجربی از یک مجموعه داده واقعی. مقدار تابع در هر نقطه در امتداد محور x کسری از مشاهدات در مجموعه داده است که کمتر یا مساوی با مقدار مشخص شده است.

آنتروپی

#df

#متریک

در تئوری اطلاعات ، توصیفی از غیرقابل پیش‌بینی بودن توزیع احتمال است. متناوباً، آنتروپی نیز به این صورت تعریف می‌شود که هر مثال حاوی چه مقدار اطلاعات است. یک توزیع دارای بالاترین آنتروپی ممکن است زمانی که همه مقادیر یک متغیر تصادفی به یک اندازه محتمل باشند.

آنتروپی یک مجموعه با دو مقدار ممکن "0" و "1" (به عنوان مثال، برچسب ها در یک مسئله طبقه بندی باینری ) فرمول زیر را دارد:

H = -p log p - q log q = -p log p - (1-p) * log (1-p)

کجا:

H آنتروپی است.
p کسری از مثال های "1" است.
q کسری از مثال های "0" است. توجه داشته باشید که q = (1 - p)
log به طور کلی log ₂ است. در این حالت واحد آنتروپی کمی است.

برای مثال موارد زیر را فرض کنید:

100 مثال حاوی مقدار "1" هستند
300 مثال حاوی مقدار "0" هستند

بنابراین، مقدار آنتروپی:

p = 0.25
q = 0.75
H = (-0.25)log ₂ (0.25) - (0.75)log ₂ (0.75) = 0.81 بیت در هر مثال

مجموعه ای که کاملاً متعادل باشد (مثلاً 200 "0" و 200 "1") آنتروپی 1.0 بیت در هر مثال خواهد داشت. وقتی یک مجموعه نامتعادل تر می شود، آنتروپی آن به سمت 0.0 حرکت می کند.

در درخت‌های تصمیم ، آنتروپی به فرمول‌بندی به دست آوردن اطلاعات کمک می‌کند تا به تقسیم‌کننده کمک کند شرایط را در طول رشد درخت تصمیم طبقه‌بندی انتخاب کند.

مقایسه آنتروپی با:

ناخالصی جینی
تابع از دست دادن آنتروپی متقابل

آنتروپی اغلب آنتروپی شانون نامیده می شود.

برای اطلاعات بیشتر به Exact splitter برای طبقه بندی باینری با ویژگی های عددی در دوره Decision Forests مراجعه کنید.

برابری فرصت ها

#مسئول

#متریک

یک معیار انصاف برای ارزیابی اینکه آیا یک مدل نتیجه مطلوب را برای همه مقادیر یک ویژگی حساس به خوبی پیش‌بینی می‌کند یا خیر. به عبارت دیگر، اگر نتیجه مطلوب برای یک مدل کلاس مثبت باشد، هدف این است که نرخ مثبت واقعی برای همه گروه‌ها یکسان باشد.

برابری فرصت به شانس مساوی مربوط می شود، که مستلزم آن است که هم نرخ های مثبت واقعی و هم نرخ های مثبت کاذب برای همه گروه ها یکسان باشند.

فرض کنید دانشگاه گلابدابدریب هم لیلیپوت ها و هم بروبدینگناگی ها را در یک برنامه ریاضی دقیق پذیرفته است. مدارس متوسطه لیلیپوت ها برنامه درسی قوی از کلاس های ریاضی ارائه می دهند و اکثریت قریب به اتفاق دانش آموزان واجد شرایط برنامه دانشگاه هستند. مدارس متوسطه Brobdingnagians به هیچ وجه کلاس های ریاضی ارائه نمی دهند و در نتیجه دانش آموزان بسیار کمتری واجد شرایط هستند. اگر دانش‌آموزان واجد شرایط به همان اندازه بدون توجه به لیلیپوتی یا بروبدینگناگی، پذیرش شوند، برای برچسب ترجیحی «پذیرفته‌شده» با توجه به ملیت (Lilliputian یا Brobdingnagian) رعایت می‌شود.

به عنوان مثال، فرض کنید 100 لیلیپوتی و 100 برابدینگ ناگی برای دانشگاه گلابدابدریب درخواست دهند و تصمیمات پذیرش به شرح زیر اتخاذ می شود:

جدول 1. متقاضیان لیلیپوت (90٪ واجد شرایط هستند)

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	45	3
رد شد	45	7
مجموع	90	10
درصد پذیرش دانشجویان واجد شرایط: 45/90 = 50% درصد رد دانش آموزان فاقد صلاحیت: 7/10 = 70% درصد کل دانشجویان لیلیپوتی پذیرفته شده: (45+3)/100 = 48%

جدول 2. متقاضیان Brobdingnagian (10٪ واجد شرایط):

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	5	9
رد شد	5	81
مجموع	10	90
درصد پذیرش دانشجویان واجد شرایط: 5/10 = 50٪ درصد مردودی دانش آموزان فاقد صلاحیت: 90/81 = 90 درصد درصد کل دانشجویان بروبدینگناگیان پذیرفته شده: (5+9)/100 = 14%

مثال‌های قبلی برابری فرصت‌ها را برای پذیرش دانش‌آموزان واجد شرایط برآورده می‌کنند، زیرا لیلیپوت‌های واجد شرایط و برابدینگناگیان هر دو 50 درصد شانس پذیرش دارند.

در حالی که برابری فرصت برآورده می شود، دو معیار انصاف زیر راضی نمی شوند:

برابری جمعیتی : لیلیپوت ها و برابدینگناگی ها با نرخ های متفاوتی در دانشگاه پذیرفته می شوند. 48 درصد از دانش آموزان لیلیپوتی پذیرش می شوند، اما تنها 14 درصد از دانش آموزان بروبدینگناگیان پذیرفته می شوند.
شانس مساوی : در حالی که دانش آموزان لیلیپوتی و بروبدینگناگی واجد شرایط هر دو شانس یکسانی برای پذیرش دارند، محدودیت اضافی که لیلیپوت های فاقد صلاحیت و برابدینگناگیان هر دو شانس یکسانی برای رد شدن دارند برآورده نمی شود. لیلیپوت های فاقد صلاحیت 70 درصد نرخ رد دارند، در حالی که بروبدینگناگیان فاقد صلاحیت 90 درصد نرخ رد دارند.

برای کسب اطلاعات بیشتر، به Fairness: Equality of فرصت ها در دوره تصادف یادگیری ماشینی مراجعه کنید.

شانس مساوی

#مسئول

#متریک

یک معیار انصاف برای ارزیابی اینکه آیا یک مدل نتایج را به خوبی برای همه مقادیر یک ویژگی حساس با توجه به کلاس مثبت و منفی - نه فقط یک طبقه یا کلاس دیگر - به طور یکسان پیش‌بینی می‌کند. به عبارت دیگر، هم نرخ مثبت واقعی و هم نرخ منفی کاذب باید برای همه گروه ها یکسان باشد.

شانس برابر شده مربوط به برابری فرصت است که فقط بر روی نرخ خطا برای یک کلاس واحد (مثبت یا منفی) تمرکز دارد.

به عنوان مثال، فرض کنید دانشگاه گلابدابدریب هم لیلیپوت ها و هم بروبدینگناگی ها را در یک برنامه ریاضی دقیق پذیرفته است. مدارس متوسطه لیلیپوت ها برنامه درسی قوی از کلاس های ریاضی ارائه می دهند و اکثریت قریب به اتفاق دانش آموزان واجد شرایط برنامه دانشگاه هستند. مدارس متوسطه Brobdingnagians به هیچ وجه کلاس های ریاضی ارائه نمی دهند و در نتیجه دانش آموزان بسیار کمتری واجد شرایط هستند. شانس مساوی شده به شرطی برآورده می شود که فارغ از اینکه متقاضی لیلیپوت باشد یا بروبدینگناگی، اگر واجد شرایط باشد، به همان اندازه احتمال دارد در برنامه پذیرفته شوند و اگر واجد شرایط نباشند، به همان اندازه احتمال رد شدن دارند.

فرض کنید 100 لیلیپوتی و 100 برابدینگ ناگی برای دانشگاه گلابدابدریب درخواست دهند و تصمیمات پذیرش به شرح زیر گرفته می شود:

جدول 3. متقاضیان لیلیپوت (90٪ واجد شرایط هستند)

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	45	2
رد شد	45	8
مجموع	90	10
درصد پذیرش دانشجویان واجد شرایط: 45/90 = 50% درصد رد دانش آموزان فاقد صلاحیت: 8/10 = 80% درصد کل دانشجویان لیلیپوتی پذیرفته شده: (45+2)/100 = 47%

جدول 4. متقاضیان Brobdingnagian (10٪ واجد شرایط):

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	5	18
رد شد	5	72
مجموع	10	90
درصد پذیرش دانشجویان واجد شرایط: 5/10 = 50٪ درصد رد دانش آموزان فاقد صلاحیت: 72/90 = 80% درصد کل دانشجویان بروبدینگناگیان پذیرفته شده: (5+18)/100 = 23%

شانس مساوی برآورده می شود زیرا دانش آموزان واجد شرایط لیلیپوتی و بروبدینگناگی هر دو 50 درصد شانس پذیرش دارند و لیلیپوتین و بروبدینگناگیان فاقد صلاحیت 80 درصد شانس رد شدن دارند.

شانس مساوی شده به طور رسمی در "برابری فرصت در یادگیری تحت نظارت" به این صورت تعریف می شود: "پیش بینی Ŷ شانس مساوی را با توجه به ویژگی محافظت شده A و نتیجه Y را برآورده می کند اگر Ŷ و A مستقل باشند، مشروط به Y."

ارزیابی می کند

#زبان

#تولید کننده هوش مصنوعی

#متریک

در درجه اول به عنوان مخفف ارزیابی های LLM استفاده می شود. به طور گسترده تر، evals مخفف هر شکلی از ارزیابی است.

ارزیابی

#زبان

#تولید کننده هوش مصنوعی

#متریک

فرآیند اندازه گیری کیفیت یک مدل یا مقایسه مدل های مختلف با یکدیگر.

برای ارزیابی یک مدل یادگیری ماشینی نظارت شده ، معمولاً آن را بر اساس یک مجموعه اعتبار سنجی و یک مجموعه آزمایش قضاوت می کنید. ارزیابی LLM معمولاً شامل ارزیابی‌های کیفی و ایمنی گسترده‌تری است.

اف

F ₁

#متریک

یک متریک طبقه‌بندی باینری "تجمعی" که هم بر دقت و هم به یادآوری متکی است. این فرمول است:

$$F{_1} = \frac{\text{2 * precision * recall}} {\text{precision + recall}}$$

برای مشاهده نمونه ها روی نماد کلیک کنید.

فرض کنید دقت و یادآوری مقادیر زیر را دارند:

دقت = 0.6
فراخوان = 0.4

شما F ₁ را به صورت زیر محاسبه می کنید:

$$F{_1} = \frac{\text{2 * 0.6 * 0.4}} {\text{0.6 + 0.4}} = 0.48$$

وقتی دقت و یادآوری تقریباً مشابه هستند (مانند مثال قبل)، F ₁ به میانگین آنها نزدیک است. هنگامی که دقت و یادآوری به طور قابل توجهی متفاوت است، F ₁ به مقدار کمتر نزدیکتر است. به عنوان مثال:

دقت = 0.9
فراخوان = 0.1

$$F{_1} = \frac{\text{2 * 0.9 * 0.1}} {\text{0.9 + 0.1}} = 0.18$$

متریک انصاف

#مسئول

#متریک

یک تعریف ریاضی از «انصاف» که قابل اندازه گیری است. برخی از معیارهای عادلانه رایج عبارتند از:

بسیاری از معیارهای انصاف متقابل هستند. ناسازگاری معیارهای انصاف را ببینید.

منفی کاذب (FN)

#مبانی

#متریک

مثالی که در آن مدل به اشتباه کلاس منفی را پیش بینی می کند. برای مثال، مدل پیش‌بینی می‌کند که یک پیام ایمیل خاص هرزنامه نیست (کلاس منفی)، اما آن پیام ایمیل در واقع هرزنامه است .

نرخ منفی کاذب

#متریک

نسبت مثال‌های مثبت واقعی که مدل به اشتباه کلاس منفی را پیش‌بینی کرده است. فرمول زیر نرخ منفی کاذب را محاسبه می کند:

$$\text{false negative rate} = \frac{\text{false negatives}}{\text{false negatives} + \text{true positives}}$$

برای اطلاعات بیشتر ، آستانه‌ها و ماتریس سردرگمی را در دوره آموزشی تصادفی یادگیری ماشین ببینید.

مثبت کاذب (FP)

#مبانی

#متریک

مثالی که در آن مدل به اشتباه کلاس مثبت را پیش بینی می کند. برای مثال، مدل پیش‌بینی می‌کند که یک پیام ایمیل خاص هرزنامه است (کلاس مثبت)، اما آن پیام ایمیل در واقع هرزنامه نیست .

برای اطلاعات بیشتر ، آستانه‌ها و ماتریس سردرگمی را در دوره آموزشی تصادفی یادگیری ماشین ببینید.

نرخ مثبت کاذب (FPR)

#مبانی

#متریک

نسبت مثال‌های منفی واقعی که مدل به اشتباه کلاس مثبت را پیش‌بینی کرده است. فرمول زیر نرخ مثبت کاذب را محاسبه می کند:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

نرخ مثبت کاذب، محور x در منحنی ROC است.

برای اطلاعات بیشتر به طبقه بندی: ROC و AUC در دوره تصادف یادگیری ماشینی مراجعه کنید.

اهمیت ویژگی ها

#df

#متریک

مترادف برای اهمیت متغیر .

کسری از موفقیت ها

#تولید کننده هوش مصنوعی

#متریک

معیاری برای ارزیابی متن تولید شده در مدل ML. کسری از موفقیت ها تعداد خروجی های متن تولید شده "موفق" تقسیم بر تعداد کل خروجی های متن تولید شده است. به عنوان مثال، اگر یک مدل زبان بزرگ 10 بلوک کد تولید کند که 5 بلوک آن موفق بوده است، کسری از موفقیت ها 50 درصد خواهد بود.

اگرچه کسری از موفقیت ها به طور گسترده در سراسر آمار مفید است، اما در ML، این معیار در درجه اول برای اندازه گیری وظایف قابل تأیید مانند تولید کد یا مسائل ریاضی مفید است.

جی

ناخالصی جینی

#df

#متریک

متریک مشابه آنتروپی . اسپلیترها از مقادیر به دست آمده از ناخالصی جینی یا آنتروپی برای ایجاد شرایط برای درختان تصمیم طبقه بندی استفاده می کنند. کسب اطلاعات از آنتروپی به دست می آید. هیچ اصطلاح معادل پذیرفته شده جهانی برای متریک مشتق شده از ناخالصی جینی وجود ندارد. با این حال، این معیار نامشخص به اندازه کسب اطلاعات مهم است.

به ناخالصی جینی شاخص جینی یا به سادگی جینی نیز گفته می شود.

برای جزئیات ریاضی درباره ناخالصی جینی روی نماد کلیک کنید.

ناخالصی جینی احتمال طبقه بندی اشتباه یک قطعه جدید از داده های گرفته شده از همان توزیع است. ناخالصی جینی یک مجموعه با دو مقدار ممکن "0" و "1" (به عنوان مثال، برچسب ها در یک مسئله طبقه بندی باینری ) از فرمول زیر محاسبه می شود:

I = 1 - (p ² + q ² ) = 1 - (p ² + (1-p) ² )

کجا:

من نجاست جینی هستم.
p کسری از مثال های "1" است.
q کسری از مثال های "0" است. توجه داشته باشید که q = 1-p

برای مثال مجموعه داده زیر را در نظر بگیرید:

100 برچسب (0.25 مجموعه داده) حاوی مقدار "1" هستند.
300 برچسب (0.75 مجموعه داده) حاوی مقدار "0" هستند.

بنابراین ناخالصی جینی عبارت است از:

p = 0.25
q = 0.75
I = 1 - (0.25 ² + 0.75 ² ) = 0.375

در نتیجه، یک برچسب تصادفی از همان مجموعه داده 37.5 درصد احتمال اشتباه طبقه بندی و 62.5 درصد احتمال طبقه بندی مناسب دارد.

یک برچسب کاملاً متعادل (مثلاً 200 "0" و 200 "1") ناخالصی جینی 0.5 خواهد داشت. یک برچسب بسیار نامتعادل ناخالصی جینی نزدیک به 0.0 خواهد داشت.

اچ

از دست دادن لولا

#متریک

خانواده‌ای از توابع ضرر برای طبقه‌بندی طراحی شده‌اند تا مرز تصمیم را تا حد امکان از هر مثال آموزشی دور کنند، بنابراین حاشیه بین نمونه‌ها و مرز را به حداکثر می‌رسانند. KSVM ها از افت لولا (یا یک تابع مرتبط مانند تلفات لولای مربع) استفاده می کنند. برای طبقه بندی باینری، تابع تلفات لولا به صورت زیر تعریف می شود:

$$\text{loss} = \text{max}(0, 1 - (y * y'))$$

که در آن y برچسب واقعی است، یا -1 یا +1، و y' خروجی خام مدل طبقه بندی است:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

در نتیجه، نمودار افت لولا در مقابل (y * y') به صورت زیر است:

نمودار دکارتی متشکل از دو پاره خط به هم پیوسته. اولین پاره خط از (3-، 4) شروع می شود و به (1، 0) ختم می شود. خط دوم بخش از (1، 0) شروع می شود و به طور نامحدود با یک شیب ادامه می یابد از 0.

من

ناسازگاری معیارهای انصاف

#مسئول

#متریک

این ایده که برخی از مفاهیم انصاف با یکدیگر ناسازگار هستند و نمی‌توانند به طور همزمان ارضا شوند. در نتیجه، هیچ معیار جهانی واحدی برای تعیین کمیت انصاف وجود ندارد که بتوان برای همه مسائل ML اعمال کرد.

اگرچه این ممکن است دلسرد کننده به نظر برسد، اما ناسازگاری معیارهای انصاف به معنای بی نتیجه بودن تلاش های عادلانه نیست. در عوض، پیشنهاد می کند که انصاف باید به صورت متناوب برای یک مشکل ML معین، با هدف جلوگیری از آسیب های خاص در موارد استفاده از آن تعریف شود.

برای بحث دقیق تر در مورد ناسازگاری معیارهای انصاف، به «در مورد (عدم) امکان انصاف» مراجعه کنید.

انصاف فردی

#مسئول

#متریک

یک معیار انصاف که بررسی می کند آیا افراد مشابه به طور مشابه طبقه بندی می شوند یا خیر. به عنوان مثال، آکادمی Brobdingnagian ممکن است بخواهد انصاف فردی را با اطمینان از اینکه دو دانش آموز با نمرات یکسان و نمرات آزمون استاندارد شده به طور مساوی احتمال پذیرش را دارند، ارضا کند.

توجه داشته باشید که انصاف فردی کاملاً به نحوه تعریف «شباهت» (در این مورد، نمرات و نمرات آزمون) بستگی دارد و اگر معیار تشابه شما اطلاعات مهمی (مانند سختگیری برنامه درسی دانش آموز) را از دست داد، می توانید خطر ایجاد مشکلات جدید انصاف را داشته باشید.

برای بحث دقیق تر در مورد انصاف فردی، به «انصاف از طریق آگاهی» مراجعه کنید.

کسب اطلاعات

#df

#متریک

در جنگل‌های تصمیم ، تفاوت بین آنتروپی یک گره و مجموع وزنی (براساس تعداد مثال) از آنتروپی گره‌های فرزند آن است. آنتروپی یک گره، آنتروپی نمونه های آن گره است.

به عنوان مثال، مقادیر آنتروپی زیر را در نظر بگیرید:

آنتروپی گره والد = 0.6
آنتروپی یک گره فرزند با 16 مثال مرتبط = 0.2
آنتروپی یک گره فرزند دیگر با 24 مثال مرتبط = 0.1

بنابراین 40 درصد از نمونه ها در یک گره فرزند و 60 درصد در گره فرزند دیگر هستند. بنابراین:

مجموع آنتروپی وزنی گره های فرزند = (0.4 * 0.2) + (0.6 * 0.1) = 0.14

بنابراین، به دست آوردن اطلاعات این است:

افزایش اطلاعات = آنتروپی گره والد - مجموع وزنی آنتروپی گره های فرزند
افزایش اطلاعات = 0.6 - 0.14 = 0.46

بیشتر اسپلیترها به دنبال ایجاد شرایطی هستند که کسب اطلاعات را به حداکثر برسانند.

توافق بین ارزیاب

#متریک

اندازه گیری تعداد دفعات توافق ارزیابی کنندگان انسانی هنگام انجام یک کار. اگر ارزیاب‌ها موافق نباشند، دستورالعمل‌های کار ممکن است نیاز به بهبود داشته باشند. گاهی اوقات توافق بین حاشیه‌نویس یا قابلیت اطمینان بین ارزیابی‌کننده نیز نامیده می‌شود. همچنین کاپا کوهن را ببینید که یکی از محبوب‌ترین اندازه‌گیری‌های توافق بین ارزیاب‌ها است.

برای اطلاعات بیشتر به داده‌های دسته‌بندی: مسائل رایج در دوره تصادف یادگیری ماشینی مراجعه کنید.

L

L ₁ باخت

#مبانی

#متریک

یک تابع ضرر که قدر مطلق تفاوت بین مقادیر واقعی برچسب و مقادیری را که یک مدل پیش بینی می کند محاسبه می کند. برای مثال، در اینجا محاسبه ضرر L ₁ برای یک دسته از پنج مثال آمده است:

ارزش واقعی مثال	مقدار پیش بینی شده مدل	مقدار مطلق دلتا
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ ضرر

از دست دادن L ₁ نسبت به L ₂ حساسیت کمتری نسبت به موارد پرت دارد.

میانگین خطای مطلق میانگین تلفات L ₁ در هر مثال است.

برای دیدن ریاضیات رسمی روی نماد کلیک کنید.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ تعداد نمونه است.
$y$ مقدار واقعی برچسب است.
$\hat{y}$ مقداری است که مدل برای $y$ پیش‌بینی می‌کند.

برای اطلاعات بیشتر به رگرسیون خطی: فقدان در دوره تصادف یادگیری ماشین مراجعه کنید.

L ₂ باخت

#مبانی

#متریک

یک تابع ضرر که مجذور تفاوت بین مقادیر واقعی برچسب و مقادیری را که یک مدل پیش بینی می کند محاسبه می کند. برای مثال، در اینجا محاسبه تلفات L ₂ برای یک دسته از پنج مثال آمده است:

ارزش واقعی مثال	مقدار پیش بینی شده مدل	مربع دلتا
7	6	1
5	4	1
8	11	9
4	6	4
9	8	1
		16 = L ₂ ضرر

با توجه به تربیع، از دست دادن L ₂ تاثیر نقاط پرت را تقویت می کند. یعنی ضرر L ₂ نسبت به ضرر L ₁ به پیش بینی های بد واکنش قوی تری نشان می دهد. به عنوان مثال، ضرر L ₁ برای دسته قبلی به جای 16، 8 خواهد بود. توجه داشته باشید که یک عدد پرت تنها 9 مورد از 16 مورد را به خود اختصاص می دهد.

مدل‌های رگرسیون معمولاً از تلفات L ₂ به عنوان تابع ضرر استفاده می‌کنند.

میانگین مربعات خطا میانگین تلفات L ₂ در هر مثال است. ضرر مربعی نام دیگری برای ضرر L ₂ است.

برای دیدن ریاضیات رسمی روی نماد کلیک کنید.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

کجا:

$n$ تعداد نمونه است.
$y$ مقدار واقعی برچسب است.
$\hat{y}$ مقداری است که مدل برای $y$ پیش‌بینی می‌کند.

برای اطلاعات بیشتر به رگرسیون لجستیک: از دست دادن و منظم‌سازی در دوره تصادف یادگیری ماشین مراجعه کنید.

ارزیابی های LLM (ارزیابی)

#زبان

#تولید هوش مصنوعی

#متریک

مجموعه‌ای از معیارها و معیارها برای ارزیابی عملکرد مدل‌های زبان بزرگ (LLM). در سطح بالا، ارزیابی های LLM:

به محققان کمک کنید مناطقی را که LLM نیاز به بهبود دارند شناسایی کنند.
در مقایسه LLM های مختلف و شناسایی بهترین LLM برای یک کار خاص مفید هستند.
کمک کنید تا مطمئن شوید که LLM ها برای استفاده ایمن و اخلاقی هستند.

برای اطلاعات بیشتر، مدل‌های زبان بزرگ (LLM) را در دوره آموزشی تصادفی یادگیری ماشین ببینید.

از دست دادن

#مبانی

#متریک

در طول آموزش یک مدل نظارت شده ، اندازه گیری از فاصله پیش بینی یک مدل با برچسب آن.

یک تابع ضرر زیان را محاسبه می کند.

برای اطلاعات بیشتر به رگرسیون خطی: فقدان در دوره تصادف یادگیری ماشین مراجعه کنید.

عملکرد از دست دادن

#مبانی

#متریک

در حین آموزش یا آزمایش، یک تابع ریاضی است که زیان را در مجموعه ای از مثال ها محاسبه می کند. یک تابع ضرر برای مدل هایی که پیش بینی های خوبی انجام می دهند، ضرر کمتری نسبت به مدل هایی که پیش بینی های بد انجام می دهند، برمی گرداند.

هدف از آموزش معمولاً به حداقل رساندن ضرری است که یک تابع ضرر باز می گرداند.

بسیاری از انواع مختلف توابع از دست دادن وجود دارد. تابع ضرر مناسب را برای نوع مدلی که می سازید انتخاب کنید. به عنوان مثال:

از دست دادن L ₂ (یا میانگین مربعات خطا ) تابع ضرر برای رگرسیون خطی است.
Log Loss تابع ضرر برای رگرسیون لجستیک است.

م

میانگین خطای مطلق (MAE)

#متریک

میانگین تلفات در هر مثال زمانی که از دست دادن L ₁ استفاده می شود. میانگین خطای مطلق را به صورت زیر محاسبه کنید:

ضرر L ₁ را برای یک دسته محاسبه کنید.
ضرر L ₁ را بر تعداد نمونه های دسته تقسیم کنید.

برای دیدن ریاضیات رسمی روی نماد کلیک کنید.

$$\text{Mean Absolute Error} = \frac{1}{n}\sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ تعداد نمونه است.
$y$ مقدار واقعی برچسب است.
$\hat{y}$ مقداری است که مدل برای $y$ پیش‌بینی می‌کند.

برای مثال، محاسبه تلفات L ₁ را در دسته ای از پنج مثال زیر در نظر بگیرید:

ارزش واقعی مثال	مقدار پیش بینی شده مدل	ضرر (تفاوت بین واقعی و پیش بینی شده)
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ ضرر

بنابراین، ضرر L ₁ 8 و تعداد مثال ها 5 است. بنابراین، میانگین خطای مطلق برابر است با:

 Mean Absolute Error = L₁ loss / Number of Examples Mean Absolute Error = 8/5 = 1.6

کنتراست میانگین خطای مطلق با میانگین مربعات خطا و ریشه میانگین خطای مربع .

میانگین دقت متوسط در k (mAP@k)

#زبان

#تولید هوش مصنوعی

#متریک

میانگین آماری تمام میانگین دقت در نمره های k در یک مجموعه داده اعتبار سنجی. یکی از کاربردهای میانگین دقت در k قضاوت در مورد کیفیت توصیه های تولید شده توسط یک سیستم توصیه می باشد.

اگرچه عبارت "میانگین متوسط" اضافی به نظر می رسد، نام متریک مناسب است. از این گذشته، این متریک میانگین دقت میانگین چندگانه را در مقادیر k پیدا می کند.

برای مشاهده نمونه روی نماد کلیک کنید.

فرض کنید یک سیستم توصیه می‌سازید که یک لیست شخصی از رمان‌های پیشنهادی برای هر کاربر ایجاد می‌کند. بر اساس بازخورد کاربران منتخب، پنج میانگین دقت زیر را در K امتیاز (یک امتیاز برای هر کاربر) محاسبه می‌کنید:

0.73
0.77
0.67
0.82
0.76

بنابراین میانگین دقت متوسط در K برابر است با:

$$\text{mean } = \frac{\text{0.73 + 0.77 + 0.67 + 0.82 + 0.76}} {\text{5}} = \text{0.75}$$

میانگین مربعات خطا (MSE)

#متریک

میانگین تلفات در هر مثال زمانی که از اتلاف L ₂ استفاده می شود. میانگین مربعات خطا را به صورت زیر محاسبه کنید:

تلفات L ₂ را برای یک دسته محاسبه کنید.
ضرر L ₂ را بر تعداد نمونه های دسته تقسیم کنید.

برای دیدن ریاضیات رسمی روی نماد کلیک کنید.

$$\text{Mean Squared Error} = \frac{1}{n}\sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$کجا:

$n$ تعداد نمونه است.
$y$ مقدار واقعی برچسب است.
$\hat{y}$ پیش‌بینی مدل برای $y$ است.

به عنوان مثال، ضرر را در دسته پنج مثال زیر در نظر بگیرید:

ارزش واقعی	پیش بینی مدل	از دست دادن	باخت مربعی
7	6	1	1
5	4	1	1
8	11	3	9
4	6	2	4
9	8	1	1
			16 = L ₂ ضرر

بنابراین، میانگین مربعات خطای زیر است:

 Mean Squared Error = L₂ loss / Number of Examples Mean Squared Error = 16/5 = 3.2

میانگین مربعات خطا یک بهینه ساز آموزشی محبوب است، به ویژه برای رگرسیون خطی .

تقابل میانگین مربعات خطا با میانگین خطای مطلق و ریشه میانگین مربعات خطا .

TensorFlow Playground از میانگین مربعات خطا برای محاسبه مقادیر تلفات استفاده می کند.

روی نماد کلیک کنید تا جزئیات بیشتری در مورد موارد پرت ببینید.

نقاط پرت به شدت بر میانگین مربعات خطا تأثیر می گذارد. برای مثال، از دست دادن 1، مجذور ضرر 1 است، اما از دست دادن 3، زیان مجذور 9 است. در جدول قبل، مثال با از دست دادن 3 ~ 56% از میانگین مربع خطا را به خود اختصاص می دهد، در حالی که هر یک از مثال های با ضرر 1 تنها 6% از میانگین مربعات خطا را به خود اختصاص می دهند.

نقاط پرت بر میانگین خطای مطلق تأثیر نمی‌گذارند به اندازه میانگین مربعات خطا. به عنوان مثال، از دست دادن 3 تنها 38٪ از میانگین خطای مطلق را تشکیل می دهد.

برش یکی از راه‌های جلوگیری از آسیب‌دیدگی پرت شدید به توانایی پیش‌بینی مدل شماست.

متریک

#TensorFlow

#متریک

آماری که شما به آن اهمیت می دهید.

هدف معیاری است که یک سیستم یادگیری ماشینی سعی در بهینه سازی آن دارد.

Metrics API (tf.metrics)

#متریک

API TensorFlow برای ارزیابی مدل ها. برای مثال، tf.metrics.accuracy تعیین می‌کند که پیش‌بینی‌های یک مدل چقدر با برچسب‌ها مطابقت دارند.

حداقل ضرر

#متریک

یک تابع ضرر برای شبکه های متخاصم مولد ، بر اساس آنتروپی متقابل بین توزیع داده های تولید شده و داده های واقعی.

حداقل تلفات در مقاله اول برای توصیف شبکه های متخاصم مولد استفاده شده است.

برای اطلاعات بیشتر به توابع ضرر در دوره شبکه های متخاصم مولد مراجعه کنید.

ظرفیت مدل

#متریک

پیچیدگی مسائلی که یک مدل می تواند یاد بگیرد. هرچه مشکلاتی که یک مدل می تواند آموخته باشد پیچیده تر ، ظرفیت مدل بالاتر می رود. ظرفیت یک مدل به طور معمول با تعداد پارامترهای مدل افزایش می یابد. برای تعریف رسمی از ظرفیت مدل طبقه بندی ، به Dimension VC مراجعه کنید.

ن

طبقه منفی

#فونداستال ها

#متناقض

در طبقه بندی باینری ، یک کلاس مثبت خوانده می شود و دیگری منفی نامیده می شود. کلاس مثبت چیز یا رویدادی است که مدل در حال آزمایش است و کلاس منفی احتمال دیگر است. به عنوان مثال:

کلاس منفی در یک آزمایش پزشکی ممکن است "تومور" نباشد.
کلاس منفی در یک مدل طبقه بندی ایمیل ممکن است "اسپم" نباشد.

تضاد با کلاس مثبت .

O

هدف

#متناقض

متریک که الگوریتم شما در تلاش برای بهینه سازی است.

تابع هدف

#متناقض

فرمول ریاضی یا متریک که یک مدل هدف بهینه سازی آن است. به عنوان مثال ، عملکرد هدف برای رگرسیون خطی معمولاً به معنای از دست دادن مربع است. بنابراین ، هنگام آموزش یک مدل رگرسیون خطی ، آموزش به حداقل رساندن میانگین از دست دادن مربع است.

در بعضی موارد ، هدف این است که عملکرد هدف را به حداکثر برسانیم . به عنوان مثال ، اگر عملکرد عینی دقت باشد ، هدف این است که حداکثر دقت را انجام دهیم.

همچنین ضرر را ببینید.

پ

پاس در K (پاس@k)

#متناقض

متریک برای تعیین کیفیت کد (به عنوان مثال ، پایتون) که یک مدل بزرگ زبان تولید می کند. به طور خاص ، پاس در K به شما این احتمال را می گوید که حداقل یک بلوک از کد تولید شده از بلوک های کد تولید شده از K ، تمام تست های واحد خود را پشت سر بگذارد.

مدل های بزرگ زبان اغلب برای تولید کد خوب برای مشکلات پیچیده برنامه نویسی تلاش می کنند. مهندسان نرم افزار با ترغیب مدل بزرگ زبان برای تولید چندین راه حل ( k ) برای همین مشکل با این مشکل سازگار می شوند. سپس ، مهندسان نرم افزار هر یک از راه حل ها را در برابر تست های واحد آزمایش می کنند. محاسبه پاس در K بستگی به نتیجه تست های واحد دارد:

اگر یک یا چند مورد از این راه حل ها از آزمون واحد عبور کنند ، LLM آن چالش تولید کد را پشت سر می گذارد .
اگر هیچ یک از راه حل ها از آزمون واحد عبور کنند ، LLM آن چالش تولید کد را شکست نمی دهد .

فرمول پاس در K به شرح زیر است:

\[\text{pass at k} = \frac{\text{total number of passes}} {\text{total number of challenges}}\]

به طور کلی ، مقادیر بالاتر K در نمرات K پاس بالاتر ایجاد می کند. با این حال ، مقادیر بالاتر K به مدل زبان بزرگ و منابع آزمایش واحد نیاز دارد.

برای مثال روی نماد کلیک کنید.

فرض کنید یک مهندس نرم افزار از یک مدل زبان بزرگ می خواهد تا راه حل های K = 10 را برای N = 50 مشکلات کدگذاری چالش برانگیز تولید کند. در اینجا نتایج آمده است:

30 پاس
20 شکست خورد

پاس در نمره 10 بنابراین:

$$\text{pass at 10} = \frac{\text{30}} {\text{50}} = 0.6$$

عملکرد

#متناقض

اصطلاح بیش از حد با معانی زیر:

معنی استاندارد در مهندسی نرم افزار. یعنی: این قطعه نرم افزار چقدر سریع (یا کارآمد) اجرا می شود؟
معنی در یادگیری ماشین. در اینجا ، عملکرد به سؤال زیر پاسخ می دهد: این مدل چقدر صحیح است؟ یعنی پیش بینی های مدل چقدر خوب است؟

واردات متغیر جابجایی

#DF

#متناقض

نوعی از اهمیت متغیر که افزایش خطای پیش بینی یک مدل را پس از جابجایی مقادیر ویژگی ارزیابی می کند. اهمیت متغیر جابجایی یک متریک مستقل از مدل است.

گیجی

#متناقض

یک اندازه گیری از چگونگی عملکرد یک مدل در انجام کار خود است. به عنوان مثال ، فرض کنید وظیفه شما خواندن چند حرف اول کلمه ای است که کاربر روی صفحه کلید تلفن تایپ می کند و لیستی از کلمات تکمیل احتمالی را ارائه می دهد. Perplexity ، P ، برای این کار تقریباً تعداد حدس هایی است که شما باید ارائه دهید تا لیست شما حاوی کلمه واقعی باشد که کاربر در تلاش است تایپ کند.

گیج کننده به آنتروپی متقاطع به شرح زیر است:

$$P= 2^{-\text{cross entropy}}$$

طبقه مثبت

#فونداستال ها

#متناقض

کلاس که برای آن آزمایش می کنید.

به عنوان مثال ، کلاس مثبت در یک مدل سرطان ممکن است "تومور" باشد. کلاس مثبت در یک مدل طبقه بندی ایمیل ممکن است "هرزنامه" باشد.

تضاد با کلاس منفی .

برای یادداشت های اضافی روی نماد کلیک کنید.

اصطلاح کلاس مثبت می تواند گیج کننده باشد زیرا نتیجه "مثبت" بسیاری از تست ها اغلب یک نتیجه نامطلوب است. به عنوان مثال ، کلاس مثبت در بسیاری از آزمایشات پزشکی با تومورها یا بیماری ها مطابقت دارد. به طور کلی ، شما می خواهید یک پزشک به شما بگوید ، "تبریک می گویم! نتایج آزمون شما منفی بود." صرف نظر از این ، کلاس مثبت رویدادی است که آزمون به دنبال یافتن آن است.

مسلماً ، شما همزمان برای هر دو کلاس مثبت و منفی آزمایش می کنید.

PR AUC (منطقه زیر منحنی PR)

#متناقض

مساحت تحت منحنی ضبط دقیق درون یابی ، به دست آمده با نقاط توطئه (فراخوان ، دقت) برای مقادیر مختلف آستانه طبقه بندی .

دقت

#متناقض

متریک برای مدل های طبقه بندی که به سؤال زیر پاسخ می دهد:

وقتی مدل کلاس مثبت را پیش بینی کرد ، چه درصد از پیش بینی ها صحیح بوده است؟

این فرمول است:

$$\text{Precision} = \frac{\text{true positives}} {\text{true positives} + \text{false positives}}$$

کجا:

مثبت مثبت به معنای مدل به درستی کلاس مثبت را پیش بینی کرده است.
مثبت کاذب به معنای مدل به اشتباه کلاس مثبت را پیش بینی کرده است.

به عنوان مثال ، فرض کنید یک مدل 200 پیش بینی مثبت ایجاد کرده است. از این 200 پیش بینی مثبت:

150 مثبت مثبت بودند.
50 مثبت کاذب بودند.

در این مورد:

$$\text{Precision} = \frac{\text{150}} {\text{150} + \text{50}} = 0.75$$

تضاد با دقت و یادآوری .

برای کسب اطلاعات بیشتر به طبقه بندی: دقت ، فراخوان ، دقت و معیارهای مرتبط در دوره تصادف یادگیری ماشین مراجعه کنید.

دقت در K (Precision@k)

#زبان

#متناقض

متریک برای ارزیابی لیست موارد رتبه بندی شده (سفارش). دقت در K کسری از اولین موارد K را در آن لیست که "مرتبط" هستند مشخص می کند. یعنی:

\[\text{precision at k} = \frac{\text{relevant items in first k items of the list}} {\text{k}}\]

مقدار k باید کمتر یا مساوی با طول لیست برگشتی باشد. توجه داشته باشید که طول لیست برگشتی بخشی از محاسبه نیست.

ارتباط اغلب ذهنی است. حتی ارزیاب های انسانی متخصص نیز اغلب با این موارد مخالف هستند.

مقایسه کنید با:

برای دیدن یک مثال روی نماد کلیک کنید.

فرض کنید یک مدل زبان بزرگ پرس و جو زیر داده شده است:

 List the 6 funniest movies of all time in order.

و مدل زبان بزرگ لیست نشان داده شده در دو ستون اول جدول زیر را برمی گرداند:

موقعیت	فیلم	مربوطه؟
1	ژنرال	بله
2	دختران بدجنس	بله
3	جوخه	خیر
4	ساقدوش ها	بله
5	شهروند کین	خیر
6	این اسپینال تپ است	بله

دو سه فیلم اول مرتبط هستند ، بنابراین دقت در 3 است:

$$\text{precision at 3} = \frac{\text{2}} {\text{3}} = 0.67$$

چهار از پنج فیلم اول بسیار خنده دار هستند ، بنابراین دقت در 5 است:

$$\text{precision at 5} = \frac{\text{4}} {\text{5}} = 0.8$$

منحنی فراخوان دقیق

#متناقض

منحنی دقت در مقابل فراخوان در آستانه های مختلف طبقه بندی .

تعصب پیش بینی

#متناقض

مقداری که نشان می دهد میانگین پیش بینی ها از میانگین برچسب های موجود در مجموعه داده فاصله دارند.

با اصطلاح تعصب در مدل های یادگیری ماشین یا با تعصب در اخلاق و انصاف اشتباه گرفته نشود.

برابری پیش بینی کننده

#مسئول

#متناقض

یک متریک انصاف که بررسی می کند که آیا برای یک طبقه بندی کننده معین ، نرخ دقیق برای زیر گروه های مورد نظر معادل است.

به عنوان مثال ، الگویی که پذیرش کالج را پیش بینی می کند ، اگر نرخ دقیق آن برای لیلیپوتیایی ها و BrobdingNagians یکسان باشد ، برابری پیش بینی برای ملیت را برآورده می کند.

برابری پیش بینی مدتی است که برابری نرخ پیش بینی نیز نامیده می شود.

برای بحث بیشتر در مورد برابری پیش بینی ، به "تعاریف انصاف توضیح داده شده" (بخش 3.2.1) مراجعه کنید.

برابری نرخ پیش بینی

#مسئول

#متناقض

نام دیگری برای برابری پیش بینی کننده .

تابع چگالی احتمال

#متناقض

تابعی که فرکانس نمونه های داده را دقیقاً یک مقدار خاص مشخص می کند. هنگامی که مقادیر مجموعه داده ها به صورت مداوم شماره شناور هستند ، مسابقات دقیق به ندرت اتفاق می افتد. با این حال ، ادغام یک تابع چگالی احتمال از مقدار x به مقدار y ، فرکانس مورد انتظار نمونه های داده بین x و y را به همراه دارد.

به عنوان مثال ، یک توزیع عادی با میانگین 200 و انحراف استاندارد 30 را در نظر بگیرید. برای تعیین فرکانس مورد انتظار نمونه های داده های موجود در محدوده 211.4 تا 218.7 ، می توانید عملکرد چگالی احتمال را برای یک توزیع عادی از 211.4 تا 218.7 ادغام کنید.

آر

به یاد بیاور

#متناقض

متریک برای مدل های طبقه بندی که به سؤال زیر پاسخ می دهد:

وقتی حقیقت زمین طبقه مثبت بود ، مدل پیش بینی ها به درستی به عنوان کلاس مثبت شناخته شده است؟

این فرمول است:

\[\text{Recall} = \frac{\text{true positives}} {\text{true positives} + \text{false negatives}} \]

کجا:

مثبت مثبت به معنای مدل به درستی کلاس مثبت را پیش بینی کرده است.
منفی کاذب به این معنی است که مدل به اشتباه کلاس منفی را پیش بینی می کند.

به عنوان مثال ، فرض کنید مدل شما 200 پیش بینی را در مثالهایی انجام داده است که حقیقت زمین کلاس مثبت است. از این 200 پیش بینی:

180 مثبت مثبت بودند.
20 منفی دروغین بودند.

در این مورد:

\[\text{Recall} = \frac{\text{180}} {\text{180} + \text{20}} = 0.9 \]

برای یادداشت های مربوط به مجموعه داده های کلاس تعادل ، روی نماد کلیک کنید.

فراخوان به ویژه برای تعیین قدرت پیش بینی مدل های طبقه بندی که در آن کلاس مثبت نادر است مفید است. به عنوان مثال ، یک مجموعه داده کلاس متعادل را در نظر بگیرید که در آن کلاس مثبت برای یک بیماری خاص فقط در 10 بیمار از یک میلیون رخ می دهد. فرض کنید مدل شما پنج میلیون پیش بینی را انجام می دهد که نتایج زیر را به همراه دارد:

30 مثبت واقعی
20 منفی دروغین
4،999،000 منفی واقعی
950 مثبت کاذب

فراخوان این مدل بنابراین:

 recall = TP / (TP + FN) recall = 30 / (30 + 20) = 0.6 = 60%

در مقابل ، صحت این مدل:

 accuracy = (TP + TN) / (TP + TN + FP + FN) accuracy = (30 + 4,999,000) / (30 + 4,999,000 + 950 + 20) = 99.98%

این ارزش بالای دقت چشمگیر به نظر می رسد اما اساساً بی معنی است. فراخوان یک متریک بسیار مفیدتر برای مجموعه داده های کلاس متعادل نسبت به دقت است.

برای کسب اطلاعات بیشتر به طبقه بندی: دقت ، فراخوان ، دقت و معیارهای مرتبط مراجعه کنید.

به یاد بیاورید در K (به یاد بیاورید@k)

#زبان

#متناقض

متریک برای ارزیابی سیستم هایی که لیستی از موارد رتبه بندی شده (سفارش داده شده) را تولید می کنند. به یاد بیاورید در K کسری از موارد مربوطه را در اولین موارد K در آن لیست از تعداد کل موارد مربوطه برگشتی مشخص می کند.

\[\text{recall at k} = \frac{\text{relevant items in first k items of the list}} {\text{total number of relevant items in the list}}\]

تضاد با دقت در k .

برای دیدن یک مثال روی نماد کلیک کنید.

فرض کنید یک مدل زبان بزرگ پرس و جو زیر داده شده است:

 List the 10 funniest movies of all time in order.

و مدل زبان بزرگ لیست نشان داده شده در دو ستون اول را برمی گرداند:

موقعیت	فیلم	مربوطه؟
1	ژنرال	بله
2	دختران بدجنس	بله
3	جوخه	خیر
4	ساقدوش ها	بله
5	این اسپینال تپ است	بله
6	هواپیما!	بله
7	روز گراند هاگ	بله
8	مونتی پایتون و جام مقدس	بله
9	اوپنهایمر	خیر
10	بی خبر	بله

هشت فیلم در لیست قبلی بسیار خنده دار هستند ، بنابراین آنها "موارد مرتبط در لیست" هستند. بنابراین ، 8 در تمام محاسبات فراخوان در k مخرج خواهد بود. در مورد شمارنده چطور؟ خوب ، 3 مورد از 4 مورد اول مرتبط است ، بنابراین به یاد بیاورید در 4 است:

$$\text{recall at 4} = \frac{\text{3}} {\text{8}} = 0.375$$

7 از 8 فیلم اول بسیار خنده دار هستند ، بنابراین به یاد بیاورید در 8 است:

$$\text{recall at 8} = \frac{\text{7}} {\text{8}} = 0.875$$

منحنی ROC (مشخصه عملیاتی گیرنده)

#فونداستال ها

#متناقض

نمودار از نرخ مثبت واقعی در مقابل نرخ مثبت کاذب برای آستانه طبقه بندی مختلف در طبقه بندی باینری.

شکل یک منحنی ROC توانایی یک مدل طبقه بندی باینری را برای جدا کردن کلاس های مثبت از کلاس های منفی نشان می دهد. به عنوان مثال فرض کنید که یک مدل طبقه بندی باینری کاملاً تمام کلاسهای منفی را از تمام کلاسهای مثبت جدا می کند:

یک خط با 8 مثال مثبت در سمت راست و 7 نمونه منفی در سمت چپ.

منحنی ROC برای مدل قبلی به شرح زیر است:

یک منحنی ROC. محور x نرخ مثبت کاذب و محور y است نرخ مثبت واقعی است منحنی دارای شکل L معکوس است. منحنی از (0.0،0.0) شروع می شود و مستقیماً به (0.0،1.0) می رود. سپس منحنی از (0.0،1.0) به (1.0،1.0) می رود.

در مقابل ، تصویر زیر مقادیر رگرسیون لجستیک خام را برای یک مدل وحشتناک که نمی تواند کلاس های منفی را از کلاس های مثبت جدا کند ، نمودار می کند:

یک خط با مثالهای مثبت و کلاسهای منفی کاملاً با هم مخلوط شده است.

منحنی ROC برای این مدل به شرح زیر است:

منحنی ROC ، که در واقع یک خط مستقیم از (0.0،0.0) است به (1.0،1.0).

در همین حال ، در دنیای واقعی ، بیشتر مدلهای طبقه بندی باینری کلاسهای مثبت و منفی را تا حدی جدا می کنند ، اما معمولاً کاملاً مناسب نیستند. بنابراین ، یک منحنی ROC معمولی در جایی بین دو افراط قرار می گیرد:

یک منحنی ROC. محور x نرخ مثبت کاذب و محور y است نرخ مثبت واقعی است منحنی ROC یک قوس لرزان را تقریبی می کند عبور از نقاط قطب نما از غرب به شمال.

نقطه در منحنی ROC نزدیک به (0.0،1.0) از لحاظ نظری آستانه طبقه بندی ایده آل را مشخص می کند. با این حال ، چندین موضوع دیگر در دنیای واقعی بر انتخاب آستانه طبقه بندی ایده آل تأثیر می گذارد. به عنوان مثال ، شاید منفی های دروغین باعث درد بسیار بیشتری نسبت به مثبت کاذب شوند.

یک متریک عددی به نام AUC منحنی ROC را به یک مقدار نقطه شناور واحد خلاصه می کند.

ریشه میانگین مربعات خطا (RMSE)

#فونداستال ها

#متناقض

ریشه مربع خطای میانگین مربع .

ROUGE (مطالعه فراخوان یادآوری گرا برای ارزیابی Gisting)

#زبان

#متناقض

خانواده ای از معیارهایی که خلاصه های خودکار و مدل های ترجمه ماشین را ارزیابی می کنند. معیارهای Rouge درجه ای را تعیین می کنند که یک متن مرجع با متن تولید شده از مدل ML همپوشانی دارد. هر یک از اعضای خانواده روژ به روشی متفاوت همپوشانی دارند. نمرات بالاتر ROUGE شباهت بیشتری بین متن مرجع و متن تولید شده نسبت به نمرات Rouge پایین تر نشان می دهد.

هر یک از اعضای خانواده Rouge به طور معمول معیارهای زیر را تولید می کنند:

دقت
به یاد بیاورید
F ₁

برای جزئیات و مثال ، به:

ROUGE-L

#زبان

#متناقض

یکی از اعضای خانواده Rouge بر طول طولانی ترین عواقب مشترک در متن مرجع و متن تولید شده متمرکز شده است. فرمول های زیر فراخوان و دقت را برای Rouge-L محاسبه می کنند:

$$\text{ROUGE-L recall} = \frac{\text{longest common sequence}} {\text{number of words in the reference text} }$$

$$\text{ROUGE-L precision} = \frac{\text{longest common sequence}} {\text{number of words in the generated text} }$$

سپس می توانید از F ₁ برای بالا بردن Rouge-L فراخوان و دقت Rouge-L در یک متریک واحد استفاده کنید:

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{ROUGE-L recall} * \text{ROUGE-L precision}} {\text{ROUGE-L recall} + \text{ROUGE-L precision} }$$

برای محاسبه مثال Rouge-L روی نماد کلیک کنید.

متن مرجع زیر و متن تولید شده را در نظر بگیرید.

دسته بندی	چه کسی تولید کرد؟	متن
متن مرجع	مترجم انسانی	من می خواهم طیف گسترده ای از چیزها را درک کنم.
متن تولید شده	مدل ML	من می خواهم چیزهای زیادی یاد بگیرم.

بنابراین:

طولانی ترین دنبال مشترک 5 است ( من می خواهم به چیزها )
تعداد کلمات موجود در متن مرجع 9 است.
تعداد کلمات موجود در متن تولید شده 7 است.

در نتیجه:

$$\text{ROUGE-L recall} = \frac{\text{5}} {\text{9} } = 0.56$$

$$\text{ROUGE-L precision} = \frac{\text{5}} {\text{7} } = 0.71$$

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{0.56} * \text{0.71}} {\text{0.56} + \text{0.71} } = 0.63$$

Rouge-L هر خط جدید را در متن مرجع و متن تولید شده نادیده می گیرد ، بنابراین طولانی ترین دنباله مشترک می تواند از چندین جمله عبور کند. هنگامی که متن مرجع و متن تولید شده شامل چندین جمله است ، تنوع Rouge-L به نام Rouge-lsum به طور کلی یک معیار بهتر است. Rouge-LSUM طولانی ترین دنبال کننده مشترک برای هر جمله در یک متن را تعیین می کند و سپس میانگین آن طولانی ترین عواقب مشترک را محاسبه می کند.

برای مثال محاسبه Rouge-lsum روی نماد کلیک کنید.

متن مرجع زیر و متن تولید شده را در نظر بگیرید.

دسته بندی	چه کسی تولید کرد؟	متن
متن مرجع	مترجم انسانی	سطح مریخ خشک است. تقریباً تمام آب در زیر زمین قرار دارد.
متن تولید شده	مدل ML	مریخ یک سطح خشک دارد. با این حال ، اکثریت قریب به اتفاق آب در زیر زمین است.

بنابراین:

	جمله اول	جمله دوم
طولانی ترین دنباله مشترک	2 (مریخ خشک)	3 (آب زیرزمینی است)
طول جمله متن مرجع	6	7
طول جمله متن تولید شده	5	8

در نتیجه:

$$\text{recall of first sentence} = \frac{\text{2}} {\text{6}} = 0.33 $$

$$\text{recall of second sentence} = \frac{\text{3}} {\text{7}} = 0.43 $$

$$\text{ROUGE-Lsum recall} = \frac{\text{0.33} + \text{0.43}} {\text{2}} = 0.38 $$

$$\text{precision of first sentence} = \frac{\text{2}} {\text{5}} = 0.4 $$

$$\text{precision of second sentence} = \frac{\text{3}} {\text{8}} = 0.38 $$

$$\text{ROUGE-Lsum precision} = \frac{\text{0.4} + \text{0.38}} {\text{2}} = 0.39 $$

$$\text{ROUGE-Lsum F}{_1} = \frac{\text{2} * \text{0.38} * \text{0.39}} {\text{0.38} + \text{0.39}} = 0.38 $$

ROUGE-N

#زبان

#متناقض

مجموعه ای از معیارهای موجود در خانواده Rouge که N-Grams مشترک با اندازه خاص را در متن مرجع و متن تولید شده مقایسه می کند. به عنوان مثال:

Rouge-1 تعداد نشانه های مشترک را در متن مرجع و متن تولید شده اندازه گیری می کند.
Rouge-2 تعداد Bigrams مشترک (2 گرم) را در متن مرجع و متن تولید شده اندازه گیری می کند.
Rouge-3 تعداد TRIGRAMS مشترک (3 گرم) را در متن مرجع و متن تولید شده اندازه گیری می کند.

می توانید از فرمول های زیر برای محاسبه فراخوان Rouge-N و دقت Rouge-N برای هر یک از اعضای خانواده Rouge-N استفاده کنید:

$$\text{ROUGE-N recall} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the reference text} }$$

$$\text{ROUGE-N precision} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the generated text} }$$

سپس می توانید از F ₁ برای بالا بردن Rouge-N فراخوان و دقت Rouge-N در یک متریک واحد استفاده کنید:

$$\text{ROUGE-N F}{_1} = \frac{\text{2} * \text{ROUGE-N recall} * \text{ROUGE-N precision}} {\text{ROUGE-N recall} + \text{ROUGE-N precision} }$$

برای مثال روی نماد کلیک کنید.

فرض کنید تصمیم دارید از Rouge-2 برای اندازه گیری اثربخشی ترجمه یک مدل ML در مقایسه با مترجم انسانی استفاده کنید.

دسته بندی	چه کسی تولید کرد؟	متن	بیگرام
متن مرجع	مترجم انسانی	من می خواهم طیف گسترده ای از چیزها را درک کنم.	من می خواهم ، می خواهم ، درک کنم ، درک کنم ، یک ، تنوع گسترده و گسترده ای ، انواع چیزها
متن تولید شده	مدل ML	من می خواهم چیزهای زیادی یاد بگیرم.	من می خواهم ، می خواهم ، یاد بگیرم ، یاد بگیرم ، چیزهای زیادی ، چیزهای زیادی را یاد بگیرم

بنابراین:

تعداد 2 گرم تطبیق 3 است ( من می خواهم ، می خواهم ، و چیزها ).
تعداد 2 گرم در متن مرجع 8 است.
تعداد 2 گرم در متن تولید شده 6 است.

در نتیجه:

$$\text{ROUGE-2 recall} = \frac{\text{3}} {\text{8} } = 0.375$$

$$\text{ROUGE-2 precision} = \frac{\text{3}} {\text{6} } = 0.5$$

$$\text{ROUGE-2 F}{_1} = \frac{\text{2} * \text{0.375} * \text{0.5}} {\text{0.375} + \text{0.5} } = 0.43$$

Rouge-s

#زبان

#متناقض

یک شکل بخشنده از Rouge-N که تطبیق Skip-Gram را امکان پذیر می کند. یعنی ، Rouge-N فقط N-Grams را که دقیقاً مطابقت دارند ، شمارش می کند ، اما Rouge-S همچنین N-Grams را که با یک یا چند کلمه از هم جدا شده اند ، شمارش می کند. برای مثال موارد زیر را در نظر بگیرید:

متن مرجع : ابرهای سفید
متن تولید شده : ابرهای سفید کننده سفید

هنگام محاسبه Rouge-N ، ابرهای سفید 2 گرم ، با ابرهای رنگ سفید مطابقت ندارند. با این حال ، هنگام محاسبه Rouge-S ، ابرهای سفید با ابرهای رنگ سفید مطابقت دارند.

R-squared

#متناقض

یک متریک رگرسیون که نشان می دهد میزان تغییر در یک برچسب به دلیل یک ویژگی فردی یا یک مجموعه ویژگی است. R-Squared یک مقدار بین 0 تا 1 است که می توانید به شرح زیر تفسیر کنید:

R-Squared از 0 به این معنی است که هیچ یک از تغییرات برچسب به دلیل مجموعه ویژگی نیست.
مربع R از 1 به این معنی است که همه تغییرات برچسب به دلیل مجموعه ویژگی است.
مربع R بین 0 تا 1 نشان می دهد که میزان تغییر برچسب از یک ویژگی خاص یا مجموعه ویژگی ها قابل پیش بینی است. به عنوان مثال ، مربع R از 0.10 به این معنی است که 10 درصد از واریانس در برچسب به دلیل مجموعه ویژگی است ، یک R-Squared 0.20 به معنای این است که 20 درصد به دلیل تنظیم ویژگی و غیره است.

R-Squared مربع ضریب همبستگی پیرسون بین مقادیری است که یک مدل پیش بینی کرده و حقیقت زمینی است .

اس

به ثمر رساندن

#سیستم ها

#متناقض

بخشی از یک سیستم توصیه ای که برای هر مورد تولید شده توسط مرحله تولید نامزد ، ارزش یا رتبه بندی را ارائه می دهد.

اندازه گیری شباهت

#خوشه ای

#متناقض

در الگوریتم های خوشه بندی ، متریک برای تعیین چگونگی یکسان (چقدر مشابه) هر دو نمونه استفاده می شود.

پراکندگی

#متناقض

تعداد عناصر تعیین شده روی صفر (یا تهی) در یک بردار یا ماتریس تقسیم بر تعداد کل ورودی های آن بردار یا ماتریس. به عنوان مثال ، یک ماتریس 100 عنصر را در نظر بگیرید که در آن 98 سلول حاوی صفر هستند. محاسبه کمبود به شرح زیر است:

$$ {\text{sparsity}} = \frac{\text{98}} {\text{100}} = {\text{0.98}} $$

مشخصات مشخصات به کمبود یک بردار ویژگی اشاره دارد. پراکندگی مدل به کمبود وزن مدل اشاره دارد.

از دست دادن لولا مربع

#متناقض

مربع از دست دادن لولا . از دست دادن لولای مربع مجازات های سخت تر از دست دادن لولای معمولی را مجازات می کند.

از دست دادن مربع

#فونداستال ها

#متناقض

مترادف برای از دست دادن L ₂ .

تی

از دست دادن

#فونداستال ها

#متناقض

یک متریک که از دست دادن یک مدل در برابر مجموعه آزمون است. هنگام ساختن یک مدل ، معمولاً سعی می کنید از دست دادن آزمایش به حداقل برسید. دلیل این امر این است که از دست دادن کم تست یک سیگنال با کیفیت قوی تر از ضرر کم آموزش یا از دست دادن اعتبار سنجی کم است.

فاصله زیادی بین از دست دادن آزمون و از دست دادن آموزش یا از دست دادن اعتبار سنجی گاهی اوقات نشان می دهد که شما نیاز به افزایش نرخ منظم دارید.

دقت

#زبان

#متناقض

درصد زمانهایی که "برچسب هدف" در اولین موقعیت K لیست های تولید شده ظاهر می شود. لیست ها می توانند توصیه های شخصی یا لیستی از موارد سفارش داده شده توسط SoftMax باشند.

دقت بالا K نیز به عنوان دقت در K شناخته می شود.

برای مثال روی نماد کلیک کنید.

یک سیستم یادگیری ماشین را در نظر بگیرید که از SoftMax برای شناسایی احتمالات درخت بر اساس تصویری از برگهای درخت استفاده می کند. جدول زیر لیست های خروجی تولید شده از پنج تصویر درخت ورودی را نشان می دهد. هر ردیف حاوی یک برچسب هدف و پنج درخت محتمل است. به عنوان مثال ، هنگامی که برچسب هدف افرا بود ، مدل یادگیری ماشین ELM را به عنوان محتمل ترین درخت ، بلوط به عنوان دومین درخت به احتمال زیاد و غیره معرفی کرد.

برچسب هدف	1	2	3	4	5
افرا	سنجد	بلوط	افرا	راش	صنوبر
چوب سگ	بلوط	چوب سگ	صنوبر	هیکوری	افرا
بلوط	بلوط	چوب باس	ملخ	توسکا	لیندن
لیندن	افرا	پنجه پنجه	بلوط	چوب باس	صنوبر
بلوط	ملخ	لیندن	بلوط	افرا	پنجه پنجه

برچسب هدف فقط در یک موقعیت اول فقط یک بار ظاهر می شود ، بنابراین دقت 1 1 بالا است:

$$\text{top-1 accuracy} = \frac{\text{1}} {\text{5}} = 0.2$$

برچسب هدف در یکی از سه موقعیت برتر چهار بار ظاهر می شود ، بنابراین دقت 3 بالا این است:

$$\text{top-1 accuracy} = \frac{\text{4}} {\text{5}} = 0.8$$

سمیت

#زبان

#متناقض

میزان سوءاستفاده ، تهدیدآمیز یا توهین آمیز است. بسیاری از مدل های یادگیری ماشین می توانند سمیت را شناسایی و اندازه گیری کنند. بسیاری از این مدل ها سمیت را در طول پارامترهای متعدد ، مانند سطح زبان سوءاستفاده و سطح زبان تهدیدآمیز مشخص می کنند.

از دست دادن آموزش

#فونداستال ها

#متناقض

یک متریک نشان دهنده ضرر یک مدل در طی یک تکرار آموزش خاص است. به عنوان مثال ، فرض کنید عملکرد از دست دادن به معنای خطای مربع است. شاید از دست دادن آموزش (میانگین خطای مربع) برای تکرار 10 2.2 باشد و از دست دادن تمرین برای 100 تکرار 1.9 است.

یک منحنی ضرر از دست دادن آموزش در مقابل تعداد تکرارها را ترسیم می کند. منحنی ضرر نکات زیر را در مورد آموزش ارائه می دهد:

یک شیب رو به پایین دلالت بر این دارد که مدل در حال بهبود است.
یک شیب رو به بالا دلالت بر این دارد که مدل بدتر می شود.
یک شیب مسطح دلالت بر این دارد که این مدل به همگرایی رسیده است.

به عنوان مثال ، منحنی از دست دادن تا حدودی ایده آل نشان می دهد:

شیب شیب دار به سمت پایین در طول تکرار اولیه ، که دلالت بر بهبود سریع مدل دارد.
یک شیب به تدریج مسطح (اما هنوز هم پایین) تا پایان آموزش ، که حاکی از بهبود مدل با سرعت کمی کندتر و سپس در طول تکرار اولیه است.
شیب مسطح به سمت پایان آموزش ، که نشان دهنده همگرایی است.

طرح از دست دادن آموزش در مقابل تکرارها. این منحنی ضرر شروع می شود با یک شیب رو به پایین. شیب به تدریج صاف می شود تا شیب صفر می شود.

اگرچه از دست دادن آموزش مهم است ، اما به تعمیم نیز مراجعه کنید.

منفی واقعی (TN)

#فونداستال ها

#متناقض

نمونه ای که در آن مدل به درستی کلاس منفی را پیش بینی می کند. به عنوان مثال ، این مدل نشان می دهد که یک پیام ایمیل خاص هرزنامه نیست و پیام ایمیل واقعاً هرزنامه نیست .

مثبت واقعی (TP)

#فونداستال ها

#متناقض

نمونه ای که در آن مدل به درستی کلاس مثبت را پیش بینی می کند. به عنوان مثال ، این مدل نشان می دهد که یک پیام ایمیل خاص هرزنامه است و پیام ایمیل واقعاً هرزنامه است.

نرخ مثبت واقعی (TPR)

#فونداستال ها

#متناقض

مترادف برای فراخوان . یعنی:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

نرخ مثبت واقعی محور y در یک منحنی ROC است.

V

از دست دادن اعتبار سنجی

#فونداستال ها

#متناقض

یک متریک که از دست دادن یک مدل در اعتبار سنجی در طول تکرار خاص آموزش است.

همچنین به منحنی تعمیم مراجعه کنید.

واردات متغیر

#DF

#متناقض

مجموعه ای از نمرات که نشان دهنده اهمیت نسبی هر ویژگی برای مدل است.

به عنوان مثال ، یک درخت تصمیم را در نظر بگیرید که قیمت خانه را تخمین می زند. فرض کنید این درخت تصمیم از سه ویژگی استفاده می کند: اندازه ، سن و سبک. اگر مجموعه ای از واردات متغیر برای سه ویژگی محاسبه شود که اندازه = 5.8 ، سن = 2.5 ، سبک = 4.7} باشد ، اندازه آن برای درخت تصمیم گیری از سن یا سبک مهمتر است.

معیارهای مختلف اهمیت متغیر وجود دارد ، که می تواند متخصصان ML را در مورد جنبه های مختلف مدل ها آگاه کند.

دبلیو

از دست دادن Wasserstein

#متناقض

یکی از توابع ضرر که معمولاً در شبکه های مخالف تولیدی مورد استفاده قرار می گیرد ، بر اساس فاصله حرکت زمین بین توزیع داده های تولید شده و داده های واقعی.

این صفحه شامل اصطلاحات واژه نامه معیارها است. برای همه شرایط واژه نامه ، اینجا را کلیک کنید .

الف

دقت

#فونداستال ها

#متناقض

تعداد پیش بینی های طبقه بندی صحیح تقسیم بر تعداد کل پیش بینی ها. یعنی:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

به عنوان مثال ، مدلی که 40 پیش بینی صحیح و 10 پیش بینی نادرست را انجام داده است ، دقت دارد:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

طبقه بندی باینری نام های خاصی را برای دسته های مختلف پیش بینی های صحیح و پیش بینی های نادرست فراهم می کند. بنابراین ، فرمول دقت برای طبقه بندی باینری به شرح زیر است:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

کجا:

TP تعداد مثبت واقعی (پیش بینی های صحیح) است.
TN تعداد منفی های واقعی (پیش بینی های صحیح) است.
FP تعداد مثبت کاذب (پیش بینی های نادرست) است.
FN تعداد منفی های کاذب (پیش بینی های نادرست) است.

دقت و کنتراست را با دقت و فراخوان مقایسه کنید.

برای جزئیات بیشتر در مورد دقت و مجموعه داده های کلاس تعادل ، روی نماد کلیک کنید.

اگرچه برای برخی از شرایط یک معیار ارزشمند است ، اما دقت برای دیگران بسیار گمراه کننده است. نکته قابل توجه ، دقت معمولاً یک معیار ضعیف برای ارزیابی مدلهای طبقه بندی است که مجموعه داده های کلاس تعادل را پردازش می کنند.

به عنوان مثال ، فرض کنید برف فقط 25 روز در قرن در یک شهر نیمه گرمسیری سقوط می کند. از روزهای بدون برف (کلاس منفی) روزهای بسیار زیادی با برف (کلاس مثبت) ، مجموعه داده برفی برای این شهر با کلاس تعادل است. یک مدل طبقه بندی باینری را تصور کنید که قرار است هر روز برف یا برف را پیش بینی کند ، اما هر روز "بدون برف" را پیش بینی می کند. این مدل بسیار دقیق است اما قدرت پیش بینی کننده ای ندارد. جدول زیر نتایج یک قرن پیش بینی را خلاصه می کند:

دسته بندی	شماره
TP	0
TN	36499
FP	0
FN	25

دقت این مدل از این رو است:

accuracy = (TP + TN) / (TP + TN + FP + FN) accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

اگرچه به نظر می رسد دقت 99.93 ٪ بسیار چشمگیر است ، اما این مدل در واقع هیچ قدرت پیش بینی کننده ای ندارد.

دقت و فراخوان معمولاً معیارهای مفیدی نسبت به دقت در ارزیابی مدل های آموزش داده شده در مجموعه داده های کلاس تعادل است.

منطقه زیر منحنی روابط عمومی

#متناقض

به PR AUC (منطقه زیر منحنی PR) مراجعه کنید.

ناحیه زیر منحنی ROC

#متناقض

به AUC (منطقه زیر منحنی ROC) مراجعه کنید.

AUC (منطقه زیر منحنی ROC)

#فونداستال ها

#متناقض

تعدادی بین 0.0 تا 1.0 نشان دهنده توانایی یک مدل طبقه بندی باینری برای جدا کردن کلاس های مثبت از کلاس های منفی است. هرچه AUC به 1.0 نزدیکتر باشد ، توانایی مدل برای جدا کردن کلاس ها از یکدیگر بهتر می شود.

به عنوان مثال ، تصویر زیر یک مدل طبقه بندی را نشان می دهد که کلاس های مثبت (تخمدان های سبز) را از کلاس های منفی (مستطیل های بنفش) کاملاً جدا می کند. این مدل غیر واقعی کامل دارای AUC 1.0 است:

یک خط با 8 نمونه مثبت از یک طرف و 9 نمونه منفی از طرف دیگر.

در مقابل ، تصویر زیر نتایج یک مدل طبقه بندی را نشان می دهد که نتایج تصادفی ایجاد می کند. این مدل دارای AUC 0.5 است:

یک خط با 6 مثال مثبت و 6 مثال منفی. دنباله مثالها مثبت ، منفی است ، مثبت ، منفی ، مثبت ، منفی ، مثبت ، منفی ، مثبت منفی ، مثبت ، منفی.

بله ، مدل قبلی دارای AUC 0.5 است ، نه 0.0.

بیشتر مدل ها در جایی بین دو افراط قرار دارند. به عنوان مثال ، مدل زیر مثبت از منفی ها را تا حدودی از هم جدا می کند ، بنابراین AUC در جایی بین 0.5 تا 1.0 دارد:

یک خط با 6 مثال مثبت و 6 مثال منفی. توالی نمونه ها منفی ، منفی ، منفی ، منفی است ، مثبت ، منفی ، مثبت ، مثبت ، منفی ، مثبت ، مثبت ، مثبت

AUC هر مقداری را که برای آستانه طبقه بندی تعیین کرده اید نادیده می گیرد. در عوض ، AUC تمام آستانه های طبقه بندی ممکن را در نظر می گیرد.

برای کسب اطلاعات در مورد رابطه بین منحنی های AUC و ROC ، روی نماد کلیک کنید.

AUC منطقه را تحت یک منحنی ROC نشان می دهد. به عنوان مثال ، منحنی ROC برای مدلی که کاملاً مثبت را از منفی جدا می کند به شرح زیر است:

AUC منطقه منطقه خاکستری در تصویر قبلی است. در این مورد غیرمعمول ، منطقه به سادگی طول منطقه خاکستری (1.0) ضرب شده با عرض منطقه خاکستری (1.0) است. بنابراین ، محصول 1.0 و 1.0 AUC دقیقاً 1.0 را به دست می آورد که بالاترین امتیاز AUC ممکن است.

در مقابل ، منحنی ROC برای یک مدل طبقه بندی که به هیچ وجه نمی تواند کلاس ها را از هم جدا کند به شرح زیر است. مساحت این منطقه خاکستری 0.5 است.

یک منحنی ROC معمولی تقریباً مانند موارد زیر به نظر می رسد:

محاسبه ناحیه زیر این منحنی به صورت دستی ، پر دردسر خواهد بود ، به همین دلیل یک برنامه به طور معمول مقادیر AUC را محاسبه می کند.

برای تعریف رسمی تر AUC ، روی نماد کلیک کنید.

AUC این احتمال است که یک مدل طبقه بندی اطمینان بیشتری داشته باشد تا یک نمونه مثبت که به طور تصادفی انتخاب شده است ، در واقع مثبت از آن است که یک نمونه منفی به طور تصادفی انتخاب شده مثبت باشد.

برای کسب اطلاعات بیشتر به طبقه بندی: ROC و AUC در دوره Crash Machine Learning مراجعه کنید.

دقت متوسط در k

#زبان

#متناقض

یک متریک برای خلاصه کردن عملکرد یک مدل در یک فرایند واحد که نتایج رتبه بندی شده را ایجاد می کند ، مانند لیست شماره گذاری شده از توصیه های کتاب. میانگین دقت در K ، به خوبی ، میانگین دقت در مقادیر K برای هر نتیجه مربوطه است. فرمول برای دقت متوسط در k بنابراین:

\[{\text{average precision at k}} = \frac{1}{n} \sum_{i=1}^n {\text{precision at k for each relevant item} } \]

کجا:

$n$ تعداد موارد مربوطه در لیست است.

تضاد با فراخوان در k .

برای مثال روی نماد کلیک کنید

فرض کنید یک مدل زبان بزرگ پرس و جو زیر داده شده است:

 List the 6 funniest movies of all time in order.

و مدل زبان بزرگ لیست زیر را برمی گرداند:

ژنرال
دختران بدجنس
جوخه
ساقدوش ها
شهروند کین
این اسپینال تپ است

چهار فیلم در لیست برگشتی بسیار خنده دار هستند (یعنی مرتبط هستند) اما دو فیلم درام هستند (مرتبط نیستند). جدول زیر نتایج را شرح می دهد:

موقعیت	فیلم	مربوطه؟	دقت در K
1	ژنرال	بله	1.0
2	دختران بدجنس	بله	1.0
3	جوخه	خیر	مرتبط نیست
4	ساقدوش ها	بله	0.75
5	شهروند کین	خیر	مرتبط نیست
6	این اسپینال تپ است	بله	0.67

تعداد نتایج مربوطه 4 است. بنابراین ، می توانید میانگین دقت را در 6 به شرح زیر محاسبه کنید:

$${\text{average precision at 6}} = \frac{1}{4} {\text{(1.0 + 1.0 + 0.75 + 0.67)} } $$$${\text{average precision at 6}} = {\text{~0.85} } $$

ب

خط پایه

#متناقض

مدلی که به عنوان یک نقطه مرجع برای مقایسه چگونگی عملکرد یک مدل دیگر (به طور معمول ، یک مدل پیچیده تر) استفاده می شود. به عنوان مثال ، یک مدل رگرسیون لجستیک ممکن است به عنوان یک پایه خوب برای یک مدل عمیق عمل کند.

برای یک مشکل خاص ، پایه به توسعه دهندگان مدل کمک می کند تا حداقل عملکرد مورد انتظار را که یک مدل جدید باید برای مدل جدید مفید باشد ، تعیین کند.

سی

هزینه

#متناقض

مترادف برای از دست دادن .

انصاف ضد خلاف

#مسئول

#متناقض

یک متریک انصاف که بررسی می کند که آیا یک مدل طبقه بندی نتیجه یکسان را برای یک فرد تولید می کند ، همانطور که برای فرد دیگری که با اولی یکسان است ، به جز با توجه به یک یا چند ویژگی حساس است. ارزیابی یک مدل طبقه بندی برای انصاف ضد عملی ، یکی از روشهای برای گسترش منابع بالقوه تعصب در یک مدل است.

برای اطلاعات بیشتر به هر یک از موارد زیر مراجعه کنید:

انصاف: انصاف ضد خلاف در دوره تصادف یادگیری ماشین.
هنگامی که جهان ها با هم برخورد می کنند: ادغام فرضیات ضد خلاف مختلف در انصاف

فاش کردن

#متناقض

تعمیم از دست دادن ورود به مشکلات طبقه بندی چند طبقه . آنتروپی متقاطع تفاوت بین دو توزیع احتمال را تعیین می کند. همچنین به گیج کننده مراجعه کنید.

عملکرد توزیع تجمعی (CDF)

#متناقض

تابعی که فرکانس نمونه ها را کمتر یا مساوی با یک مقدار هدف تعریف می کند. به عنوان مثال ، توزیع عادی مقادیر مداوم را در نظر بگیرید. CDF به شما می گوید که تقریباً 50 ٪ از نمونه ها باید کمتر از یا مساوی با میانگین باشند و تقریباً 84 ٪ نمونه ها باید کمتر از یا مساوی با یک انحراف استاندارد بالاتر از میانگین باشند.

D

برابری جمعیتی

#مسئول

#متناقض

متریک انصاف که اگر نتایج طبقه بندی یک مدل به یک ویژگی حساس خاص وابسته نباشد ، راضی است.

به عنوان مثال ، اگر هر دو لیلیپوتیایی و Brobdingnagians در دانشگاه Glubbdubdrib اعمال شوند ، در صورتی که درصد لیلیپوتیان پذیرفته شده همانند درصد BrobdingNagians اعتراف کند ، بدون توجه به اینکه یک گروه به طور متوسط نسبت به گروه دیگر واجد شرایط تر هستند ، برابری جمعیتی حاصل می شود.

تضاد با شانس مساوی و برابری فرصت ، که باعث می شود طبقه بندی منجر به وابستگی به ویژگی های حساس شود ، اما نتایج طبقه بندی را برای برخی از برچسب های حقیقت مشخص شده زمین به ویژگی های حساس وابسته نمی کند. برای تجسم در مورد تجارت در هنگام بهینه سازی برای برابری جمعیتی ، به "حمله به تبعیض با یادگیری ماشین هوشمند" مراجعه کنید.

See Fairness: demographic parity in Machine Learning Crash Course for more information.

E

earth mover's distance (EMD)

#Metric

A measure of the relative similarity of two distributions . The lower the earth mover's distance, the more similar the distributions.

edit distance

#زبان

#Metric

A measurement of how similar two text strings are to each other. In machine learning, edit distance is useful for the following reasons:

Edit distance is easy to compute.
Edit distance can compare two strings known to be similar to each other.
Edit distance can determine the degree to which different strings are similar to a given string.

There are several definitions of edit distance, each using different string operations. See Levenshtein distance for an example.

empirical cumulative distribution function (eCDF or EDF)

#Metric

A cumulative distribution function based on empirical measurements from a real dataset. The value of the function at any point along the x-axis is the fraction of observations in the dataset that are less than or equal to the specified value.

آنتروپی

#df

#Metric

In information theory , a description of how unpredictable a probability distribution is. Alternatively, entropy is also defined as how much information each example contains. A distribution has the highest possible entropy when all values of a random variable are equally likely.

The entropy of a set with two possible values "0" and "1" (for example, the labels in a binary classification problem) has the following formula:

H = -p log p - q log q = -p log p - (1-p) * log (1-p)

کجا:

H is the entropy.
p is the fraction of "1" examples.
q is the fraction of "0" examples. Note that q = (1 - p)
log is generally log ₂ . In this case, the entropy unit is a bit.

برای مثال موارد زیر را فرض کنید:

100 examples contain the value "1"
300 examples contain the value "0"

Therefore, the entropy value is:

p = 0.25
q = 0.75
H = (-0.25)log ₂ (0.25) - (0.75)log ₂ (0.75) = 0.81 bits per example

A set that is perfectly balanced (for example, 200 "0"s and 200 "1"s) would have an entropy of 1.0 bit per example. As a set becomes more imbalanced , its entropy moves towards 0.0.

In decision trees , entropy helps formulate information gain to help the splitter select the conditions during the growth of a classification decision tree.

Compare entropy with:

ناخالصی جینی
cross-entropy loss function

Entropy is often called Shannon's entropy .

See Exact splitter for binary classification with numerical features in the Decision Forests course for more information.

برابری فرصت ها

#responsible

#Metric

A fairness metric to assess whether a model is predicting the desirable outcome equally well for all values of a sensitive attribute . In other words, if the desirable outcome for a model is the positive class , the goal would be to have the true positive rate be the same for all groups.

Equality of opportunity is related to equalized odds , which requires that both the true positive rates and false positive rates are the same for all groups.

Suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians' secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians' secondary schools don't offer math classes at all, and as a result, far fewer of their students are qualified. Equality of opportunity is satisfied for the preferred label of "admitted" with respect to nationality (Lilliputian or Brobdingnagian) if qualified students are equally likely to be admitted irrespective of whether they're a Lilliputian or a Brobdingnagian.

For example, suppose 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 1. Lilliputian applicants (90% are qualified)

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	45	3
رد شد	45	7
مجموع	90	10
Percentage of qualified students admitted: 45/90 = 50% Percentage of unqualified students rejected: 7/10 = 70% Total percentage of Lilliputian students admitted: (45+3)/100 = 48%

Table 2. Brobdingnagian applicants (10% are qualified):

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	5	9
رد شد	5	81
مجموع	10	90
Percentage of qualified students admitted: 5/10 = 50% Percentage of unqualified students rejected: 81/90 = 90% Total percentage of Brobdingnagian students admitted: (5+9)/100 = 14%

The preceding examples satisfy equality of opportunity for acceptance of qualified students because qualified Lilliputians and Brobdingnagians both have a 50% chance of being admitted.

While equality of opportunity is satisfied, the following two fairness metrics are not satisfied:

demographic parity : Lilliputians and Brobdingnagians are admitted to the university at different rates; 48% of Lilliputians students are admitted, but only 14% of Brobdingnagian students are admitted.
equalized odds : While qualified Lilliputian and Brobdingnagian students both have the same chance of being admitted, the additional constraint that unqualified Lilliputians and Brobdingnagians both have the same chance of being rejected is not satisfied. Unqualified Lilliputians have a 70% rejection rate, whereas unqualified Brobdingnagians have a 90% rejection rate.

See Fairness: Equality of opportunity in Machine Learning Crash Course for more information.

equalized odds

#responsible

#Metric

A fairness metric to assess whether a model is predicting outcomes equally well for all values of a sensitive attribute with respect to both the positive class and negative class —not just one class or the other exclusively. In other words, both the true positive rate and false negative rate should be the same for all groups.

Equalized odds is related to equality of opportunity , which only focuses on error rates for a single class (positive or negative).

For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians' secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians' secondary schools don't offer math classes at all, and as a result, far fewer of their students are qualified. Equalized odds is satisfied provided that no matter whether an applicant is a Lilliputian or a Brobdingnagian, if they are qualified, they are equally as likely to get admitted to the program, and if they are not qualified, they are equally as likely to get rejected.

Suppose 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 3. Lilliputian applicants (90% are qualified)

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	45	2
رد شد	45	8
مجموع	90	10
Percentage of qualified students admitted: 45/90 = 50% Percentage of unqualified students rejected: 8/10 = 80% Total percentage of Lilliputian students admitted: (45+2)/100 = 47%

Table 4. Brobdingnagian applicants (10% are qualified):

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	5	18
رد شد	5	72
مجموع	10	90
Percentage of qualified students admitted: 5/10 = 50% Percentage of unqualified students rejected: 72/90 = 80% Total percentage of Brobdingnagian students admitted: (5+18)/100 = 23%

Equalized odds is satisfied because qualified Lilliputian and Brobdingnagian students both have a 50% chance of being admitted, and unqualified Lilliputian and Brobdingnagian have an 80% chance of being rejected.

Equalized odds is formally defined in "Equality of Opportunity in Supervised Learning" as follows: "predictor Ŷ satisfies equalized odds with respect to protected attribute A and outcome Y if Ŷ and A are independent, conditional on Y."

evals

#زبان

#generativeAI

#Metric

Primarily used as an abbreviation for LLM evaluations . More broadly, evals is an abbreviation for any form of evaluation .

ارزیابی

#زبان

#generativeAI

#Metric

The process of measuring a model's quality or comparing different models against each other.

To evaluate a supervised machine learning model, you typically judge it against a validation set and a test set . Evaluating a LLM typically involves broader quality and safety assessments.

اف

F ₁

#Metric

A "roll-up" binary classification metric that relies on both precision and recall . این فرمول است:

$$F{_1} = \frac{\text{2 * precision * recall}} {\text{precision + recall}}$$

Click the icon to see examples.

Suppose precision and recall have the following values:

precision = 0.6
recall = 0.4

You calculate F ₁ as follows:

$$F{_1} = \frac{\text{2 * 0.6 * 0.4}} {\text{0.6 + 0.4}} = 0.48$$

When precision and recall are fairly similar (as in the preceding example), F ₁ is close to their mean. When precision and recall differ significantly, F ₁ is closer to the lower value. به عنوان مثال:

precision = 0.9
recall = 0.1

$$F{_1} = \frac{\text{2 * 0.9 * 0.1}} {\text{0.9 + 0.1}} = 0.18$$

fairness metric

#responsible

#Metric

A mathematical definition of "fairness" that is measurable. Some commonly used fairness metrics include:

Many fairness metrics are mutually exclusive; see incompatibility of fairness metrics .

false negative (FN)

#fundamentals

#Metric

An example in which the model mistakenly predicts the negative class . For example, the model predicts that a particular email message is not spam (the negative class), but that email message actually is spam .

false negative rate

#Metric

The proportion of actual positive examples for which the model mistakenly predicted the negative class. The following formula calculates the false negative rate:

$$\text{false negative rate} = \frac{\text{false negatives}}{\text{false negatives} + \text{true positives}}$$

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive (FP)

#fundamentals

#Metric

An example in which the model mistakenly predicts the positive class . For example, the model predicts that a particular email message is spam (the positive class), but that email message is actually not spam .

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals

#Metric

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in an ROC curve .

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

اهمیت ویژگی ها

#df

#Metric

Synonym for variable importances .

fraction of successes

#generativeAI

#Metric

A metric for evaluating an ML model's generated text . The fraction of successes is the number of "successful" generated text outputs divided by the total number of generated text outputs. For example, if a large language model generated 10 blocks of code, five of which were successful, then the fraction of successes would be 50%.

Although fraction of successes is broadly useful throughout statistics, within ML, this metric is primarily useful for measuring verifiable tasks like code generation or math problems.

جی

ناخالصی جینی

#df

#Metric

A metric similar to entropy . Splitters use values derived from either gini impurity or entropy to compose conditions for classification decision trees . Information gain is derived from entropy. There is no universally accepted equivalent term for the metric derived from gini impurity; however, this unnamed metric is just as important as information gain.

Gini impurity is also called gini index , or simply gini .

Click the icon for mathematical details about gini impurity.

Gini impurity is the probability of misclassifying a new piece of data taken from the same distribution. The gini impurity of a set with two possible values "0" and "1" (for example, the labels in a binary classification problem) is calculated from the following formula:

I = 1 - (p ² + q ² ) = 1 - (p ² + (1-p) ² )

کجا:

I is the gini impurity.
p is the fraction of "1" examples.
q is the fraction of "0" examples. Note that q = 1-p

For example, consider the following dataset:

100 labels (0.25 of the dataset) contain the value "1"
300 labels (0.75 of the dataset) contain the value "0"

Therefore, the gini impurity is:

p = 0.25
q = 0.75
I = 1 - (0.25 ² + 0.75 ² ) = 0.375

Consequently, a random label from the same dataset would have a 37.5% chance of being misclassified, and a 62.5% chance of being properly classified.

A perfectly balanced label (for example, 200 "0"s and 200 "1"s) would have a gini impurity of 0.5. A highly imbalanced label would have a gini impurity close to 0.0.

اچ

hinge loss

#Metric

A family of loss functions for classification designed to find the decision boundary as distant as possible from each training example, thus maximizing the margin between examples and the boundary. KSVMs use hinge loss (or a related function, such as squared hinge loss). For binary classification, the hinge loss function is defined as follows:

$$\text{loss} = \text{max}(0, 1 - (y * y'))$$

where y is the true label, either -1 or +1, and y' is the raw output of the classification model :

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

Consequently, a plot of hinge loss versus (y * y') looks as follows:

A Cartesian plot consisting of two joined line segments. اولین line segment starts at (-3, 4) and ends at (1, 0). خط دوم segment begins at (1, 0) and continues indefinitely with a slope of 0.

من

incompatibility of fairness metrics

#responsible

#Metric

The idea that some notions of fairness are mutually incompatible and cannot be satisfied simultaneously. As a result, there is no single universal metric for quantifying fairness that can be applied to all ML problems.

While this may seem discouraging, incompatibility of fairness metrics doesn't imply that fairness efforts are fruitless. Instead, it suggests that fairness must be defined contextually for a given ML problem, with the goal of preventing harms specific to its use cases.

See "On the (im)possibility of fairness" for a more detailed discussion of the incompatibility of fairness metrics.

individual fairness

#responsible

#Metric

A fairness metric that checks whether similar individuals are classified similarly. For example, Brobdingnagian Academy might want to satisfy individual fairness by ensuring that two students with identical grades and standardized test scores are equally likely to gain admission.

Note that individual fairness relies entirely on how you define "similarity" (in this case, grades and test scores), and you can run the risk of introducing new fairness problems if your similarity metric misses important information (such as the rigor of a student's curriculum).

See "Fairness Through Awareness" for a more detailed discussion of individual fairness.

information gain

#df

#Metric

In decision forests , the difference between a node's entropy and the weighted (by number of examples) sum of the entropy of its children nodes. A node's entropy is the entropy of the examples in that node.

For example, consider the following entropy values:

entropy of parent node = 0.6
entropy of one child node with 16 relevant examples = 0.2
entropy of another child node with 24 relevant examples = 0.1

So 40% of the examples are in one child node and 60% are in the other child node. بنابراین:

weighted entropy sum of child nodes = (0.4 * 0.2) + (0.6 * 0.1) = 0.14

So, the information gain is:

information gain = entropy of parent node - weighted entropy sum of child nodes
information gain = 0.6 - 0.14 = 0.46

Most splitters seek to create conditions that maximize information gain.

inter-rater agreement

#Metric

A measurement of how often human raters agree when doing a task. If raters disagree, the task instructions may need to be improved. Also sometimes called inter-annotator agreement or inter-rater reliability . See also Cohen's kappa , which is one of the most popular inter-rater agreement measurements.

See Categorical data: Common issues in Machine Learning Crash Course for more information.

L

L ₁ loss

#fundamentals

#Metric

A loss function that calculates the absolute value of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L ₁ loss for a batch of five examples :

Actual value of example	Model's predicted value	Absolute value of delta
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ loss

L ₁ loss is less sensitive to outliers than L ₂ loss .

The Mean Absolute Error is the average L ₁ loss per example.

Click the icon to see the formal math.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Linear regression: Loss in Machine Learning Crash Course for more information.

L ₂ loss

#fundamentals

#Metric

A loss function that calculates the square of the difference between actual label values and the values that a model predicts. For example, here's the calculation of L ₂ loss for a batch of five examples :

Actual value of example	Model's predicted value	Square of delta
7	6	1
5	4	1
8	11	9
4	6	4
9	8	1
		16 = L ₂ loss

Due to squaring, L ₂ loss amplifies the influence of outliers . That is, L ₂ loss reacts more strongly to bad predictions than L ₁ loss . For example, the L ₁ loss for the preceding batch would be 8 rather than 16. Notice that a single outlier accounts for 9 of the 16.

Regression models typically use L ₂ loss as the loss function.

The Mean Squared Error is the average L ₂ loss per example. Squared loss is another name for L ₂ loss.

Click the icon to see the formal math.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

LLM evaluations (evals)

#زبان

#generativeAI

#Metric

A set of metrics and benchmarks for assessing the performance of large language models (LLMs). At a high level, LLM evaluations:

Help researchers identify areas where LLMs need improvement.
Are useful in comparing different LLMs and identifying the best LLM for a particular task.
Help ensure that LLMs are safe and ethical to use.

See Large language models (LLMs) in Machine Learning Crash Course for more information.

از دست دادن

#fundamentals

#Metric

During the training of a supervised model , a measure of how far a model's prediction is from its label .

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

عملکرد از دست دادن

#fundamentals

#Metric

During training or testing, a mathematical function that calculates the loss on a batch of examples. A loss function returns a lower loss for models that makes good predictions than for models that make bad predictions.

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. به عنوان مثال:

L ₂ loss (or Mean Squared Error ) is the loss function for linear regression .
Log Loss is the loss function for logistic regression .

م

میانگین خطای مطلق (MAE)

#Metric

The average loss per example when L ₁ loss is used. Calculate Mean Absolute Error as follows:

Calculate the L ₁ loss for a batch.
Divide the L ₁ loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Absolute Error} = \frac{1}{n}\sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

For example, consider the calculation of L ₁ loss on the following batch of five examples:

Actual value of example	Model's predicted value	Loss (difference between actual and predicted)
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ loss

So, L ₁ loss is 8 and the number of examples is 5. Therefore, the Mean Absolute Error is:

 Mean Absolute Error = L₁ loss / Number of Examples Mean Absolute Error = 8/5 = 1.6

Contrast Mean Absolute Error with Mean Squared Error and Root Mean Squared Error .

mean average precision at k (mAP@k)

#زبان

#generativeAI

#Metric

The statistical mean of all average precision at k scores across a validation dataset. One use of mean average precision at k is to judge the quality of recommendations generated by a recommendation system .

Although the phrase "mean average" sounds redundant, the name of the metric is appropriate. After all, this metric finds the mean of multiple average precision at k values.

Click the icon to see an example.

Suppose you build a recommendation system that generates a personalized list of recommended novels for each user. Based on feedback from selected users, you calculate the following five average precision at k scores (one score per user):

0.73
0.77
0.67
0.82
0.76

The mean Average Precision at K is therefore:

$$\text{mean } = \frac{\text{0.73 + 0.77 + 0.67 + 0.82 + 0.76}} {\text{5}} = \text{0.75}$$

میانگین مربعات خطا (MSE)

#Metric

The average loss per example when L ₂ loss is used. Calculate Mean Squared Error as follows:

Calculate the L ₂ loss for a batch.
Divide the L ₂ loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Squared Error} = \frac{1}{n}\sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the model's prediction for $y$.

For example, consider the loss on the following batch of five examples:

ارزش واقعی	Model's prediction	از دست دادن	Squared loss
7	6	1	1
5	4	1	1
8	11	3	9
4	6	2	4
9	8	1	1
			16 = L ₂ loss

Therefore, the Mean Squared Error is:

 Mean Squared Error = L₂ loss / Number of Examples Mean Squared Error = 16/5 = 3.2

Mean Squared Error is a popular training optimizer , particularly for linear regression .

Contrast Mean Squared Error with Mean Absolute Error and Root Mean Squared Error .

TensorFlow Playground uses Mean Squared Error to calculate loss values.

Click the icon to see more details about outliers.

Outliers strongly influence Mean Squared Error. For example, a loss of 1 is a squared loss of 1, but a loss of 3 is a squared loss of 9. In the preceding table, the example with a loss of 3 accounts for ~56% of the Mean Squared Error, while each of the examples with a loss of 1 accounts for only 6% of the Mean Squared Error.

Outliers don't influence Mean Absolute Error as strongly as Mean Squared Error. For example, a loss of 3 accounts for only ~38% of the Mean Absolute Error.

Clipping is one way to prevent extreme outliers from damaging your model's predictive ability.

متریک

#TensorFlow

#Metric

A statistic that you care about.

An objective is a metric that a machine learning system tries to optimize.

Metrics API (tf.metrics)

#Metric

A TensorFlow API for evaluating models. For example, tf.metrics.accuracy determines how often a model's predictions match labels.

minimax loss

#Metric

A loss function for generative adversarial networks , based on the cross-entropy between the distribution of generated data and real data.

Minimax loss is used in the first paper to describe generative adversarial networks.

See Loss Functions in the Generative Adversarial Networks course for more information.

model capacity

#Metric

The complexity of problems that a model can learn. The more complex the problems that a model can learn, the higher the model's capacity. A model's capacity typically increases with the number of model parameters. For a formal definition of classification model capacity, see VC dimension .

ن

negative class

#fundamentals

#Metric

In binary classification , one class is termed positive and the other is termed negative . The positive class is the thing or event that the model is testing for and the negative class is the other possibility. به عنوان مثال:

The negative class in a medical test might be "not tumor."
The negative class in an email classification model might be "not spam."

Contrast with positive class .

O

هدف

#Metric

A metric that your algorithm is trying to optimize.

تابع هدف

#Metric

The mathematical formula or metric that a model aims to optimize. For example, the objective function for linear regression is usually Mean Squared Loss . Therefore, when training a linear regression model, training aims to minimize Mean Squared Loss.

In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy, the goal is to maximize accuracy.

پ

pass at k (pass@k)

#Metric

A metric to determine the quality of code (for example, Python) that a large language model generates. More specifically, pass at k tells you the likelihood that at least one generated block of code out of k generated blocks of code will pass all of its unit tests.

Large language models often struggle to generate good code for complex programming problems. Software engineers adapt to this problem by prompting the large language model to generate multiple ( k ) solutions for the same problem. Then, software engineers test each of the solutions against unit tests. The calculation of pass at k depends on the outcome of the unit tests:

If one or more of those solutions pass the unit test, then the LLM Passes that code generation challenge.
If none of the solutions pass the unit test, then the LLM Fails that code generation challenge.

The formula for pass at k is as follows:

\[\text{pass at k} = \frac{\text{total number of passes}} {\text{total number of challenges}}\]

In general, higher values of k produce higher pass at k scores; however, higher values of k require more large language model and unit testing resources.

Click the icon for an example.

Suppose a software engineer asks a large language model to generate k =10 solutions for n =50 challenging coding problems. در اینجا نتایج آمده است:

30 Passes
20 Fails

The pass at 10 score is therefore:

$$\text{pass at 10} = \frac{\text{30}} {\text{50}} = 0.6$$

عملکرد

#Metric

Overloaded term with the following meanings:

The standard meaning within software engineering. Namely: How fast (or efficiently) does this piece of software run?
The meaning within machine learning. Here, performance answers the following question: How correct is this model ? That is, how good are the model's predictions?

permutation variable importances

#df

#Metric

A type of variable importance that evaluates the increase in the prediction error of a model after permuting the feature's values. Permutation variable importance is a model-independent metric.

گیجی

#Metric

One measure of how well a model is accomplishing its task. For example, suppose your task is to read the first few letters of a word a user is typing on a phone keyboard, and to offer a list of possible completion words. Perplexity, P, for this task is approximately the number of guesses you need to offer in order for your list to contain the actual word the user is trying to type.

Perplexity is related to cross-entropy as follows:

$$P= 2^{-\text{cross entropy}}$$

positive class

#fundamentals

#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

Click the icon for additional notes.

The term positive class can be confusing because the "positive" outcome of many tests is often an undesirable result. For example, the positive class in many medical tests corresponds to tumors or diseases. In general, you want a doctor to tell you, "Congratulations! Your test results were negative." Regardless, the positive class is the event that the test is seeking to find.

Admittedly, you're simultaneously testing for both the positive and negative classes.

PR AUC (area under the PR curve)

#Metric

Area under the interpolated precision-recall curve , obtained by plotting (recall, precision) points for different values of the classification threshold .

دقت

#Metric

A metric for classification models that answers the following question:

When the model predicted the positive class , what percentage of the predictions were correct?

این فرمول است:

$$\text{Precision} = \frac{\text{true positives}} {\text{true positives} + \text{false positives}}$$

کجا:

true positive means the model correctly predicted the positive class.
false positive means the model mistakenly predicted the positive class.

For example, suppose a model made 200 positive predictions. Of these 200 positive predictions:

150 were true positives.
50 were false positives.

در این مورد:

$$\text{Precision} = \frac{\text{150}} {\text{150} + \text{50}} = 0.75$$

Contrast with accuracy and recall .

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

precision at k (precision@k)

#زبان

#Metric

A metric for evaluating a ranked (ordered) list of items. Precision at k identifies the fraction of the first k items in that list that are "relevant." یعنی:

\[\text{precision at k} = \frac{\text{relevant items in first k items of the list}} {\text{k}}\]

The value of k must be less than or equal to the length of the returned list. Note that the length of the returned list is not part of the calculation.

Relevance is often subjective; even expert human evaluators often disagree on which items are relevant.

مقایسه کنید با:

Click the icon to see an example.

Suppose a large language model is given the following query:

 List the 6 funniest movies of all time in order.

And the large language model returns the list shown in the first two columns of the following table:

موقعیت	فیلم	مربوطه؟
1	ژنرال	بله
2	دختران بدجنس	بله
3	جوخه	خیر
4	ساقدوش ها	بله
5	شهروند کین	خیر
6	این اسپینال تپ است	بله

Two of the first three movies are relevant, so precision at 3 is:

$$\text{precision at 3} = \frac{\text{2}} {\text{3}} = 0.67$$

Four of the first five movies are very funny, so precision at 5 is:

$$\text{precision at 5} = \frac{\text{4}} {\text{5}} = 0.8$$

منحنی فراخوان دقیق

#Metric

A curve of precision versus recall at different classification thresholds .

prediction bias

#Metric

A value indicating how far apart the average of predictions is from the average of labels in the dataset.

Not to be confused with the bias term in machine learning models or with bias in ethics and fairness .

predictive parity

#responsible

#Metric

A fairness metric that checks whether, for a given classifier, the precision rates are equivalent for subgroups under consideration.

For example, a model that predicts college acceptance would satisfy predictive parity for nationality if its precision rate is the same for Lilliputians and Brobdingnagians.

Predictive parity is sometime also called predictive rate parity .

See "Fairness Definitions Explained" (section 3.2.1) for a more detailed discussion of predictive parity.

predictive rate parity

#responsible

#Metric

Another name for predictive parity .

تابع چگالی احتمال

#Metric

A function that identifies the frequency of data samples having exactly a particular value. When a dataset's values are continuous floating-point numbers, exact matches rarely occur. However, integrating a probability density function from value x to value y yields the expected frequency of data samples between x and y .

For example, consider a normal distribution having a mean of 200 and a standard deviation of 30. To determine the expected frequency of data samples falling within the range 211.4 to 218.7, you can integrate the probability density function for a normal distribution from 211.4 to 218.7.

آر

به یاد بیاور

#Metric

A metric for classification models that answers the following question:

When ground truth was the positive class , what percentage of predictions did the model correctly identify as the positive class?

این فرمول است:

\[\text{Recall} = \frac{\text{true positives}} {\text{true positives} + \text{false negatives}} \]

کجا:

true positive means the model correctly predicted the positive class.
false negative means that the model mistakenly predicted the negative class .

For instance, suppose your model made 200 predictions on examples for which ground truth was the positive class. Of these 200 predictions:

180 were true positives.
20 were false negatives.

در این مورد:

\[\text{Recall} = \frac{\text{180}} {\text{180} + \text{20}} = 0.9 \]

Click the icon for notes about class-imbalanced datasets.

Recall is particularly useful for determining the predictive power of classification models in which the positive class is rare. For example, consider a class-imbalanced dataset in which the positive class for a certain disease occurs in only 10 patients out of a million. Suppose your model makes five million predictions that yield the following outcomes:

30 True Positives
20 False Negatives
4,999,000 True Negatives
950 False Positives

The recall of this model is therefore:

 recall = TP / (TP + FN) recall = 30 / (30 + 20) = 0.6 = 60%

By contrast, the accuracy of this model is:

 accuracy = (TP + TN) / (TP + TN + FP + FN) accuracy = (30 + 4,999,000) / (30 + 4,999,000 + 950 + 20) = 99.98%

That high value of accuracy looks impressive but is essentially meaningless. Recall is a much more useful metric for class-imbalanced datasets than accuracy.

See Classification: Accuracy, recall, precision and related metrics for more information.

recall at k (recall@k)

#زبان

#Metric

A metric for evaluating systems that output a ranked (ordered) list of items. Recall at k identifies the fraction of relevant items in the first k items in that list out of the total number of relevant items returned.

\[\text{recall at k} = \frac{\text{relevant items in first k items of the list}} {\text{total number of relevant items in the list}}\]

Contrast with precision at k .

Click the icon to see an example.

Suppose a large language model is given the following query:

 List the 10 funniest movies of all time in order.

And the large language model returns the list shown in the first two columns:

موقعیت	فیلم	مربوطه؟
1	ژنرال	بله
2	دختران بدجنس	بله
3	جوخه	خیر
4	ساقدوش ها	بله
5	این اسپینال تپ است	بله
6	هواپیما!	بله
7	روز گراند هاگ	بله
8	مونتی پایتون و جام مقدس	بله
9	اوپنهایمر	خیر
10	بی خبر	بله

Eight of the movies in the preceding list are very funny, so they are "relevant items in the list." Therefore, 8 will be the denominator in all the calculations of recall at k . What about the numerator? Well, 3 of the first 4 items are relevant, so recall at 4 is:

$$\text{recall at 4} = \frac{\text{3}} {\text{8}} = 0.375$$

7 of the first 8 movies are very funny, so recall at 8 is:

$$\text{recall at 8} = \frac{\text{7}} {\text{8}} = 0.875$$

ROC (receiver operating characteristic) Curve

#fundamentals

#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

The shape of an ROC curve suggests a binary classification model's ability to separate positive classes from negative classes. Suppose, for example, that a binary classification model perfectly separates all the negative classes from all the positive classes:

A number line with 8 positive examples on the right side and 7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

An ROC curve. The x-axis is False Positive Rate and the y-axis is True Positive Rate. The curve has an inverted L shape. منحنی starts at (0.0,0.0) and goes straight up to (0.0,1.0). Then the curve goes from (0.0,1.0) to (1.0,1.0).

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0) to (1.0,1.0).

Meanwhile, back in the real world, most binary classification models separate positive and negative classes to some degree, but usually not perfectly. So, a typical ROC curve falls somewhere between the two extremes:

An ROC curve. The x-axis is False Positive Rate and the y-axis is True Positive Rate. The ROC curve approximates a shaky arc traversing the compass points from West to North.

The point on an ROC curve closest to (0.0,1.0) theoretically identifies the ideal classification threshold. However, several other real-world issues influence the selection of the ideal classification threshold. For example, perhaps false negatives cause far more pain than false positives.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

ریشه میانگین مربعات خطا (RMSE)

#fundamentals

#Metric

The square root of the Mean Squared Error .

ROUGE (مطالعه فراخوان یادآوری گرا برای ارزیابی Gisting)

#زبان

#Metric

A family of metrics that evaluate automatic summarization and machine translation models. ROUGE metrics determine the degree to which a reference text overlaps an ML model's generated text . Each member of the ROUGE family measures overlap in a different way. Higher ROUGE scores indicate more similarity between the reference text and generated text than lower ROUGE scores.

Each ROUGE family member typically generates the following metrics:

دقت
به یاد بیاورید
F ₁

For details and examples, see:

ROUGE-L

#زبان

#Metric

A member of the ROUGE family focused on the length of the longest common subsequence in the reference text and generated text . The following formulas calculate recall and precision for ROUGE-L:

$$\text{ROUGE-L recall} = \frac{\text{longest common sequence}} {\text{number of words in the reference text} }$$

$$\text{ROUGE-L precision} = \frac{\text{longest common sequence}} {\text{number of words in the generated text} }$$

You can then use F ₁ to roll up ROUGE-L recall and ROUGE-L precision into a single metric:

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{ROUGE-L recall} * \text{ROUGE-L precision}} {\text{ROUGE-L recall} + \text{ROUGE-L precision} }$$

Click the icon for an example calculation of ROUGE-L.

Consider the following reference text and generated text.

دسته بندی	Who produced?	متن
Reference text	مترجم انسانی	I want to understand a wide variety of things.
متن تولید شده	مدل ML	I want to learn plenty of things.

بنابراین:

The longest common subsequence is 5 ( I want to of things )
The number of words in the reference text is 9.
The number of words in the generated text is 7.

در نتیجه:

$$\text{ROUGE-L recall} = \frac{\text{5}} {\text{9} } = 0.56$$

$$\text{ROUGE-L precision} = \frac{\text{5}} {\text{7} } = 0.71$$

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{0.56} * \text{0.71}} {\text{0.56} + \text{0.71} } = 0.63$$

ROUGE-L ignores any newlines in the reference text and generated text, so the longest common subsequence could cross multiple sentences. When the reference text and generated text involve multiple sentences, a variation of ROUGE-L called ROUGE-Lsum is generally a better metric. ROUGE-Lsum determines the longest common subsequence for each sentence in a passage and then calculates the mean of those longest common subsequences.

Click the icon for an example calculation of ROUGE-Lsum.

Consider the following reference text and generated text.

دسته بندی	Who produced?	متن
Reference text	مترجم انسانی	The surface of Mars is dry. Nearly all the water is deep underground.
متن تولید شده	مدل ML	Mars has a dry surface. However, the vast majority of water is underground.

بنابراین:

	جمله اول	جمله دوم
Longest common sequence	2 (Mars dry)	3 (water is underground)
Sentence length of reference text	6	7
Sentence length of generated text	5	8

در نتیجه:

$$\text{recall of first sentence} = \frac{\text{2}} {\text{6}} = 0.33 $$

$$\text{recall of second sentence} = \frac{\text{3}} {\text{7}} = 0.43 $$

$$\text{ROUGE-Lsum recall} = \frac{\text{0.33} + \text{0.43}} {\text{2}} = 0.38 $$

$$\text{precision of first sentence} = \frac{\text{2}} {\text{5}} = 0.4 $$

$$\text{precision of second sentence} = \frac{\text{3}} {\text{8}} = 0.38 $$

$$\text{ROUGE-Lsum precision} = \frac{\text{0.4} + \text{0.38}} {\text{2}} = 0.39 $$

$$\text{ROUGE-Lsum F}{_1} = \frac{\text{2} * \text{0.38} * \text{0.39}} {\text{0.38} + \text{0.39}} = 0.38 $$

ROUGE-N

#زبان

#Metric

A set of metrics within the ROUGE family that compares the shared N-grams of a certain size in the reference text and generated text . به عنوان مثال:

ROUGE-1 measures the number of shared tokens in the reference text and generated text.
ROUGE-2 measures the number of shared bigrams (2-grams) in the reference text and generated text.
ROUGE-3 measures the number of shared trigrams (3-grams) in the reference text and generated text.

You can use the following formulas to calculate ROUGE-N recall and ROUGE-N precision for any member of the ROUGE-N family:

$$\text{ROUGE-N recall} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the reference text} }$$

$$\text{ROUGE-N precision} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the generated text} }$$

You can then use F ₁ to roll up ROUGE-N recall and ROUGE-N precision into a single metric:

$$\text{ROUGE-N F}{_1} = \frac{\text{2} * \text{ROUGE-N recall} * \text{ROUGE-N precision}} {\text{ROUGE-N recall} + \text{ROUGE-N precision} }$$

Click the icon for an example.

Suppose you decide to use ROUGE-2 to measure the effectiveness of an ML model's translation compared to a human translator's.

دسته بندی	Who produced?	متن	بیگرام
Reference text	مترجم انسانی	I want to understand a wide variety of things.	I want, want to, to understand, understand a, a wide, wide variety, variety of, of things
متن تولید شده	مدل ML	I want to learn plenty of things.	I want, want to, to learn, learn plenty, plenty of, of things

بنابراین:

The number of matching 2-grams is 3 ( I want , want to , and of things ).
The number of 2-grams in the reference text is 8.
The number of 2-grams in the generated text is 6.

در نتیجه:

$$\text{ROUGE-2 recall} = \frac{\text{3}} {\text{8} } = 0.375$$

$$\text{ROUGE-2 precision} = \frac{\text{3}} {\text{6} } = 0.5$$

$$\text{ROUGE-2 F}{_1} = \frac{\text{2} * \text{0.375} * \text{0.5}} {\text{0.375} + \text{0.5} } = 0.43$$

ROUGE-S

#زبان

#Metric

A forgiving form of ROUGE-N that enables skip-gram matching. That is, ROUGE-N only counts N-grams that match exactly , but ROUGE-S also counts N-grams separated by one or more words. برای مثال موارد زیر را در نظر بگیرید:

reference text : White clouds
generated text : White billowing clouds

When calculating ROUGE-N, the 2-gram, White clouds doesn't match White billowing clouds . However, when calculating ROUGE-S, White clouds does match White billowing clouds .

R-squared

#Metric

A regression metric indicating how much variation in a label is due to an individual feature or to a feature set. R-squared is a value between 0 and 1, which you can interpret as follows:

An R-squared of 0 means that none of a label's variation is due to the feature set.
An R-squared of 1 means that all of a label's variation is due to the feature set.
An R-squared between 0 and 1 indicates the extent to which the label's variation can be predicted from a particular feature or the feature set. For example, an R-squared of 0.10 means that 10 percent of the variance in the label is due to the feature set, an R-squared of 0.20 means that 20 percent is due to the feature set, and so on.

R-squared is the square of the Pearson correlation coefficient between the values that a model predicted and ground truth .

اس

به ثمر رساندن

#recsystems

#Metric

The part of a recommendation system that provides a value or ranking for each item produced by the candidate generation phase.

اندازه گیری شباهت

#clustering

#Metric

In clustering algorithms, the metric used to determine how alike (how similar) any two examples are.

پراکندگی

#Metric

The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that vector or matrix. For example, consider a 100-element matrix in which 98 cells contain zero. The calculation of sparsity is as follows:

$$ {\text{sparsity}} = \frac{\text{98}} {\text{100}} = {\text{0.98}} $$

Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model weights.

squared hinge loss

#Metric

The square of the hinge loss . Squared hinge loss penalizes outliers more harshly than regular hinge loss.

squared loss

#fundamentals

#Metric

Synonym for L ₂ loss .

تی

test loss

#fundamentals

#Metric

A metric representing a model's loss against the test set . When building a model , you typically try to minimize test loss. That's because a low test loss is a stronger quality signal than a low training loss or low validation loss .

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

top-k accuracy

#زبان

#Metric

The percentage of times that a "target label" appears within the first k positions of generated lists. The lists could be personalized recommendations or a list of items ordered by softmax .

Top-k accuracy is also known as accuracy at k .

Click the icon for an example.

Consider a machine learning system that uses softmax to identify tree probabilities based on a picture of tree leaves. The following table shows output lists generated from five input tree pictures. Each row contains a target label and the five most likely trees. For example, when the target label was maple , the machine learning model identified elm as the most likely tree, oak as the second most likely tree, and so on.

Target label	1	2	3	4	5
افرا	سنجد	بلوط	افرا	راش	صنوبر
چوب سگ	بلوط	چوب سگ	صنوبر	هیکوری	افرا
بلوط	بلوط	چوب باس	ملخ	توسکا	لیندن
لیندن	افرا	paw-paw	بلوط	چوب باس	صنوبر
بلوط	ملخ	لیندن	بلوط	افرا	paw-paw

The target label appears in the first position only once, so the top-1 accuracy is:

$$\text{top-1 accuracy} = \frac{\text{1}} {\text{5}} = 0.2$$

The target label appears in one of the top three positions four times, so the top-3 accuracy is:

$$\text{top-1 accuracy} = \frac{\text{4}} {\text{5}} = 0.8$$

سمیت

#زبان

#Metric

The degree to which content is abusive, threatening, or offensive. Many machine learning models can identify and measure toxicity. Most of these models identify toxicity along multiple parameters, such as the level of abusive language and the level of threatening language.

training loss

#fundamentals

#Metric

A metric representing a model's loss during a particular training iteration. For example, suppose the loss function is Mean Squared Error . Perhaps the training loss (the Mean Squared Error) for the 10th iteration is 2.2, and the training loss for the 100th iteration is 1.9.

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

A downward slope implies that the model is improving.
An upward slope implies that the model is getting worse.
A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

A steep downward slope during the initial iterations, which implies rapid model improvement.
A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts with a steep downward slope. The slope gradually flattens until the slope becomes zero.

Although training loss is important, see also generalization .

true negative (TN)

#fundamentals

#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

true positive (TP)

#fundamentals

#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals

#Metric

Synonym for recall . یعنی:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

V

validation loss

#fundamentals

#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

variable importances

#df

#Metric

A set of scores that indicates the relative importance of each feature to the model.

For example, consider a decision tree that estimates house prices. Suppose this decision tree uses three features: size, age, and style. If a set of variable importances for the three features are calculated to be {size=5.8, age=2.5, style=4.7}, then size is more important to the decision tree than age or style.

Different variable importance metrics exist, which can inform ML experts about different aspects of models.

دبلیو

Wasserstein loss

#Metric

One of the loss functions commonly used in generative adversarial networks , based on the earth mover's distance between the distribution of generated data and real data.

This page contains Metrics glossary terms. For all glossary terms, click here .

الف

دقت

#fundamentals

#Metric

The number of correct classification predictions divided by the total number of predictions. یعنی:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions . So, the accuracy formula for binary classification is as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

کجا:

TP is the number of true positives (correct predictions).
TN is the number of true negatives (correct predictions).
FP is the number of false positives (incorrect predictions).
FN is the number of false negatives (incorrect predictions).

Compare and contrast accuracy with precision and recall .

Click the icon for details about accuracy and class-imbalanced datasets.

Although a valuable metric for some situations, accuracy is highly misleading for others. Notably, accuracy is usually a poor metric for evaluating classification models that process class-imbalanced datasets .

For example, suppose snow falls only 25 days per century in a certain subtropical city. Since days without snow (the negative class) vastly outnumber days with snow (the positive class), the snow dataset for this city is class-imbalanced. Imagine a binary classification model that is supposed to predict either snow or no snow each day but simply predicts "no snow" every day. This model is highly accurate but has no predictive power. The following table summarizes the results for a century of predictions:

دسته بندی	شماره
TP	0
TN	36499
FP	0
FN	25

The accuracy of this model is therefore:

accuracy = (TP + TN) / (TP + TN + FP + FN) accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

Although 99.93% accuracy seems like very a impressive percentage, the model actually has no predictive power.

Precision and recall are usually more useful metrics than accuracy for evaluating models trained on class-imbalanced datasets.

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

area under the PR curve

#Metric

See PR AUC (Area under the PR Curve) .

ناحیه زیر منحنی ROC

#Metric

See AUC (Area under the ROC curve) .

AUC (Area under the ROC curve)

#fundamentals

#Metric

A number between 0.0 and 1.0 representing a binary classification model's ability to separate positive classes from negative classes . The closer the AUC is to 1.0, the better the model's ability to separate classes from each other.

For example, the following illustration shows a classification model that separates positive classes (green ovals) from negative classes (purple rectangles) perfectly. This unrealistically perfect model has an AUC of 1.0:

A number line with 8 positive examples on one side and 9 negative examples on the other side.

Conversely, the following illustration shows the results for a classification model that generated random results. This model has an AUC of 0.5:

A number line with 6 positive examples and 6 negative examples. The sequence of examples is positive, negative, positive, negative, positive, negative, positive, negative, positive negative, positive, negative.

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

A number line with 6 positive examples and 6 negative examples. The sequence of examples is negative, negative, negative, negative, positive, negative, positive, positive, negative, positive, positive, مثبت

AUC ignores any value you set for classification threshold . Instead, AUC considers all possible classification thresholds.

Click the icon to learn about the relationship between AUC and ROC curves.

AUC represents the area under an ROC curve . For example, the ROC curve for a model that perfectly separates positives from negatives looks as follows:

AUC is the area of the gray region in the preceding illustration. In this unusual case, the area is simply the length of the gray region (1.0) multiplied by the width of the gray region (1.0). So, the product of 1.0 and 1.0 yields an AUC of exactly 1.0, which is the highest possible AUC score.

Conversely, the ROC curve for a classification model that can't separate classes at all is as follows. The area of this gray region is 0.5.

A more typical ROC curve looks approximately like the following:

It would be painstaking to calculate the area under this curve manually, which is why a program typically calculates most AUC values.

Click the icon for a more formal definition of AUC.

AUC is the probability that a classification model will be more confident than a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

average precision at k

#زبان

#Metric

A metric for summarizing a model's performance on a single prompt that generates ranked results, such as a numbered list of book recommendations. Average precision at k is, well, the average of the precision at k values for each relevant result. The formula for average precision at k is therefore:

\[{\text{average precision at k}} = \frac{1}{n} \sum_{i=1}^n {\text{precision at k for each relevant item} } \]

کجا:

$n$ is the number of relevant items in the list.

Contrast with recall at k .

Click the icon for an example

Suppose a large language model is given the following query:

 List the 6 funniest movies of all time in order.

And the large language model returns the following list:

ژنرال
دختران بدجنس
جوخه
ساقدوش ها
شهروند کین
این اسپینال تپ است

Four of the movies in the returned list are very funny (that is, they are relevant) but two movies are dramas (not relevant). The following table details the results:

موقعیت	فیلم	مربوطه؟	Precision at k
1	ژنرال	بله	1.0
2	دختران بدجنس	بله	1.0
3	جوخه	خیر	مرتبط نیست
4	ساقدوش ها	بله	0.75
5	شهروند کین	خیر	مرتبط نیست
6	این اسپینال تپ است	بله	0.67

The number of relevant results is 4. Therefore, you can calculate the average precision at 6 as follows:

$${\text{average precision at 6}} = \frac{1}{4} {\text{(1.0 + 1.0 + 0.75 + 0.67)} } $$$${\text{average precision at 6}} = {\text{~0.85} } $$

ب

خط پایه

#Metric

A model used as a reference point for comparing how well another model (typically, a more complex one) is performing. For example, a logistic regression model might serve as a good baseline for a deep model .

For a particular problem, the baseline helps model developers quantify the minimal expected performance that a new model must achieve for the new model to be useful.

سی

هزینه

#Metric

Synonym for loss .

counterfactual fairness

#responsible

#Metric

A fairness metric that checks whether a classification model produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more sensitive attributes . Evaluating a classification model for counterfactual fairness is one method for surfacing potential sources of bias in a model.

See either of the following for more information:

Fairness: Counterfactual fairness in Machine Learning Crash Course.
When Worlds Collide: Integrating Different Counterfactual Assumptions in Fairness

cross-entropy

#Metric

A generalization of Log Loss to multi-class classification problems . Cross-entropy quantifies the difference between two probability distributions. See also perplexity .

cumulative distribution function (CDF)

#Metric

A function that defines the frequency of samples less than or equal to a target value. For example, consider a normal distribution of continuous values. A CDF tells you that approximately 50% of samples should be less than or equal to the mean and that approximately 84% of samples should be less than or equal to one standard deviation above the mean.

D

demographic parity

#responsible

#Metric

A fairness metric that is satisfied if the results of a model's classification are not dependent on a given sensitive attribute .

For example, if both Lilliputians and Brobdingnagians apply to Glubbdubdrib University, demographic parity is achieved if the percentage of Lilliputians admitted is the same as the percentage of Brobdingnagians admitted, irrespective of whether one group is on average more qualified than the other.

Contrast with equalized odds and equality of opportunity , which permit classification results in aggregate to depend on sensitive attributes, but don't permit classification results for certain specified ground truth labels to depend on sensitive attributes. See "Attacking discrimination with smarter machine learning" for a visualization exploring the tradeoffs when optimizing for demographic parity.

See Fairness: demographic parity in Machine Learning Crash Course for more information.

E

earth mover's distance (EMD)

#Metric

A measure of the relative similarity of two distributions . The lower the earth mover's distance, the more similar the distributions.

edit distance

#زبان

#Metric

A measurement of how similar two text strings are to each other. In machine learning, edit distance is useful for the following reasons:

Edit distance is easy to compute.
Edit distance can compare two strings known to be similar to each other.
Edit distance can determine the degree to which different strings are similar to a given string.

There are several definitions of edit distance, each using different string operations. See Levenshtein distance for an example.

empirical cumulative distribution function (eCDF or EDF)

#Metric

آنتروپی

#df

#Metric

The entropy of a set with two possible values "0" and "1" (for example, the labels in a binary classification problem) has the following formula:

H = -p log p - q log q = -p log p - (1-p) * log (1-p)

کجا:

H is the entropy.
p is the fraction of "1" examples.
q is the fraction of "0" examples. Note that q = (1 - p)
log is generally log ₂ . In this case, the entropy unit is a bit.

برای مثال موارد زیر را فرض کنید:

100 examples contain the value "1"
300 examples contain the value "0"

Therefore, the entropy value is:

p = 0.25
q = 0.75
H = (-0.25)log ₂ (0.25) - (0.75)log ₂ (0.75) = 0.81 bits per example

A set that is perfectly balanced (for example, 200 "0"s and 200 "1"s) would have an entropy of 1.0 bit per example. As a set becomes more imbalanced , its entropy moves towards 0.0.

In decision trees , entropy helps formulate information gain to help the splitter select the conditions during the growth of a classification decision tree.

Compare entropy with:

ناخالصی جینی
cross-entropy loss function

Entropy is often called Shannon's entropy .

See Exact splitter for binary classification with numerical features in the Decision Forests course for more information.

برابری فرصت ها

#responsible

#Metric

Equality of opportunity is related to equalized odds , which requires that both the true positive rates and false positive rates are the same for all groups.

For example, suppose 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 1. Lilliputian applicants (90% are qualified)

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	45	3
رد شد	45	7
مجموع	90	10
Percentage of qualified students admitted: 45/90 = 50% Percentage of unqualified students rejected: 7/10 = 70% Total percentage of Lilliputian students admitted: (45+3)/100 = 48%

Table 2. Brobdingnagian applicants (10% are qualified):

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	5	9
رد شد	5	81
مجموع	10	90
Percentage of qualified students admitted: 5/10 = 50% Percentage of unqualified students rejected: 81/90 = 90% Total percentage of Brobdingnagian students admitted: (5+9)/100 = 14%

The preceding examples satisfy equality of opportunity for acceptance of qualified students because qualified Lilliputians and Brobdingnagians both have a 50% chance of being admitted.

While equality of opportunity is satisfied, the following two fairness metrics are not satisfied:

demographic parity : Lilliputians and Brobdingnagians are admitted to the university at different rates; 48% of Lilliputians students are admitted, but only 14% of Brobdingnagian students are admitted.
equalized odds : While qualified Lilliputian and Brobdingnagian students both have the same chance of being admitted, the additional constraint that unqualified Lilliputians and Brobdingnagians both have the same chance of being rejected is not satisfied. Unqualified Lilliputians have a 70% rejection rate, whereas unqualified Brobdingnagians have a 90% rejection rate.

See Fairness: Equality of opportunity in Machine Learning Crash Course for more information.

equalized odds

#responsible

#Metric

Equalized odds is related to equality of opportunity , which only focuses on error rates for a single class (positive or negative).

Suppose 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 3. Lilliputian applicants (90% are qualified)

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	45	2
رد شد	45	8
مجموع	90	10
Percentage of qualified students admitted: 45/90 = 50% Percentage of unqualified students rejected: 8/10 = 80% Total percentage of Lilliputian students admitted: (45+2)/100 = 47%

Table 4. Brobdingnagian applicants (10% are qualified):

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	5	18
رد شد	5	72
مجموع	10	90
Percentage of qualified students admitted: 5/10 = 50% Percentage of unqualified students rejected: 72/90 = 80% Total percentage of Brobdingnagian students admitted: (5+18)/100 = 23%

evals

#زبان

#generativeAI

#Metric

Primarily used as an abbreviation for LLM evaluations . More broadly, evals is an abbreviation for any form of evaluation .

ارزیابی

#زبان

#generativeAI

#Metric

The process of measuring a model's quality or comparing different models against each other.

اف

F ₁

#Metric

A "roll-up" binary classification metric that relies on both precision and recall . این فرمول است:

$$F{_1} = \frac{\text{2 * precision * recall}} {\text{precision + recall}}$$

Click the icon to see examples.

Suppose precision and recall have the following values:

precision = 0.6
recall = 0.4

You calculate F ₁ as follows:

$$F{_1} = \frac{\text{2 * 0.6 * 0.4}} {\text{0.6 + 0.4}} = 0.48$$

precision = 0.9
recall = 0.1

$$F{_1} = \frac{\text{2 * 0.9 * 0.1}} {\text{0.9 + 0.1}} = 0.18$$

fairness metric

#responsible

#Metric

A mathematical definition of "fairness" that is measurable. Some commonly used fairness metrics include:

Many fairness metrics are mutually exclusive; see incompatibility of fairness metrics .

false negative (FN)

#fundamentals

#Metric

false negative rate

#Metric

The proportion of actual positive examples for which the model mistakenly predicted the negative class. The following formula calculates the false negative rate:

$$\text{false negative rate} = \frac{\text{false negatives}}{\text{false negatives} + \text{true positives}}$$

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive (FP)

#fundamentals

#Metric

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals

#Metric

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in an ROC curve .

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

اهمیت ویژگی ها

#df

#Metric

Synonym for variable importances .

fraction of successes

#generativeAI

#Metric

Although fraction of successes is broadly useful throughout statistics, within ML, this metric is primarily useful for measuring verifiable tasks like code generation or math problems.

جی

ناخالصی جینی

#df

#Metric

Gini impurity is also called gini index , or simply gini .

Click the icon for mathematical details about gini impurity.

I = 1 - (p ² + q ² ) = 1 - (p ² + (1-p) ² )

کجا:

I is the gini impurity.
p is the fraction of "1" examples.
q is the fraction of "0" examples. Note that q = 1-p

For example, consider the following dataset:

100 labels (0.25 of the dataset) contain the value "1"
300 labels (0.75 of the dataset) contain the value "0"

Therefore, the gini impurity is:

p = 0.25
q = 0.75
I = 1 - (0.25 ² + 0.75 ² ) = 0.375

Consequently, a random label from the same dataset would have a 37.5% chance of being misclassified, and a 62.5% chance of being properly classified.

A perfectly balanced label (for example, 200 "0"s and 200 "1"s) would have a gini impurity of 0.5. A highly imbalanced label would have a gini impurity close to 0.0.

اچ

hinge loss

#Metric

$$\text{loss} = \text{max}(0, 1 - (y * y'))$$

where y is the true label, either -1 or +1, and y' is the raw output of the classification model :

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

Consequently, a plot of hinge loss versus (y * y') looks as follows:

من

incompatibility of fairness metrics

#responsible

#Metric

See "On the (im)possibility of fairness" for a more detailed discussion of the incompatibility of fairness metrics.

individual fairness

#responsible

#Metric

See "Fairness Through Awareness" for a more detailed discussion of individual fairness.

information gain

#df

#Metric

For example, consider the following entropy values:

entropy of parent node = 0.6
entropy of one child node with 16 relevant examples = 0.2
entropy of another child node with 24 relevant examples = 0.1

So 40% of the examples are in one child node and 60% are in the other child node. بنابراین:

weighted entropy sum of child nodes = (0.4 * 0.2) + (0.6 * 0.1) = 0.14

So, the information gain is:

information gain = entropy of parent node - weighted entropy sum of child nodes
information gain = 0.6 - 0.14 = 0.46

Most splitters seek to create conditions that maximize information gain.

inter-rater agreement

#Metric

See Categorical data: Common issues in Machine Learning Crash Course for more information.

L

L ₁ loss

#fundamentals

#Metric

Actual value of example	Model's predicted value	Absolute value of delta
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ loss

L ₁ loss is less sensitive to outliers than L ₂ loss .

The Mean Absolute Error is the average L ₁ loss per example.

Click the icon to see the formal math.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Linear regression: Loss in Machine Learning Crash Course for more information.

L ₂ loss

#fundamentals

#Metric

Actual value of example	Model's predicted value	Square of delta
7	6	1
5	4	1
8	11	9
4	6	4
9	8	1
		16 = L ₂ loss

Regression models typically use L ₂ loss as the loss function.

The Mean Squared Error is the average L ₂ loss per example. Squared loss is another name for L ₂ loss.

Click the icon to see the formal math.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

LLM evaluations (evals)

#زبان

#generativeAI

#Metric

A set of metrics and benchmarks for assessing the performance of large language models (LLMs). At a high level, LLM evaluations:

Help researchers identify areas where LLMs need improvement.
Are useful in comparing different LLMs and identifying the best LLM for a particular task.
Help ensure that LLMs are safe and ethical to use.

See Large language models (LLMs) in Machine Learning Crash Course for more information.

از دست دادن

#fundamentals

#Metric

During the training of a supervised model , a measure of how far a model's prediction is from its label .

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

عملکرد از دست دادن

#fundamentals

#Metric

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. به عنوان مثال:

L ₂ loss (or Mean Squared Error ) is the loss function for linear regression .
Log Loss is the loss function for logistic regression .

م

میانگین خطای مطلق (MAE)

#Metric

The average loss per example when L ₁ loss is used. Calculate Mean Absolute Error as follows:

Calculate the L ₁ loss for a batch.
Divide the L ₁ loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Absolute Error} = \frac{1}{n}\sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

For example, consider the calculation of L ₁ loss on the following batch of five examples:

Actual value of example	Model's predicted value	Loss (difference between actual and predicted)
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ loss

So, L ₁ loss is 8 and the number of examples is 5. Therefore, the Mean Absolute Error is:

 Mean Absolute Error = L₁ loss / Number of Examples Mean Absolute Error = 8/5 = 1.6

Contrast Mean Absolute Error with Mean Squared Error and Root Mean Squared Error .

mean average precision at k (mAP@k)

#زبان

#generativeAI

#Metric

Although the phrase "mean average" sounds redundant, the name of the metric is appropriate. After all, this metric finds the mean of multiple average precision at k values.

Click the icon to see an example.

0.73
0.77
0.67
0.82
0.76

The mean Average Precision at K is therefore:

$$\text{mean } = \frac{\text{0.73 + 0.77 + 0.67 + 0.82 + 0.76}} {\text{5}} = \text{0.75}$$

میانگین مربعات خطا (MSE)

#Metric

The average loss per example when L ₂ loss is used. Calculate Mean Squared Error as follows:

Calculate the L ₂ loss for a batch.
Divide the L ₂ loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Squared Error} = \frac{1}{n}\sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the model's prediction for $y$.

For example, consider the loss on the following batch of five examples:

ارزش واقعی	Model's prediction	از دست دادن	Squared loss
7	6	1	1
5	4	1	1
8	11	3	9
4	6	2	4
9	8	1	1
			16 = L ₂ loss

Therefore, the Mean Squared Error is:

 Mean Squared Error = L₂ loss / Number of Examples Mean Squared Error = 16/5 = 3.2

Mean Squared Error is a popular training optimizer , particularly for linear regression .

Contrast Mean Squared Error with Mean Absolute Error and Root Mean Squared Error .

TensorFlow Playground uses Mean Squared Error to calculate loss values.

Click the icon to see more details about outliers.

Outliers don't influence Mean Absolute Error as strongly as Mean Squared Error. For example, a loss of 3 accounts for only ~38% of the Mean Absolute Error.

Clipping is one way to prevent extreme outliers from damaging your model's predictive ability.

متریک

#TensorFlow

#Metric

A statistic that you care about.

An objective is a metric that a machine learning system tries to optimize.

Metrics API (tf.metrics)

#Metric

A TensorFlow API for evaluating models. For example, tf.metrics.accuracy determines how often a model's predictions match labels.

minimax loss

#Metric

A loss function for generative adversarial networks , based on the cross-entropy between the distribution of generated data and real data.

Minimax loss is used in the first paper to describe generative adversarial networks.

See Loss Functions in the Generative Adversarial Networks course for more information.

model capacity

#Metric

ن

negative class

#fundamentals

#Metric

The negative class in a medical test might be "not tumor."
The negative class in an email classification model might be "not spam."

Contrast with positive class .

O

هدف

#Metric

A metric that your algorithm is trying to optimize.

تابع هدف

#Metric

In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy, the goal is to maximize accuracy.

پ

pass at k (pass@k)

#Metric

If one or more of those solutions pass the unit test, then the LLM Passes that code generation challenge.
If none of the solutions pass the unit test, then the LLM Fails that code generation challenge.

The formula for pass at k is as follows:

\[\text{pass at k} = \frac{\text{total number of passes}} {\text{total number of challenges}}\]

In general, higher values of k produce higher pass at k scores; however, higher values of k require more large language model and unit testing resources.

Click the icon for an example.

Suppose a software engineer asks a large language model to generate k =10 solutions for n =50 challenging coding problems. در اینجا نتایج آمده است:

30 Passes
20 Fails

The pass at 10 score is therefore:

$$\text{pass at 10} = \frac{\text{30}} {\text{50}} = 0.6$$

عملکرد

#Metric

Overloaded term with the following meanings:

The standard meaning within software engineering. Namely: How fast (or efficiently) does this piece of software run?
The meaning within machine learning. Here, performance answers the following question: How correct is this model ? That is, how good are the model's predictions?

permutation variable importances

#df

#Metric

A type of variable importance that evaluates the increase in the prediction error of a model after permuting the feature's values. Permutation variable importance is a model-independent metric.

گیجی

#Metric

Perplexity is related to cross-entropy as follows:

$$P= 2^{-\text{cross entropy}}$$

positive class

#fundamentals

#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

Click the icon for additional notes.

Admittedly, you're simultaneously testing for both the positive and negative classes.

PR AUC (area under the PR curve)

#Metric

Area under the interpolated precision-recall curve , obtained by plotting (recall, precision) points for different values of the classification threshold .

دقت

#Metric

A metric for classification models that answers the following question:

When the model predicted the positive class , what percentage of the predictions were correct?

این فرمول است:

$$\text{Precision} = \frac{\text{true positives}} {\text{true positives} + \text{false positives}}$$

کجا:

true positive means the model correctly predicted the positive class.
false positive means the model mistakenly predicted the positive class.

For example, suppose a model made 200 positive predictions. Of these 200 positive predictions:

150 were true positives.
50 were false positives.

در این مورد:

$$\text{Precision} = \frac{\text{150}} {\text{150} + \text{50}} = 0.75$$

Contrast with accuracy and recall .

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

precision at k (precision@k)

#زبان

#Metric

A metric for evaluating a ranked (ordered) list of items. Precision at k identifies the fraction of the first k items in that list that are "relevant." یعنی:

\[\text{precision at k} = \frac{\text{relevant items in first k items of the list}} {\text{k}}\]

The value of k must be less than or equal to the length of the returned list. Note that the length of the returned list is not part of the calculation.

Relevance is often subjective; even expert human evaluators often disagree on which items are relevant.

مقایسه کنید با:

Click the icon to see an example.

Suppose a large language model is given the following query:

 List the 6 funniest movies of all time in order.

And the large language model returns the list shown in the first two columns of the following table:

موقعیت	فیلم	مربوطه؟
1	ژنرال	بله
2	دختران بدجنس	بله
3	جوخه	خیر
4	ساقدوش ها	بله
5	شهروند کین	خیر
6	این اسپینال تپ است	بله

Two of the first three movies are relevant, so precision at 3 is:

$$\text{precision at 3} = \frac{\text{2}} {\text{3}} = 0.67$$

Four of the first five movies are very funny, so precision at 5 is:

$$\text{precision at 5} = \frac{\text{4}} {\text{5}} = 0.8$$

منحنی فراخوان دقیق

#Metric

A curve of precision versus recall at different classification thresholds .

prediction bias

#Metric

A value indicating how far apart the average of predictions is from the average of labels in the dataset.

Not to be confused with the bias term in machine learning models or with bias in ethics and fairness .

predictive parity

#responsible

#Metric

A fairness metric that checks whether, for a given classifier, the precision rates are equivalent for subgroups under consideration.

For example, a model that predicts college acceptance would satisfy predictive parity for nationality if its precision rate is the same for Lilliputians and Brobdingnagians.

Predictive parity is sometime also called predictive rate parity .

See "Fairness Definitions Explained" (section 3.2.1) for a more detailed discussion of predictive parity.

predictive rate parity

#responsible

#Metric

Another name for predictive parity .

تابع چگالی احتمال

#Metric

آر

به یاد بیاور

#Metric

A metric for classification models that answers the following question:

When ground truth was the positive class , what percentage of predictions did the model correctly identify as the positive class?

این فرمول است:

\[\text{Recall} = \frac{\text{true positives}} {\text{true positives} + \text{false negatives}} \]

کجا:

true positive means the model correctly predicted the positive class.
false negative means that the model mistakenly predicted the negative class .

For instance, suppose your model made 200 predictions on examples for which ground truth was the positive class. Of these 200 predictions:

180 were true positives.
20 were false negatives.

در این مورد:

\[\text{Recall} = \frac{\text{180}} {\text{180} + \text{20}} = 0.9 \]

Click the icon for notes about class-imbalanced datasets.

30 True Positives
20 False Negatives
4,999,000 True Negatives
950 False Positives

The recall of this model is therefore:

 recall = TP / (TP + FN) recall = 30 / (30 + 20) = 0.6 = 60%

By contrast, the accuracy of this model is:

 accuracy = (TP + TN) / (TP + TN + FP + FN) accuracy = (30 + 4,999,000) / (30 + 4,999,000 + 950 + 20) = 99.98%

That high value of accuracy looks impressive but is essentially meaningless. Recall is a much more useful metric for class-imbalanced datasets than accuracy.

See Classification: Accuracy, recall, precision and related metrics for more information.

recall at k (recall@k)

#زبان

#Metric

\[\text{recall at k} = \frac{\text{relevant items in first k items of the list}} {\text{total number of relevant items in the list}}\]

Contrast with precision at k .

Click the icon to see an example.

Suppose a large language model is given the following query:

 List the 10 funniest movies of all time in order.

And the large language model returns the list shown in the first two columns:

موقعیت	فیلم	مربوطه؟
1	ژنرال	بله
2	دختران بدجنس	بله
3	جوخه	خیر
4	ساقدوش ها	بله
5	این اسپینال تپ است	بله
6	هواپیما!	بله
7	روز گراند هاگ	بله
8	مونتی پایتون و جام مقدس	بله
9	اوپنهایمر	خیر
10	بی خبر	بله

$$\text{recall at 4} = \frac{\text{3}} {\text{8}} = 0.375$$

7 of the first 8 movies are very funny, so recall at 8 is:

$$\text{recall at 8} = \frac{\text{7}} {\text{8}} = 0.875$$

ROC (receiver operating characteristic) Curve

#fundamentals

#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

A number line with 8 positive examples on the right side and 7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0) to (1.0,1.0).

An ROC curve. The x-axis is False Positive Rate and the y-axis is True Positive Rate. The ROC curve approximates a shaky arc traversing the compass points from West to North.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

ریشه میانگین مربعات خطا (RMSE)

#fundamentals

#Metric

The square root of the Mean Squared Error .

ROUGE (مطالعه فراخوان یادآوری گرا برای ارزیابی Gisting)

#زبان

#Metric

Each ROUGE family member typically generates the following metrics:

دقت
به یاد بیاورید
F ₁

For details and examples, see:

ROUGE-L

#زبان

#Metric

$$\text{ROUGE-L recall} = \frac{\text{longest common sequence}} {\text{number of words in the reference text} }$$

$$\text{ROUGE-L precision} = \frac{\text{longest common sequence}} {\text{number of words in the generated text} }$$

You can then use F ₁ to roll up ROUGE-L recall and ROUGE-L precision into a single metric:

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{ROUGE-L recall} * \text{ROUGE-L precision}} {\text{ROUGE-L recall} + \text{ROUGE-L precision} }$$

Click the icon for an example calculation of ROUGE-L.

Consider the following reference text and generated text.

دسته بندی	Who produced?	متن
Reference text	مترجم انسانی	I want to understand a wide variety of things.
متن تولید شده	مدل ML	I want to learn plenty of things.

بنابراین:

The longest common subsequence is 5 ( I want to of things )
The number of words in the reference text is 9.
The number of words in the generated text is 7.

در نتیجه:

$$\text{ROUGE-L recall} = \frac{\text{5}} {\text{9} } = 0.56$$

$$\text{ROUGE-L precision} = \frac{\text{5}} {\text{7} } = 0.71$$

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{0.56} * \text{0.71}} {\text{0.56} + \text{0.71} } = 0.63$$

Click the icon for an example calculation of ROUGE-Lsum.

Consider the following reference text and generated text.

دسته بندی	Who produced?	متن
Reference text	مترجم انسانی	The surface of Mars is dry. Nearly all the water is deep underground.
متن تولید شده	مدل ML	Mars has a dry surface. However, the vast majority of water is underground.

بنابراین:

	جمله اول	جمله دوم
Longest common sequence	2 (Mars dry)	3 (water is underground)
Sentence length of reference text	6	7
Sentence length of generated text	5	8

در نتیجه:

$$\text{recall of first sentence} = \frac{\text{2}} {\text{6}} = 0.33 $$

$$\text{recall of second sentence} = \frac{\text{3}} {\text{7}} = 0.43 $$

$$\text{ROUGE-Lsum recall} = \frac{\text{0.33} + \text{0.43}} {\text{2}} = 0.38 $$

$$\text{precision of first sentence} = \frac{\text{2}} {\text{5}} = 0.4 $$

$$\text{precision of second sentence} = \frac{\text{3}} {\text{8}} = 0.38 $$

$$\text{ROUGE-Lsum precision} = \frac{\text{0.4} + \text{0.38}} {\text{2}} = 0.39 $$

$$\text{ROUGE-Lsum F}{_1} = \frac{\text{2} * \text{0.38} * \text{0.39}} {\text{0.38} + \text{0.39}} = 0.38 $$

ROUGE-N

#زبان

#Metric

A set of metrics within the ROUGE family that compares the shared N-grams of a certain size in the reference text and generated text . به عنوان مثال:

ROUGE-1 measures the number of shared tokens in the reference text and generated text.
ROUGE-2 measures the number of shared bigrams (2-grams) in the reference text and generated text.
ROUGE-3 measures the number of shared trigrams (3-grams) in the reference text and generated text.

You can use the following formulas to calculate ROUGE-N recall and ROUGE-N precision for any member of the ROUGE-N family:

$$\text{ROUGE-N recall} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the reference text} }$$

$$\text{ROUGE-N precision} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the generated text} }$$

You can then use F ₁ to roll up ROUGE-N recall and ROUGE-N precision into a single metric:

$$\text{ROUGE-N F}{_1} = \frac{\text{2} * \text{ROUGE-N recall} * \text{ROUGE-N precision}} {\text{ROUGE-N recall} + \text{ROUGE-N precision} }$$

Click the icon for an example.

Suppose you decide to use ROUGE-2 to measure the effectiveness of an ML model's translation compared to a human translator's.

دسته بندی	Who produced?	متن	بیگرام
Reference text	مترجم انسانی	I want to understand a wide variety of things.	I want, want to, to understand, understand a, a wide, wide variety, variety of, of things
متن تولید شده	مدل ML	I want to learn plenty of things.	I want, want to, to learn, learn plenty, plenty of, of things

بنابراین:

The number of matching 2-grams is 3 ( I want , want to , and of things ).
The number of 2-grams in the reference text is 8.
The number of 2-grams in the generated text is 6.

در نتیجه:

$$\text{ROUGE-2 recall} = \frac{\text{3}} {\text{8} } = 0.375$$

$$\text{ROUGE-2 precision} = \frac{\text{3}} {\text{6} } = 0.5$$

$$\text{ROUGE-2 F}{_1} = \frac{\text{2} * \text{0.375} * \text{0.5}} {\text{0.375} + \text{0.5} } = 0.43$$

ROUGE-S

#زبان

#Metric

reference text : White clouds
generated text : White billowing clouds

When calculating ROUGE-N, the 2-gram, White clouds doesn't match White billowing clouds . However, when calculating ROUGE-S, White clouds does match White billowing clouds .

R-squared

#Metric

A regression metric indicating how much variation in a label is due to an individual feature or to a feature set. R-squared is a value between 0 and 1, which you can interpret as follows:

An R-squared of 0 means that none of a label's variation is due to the feature set.
An R-squared of 1 means that all of a label's variation is due to the feature set.
An R-squared between 0 and 1 indicates the extent to which the label's variation can be predicted from a particular feature or the feature set. For example, an R-squared of 0.10 means that 10 percent of the variance in the label is due to the feature set, an R-squared of 0.20 means that 20 percent is due to the feature set, and so on.

R-squared is the square of the Pearson correlation coefficient between the values that a model predicted and ground truth .

اس

به ثمر رساندن

#recsystems

#Metric

The part of a recommendation system that provides a value or ranking for each item produced by the candidate generation phase.

اندازه گیری شباهت

#clustering

#Metric

In clustering algorithms, the metric used to determine how alike (how similar) any two examples are.

پراکندگی

#Metric

$$ {\text{sparsity}} = \frac{\text{98}} {\text{100}} = {\text{0.98}} $$

Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model weights.

squared hinge loss

#Metric

The square of the hinge loss . Squared hinge loss penalizes outliers more harshly than regular hinge loss.

squared loss

#fundamentals

#Metric

Synonym for L ₂ loss .

تی

test loss

#fundamentals

#Metric

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

top-k accuracy

#زبان

#Metric

The percentage of times that a "target label" appears within the first k positions of generated lists. The lists could be personalized recommendations or a list of items ordered by softmax .

Top-k accuracy is also known as accuracy at k .

Click the icon for an example.

Target label	1	2	3	4	5
افرا	سنجد	بلوط	افرا	راش	صنوبر
چوب سگ	بلوط	چوب سگ	صنوبر	هیکوری	افرا
بلوط	بلوط	چوب باس	ملخ	توسکا	لیندن
لیندن	افرا	paw-paw	بلوط	چوب باس	صنوبر
بلوط	ملخ	لیندن	بلوط	افرا	paw-paw

The target label appears in the first position only once, so the top-1 accuracy is:

$$\text{top-1 accuracy} = \frac{\text{1}} {\text{5}} = 0.2$$

The target label appears in one of the top three positions four times, so the top-3 accuracy is:

$$\text{top-1 accuracy} = \frac{\text{4}} {\text{5}} = 0.8$$

سمیت

#زبان

#Metric

training loss

#fundamentals

#Metric

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

A downward slope implies that the model is improving.
An upward slope implies that the model is getting worse.
A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

A steep downward slope during the initial iterations, which implies rapid model improvement.
A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts with a steep downward slope. The slope gradually flattens until the slope becomes zero.

Although training loss is important, see also generalization .

true negative (TN)

#fundamentals

#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

true positive (TP)

#fundamentals

#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals

#Metric

Synonym for recall . یعنی:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

V

validation loss

#fundamentals

#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

variable importances

#df

#Metric

A set of scores that indicates the relative importance of each feature to the model.

Different variable importance metrics exist, which can inform ML experts about different aspects of models.

دبلیو

Wasserstein loss

#Metric

One of the loss functions commonly used in generative adversarial networks , based on the earth mover's distance between the distribution of generated data and real data.

This page contains Metrics glossary terms. For all glossary terms, click here .

الف

دقت

#fundamentals

#Metric

The number of correct classification predictions divided by the total number of predictions. یعنی:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

Binary classification provides specific names for the different categories of correct predictions and incorrect predictions . So, the accuracy formula for binary classification is as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

کجا:

TP is the number of true positives (correct predictions).
TN is the number of true negatives (correct predictions).
FP is the number of false positives (incorrect predictions).
FN is the number of false negatives (incorrect predictions).

Compare and contrast accuracy with precision and recall .

Click the icon for details about accuracy and class-imbalanced datasets.

دسته بندی	شماره
TP	0
TN	36499
FP	0
FN	25

The accuracy of this model is therefore:

accuracy = (TP + TN) / (TP + TN + FP + FN) accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

Although 99.93% accuracy seems like very a impressive percentage, the model actually has no predictive power.

Precision and recall are usually more useful metrics than accuracy for evaluating models trained on class-imbalanced datasets.

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

area under the PR curve

#Metric

See PR AUC (Area under the PR Curve) .

ناحیه زیر منحنی ROC

#Metric

See AUC (Area under the ROC curve) .

AUC (Area under the ROC curve)

#fundamentals

#Metric

A number line with 8 positive examples on one side and 9 negative examples on the other side.

Conversely, the following illustration shows the results for a classification model that generated random results. This model has an AUC of 0.5:

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

AUC ignores any value you set for classification threshold . Instead, AUC considers all possible classification thresholds.

Click the icon to learn about the relationship between AUC and ROC curves.

AUC represents the area under an ROC curve . For example, the ROC curve for a model that perfectly separates positives from negatives looks as follows:

Conversely, the ROC curve for a classification model that can't separate classes at all is as follows. The area of this gray region is 0.5.

A more typical ROC curve looks approximately like the following:

It would be painstaking to calculate the area under this curve manually, which is why a program typically calculates most AUC values.

Click the icon for a more formal definition of AUC.

AUC is the probability that a classification model will be more confident than a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

average precision at k

#زبان

#Metric

\[{\text{average precision at k}} = \frac{1}{n} \sum_{i=1}^n {\text{precision at k for each relevant item} } \]

کجا:

$n$ is the number of relevant items in the list.

Contrast with recall at k .

Click the icon for an example

Suppose a large language model is given the following query:

 List the 6 funniest movies of all time in order.

And the large language model returns the following list:

ژنرال
دختران بدجنس
جوخه
ساقدوش ها
شهروند کین
این اسپینال تپ است

Four of the movies in the returned list are very funny (that is, they are relevant) but two movies are dramas (not relevant). The following table details the results:

موقعیت	فیلم	مربوطه؟	Precision at k
1	ژنرال	بله	1.0
2	دختران بدجنس	بله	1.0
3	جوخه	خیر	مرتبط نیست
4	ساقدوش ها	بله	0.75
5	شهروند کین	خیر	مرتبط نیست
6	این اسپینال تپ است	بله	0.67

The number of relevant results is 4. Therefore, you can calculate the average precision at 6 as follows:

$${\text{average precision at 6}} = \frac{1}{4} {\text{(1.0 + 1.0 + 0.75 + 0.67)} } $$$${\text{average precision at 6}} = {\text{~0.85} } $$

ب

خط پایه

#Metric

For a particular problem, the baseline helps model developers quantify the minimal expected performance that a new model must achieve for the new model to be useful.

سی

هزینه

#Metric

Synonym for loss .

counterfactual fairness

#responsible

#Metric

See either of the following for more information:

Fairness: Counterfactual fairness in Machine Learning Crash Course.
When Worlds Collide: Integrating Different Counterfactual Assumptions in Fairness

cross-entropy

#Metric

A generalization of Log Loss to multi-class classification problems . Cross-entropy quantifies the difference between two probability distributions. See also perplexity .

cumulative distribution function (CDF)

#Metric

D

demographic parity

#responsible

#Metric

A fairness metric that is satisfied if the results of a model's classification are not dependent on a given sensitive attribute .

See Fairness: demographic parity in Machine Learning Crash Course for more information.

E

earth mover's distance (EMD)

#Metric

A measure of the relative similarity of two distributions . The lower the earth mover's distance, the more similar the distributions.

edit distance

#زبان

#Metric

A measurement of how similar two text strings are to each other. In machine learning, edit distance is useful for the following reasons:

Edit distance is easy to compute.
Edit distance can compare two strings known to be similar to each other.
Edit distance can determine the degree to which different strings are similar to a given string.

There are several definitions of edit distance, each using different string operations. See Levenshtein distance for an example.

empirical cumulative distribution function (eCDF or EDF)

#Metric

آنتروپی

#df

#Metric

The entropy of a set with two possible values "0" and "1" (for example, the labels in a binary classification problem) has the following formula:

H = -p log p - q log q = -p log p - (1-p) * log (1-p)

کجا:

H is the entropy.
p is the fraction of "1" examples.
q is the fraction of "0" examples. Note that q = (1 - p)
log is generally log ₂ . In this case, the entropy unit is a bit.

برای مثال موارد زیر را فرض کنید:

100 examples contain the value "1"
300 examples contain the value "0"

Therefore, the entropy value is:

p = 0.25
q = 0.75
H = (-0.25)log ₂ (0.25) - (0.75)log ₂ (0.75) = 0.81 bits per example

A set that is perfectly balanced (for example, 200 "0"s and 200 "1"s) would have an entropy of 1.0 bit per example. As a set becomes more imbalanced , its entropy moves towards 0.0.

In decision trees , entropy helps formulate information gain to help the splitter select the conditions during the growth of a classification decision tree.

Compare entropy with:

ناخالصی جینی
cross-entropy loss function

Entropy is often called Shannon's entropy .

See Exact splitter for binary classification with numerical features in the Decision Forests course for more information.

برابری فرصت ها

#responsible

#Metric

Equality of opportunity is related to equalized odds , which requires that both the true positive rates and false positive rates are the same for all groups.

For example, suppose 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 1. Lilliputian applicants (90% are qualified)

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	45	3
رد شد	45	7
مجموع	90	10
Percentage of qualified students admitted: 45/90 = 50% Percentage of unqualified students rejected: 7/10 = 70% Total percentage of Lilliputian students admitted: (45+3)/100 = 48%

Table 2. Brobdingnagian applicants (10% are qualified):

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	5	9
رد شد	5	81
مجموع	10	90
Percentage of qualified students admitted: 5/10 = 50% Percentage of unqualified students rejected: 81/90 = 90% Total percentage of Brobdingnagian students admitted: (5+9)/100 = 14%

The preceding examples satisfy equality of opportunity for acceptance of qualified students because qualified Lilliputians and Brobdingnagians both have a 50% chance of being admitted.

While equality of opportunity is satisfied, the following two fairness metrics are not satisfied:

demographic parity : Lilliputians and Brobdingnagians are admitted to the university at different rates; 48% of Lilliputians students are admitted, but only 14% of Brobdingnagian students are admitted.
equalized odds : While qualified Lilliputian and Brobdingnagian students both have the same chance of being admitted, the additional constraint that unqualified Lilliputians and Brobdingnagians both have the same chance of being rejected is not satisfied. Unqualified Lilliputians have a 70% rejection rate, whereas unqualified Brobdingnagians have a 90% rejection rate.

See Fairness: Equality of opportunity in Machine Learning Crash Course for more information.

equalized odds

#responsible

#Metric

Equalized odds is related to equality of opportunity , which only focuses on error rates for a single class (positive or negative).

Suppose 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 3. Lilliputian applicants (90% are qualified)

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	45	2
رد شد	45	8
مجموع	90	10
Percentage of qualified students admitted: 45/90 = 50% Percentage of unqualified students rejected: 8/10 = 80% Total percentage of Lilliputian students admitted: (45+2)/100 = 47%

Table 4. Brobdingnagian applicants (10% are qualified):

	واجد شرایط	فاقد صلاحیت
پذیرفته شد	5	18
رد شد	5	72
مجموع	10	90
Percentage of qualified students admitted: 5/10 = 50% Percentage of unqualified students rejected: 72/90 = 80% Total percentage of Brobdingnagian students admitted: (5+18)/100 = 23%

evals

#زبان

#generativeAI

#Metric

Primarily used as an abbreviation for LLM evaluations . More broadly, evals is an abbreviation for any form of evaluation .

ارزیابی

#زبان

#generativeAI

#Metric

The process of measuring a model's quality or comparing different models against each other.

اف

F ₁

#Metric

A "roll-up" binary classification metric that relies on both precision and recall . این فرمول است:

$$F{_1} = \frac{\text{2 * precision * recall}} {\text{precision + recall}}$$

Click the icon to see examples.

Suppose precision and recall have the following values:

precision = 0.6
recall = 0.4

You calculate F ₁ as follows:

$$F{_1} = \frac{\text{2 * 0.6 * 0.4}} {\text{0.6 + 0.4}} = 0.48$$

precision = 0.9
recall = 0.1

$$F{_1} = \frac{\text{2 * 0.9 * 0.1}} {\text{0.9 + 0.1}} = 0.18$$

fairness metric

#responsible

#Metric

A mathematical definition of "fairness" that is measurable. Some commonly used fairness metrics include:

Many fairness metrics are mutually exclusive; see incompatibility of fairness metrics .

false negative (FN)

#fundamentals

#Metric

false negative rate

#Metric

The proportion of actual positive examples for which the model mistakenly predicted the negative class. The following formula calculates the false negative rate:

$$\text{false negative rate} = \frac{\text{false negatives}}{\text{false negatives} + \text{true positives}}$$

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive (FP)

#fundamentals

#Metric

See Thresholds and the confusion matrix in Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals

#Metric

The proportion of actual negative examples for which the model mistakenly predicted the positive class. The following formula calculates the false positive rate:

$$\text{false positive rate} = \frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in an ROC curve .

See Classification: ROC and AUC in Machine Learning Crash Course for more information.

اهمیت ویژگی ها

#df

#Metric

Synonym for variable importances .

fraction of successes

#generativeAI

#Metric

Although fraction of successes is broadly useful throughout statistics, within ML, this metric is primarily useful for measuring verifiable tasks like code generation or math problems.

جی

ناخالصی جینی

#df

#Metric

Gini impurity is also called gini index , or simply gini .

Click the icon for mathematical details about gini impurity.

I = 1 - (p ² + q ² ) = 1 - (p ² + (1-p) ² )

کجا:

I is the gini impurity.
p is the fraction of "1" examples.
q is the fraction of "0" examples. Note that q = 1-p

For example, consider the following dataset:

100 labels (0.25 of the dataset) contain the value "1"
300 labels (0.75 of the dataset) contain the value "0"

Therefore, the gini impurity is:

p = 0.25
q = 0.75
I = 1 - (0.25 ² + 0.75 ² ) = 0.375

Consequently, a random label from the same dataset would have a 37.5% chance of being misclassified, and a 62.5% chance of being properly classified.

A perfectly balanced label (for example, 200 "0"s and 200 "1"s) would have a gini impurity of 0.5. A highly imbalanced label would have a gini impurity close to 0.0.

اچ

hinge loss

#Metric

$$\text{loss} = \text{max}(0, 1 - (y * y'))$$

where y is the true label, either -1 or +1, and y' is the raw output of the classification model :

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

Consequently, a plot of hinge loss versus (y * y') looks as follows:

من

incompatibility of fairness metrics

#responsible

#Metric

See "On the (im)possibility of fairness" for a more detailed discussion of the incompatibility of fairness metrics.

individual fairness

#responsible

#Metric

See "Fairness Through Awareness" for a more detailed discussion of individual fairness.

information gain

#df

#Metric

For example, consider the following entropy values:

entropy of parent node = 0.6
entropy of one child node with 16 relevant examples = 0.2
entropy of another child node with 24 relevant examples = 0.1

So 40% of the examples are in one child node and 60% are in the other child node. بنابراین:

weighted entropy sum of child nodes = (0.4 * 0.2) + (0.6 * 0.1) = 0.14

So, the information gain is:

information gain = entropy of parent node - weighted entropy sum of child nodes
information gain = 0.6 - 0.14 = 0.46

Most splitters seek to create conditions that maximize information gain.

inter-rater agreement

#Metric

See Categorical data: Common issues in Machine Learning Crash Course for more information.

L

L ₁ loss

#fundamentals

#Metric

Actual value of example	Model's predicted value	Absolute value of delta
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ loss

L ₁ loss is less sensitive to outliers than L ₂ loss .

The Mean Absolute Error is the average L ₁ loss per example.

Click the icon to see the formal math.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Linear regression: Loss in Machine Learning Crash Course for more information.

L ₂ loss

#fundamentals

#Metric

Actual value of example	Model's predicted value	Square of delta
7	6	1
5	4	1
8	11	9
4	6	4
9	8	1
		16 = L ₂ loss

Regression models typically use L ₂ loss as the loss function.

The Mean Squared Error is the average L ₂ loss per example. Squared loss is another name for L ₂ loss.

Click the icon to see the formal math.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

See Logistic regression: Loss and regularization in Machine Learning Crash Course for more information.

LLM evaluations (evals)

#زبان

#generativeAI

#Metric

A set of metrics and benchmarks for assessing the performance of large language models (LLMs). At a high level, LLM evaluations:

Help researchers identify areas where LLMs need improvement.
Are useful in comparing different LLMs and identifying the best LLM for a particular task.
Help ensure that LLMs are safe and ethical to use.

See Large language models (LLMs) in Machine Learning Crash Course for more information.

از دست دادن

#fundamentals

#Metric

During the training of a supervised model , a measure of how far a model's prediction is from its label .

A loss function calculates the loss.

See Linear regression: Loss in Machine Learning Crash Course for more information.

عملکرد از دست دادن

#fundamentals

#Metric

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. به عنوان مثال:

L ₂ loss (or Mean Squared Error ) is the loss function for linear regression .
Log Loss is the loss function for logistic regression .

م

میانگین خطای مطلق (MAE)

#Metric

The average loss per example when L ₁ loss is used. Calculate Mean Absolute Error as follows:

Calculate the L ₁ loss for a batch.
Divide the L ₁ loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Absolute Error} = \frac{1}{n}\sum_{i=0}^n | y_i - \hat{y}_i |$$

کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

For example, consider the calculation of L ₁ loss on the following batch of five examples:

Actual value of example	Model's predicted value	Loss (difference between actual and predicted)
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L ₁ loss

So, L ₁ loss is 8 and the number of examples is 5. Therefore, the Mean Absolute Error is:

 Mean Absolute Error = L₁ loss / Number of Examples Mean Absolute Error = 8/5 = 1.6

Contrast Mean Absolute Error with Mean Squared Error and Root Mean Squared Error .

mean average precision at k (mAP@k)

#زبان

#generativeAI

#Metric

Although the phrase "mean average" sounds redundant, the name of the metric is appropriate. After all, this metric finds the mean of multiple average precision at k values.

Click the icon to see an example.

0.73
0.77
0.67
0.82
0.76

The mean Average Precision at K is therefore:

$$\text{mean } = \frac{\text{0.73 + 0.77 + 0.67 + 0.82 + 0.76}} {\text{5}} = \text{0.75}$$

میانگین مربعات خطا (MSE)

#Metric

The average loss per example when L ₂ loss is used. Calculate Mean Squared Error as follows:

Calculate the L ₂ loss for a batch.
Divide the L ₂ loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Squared Error} = \frac{1}{n}\sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$کجا:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the model's prediction for $y$.

For example, consider the loss on the following batch of five examples:

ارزش واقعی	Model's prediction	از دست دادن	Squared loss
7	6	1	1
5	4	1	1
8	11	3	9
4	6	2	4
9	8	1	1
			16 = L ₂ loss

Therefore, the Mean Squared Error is:

 Mean Squared Error = L₂ loss / Number of Examples Mean Squared Error = 16/5 = 3.2

Mean Squared Error is a popular training optimizer , particularly for linear regression .

Contrast Mean Squared Error with Mean Absolute Error and Root Mean Squared Error .

TensorFlow Playground uses Mean Squared Error to calculate loss values.

Click the icon to see more details about outliers.

Outliers don't influence Mean Absolute Error as strongly as Mean Squared Error. For example, a loss of 3 accounts for only ~38% of the Mean Absolute Error.

Clipping is one way to prevent extreme outliers from damaging your model's predictive ability.

متریک

#TensorFlow

#Metric

A statistic that you care about.

An objective is a metric that a machine learning system tries to optimize.

Metrics API (tf.metrics)

#Metric

A TensorFlow API for evaluating models. For example, tf.metrics.accuracy determines how often a model's predictions match labels.

minimax loss

#Metric

A loss function for generative adversarial networks , based on the cross-entropy between the distribution of generated data and real data.

Minimax loss is used in the first paper to describe generative adversarial networks.

See Loss Functions in the Generative Adversarial Networks course for more information.

model capacity

#Metric

ن

negative class

#fundamentals

#Metric

The negative class in a medical test might be "not tumor."
The negative class in an email classification model might be "not spam."

Contrast with positive class .

O

هدف

#Metric

A metric that your algorithm is trying to optimize.

تابع هدف

#Metric

In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy, the goal is to maximize accuracy.

پ

pass at k (pass@k)

#Metric

If one or more of those solutions pass the unit test, then the LLM Passes that code generation challenge.
If none of the solutions pass the unit test, then the LLM Fails that code generation challenge.

The formula for pass at k is as follows:

\[\text{pass at k} = \frac{\text{total number of passes}} {\text{total number of challenges}}\]

In general, higher values of k produce higher pass at k scores; however, higher values of k require more large language model and unit testing resources.

Click the icon for an example.

Suppose a software engineer asks a large language model to generate k =10 solutions for n =50 challenging coding problems. در اینجا نتایج آمده است:

30 Passes
20 Fails

The pass at 10 score is therefore:

$$\text{pass at 10} = \frac{\text{30}} {\text{50}} = 0.6$$

عملکرد

#Metric

Overloaded term with the following meanings:

The standard meaning within software engineering. Namely: How fast (or efficiently) does this piece of software run?
The meaning within machine learning. Here, performance answers the following question: How correct is this model ? That is, how good are the model's predictions?

permutation variable importances

#df

#Metric

A type of variable importance that evaluates the increase in the prediction error of a model after permuting the feature's values. Permutation variable importance is a model-independent metric.

گیجی

#Metric

Perplexity is related to cross-entropy as follows:

$$P= 2^{-\text{cross entropy}}$$

positive class

#fundamentals

#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classification model might be "spam."

Contrast with negative class .

Click the icon for additional notes.

Admittedly, you're simultaneously testing for both the positive and negative classes.

PR AUC (area under the PR curve)

#Metric

Area under the interpolated precision-recall curve , obtained by plotting (recall, precision) points for different values of the classification threshold .

دقت

#Metric

A metric for classification models that answers the following question:

When the model predicted the positive class , what percentage of the predictions were correct?

این فرمول است:

$$\text{Precision} = \frac{\text{true positives}} {\text{true positives} + \text{false positives}}$$

کجا:

true positive means the model correctly predicted the positive class.
false positive means the model mistakenly predicted the positive class.

For example, suppose a model made 200 positive predictions. Of these 200 positive predictions:

150 were true positives.
50 were false positives.

در این مورد:

$$\text{Precision} = \frac{\text{150}} {\text{150} + \text{50}} = 0.75$$

Contrast with accuracy and recall .

See Classification: Accuracy, recall, precision and related metrics in Machine Learning Crash Course for more information.

precision at k (precision@k)

#زبان

#Metric

A metric for evaluating a ranked (ordered) list of items. Precision at k identifies the fraction of the first k items in that list that are "relevant." یعنی:

\[\text{precision at k} = \frac{\text{relevant items in first k items of the list}} {\text{k}}\]

The value of k must be less than or equal to the length of the returned list. Note that the length of the returned list is not part of the calculation.

Relevance is often subjective; even expert human evaluators often disagree on which items are relevant.

مقایسه کنید با:

Click the icon to see an example.

Suppose a large language model is given the following query:

 List the 6 funniest movies of all time in order.

And the large language model returns the list shown in the first two columns of the following table:

موقعیت	فیلم	مربوطه؟
1	ژنرال	بله
2	دختران بدجنس	بله
3	جوخه	خیر
4	ساقدوش ها	بله
5	شهروند کین	خیر
6	این اسپینال تپ است	بله

Two of the first three movies are relevant, so precision at 3 is:

$$\text{precision at 3} = \frac{\text{2}} {\text{3}} = 0.67$$

Four of the first five movies are very funny, so precision at 5 is:

$$\text{precision at 5} = \frac{\text{4}} {\text{5}} = 0.8$$

منحنی فراخوان دقیق

#Metric

A curve of precision versus recall at different classification thresholds .

prediction bias

#Metric

A value indicating how far apart the average of predictions is from the average of labels in the dataset.

Not to be confused with the bias term in machine learning models or with bias in ethics and fairness .

predictive parity

#responsible

#Metric

A fairness metric that checks whether, for a given classifier, the precision rates are equivalent for subgroups under consideration.

For example, a model that predicts college acceptance would satisfy predictive parity for nationality if its precision rate is the same for Lilliputians and Brobdingnagians.

Predictive parity is sometime also called predictive rate parity .

See "Fairness Definitions Explained" (section 3.2.1) for a more detailed discussion of predictive parity.

predictive rate parity

#responsible

#Metric

Another name for predictive parity .

تابع چگالی احتمال

#Metric

آر

به یاد بیاور

#Metric

A metric for classification models that answers the following question:

When ground truth was the positive class , what percentage of predictions did the model correctly identify as the positive class?

این فرمول است:

\[\text{Recall} = \frac{\text{true positives}} {\text{true positives} + \text{false negatives}} \]

کجا:

true positive means the model correctly predicted the positive class.
false negative means that the model mistakenly predicted the negative class .

For instance, suppose your model made 200 predictions on examples for which ground truth was the positive class. Of these 200 predictions:

180 were true positives.
20 were false negatives.

در این مورد:

\[\text{Recall} = \frac{\text{180}} {\text{180} + \text{20}} = 0.9 \]

Click the icon for notes about class-imbalanced datasets.

30 True Positives
20 False Negatives
4,999,000 True Negatives
950 False Positives

The recall of this model is therefore:

 recall = TP / (TP + FN) recall = 30 / (30 + 20) = 0.6 = 60%

By contrast, the accuracy of this model is:

 accuracy = (TP + TN) / (TP + TN + FP + FN) accuracy = (30 + 4,999,000) / (30 + 4,999,000 + 950 + 20) = 99.98%

That high value of accuracy looks impressive but is essentially meaningless. Recall is a much more useful metric for class-imbalanced datasets than accuracy.

See Classification: Accuracy, recall, precision and related metrics for more information.

recall at k (recall@k)

#زبان

#Metric

\[\text{recall at k} = \frac{\text{relevant items in first k items of the list}} {\text{total number of relevant items in the list}}\]

Contrast with precision at k .

Click the icon to see an example.

Suppose a large language model is given the following query:

 List the 10 funniest movies of all time in order.

And the large language model returns the list shown in the first two columns:

موقعیت	فیلم	مربوطه؟
1	ژنرال	بله
2	دختران بدجنس	بله
3	جوخه	خیر
4	ساقدوش ها	بله
5	این اسپینال تپ است	بله
6	هواپیما!	بله
7	روز گراند هاگ	بله
8	مونتی پایتون و جام مقدس	بله
9	اوپنهایمر	خیر
10	بی خبر	بله

$$\text{recall at 4} = \frac{\text{3}} {\text{8}} = 0.375$$

7 of the first 8 movies are very funny, so recall at 8 is:

$$\text{recall at 8} = \frac{\text{7}} {\text{8}} = 0.875$$

ROC (receiver operating characteristic) Curve

#fundamentals

#Metric

A graph of true positive rate versus false positive rate for different classification thresholds in binary classification.

A number line with 8 positive examples on the right side and 7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

A number line with positive examples and negative classes completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0) to (1.0,1.0).

An ROC curve. The x-axis is False Positive Rate and the y-axis is True Positive Rate. The ROC curve approximates a shaky arc traversing the compass points from West to North.

A numerical metric called AUC summarizes the ROC curve into a single floating-point value.

ریشه میانگین مربعات خطا (RMSE)

#fundamentals

#Metric

The square root of the Mean Squared Error .

ROUGE (مطالعه فراخوان یادآوری گرا برای ارزیابی Gisting)

#زبان

#Metric

Each ROUGE family member typically generates the following metrics:

دقت
به یاد بیاورید
F ₁

For details and examples, see:

ROUGE-L

#زبان

#Metric

$$\text{ROUGE-L recall} = \frac{\text{longest common sequence}} {\text{number of words in the reference text} }$$

$$\text{ROUGE-L precision} = \frac{\text{longest common sequence}} {\text{number of words in the generated text} }$$

You can then use F ₁ to roll up ROUGE-L recall and ROUGE-L precision into a single metric:

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{ROUGE-L recall} * \text{ROUGE-L precision}} {\text{ROUGE-L recall} + \text{ROUGE-L precision} }$$

Click the icon for an example calculation of ROUGE-L.

Consider the following reference text and generated text.

دسته بندی	Who produced?	متن
Reference text	مترجم انسانی	I want to understand a wide variety of things.
متن تولید شده	مدل ML	I want to learn plenty of things.

بنابراین:

The longest common subsequence is 5 ( I want to of things )
The number of words in the reference text is 9.
The number of words in the generated text is 7.

در نتیجه:

$$\text{ROUGE-L recall} = \frac{\text{5}} {\text{9} } = 0.56$$

$$\text{ROUGE-L precision} = \frac{\text{5}} {\text{7} } = 0.71$$

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{0.56} * \text{0.71}} {\text{0.56} + \text{0.71} } = 0.63$$

Click the icon for an example calculation of ROUGE-Lsum.

Consider the following reference text and generated text.

دسته بندی	Who produced?	متن
Reference text	مترجم انسانی	The surface of Mars is dry. Nearly all the water is deep underground.
متن تولید شده	مدل ML	Mars has a dry surface. However, the vast majority of water is underground.

بنابراین:

	جمله اول	جمله دوم
Longest common sequence	2 (Mars dry)	3 (water is underground)
Sentence length of reference text	6	7
Sentence length of generated text	5	8

در نتیجه:

$$\text{recall of first sentence} = \frac{\text{2}} {\text{6}} = 0.33 $$

$$\text{recall of second sentence} = \frac{\text{3}} {\text{7}} = 0.43 $$

$$\text{ROUGE-Lsum recall} = \frac{\text{0.33} + \text{0.43}} {\text{2}} = 0.38 $$

$$\text{precision of first sentence} = \frac{\text{2}} {\text{5}} = 0.4 $$

$$\text{precision of second sentence} = \frac{\text{3}} {\text{8}} = 0.38 $$

$$\text{ROUGE-Lsum precision} = \frac{\text{0.4} + \text{0.38}} {\text{2}} = 0.39 $$

$$\text{ROUGE-Lsum F}{_1} = \frac{\text{2} * \text{0.38} * \text{0.39}} {\text{0.38} + \text{0.39}} = 0.38 $$

ROUGE-N

#زبان

#Metric

A set of metrics within the ROUGE family that compares the shared N-grams of a certain size in the reference text and generated text . به عنوان مثال:

ROUGE-1 measures the number of shared tokens in the reference text and generated text.
ROUGE-2 measures the number of shared bigrams (2-grams) in the reference text and generated text.
ROUGE-3 measures the number of shared trigrams (3-grams) in the reference text and generated text.

You can use the following formulas to calculate ROUGE-N recall and ROUGE-N precision for any member of the ROUGE-N family:

$$\text{ROUGE-N recall} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the reference text} }$$

$$\text{ROUGE-N precision} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the generated text} }$$

You can then use F ₁ to roll up ROUGE-N recall and ROUGE-N precision into a single metric:

$$\text{ROUGE-N F}{_1} = \frac{\text{2} * \text{ROUGE-N recall} * \text{ROUGE-N precision}} {\text{ROUGE-N recall} + \text{ROUGE-N precision} }$$

Click the icon for an example.

Suppose you decide to use ROUGE-2 to measure the effectiveness of an ML model's translation compared to a human translator's.

دسته بندی	Who produced?	متن	بیگرام
Reference text	مترجم انسانی	I want to understand a wide variety of things.	I want, want to, to understand, understand a, a wide, wide variety, variety of, of things
متن تولید شده	مدل ML	I want to learn plenty of things.	I want, want to, to learn, learn plenty, plenty of, of things

بنابراین:

The number of matching 2-grams is 3 ( I want , want to , and of things ).
The number of 2-grams in the reference text is 8.
The number of 2-grams in the generated text is 6.

در نتیجه:

$$\text{ROUGE-2 recall} = \frac{\text{3}} {\text{8} } = 0.375$$

$$\text{ROUGE-2 precision} = \frac{\text{3}} {\text{6} } = 0.5$$

$$\text{ROUGE-2 F}{_1} = \frac{\text{2} * \text{0.375} * \text{0.5}} {\text{0.375} + \text{0.5} } = 0.43$$

ROUGE-S

#زبان

#Metric

reference text : White clouds
generated text : White billowing clouds

When calculating ROUGE-N, the 2-gram, White clouds doesn't match White billowing clouds . However, when calculating ROUGE-S, White clouds does match White billowing clouds .

R-squared

#Metric

A regression metric indicating how much variation in a label is due to an individual feature or to a feature set. R-squared is a value between 0 and 1, which you can interpret as follows:

An R-squared of 0 means that none of a label's variation is due to the feature set.
An R-squared of 1 means that all of a label's variation is due to the feature set.
An R-squared between 0 and 1 indicates the extent to which the label's variation can be predicted from a particular feature or the feature set. For example, an R-squared of 0.10 means that 10 percent of the variance in the label is due to the feature set, an R-squared of 0.20 means that 20 percent is due to the feature set, and so on.

R-squared is the square of the Pearson correlation coefficient between the values that a model predicted and ground truth .

اس

به ثمر رساندن

#recsystems

#Metric

The part of a recommendation system that provides a value or ranking for each item produced by the candidate generation phase.

اندازه گیری شباهت

#clustering

#Metric

In clustering algorithms, the metric used to determine how alike (how similar) any two examples are.

پراکندگی

#Metric

$$ {\text{sparsity}} = \frac{\text{98}} {\text{100}} = {\text{0.98}} $$

Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model weights.

squared hinge loss

#Metric

The square of the hinge loss . Squared hinge loss penalizes outliers more harshly than regular hinge loss.

squared loss

#fundamentals

#Metric

Synonym for L ₂ loss .

تی

test loss

#fundamentals

#Metric

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the regularization rate .

top-k accuracy

#زبان

#Metric

The percentage of times that a "target label" appears within the first k positions of generated lists. The lists could be personalized recommendations or a list of items ordered by softmax .

Top-k accuracy is also known as accuracy at k .

Click the icon for an example.

Target label	1	2	3	4	5
افرا	سنجد	بلوط	افرا	راش	صنوبر
چوب سگ	بلوط	چوب سگ	صنوبر	هیکوری	افرا
بلوط	بلوط	چوب باس	ملخ	توسکا	لیندن
لیندن	افرا	paw-paw	بلوط	چوب باس	صنوبر
بلوط	ملخ	لیندن	بلوط	افرا	paw-paw

The target label appears in the first position only once, so the top-1 accuracy is:

$$\text{top-1 accuracy} = \frac{\text{1}} {\text{5}} = 0.2$$

The target label appears in one of the top three positions four times, so the top-3 accuracy is:

$$\text{top-1 accuracy} = \frac{\text{4}} {\text{5}} = 0.8$$

سمیت

#زبان

#Metric

training loss

#fundamentals

#Metric

A loss curve plots training loss versus the number of iterations. A loss curve provides the following hints about training:

A downward slope implies that the model is improving.
An upward slope implies that the model is getting worse.
A flat slope implies that the model has reached convergence .

For example, the following somewhat idealized loss curve shows:

A steep downward slope during the initial iterations, which implies rapid model improvement.
A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts with a steep downward slope. The slope gradually flattens until the slope becomes zero.

Although training loss is important, see also generalization .

true negative (TN)

#fundamentals

#Metric

An example in which the model correctly predicts the negative class . For example, the model infers that a particular email message is not spam , and that email message really is not spam .

true positive (TP)

#fundamentals

#Metric

An example in which the model correctly predicts the positive class . For example, the model infers that a particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals

#Metric

Synonym for recall . یعنی:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in an ROC curve .

V

validation loss

#fundamentals

#Metric

A metric representing a model's loss on the validation set during a particular iteration of training.

variable importances

#df

#Metric

A set of scores that indicates the relative importance of each feature to the model.

Different variable importance metrics exist, which can inform ML experts about different aspects of models.

دبلیو

Wasserstein loss

#Metric

One of the loss functions commonly used in generative adversarial networks , based on the earth mover's distance between the distribution of generated data and real data.

واژه نامه یادگیری ماشینی: متریک با مجموعه‌ها، منظم بمانید ذخیره و طبقه‌بندی محتوا براساس اولویت‌های شما.

الف

دقت

برای جزئیات در مورد دقت و مجموعه داده های نامتعادل کلاس، روی نماد کلیک کنید.

ناحیه زیر منحنی PR

ناحیه زیر منحنی ROC

AUC (مساحت زیر منحنی ROC)

برای اطلاع از رابطه بین منحنی های AUC و ROC روی نماد کلیک کنید.

برای تعریف رسمی تر AUC روی نماد کلیک کنید.

دقت متوسط ​​در k

برای مثال روی نماد کلیک کنید

ب

خط پایه

سی

هزینه

انصاف خلاف واقع

آنتروپی متقابل

تابع توزیع تجمعی (CDF)

D

برابری جمعیتی

E

فاصله حرکت دهنده زمین (EMD)

فاصله را ویرایش کنید

تابع توزیع تجمعی تجربی (eCDF یا EDF)

آنتروپی

برابری فرصت ها

شانس مساوی

ارزیابی می کند

ارزیابی

اف

F 1

برای مشاهده نمونه ها روی نماد کلیک کنید.

متریک انصاف

منفی کاذب (FN)

نرخ منفی کاذب

مثبت کاذب (FP)

نرخ مثبت کاذب (FPR)

اهمیت ویژگی ها

کسری از موفقیت ها

جی

ناخالصی جینی

برای جزئیات ریاضی درباره ناخالصی جینی روی نماد کلیک کنید.

اچ

از دست دادن لولا

من

ناسازگاری معیارهای انصاف

انصاف فردی

کسب اطلاعات

توافق بین ارزیاب

L

L 1 باخت

برای دیدن ریاضیات رسمی روی نماد کلیک کنید.

L 2 باخت

برای دیدن ریاضیات رسمی روی نماد کلیک کنید.

ارزیابی های LLM (ارزیابی)

از دست دادن

عملکرد از دست دادن

م

میانگین خطای مطلق (MAE)

برای دیدن ریاضیات رسمی روی نماد کلیک کنید.

میانگین دقت متوسط ​​در k (mAP@k)

برای مشاهده نمونه روی نماد کلیک کنید.

میانگین مربعات خطا (MSE)

برای دیدن ریاضیات رسمی روی نماد کلیک کنید.

روی نماد کلیک کنید تا جزئیات بیشتری در مورد موارد پرت ببینید.

متریک

Metrics API (tf.metrics)

حداقل ضرر

ظرفیت مدل

ن

طبقه منفی

O

هدف

تابع هدف

پ

پاس در K (پاس@k)

برای مثال روی نماد کلیک کنید.

عملکرد

واردات متغیر جابجایی

گیجی

واژه نامه یادگیری ماشینی: متریک
با مجموعه‌ها، منظم بمانید ذخیره و طبقه‌بندی محتوا براساس اولویت‌های شما.

دقت متوسط در k

F ₁

L ₁ باخت

L ₂ باخت

میانگین دقت متوسط در k (mAP@k)

دقت متوسط در k